by Mu-Chi Chen
What Is a Data Hub?
What Is a Data Hub: a one stop shop for data consumers
What Is a Data Hub: Main Functions
Examples
from platform providers:
from an alternative/complementing definition of data hub:
from Google as a narrowly purposed product:
Example: Cumulocity
Highlights:
Example: MarkLogic
Highlights:
Example: Cloudera
workflow:
reference:
https://www.cloudera.com/products/data-hub.html#
https://docs.cloudera.com/data-hub/cloud/index.html
Things to Consider
how to move data into our data hub? (data ingestion & integration)
where do we store the data in our data hub?
implications of our infrastructure choices?
slides above were covered in
Data Hub Project Weekly Meeting on 9/28
Example: Data Hub as Data Catalog/ Metadata Platform
What is a Data Catalog?
According to IBM, “a data catalog is a detailed inventory of all data assets in an organization, designed to help data professionals quickly find the most appropriate data for any analytical or business purpose.”
reference: https://www.ibm.com/topics/data-catalog
Why does a Data Hub need this?
Aside from enabling people to get the data they need easily, data hubs should also help them find the data they need (this would essentially be the face of this project).
Example: Data Hub as Data Catalog/ Metadata Platform
Common functions of a data catalog
reference: https://engineering.linkedin.com/blog/2020/datahub-popular-metadata-architectures-explained
Example: LinkedIn & Acryl Data’s DataHub
open source
slides 12, 14-16 references a LinkedIn Engineering Blog about metadata catalog architectures by Acryl Data’s CTO (previously LinkedIn’s Principal Staff Software Engineer leading the data infrastructure team)
Example: LinkedIn & Acryl Data’s DataHub
architecture
reference: https://engineering.linkedin.com/blog/2020/datahub-popular-metadata-architectures-explained
Example: LinkedIn & Acryl Data’s DataHub
architecture
reference: https://engineering.linkedin.com/blog/2020/datahub-popular-metadata-architectures-explained
Example: LinkedIn & Acryl Data’s DataHub
reference: https://engineering.linkedin.com/blog/2020/datahub-popular-metadata-architectures-explained
Example: Microsoft’s OneLake Data Hub
question:
What does MBZUAI need?
Example: Ads Data Hub (will skip in presentation)
Highlights:
For weekly meeting on 10/5/2023
topic: continue to survey data hub elsewhere
maybe
Example: Internet of Water Data Hub
What is the Internet of Water Coalition?
A group of organizations (e.g. Duke University Nicholas Institute, Lincoln Institute Center for Geospatial Solutions, Western States Water Council) working with federal, state, and local governments to improve water management by building a water data infrastructure.
In summary:
Better data means better management. Therefore, we need good infrastructure to make water data more discoverable, accessible, and usable
Example: Internet of Water (IoW) Data Hub
Example: Internet of Water (IoW) Data Hub
basic components of a Data Hub
individual data hubs are then connected by Geoconnex, which is essentially a global metadata catalog (this is another place data users can come in)
Example: Internet of Water (IoW) Data Hub
there are four types of data hub, and this is type A
Example: Internet of Water (IoW) Data Hub
type B:
type C:
* data sharing creates a privacy issue
Example: Internet of Water (IoW) Data Hub
type D (which is more like our original vision):
* can the data producers trust the data hub to transform the data right? (which brings us back to the ETL v.s. ELT question)
Example: Internet of Water (IoW) Geoconnex
purpose: make it easier for data consumers to find/access the data they want/need. Roughly similar to what LinkedIn and Microsoft is doing
why I think it is not a good product
this is so that a web crawler can then go to these pages to extract the data it needs and store it to a central database once everything above is done right
Example: Internet of Water (IoW) Data Hub
IoW Data Hub examples:
https://newmexicowaterdata.org/
references:
https://internetofwater.org/wp-content/uploads/2022/09/Blog019_IoWWaterDataHubs_Watson.pdf (slides 20-25)
https://internetofwater.org/blog/geoconnex/ (slide 26)
Example: the Art Institute of Chicago Data Hub
background
Example: the Art Institute of Chicago Data Hub
design goals
Example: the Art Institute of Chicago Data Hub
architecture
Example: the Art Institute of Chicago Data Hub
challenges
references: https://mw19.mwconf.org/paper/building-a-data-hub-microservices-apis-and-system-integration-at-the-art-institute-of-chicago/index.html (slides 28-31)
Example: Red Hat’s Open Data Hub (ODH)
Example: Red Hat’s Open Data Hub (ODH)
Architecture highlights
Example: Red Hat’s Open Data Hub (ODH)
references:
https://opendatahub.io/docs/architecture/ (slide 32)
https://cloud.redhat.com/hubfs/Open-Data-Hub-A-Deep-Dive.pdf (slide 33)
Example: Google Analytics Hub
feature of BigQuery
short intro to BigQuery:
Example: Google Analytics Hub
How might it look like for MBZUAI?
Example: Google Analytics Hub
Alright, so how would I build it?
Scaling
What could be MBZUAI Data Hub’s unique offering?
Automatic Sampling Bias Detection
what is sampling bias?
sample unrepresentative of population
why do we care?
what role can data hub play in this?
by enabling automatic and real-time exploratory data analysis, we can
Automatic Sampling Bias Detection
how do we detect sampling bias?
what does detecting sampling bias mean?
uncovering patterns in data (arguably data mining, knowledge discovery in
data): 90% of the data collected today came from people who identifies as
cisgender man (assigned male at birth and identifies as man)
Automatic Sampling Bias Detection
ways to uncover patterns in data
methods of unsupervised learning:
exclusive (K-means)
hierarchical (agglomerative)
probabilistic (Gaussian Mixture, determines probability of data belonging to to a particular distribution)
Apriori algorithm (frequent data combination)
Automatic Sampling Bias Detection
ways to uncover patterns in data
methods of unsupervised learning:
local outlier factor (local density of focus point v.s. neighbor)
isolation forest (build overfitted decision tree by splitting at random threshold & feature until every point is isolated. Those isolated early are outliers)
autoencoders (perhaps for images. Encoders learns a latent representation of input and decoders reconstruct input from encoded output. Larger
reconstruction error implies anomaly)
Automatic Sampling Bias Detection
ways to uncover patterns in data
methods of supervised learning:
understand the different ways in which your data can be categorized to adjust data collection strategy
e.g.
classification results seems geographical, and much more data are classified as coming from East Asia
KNN, K-nearest neighbors
decision tree & random forest
Academic Discussions on Data Hub
Technical Report: Developing a Working Data Hub
Vijay Gadepally (OSU Ph.D. in ECE, technical staff at MIT Lincoln Lab)
Jeremy Kepner (Princeton Ph.D. in Astrophysics, head of MIT Lincoln Lab)
MIT on the need of data hub
big data challenges:
evolving solution:
MIT on data hub architecture
at the bottom:
external data sources + your own data storage including databases, data warehouses, and data lakes
in the middle:
heterogeneous data management and transformation. enable users to query multiple sources at once
on the top:
data discovery and define data transformation rules
MIT on data hub system engineering
NOTE: “system engineering studies the development of complex systems (p.8)”
HIGHLIGHTS:
3a is an operation supported on databases while 3b is a strategy for file systems/ object storage/ block storage i.e. place for unstructured/ raw data like data lakes
=> similar to my data hub architecture proposal
MIT on raw data storage: Lustre
Lustre further reading
“Azure Managed Lustre: not your grandparents' parallel file system”
Ceph (used in Redhat’s Open Data Hub, originally part of Sage Weil’s doctoral dissertation at UC Santa Cruz, also used extensively in HPC) vs Lustre:
https://www.linkedin.com/pulse/ceph-lustre-yashar-esmaildokht/
AWS FSx for Lustre (integrates with s3)
https://docs.aws.amazon.com/fsx/latest/LustreGuide/what-is.html
official site:
MIT on DBMS
what is DMBS?
software interface between users and database
why use DMBS?
MIT on DBMS
ACID v.s. BASE
relational db: atomicity, consistency, isolation, and durability. OLTP
non-relational db: basically available, soft state, eventual consistency
CAP theorem
you can only choose 2 between consistency, availability, and partition tolerance
(introduced by professor at Cal, supported by professors at MIT)
relational db: C + A
non-relational db: C + P (MongoDB), A + P (Cassandra, developed at FB, and founder later went to AWS to build DynamoDB)
MIT on DBMS
scaling
relational db: vertical (improve hardware because they are mainly on single mode)
non-relational db: horizontal (add nodes, which creates need of strong partition tolerance, which makes relaxing C and/or A inevitable if the CAP theorem is correct)
That said, there are controversy surrounding CAP
NewSQL (as opposed to SQL and NoSQL) movement that aims to build databases that are both ACID compliant and performant (i.e. horizontally scalable)
some examples include …
CockroachDB: https://www.cockroachlabs.com/
Google Cloud Spanner (which we have seen before when introducing BigQuery, now advertised as being half the cost of DynamoDB): https://cloud.google.com/spanner?hl=en
MIT on DBMS
Access Control: who can access what
granularity: table, row, to cell levels
principal: the “who,” can be role/ user/ groups
asset: the “what,” i.e. resources stored in database
view-based access control being most popular
query control strategies: restricts what query the principal is allowed to issue
MIT on DBMS
ways to add security
well … duh … encryption and/or masking
CryptDB (also by MIT):
MIT on heterogeneous data management
MIT on heterogeneous data management
BigDAWG (Big Data Working Group)
MIT on heterogeneous data management
MIT on key considerations when developing data hub
Technological considerations
MIT on key considerations when developing data hub
Infrastructure considerations
MIT on key considerations when developing data hub
Data formatting considerations
MIT on key considerations when developing data hub
Security Considerations
MIT on key considerations when developing data hub
Policy considerations
MIT on key considerations when developing data hub
User engagement
UCSC Xena Platform
https://www.nature.com/articles/s41587-020-0546-8
enables visualization and interpretation of cancer genomics data from disparate sources
Background
UCSC Xena Platform
Architectural Design
two components:
UCSC Xena Platform
Lessons from government agencies
Taiwan’s Health and Welfare Data Center (HWDC)
https://pubmed.ncbi.nlm.nih.gov/31118821/
Background
Taiwan’s Health and Welfare Data Center (HWDC)
Background
Taiwan’s Health and Welfare Data Center (HWDC)
Challenges: Privacy
Taiwan’s Health and Welfare Data Center (HWDC)
Challenges: Data Quality
Taiwan’s Health and Welfare Data Center (HWDC)
Challenges: Unmeasured or unavailable variables
Querying Encrypted Data
by Microsoft Research
Querying Encrypted Data: Demand
migration to cloud and thus storage on cloud
Querying Encrypted Data
enters encryption:
you won’t be able to understand/ use the data even if you managed to gain access to them
the quick brown fox … a pangram
Querying Encrypted Data
two fundamental techniques:
Querying Encrypted Data
more on the previous page
partial: one gate (+ or *)
full: multiple gates and unbounded depth (i.e. arbitrary functions)
Monomi: work of MIT CS and AI Lab (CSAIL) in 2013
CryptDB: from CSAIL too, 2011, Raluca Ada Popa is an Associate Professor at Berkeley EECS now
TrustedDB: Stony Brook University (NY), 2014
Cipherbase: Microsoft research, 2015
Querying Encrypted Data
symmetric encryption
Querying Encrypted Data
asymmetric encryption
Querying Encrypted Data
Advanced encryption standard cipher block chaining mode
Querying Encrypted Data
nondeterministic (more secure):
>> for i in range(2):
>> print(encrypt(“foo”))
>> 1qaz2wsx
>> 3edc4rfv
deterministic:
Querying Encrypted Data
homomorphic encryption:
>> encrypt(1) + encrypt(1) == encrypt(2)
order preserving encryption:
Querying Encrypted Data
to summarize
Querying Encrypted Data
authors’ opinion: outdated?
Querying Encrypted Data
Querying Encrypted Data: Trusted Client Architecture
client: applications performing CRUD operations against a server DBMS
Querying Encrypted Data: CryptDB as Trusted Client example
CryptDB:
Querying Encrypted Data: CryptDB as Trusted Client example
example operation
client app
>> SELECT SUM(grade) FROM students;
web proxy
>> decrypt(paillier_grade_sum)
DBMS
>> SELECT PAILLIER_SUM(paillier_grade) AS paillier_grade_sum
FROM students;
students table’s schema:
from user’s perspective
{
“id”: int,
“name”: str,
“grade”: int
}
on DBMS
{
“paillier_id”: str,
“name”: str,
“paillier_grade”: str
}
Querying Encrypted Data: Blob Store as Trusted Client example
What is Blob Store?
students table’s schema:
{
“name”: str,
“blob_grade”: str,
“partition_id”: int
}
range partition:
grade | parition_id |
1 - 9 | 0 |
10 - 19 | 1 |
20 - 29 | 2 |
Querying Encrypted Data: Blob Store as Trusted Client example
Architecture
Querying Encrypted Data: Blob Store as Trusted Client example
client app
>> SELECT SUM(grade)
FROM students
WHERE grade > 10;
DBMS Shell (trusted, client side)
>> SELECT SUM(DECRYPT(grade_blob))
FROM students
WHERE DECRYPT(grade_blob) > 10;
DMBS (untrusted, server side)
>> SELECT grade_blob
WHERE partition_id > 0;
example operation
Querying Encrypted Data: Blob Store as Trusted Client example
challenge
Querying Encrypted Data: Monomi as Enhanced Blob Store
WHERE updated_at = created_at + 1 -> without key, you don’t know encrypt(1))
example operation
client app
>> SELECT SUM(grade)
FROM students
WHERE grade > 10;
DBMS Shell (trusted, client side)
SUM(DECRYPT(grade_paillier))
DMBS (untrusted, server side)
>> SELECT grade_paillier
WHERE grade_paillier > 10;
Querying Encrypted Data: Trusted Client Summary
Querying Encrypted Data: Secure In-Cloud Processing
we talked about trusted client in the previous slides
Querying Encrypted Data: Secure In-Cloud Processing
how do queries under different architecture work?
benefit of having smaller TCB?
notice how isolation (hardware, memory) affects size of base
Querying Encrypted Data: Secure In-Cloud Processing
larger TCB | smaller TCB |
less secure | more secure |
more administration | less administration |
according to this paper:
Querying Encrypted Data: TrustedDB as an Example of Secure In-Cloud Processing
Highlights
SQLite: trusted storage for sensitive data
Querying Encrypted Data: TrustedDB as an Example of Secure In-Cloud Processing
id | name | birthday_timestamp |
0 | Andrew | 866476800 |
1 | Bart | 961171200 |
2 | Casey | 1087401600 |
id | SSN |
0 | 123456789 |
1 | 123456790 |
2 | 123456791 |
MySQL (untrusted)
SQLite (trusted)
students table
Querying Encrypted Data: TrustedDB as an Example of Secure In-Cloud Processing
example query
User Query
>> SELECT *
>> FROM students
>> WHERE birthday_timestamp >866476800 AND
>> SSN > 123456790
Querying Encrypted Data: TrustedDB as an Example of Secure In-Cloud Processing
What Happens Under the Hood
>> SELECT *
>> FROM mysql_students
>> RIGHT JOIN (
>> SELECT id
>> FROM students
>> WHERE SSN > 123456790
>> ) AS sqlite_result
>> ON mysql_students.id = sqlite_result.id
>> WHERE mysql_students.birthday_timestamp >866476800
Querying Encrypted Data: TrustedDB as an Example of Secure In-Cloud Processing
performance bottleneck:
Querying Encrypted Data: Cipherbase as an Example of Secure In-Cloud Processing
Querying Encrypted Data: Cipherbase as an Example of Secure In-Cloud Processing
https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/cipherbase.pdf
Querying Encrypted Data: Semantic Security
“An adversary is allowed to choose between two plaintexts, m0 and m1, and he receives an encryption of either one of the plaintexts. An encryption scheme is semantically secure, if an adversary cannot guess with better probability than 1/2 whether the given ciphertext is an encryption of message m0 or m1. The notion is also referred to as indistinguishability of encryptions. The encryption reveals no information no matter what kind of semantics (meanings) are embedded in the encryption.”
https://link.springer.com/referenceworkentry/10.1007/978-1-4419-5906-5_23
Querying Encrypted Data: Leakage
what information does encryption schemes reveal?
AES in CBC mode (non-deterministic)
deterministic
order preserving
Querying Encrypted Data: Leakage
memory access pattern leakage
malicious actors watching on server end can study storage locations different operation access
e.g.
observer learns the following workflow
query A -> access location 0 ~ 9 -> wires money to some account
which implies
location 0 ~ 9 must store information required for some transaction
Querying Encrypted Data: Leakage
Querying Encrypted Data: Access Pattern Oblivious Algorithms
bubble sort
for i in range(len(arr) - 1, -1, -1):
for j in range(0, i):
if arr[j] < arr[j + 1]:
continue
arr[j], arr[j + 1] = arr[j + 1], arr[j]
image source:
Querying Encrypted Data: Access Pattern Oblivious Algorithms
aggregator oblivious encryption
https://eprint.iacr.org/2017/479.pdf
oblivious simulation
an additional layer that makes memory access pattern looks random -> disables benefitting from spatial and temporal locality
Querying Encrypted Data
(from the Cipherbase paper)
In conclusion, what do we want?
solution that balances
Transaction Processing on Confidential Data using Cipherbase
background
Transaction Processing on Confidential Data using Cipherbase
Abstract (on Cipherbase):
Transaction Processing on Confidential Data using Cipherbase
Architecture Benefits (aside from what was already mentioned last week)
Transaction Processing on Confidential Data using Cipherbase
Threat Model (categories below are assumed to be passive i.e. does not tamper with database contents or processing)
strong adversary:
privileged OS and database access, can view the contents on memory/disk and all internal/external communication. However, cannot observe state and computations in secure hardware.
weak adversary:
can obtain a one-time snapshot of data-at-rest
in real world:
adversary usually lies between the two
Transaction Processing on Confidential Data using Cipherbase
Data Confidentiality (assuming all columns are all strongly, or non-deterministically, encrypted)
Transaction Processing on Confidential Data using Cipherbase
relational algebra notations:
projection - subset of columns
restriction (selection) - subset of rows
security guarantee of range index is similar to that of order preserving encryption
Transaction Processing on Confidential Data using Cipherbase
Data Confidentiality comparing to prior work
CryptDB (PHE)
e.g. to evaluate
CryptDB has to store the entire column deterministically encrypted, which leaks equivalence relation & distribution when weak adversary obtains snapshot
On the other hand, with Cipherbase
scan-based plan only reveal unknown_predicate(A) over R tuples (bc vals can be non-deterministically encrypted)
Transaction Processing on Confidential Data using Cipherbase
Architecture & Overview of query/transaction processing
database processing is about
Transaction Processing on Confidential Data using Cipherbase
Cipherbase client
Transaction Processing on Confidential Data using Cipherbase
e.g. to evaluate
needs a stack program that
inputs encrypted A, outputs plaintext True/False
invoked once for each tuple of R
this usage pattern inspires the design => registered once, invoked using handle (id) returned by register method
has a data cache but is stateless apart from that
Trusted Module
evaluate expressions over encrypted data during query processing
Transaction Processing on Confidential Data using Cipherbase
Key Management
keys communicated to trusted module using standard key-exchange techniques: secure hardware device manufacturer assigns a public key
Data Encryption
supports cell-level (row + column specification) encryption
=> minimizes TM traffic
=> incurs storage penalty (the ciphertext of a 4-byte integer is 12 bytes in AES + CTR mode)
Transaction Processing on Confidential Data using Cipherbase
Indexing
range index:
=> which means weak adversary learns ordering but not equality relationship
=> strong adversary learns equality relationship of queried ranges
Transaction Processing on Confidential Data using Cipherbase
inserting 7 into a range index
Transaction Processing on Confidential Data using Cipherbase
Indexing
equality index:
Transaction Processing on Confidential Data using Cipherbase
transaction processing
First, client issues query Q
during the prepare step:
Transaction Processing on Confidential Data using Cipherbase
e.g.
user issues:
>> UPDATE Accounts
>> SET Balance = Balance + @Amt
>> WHERE Id = @Id
Cipherbase Client rewrites into:
>> UPDATE Accounts
>> SET Balance = TMEval(21, Balance, @Amt)
>> WHERE Id = @Id
TMEval: built-in function on UM (SQL server) added by Cipherbase to invoke stack programs on TM
After rewritten query submitted to UM, it prepares a query plan like the following:
Transaction Processing on Confidential Data using Cipherbase
PHE avoids round-trips to TM:
in the previous example, under a scan-based plan, and if Id column is deterministically encrypted, query does not have to involve TM
Transaction Processing on Confidential Data using Cipherbase
Trusted Module
Transaction Processing on Confidential Data using Cipherbase
System engineering and optimization details
Transaction processing challenges:
Transaction Processing on Confidential Data using Cipherbase
Transaction Processing on Confidential Data using Cipherbase
Optimizations
Transaction Processing on Confidential Data using Cipherbase
Transaction Processing on Confidential Data using Cipherbase
user issues
>> UPDATE Inventory
>> SET item_cnt = item_cnt + @qty
>> SET total_cnt = total_cnt + @qty
>> WHERE storeId = @Id
Naive Cipherbase rewrite
>> UPDATE Inventory
>> SET item_cnt = TMEval(0, item_cnt, @qty)
>> SET total_cnt = TMEval(1, total_cnt, @qty)
>> WHERE storeId = @Id
expression folding
>> UPDATE Inventory
>> SET @var = TMEval(2, item_cnt, total_cnt, @qty)
>> SET item_cnt = UMExtract(@var, 0)
>> SET total_cnt = UMExtract(@var, 1)
>> WHERE storeId = @Id
i.e. common subexpression elimination
Transaction Processing on Confidential Data using Cipherbase
Hitting Pause on Cipherbase
Hitting Pause on Cipherbase
https://www.microsoft.com/en-us/research/project/microsoft-seal/
Hitting Pause on Cipherbase
however, what is SEAL actually capable of?
silent confession by Microsoft:
Hitting Pause on Cipherbase: Microsoft SEAL
=> this means a large portion of SQL queries is unachievable
this restriction is entirely contradictory to the idea of FHE
Hitting Pause on Cipherbase: Microsoft SEAL
this is rather a problem common to Cipherbase and all the previous technologies discussed
Switching to Data Observatory
Clarifying Survey Direction
questions to address:
Thinking Outside the Box: AI4VIS
https://arxiv.org/pdf/2102.01330.pdf
background
i.e. model => visualization
i.e. visualization => model
Thinking Outside the Box: AI4VIS
Why should MBZUAI care?
model => visualization => model
we are not only advancing AI in two directions but giving more content to show on our screen
Thinking Outside the Box: AI4VIS
credence: author affiliation
Thinking Outside the Box: AI4VIS
definitions
Thinking Outside the Box: AI4VIS
why apply AI to visualization data (mentioned previously but now in more detail)
Thinking Outside the Box: AI4VIS
Thinking Outside the Box: AI4VIS
how is the research community applying AI to visualizations?
Thinking Outside the Box: AI4VIS
Thinking Outside the Box: AI4VIS
Thinking Outside the Box: AI4VIS
Thinking Outside the Box: AI4VIS
future research opportunities
AI4VIS: implementation example from Data2VIS => LIDA
Data2Vis => LIDA
links to papers
Data2Vis: https://arxiv.org/pdf/1804.03126.pdf
LIDA: https://arxiv.org/pdf/2303.02927.pdf
Data2Vis => LIDA
background
Data2Vis (2019, submitted on arXiv in 2018) author:
Victor Dibia when at IBM Research, Çağatay Demiralp at MIT CSAIL
LIDA (
submitted on arXiv in March, 2023,
open-sourced on GitHub only in August this year,
claimed by Victor as an update of Data2Vis
) author:
Victor Dibia at Microsoft Research
Data2Vis => LIDA
summary - Data2Vis
Data2Vis => LIDA
Visualizations created by Data2Vis
Data2Vis => LIDA
summary - LIDA
data to semantically rich natural language description,
LLMs suffer from hallucination which sometimes lead them to produce
output not grounded in data. Informing summary augments model by provides grounding context for generating visualizations
question (hypothesis): what are we trying to understand
visualization: e.g. histogram of miles per gallon
rationale: why should we ask the question above?
Data2Vis => LIDA
generate, evaluate, repair, and execute visualization code
code scaffold constructor: imports dependent libraries and create
function stubs
code generator & executor: fill in TODOs, get visualization
Data2Vis => LIDA
Formalizing Visualization Design Knowledge as Constraints: Draco
Visualization Design Knowledge as Constraints: Draco
Visualization Design Knowledge as Constraints: Draco
Visualization Design Knowledge as Constraints: Draco
taste of Answer Set Programming
A :- L_1, …, L_n => a rule
A => atoms, proposition that may be true/false
L_i’s => literals, A is true if only all of them is true
bodiless rules encode a fact whereas headless rules impose integrity constraints (cannot all be satisfied)
e.g.
light_on => power_on, not broken
Visualization Design Knowledge as Constraints: Draco
integrity constraint defines how attributes should interact with each other:
:- mark(bar), channel(E, y), continuous(E), not zero(E)
rules out vertical bar charts that does not use 0 as baseline
set of aggregate rules: specifies attributes of a visualization (e.g. mark, encoding)
Visualization Design Knowledge as Constraints: Draco
finding optimal completion of
partially specified
visualization
Visualization Design Knowledge as Constraints: Draco
e.g.
:~ continuous(E), not zero(E). [5]
this says that model prefers to include 0 for continuous fields and that violating this constraint incurs a cost of 5
=>
where n_p_i denote # violations of soft constraint p_i
Visualization Design Knowledge as Constraints: Draco
using the cost function defined on the last page,
(v1, v2, y = sign(Cost(v1) - Cost(v2)))
=> y = -1 if v1 preferred over v2
=> let
x be vector of [n_p_0(v), …, n_p_k(v)],
corresponding weight vector be w
=> Cost(v1) - Cost(v2) = w * x_1 - w * x_2 = w * (x_1 - x_2)
Linear regression with Ridge regularization, minimizes hinge loss
Visualization Design Knowledge as Constraints: Draco
Draco 1 (2018, from UDub Interactive Data Lab)
https://idl.cs.washington.edu/files/2019-Draco-InfoVis.pdf
Draco 2 (2023, from UDub, University of Vienna, University of Maryland, CMU)
Experiments with Mac Studio
Last Week: exploring MLX on Mac Studio
Last Week: exploring MLX on Mac Studio
Last Week: llama 2 70B with MLX
Last Week: llama 2 70B with MLX
llama 2 -> MLX weights conversion example source code by Apple
Llama 2 with Hugging Face: quickstart module
allows running inference on
https://huggingface.co/meta-llama
either approaches require submitting request to Meta and/or Hugging Face
Llama 2 with Hugging Face: quickstart module
https://github.com/ggerganov/llama.cpp
, which is mainly designed for quantized models (requires less computational and storage capacity)
Llama 2 with Hugging Face: 7B, sample prompt 1
Llama 2 with Hugging Face: 7B, sample prompt 2
Llama 2 with Hugging Face: 13B, sample prompt 1
Llama 2 with Hugging Face: 13B, sample prompt 2
Llama 2 with Hugging Face: 70B, sample prompt 1
Llama 2 with Hugging Face: 70B, sample prompt 2
Llama 2 with llama.cpp
why are we back?
https://huggingface.co/docs/transformers/benchmarks
Llama 2 with llama.cpp: 7B
“Input length is not significant for performance but important for hardware requirements”
source:
https://www.databricks.com/blog/llm-inference-performance-engineering-best-practices
Llama 2 with llama.cpp: interpreting console logs
explanation by project collaborator
Llama 2 with llama.cpp: 13B
Llama 2 with llama.cpp: 70B
Llama 2 with llama.cpp: comparison with other experiments
=> quantized to 4 bits
(originally 16)
source:
https://towardsdatascience.com/apple-m3-machine-learning-speed-test-f1346e23a1b2
Llama 2 with llama.cpp: comparison with other experiments
source: https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF
Llama 2 with llama.cpp: comparison with other experiments
source:
Llama 2 with llama.cpp: comparison with other experiments
source:
https://medium.com/@andreask_75652/benchmarking-apples-mlx-vs-llama-cpp-bbbebdc18416
Llama 2 with llama.cpp: comparison with other experiments
“Overall latency scales sub-linearly with model size” - from the databricks blog
mentioned in slide 189
70 / 13 = 3.58�209.19 (70B eval t/ms) / 44.62 (13B eval t/ms) = 4.68
Question from last week: suspicious prompt eval speed
from llama.cpp maintainer (llama-2-7b fp16):
my local replication:
Question from last week: suspicious prompt eval speed
screenshot from slide 196:
Optimizing LLM Inference: batching and vLLM
why batch?
Optimizing LLM Inference: batching and vLLM
naive/ static batching
Optimizing LLM Inference: batching and vLLM
continuous batching
sequences waiting prefill/ sequences still generating
Optimizing LLM Inference: batching and vLLM
vLLM:
reference: https://blog.vllm.ai/2023/06/20/vllm.html
Optimizing LLM Inference: batching and vLLM
core problem addressed:
inefficient memory management on KV cache�=> kv pairs of all prev tokens required to generate the next one�=> size is large: takes up to 1.7 GB for a single sequence in llama-13B
=> and dynamic: depends on highly variable sequence length, leads to wastage due to fragmentation and over-reservation
devised solution:
PagedAttention
=> allows kv cache stored non-contiguously in memory by partitioning the kv pairs of each sequence into blocks each dedicated to a fixed number of tokens
Optimizing LLM Inference: batching and vLLM
=> memory wastage only occurs at the last block
=> enables efficient memory sharing by mapping to the same physical block comes in handy when we need to generate multiple output from the same sequence
Optimizing LLM Inference: batching and vLLM
=> baseline: HuggingFace’s Pipelines library, which implements static batching
=> generation length sampled from exponential dist with mean = 128 tokens
=> FasterTransformer (optimized model implementation) actually do pretty well
=> continuous batching clearly yields strong improvement over static, and so does vLLM over the other three on the graph
Optimizing LLM Inference: batching and vLLM
QPS: query per second, expected rate of poisson distribution sampling from
Optimizing LLM Inference: batching and vLLM
takeaway from the two graphs:
MLPerf: trouble we are running in
how do we run it?
MLPerf: trouble we are running in
last time, we looked at MLPerf Training results on “GPT-3”
MLPerf: comforting news
CTuning (MLCommons founding member, https://ctuning.org/)
MLPerf: comforting news
running BERT-large on Mac Studio M2 Ultra
MLPerf: comforting news
comparison with results published by MLCommons (MLPerf Inference: Edge)
https://mlcommons.org/benchmarks/inference-edge/
system | framework | offline (sample/s) | single stream (latency in ms) |
Macbook Pro M1 | onnx runtime | 1.69 | 597.08 |
Mac Studio M2 Ultra | onnx runtime | 6.48 | 580.58 |
MLPerf … on Mac?
MLPerf … on Mac?
MLPerf … on Mac?
MLPerf … on Mac?
MLPerf … on Mac?
link to drawing above
https://docs.google.com/drawings/d/1OeIpVmCSgLQk4JSc1364l8wdFaXFKoEAQIWrftISNCQ/edit
MLPerf … on Mac?
main obstacle
https://github.com/ml-explore/mlx-examples/tree/main/llms/llama
7b & 13b worked, 70b returns weird output
https://github.com/ml-explore/mlx-examples/tree/main/llms/mlx_lm
MLPerf … on Mac?
discussion with Awni Hannun
MLPerf … on Mac?
model: llama-2-7b-chat
| prompt evaluation (t/s) | token generation (t/s) |
llama.cpp | 83.42 | 39.72 |
mlx | 25.455 | 29.416 |
llama.cpp/ mlx | 3.27 | 1.35 |
MLPerf … on Mac?
model: llama-2-13b-chat
| prompt evaluation (t/s) | token generation (t/s) |
llama.cpp | 51.45 | 22.41 |
mlx | 14.164 | 18.662 |
llama.cpp/ mlx | 3.63 | 1.2 |
MLPerf … on Mac?
point being …
model: llama-2-70b-chat
| prompt evaluation (t/s) | token generation (t/s) |
llama.cpp | 11.87 | 4.78 |
mlx | 1.063 | 0.201 |
llama.cpp/ mlx | 11.16 | 23.78 |
will continue investigating
MLPerf … on Mac?
mlx v. llama.cpp
performance anomaly?
mlx v. llama.cpp
performance anomaly?
MLX GPU profile
Llama.cpp GPU profile
Llama.cpp parallelization
MLX representative workload: on reboot perf anomaly
MLX representative workload: on perf with diff loading strategy
MLX v. Metal: on perf with diff loading strategy
experiments below were conducted in�environment:�mlx 0.6.0�https://github.com/MBZUAI-Research-Office/research_muchi�commit: fd2376d405b8fe786fe2851b828e435f40421edb
Almost root to the issue?
How weights loading strategy affects application performance
batch
sep
upon closer examination, matmul performance seems identical across loading strategies
Almost root to the issue?
batch
sep
zooming out …�we notice the space between computations are caused by driver (wire memory) activities making sure that there are physical memory backing to GPU resources consumed in metal code
(
where could 256 MiB come from? Maybe …
size of weights per layer
8192 * 8192 * 4 / (1024 **2)
= 256 MiB
)
Almost root to the issue?
well … how does that space/ duration caused by driver activity changes upon reboot?
for the first 2 seconds (after the computation starts), there were still plenty driver (wire memory) activities
Almost root to the issue?
well … how does that space/ duration caused by driver activity changes upon reboot?
after that, however, driver (wire memory) activities ceased => lowering the average and thus making the first run post reboot seem much faster than subsequent runs
space between computations diminishes along with decrease in driver (wire memory) activities!
Almost root to the issue?
2nd run post reboot
3rd run post reboot
What does this mean?�The performance degradation we observed from sep loading strategy can also be accounted by these driver (wire memory) activity
Sidenote: how does matmul in mlx work under the hood?
DBRX: distributed inference with M2 Ultras
What is DBRX?
DBRX: inference efficiency
setting:
What is MoE?
TL;DR
MoE models
reference: https://huggingface.co/blog/moe
What is MoE?
dense: refers to the feed-forward layer, which is really just linear combination followed by activation (in Llama-2: multi-layer perceptron)
traditional transformer encoder block
What is MoE?
traditional transformer
transformer with MoE
vs
the traditional/ dense feed-forward network (FFN) layer is replaced by a
sparse MoE layer that operates independently on the tokens in the sequence
- the router/ gate network decides which expert the token goes
- output of MoE layer = router gate value * output of selected expert
What is MoE?
what does this remind you of?
in addition,
So, how do we run distributed inference with DBRX?
example:�transformer encoder block from Google’s GShard�- attention weights are replicated across devices�- each device is responsible for a FFN/ expert (model parallelism: model partitioned across nodes)
reference: https://arxiv.org/abs/2006.16668
Requirements to run DBRX
132B params, each param being a bfloat16�=> 132 * 2 (bytes) = 264 * (10 ^ 9) = 264 GB�=> however, DBRX’s github repo README recommends having at least 320 GB of memory��right now, model implementations are available in�- TensorRT-LLM (requires CUDA)�- vLLM (requires CUDA)�- MLX (requires Apple Silicon, focus, commitment, and sheer will)
interestingly,�HuggingFace’s transformer library integrates an experiment project called PiPPy that aims to enable easy pipeline parallelism (where each node gets a portion of the model’s layers) for pyTorch => solves the memory problem, but seemingly little performance benefits�https://github.com/pytorch/PiPPy/
Potential Strategy with M2 Ultras
M2 Ultra Cluster Communication latency
recap:
each of our Mac Studio M2 Ultra has 192GB of memory, but DBRX inference requires 264GB (Databricks even recommends to have at least 320GB of memory)
our goal is to run distributed inference on a cluster (2 - 8 nodes), which should provide us more than enough computation & memory resources. However, we need to make sure that communication between nodes does not become a bottleneck
what information is being exchanged?
DBRX has …
L_0
L_39
ATT
FFN_0
FFN_1
FFN_15
.
.
.
…
router
B 6144
B 6144
L_38
to 0, 4, 7, 15
A
6144
C 6144
+
B 6144
B 6144
B 6144
choosing 4 from 16 experts
float16
dispatch/ aggregate
M2 Ultra Cluster Communication latency
4 nodes (Thunderbolt with star topology, Ethernet with switch)
Protocol:
gRPC
(
avg latency includes serialization and deserialization time
)
DBRX: distributed inference with M2 Ultras
recap
DBRX: distributed inference with M2 Ultras
our progress so far
where did we get stuck? (like for quite a while)
DBRX: distributed architecture design
DBRX: distributed inference with M2 Ultras
0.5 t/s performance breakdown
one layer
performance view from Instruments
expert calc, comm
wait/stall
DBRX: distributed inference with M2 Ultras
optimization strategies
updates from 4/30 PAS lab meeting
DBRX: distributed inference with M2 Ultras
latest employed optimization strategies
DBRX: distributed inference with M2 Ultras
model performance
workload: batch size = 1, input = 20 tokens, output = 64 tokens
| prompt evaluation | token generation |
week_0:�baseline | 1.3 t/s | 0.6 t/s |
week_1:�with expert parallelization, weights loading in one-chunk, dense | 3 t/s | 2.4 t/s |
05/09/2024 weekly report
outline
DBRX distributed inference architecture design
DBRX distributed inference performance breakdown
| per layer | per prompt | ||
| attention + router | MoE | wait / stall | token generation t/s |
v1, �- baseline | ~7 ms | ~30 ms | ~13 ms | ~0.5 t/s |
v1.5, - load each expert’s weights as one-chunk - dense | ~ 4 ms | ~3.5 ms | ~3 ms | ~2.4 t/s |
v2, - everything from v1.5 - earlier driver eval - custom serialization | ~1.5 ms | ~3.5 ms | ~3 ms | ~3.1 t/s |
v3, - everything from v2 - architecture redesign - ½ communication - ½ expert computation | estimated: ~1.5 ms | estimated: ~2 ms | estimated: ~1.5 ms | estimated: ~5 t/s |
filename: dbrx_poc_one_custom_serialization
filename: dbrx_poc_batch2
filename: dbrx_poc_perf
Obstacles encountered during v3 development:
driver activity surfacing from idleness
sleep in ms | incurs heavy driver activity |
1 | False |
10 | False |
100 | False |
500 | True |
1000 | True |
driver activity here refers to:
time driver spent on wiring down sufficient physical memory for GPU computation
where does idleness come from (more on this next page):
wait during all-reduce (the shard that finishes first has to wait for the one that finishes last)
observation from v3 MoE layer design (4 chooses 2 experts activated per shard):
waiting/sleeping between layers causes lengthy driver activity to surface
when sleep = 1000 ms, driver processing takes ~400 ms for a ~1 ms GPU computation
How idleness becomes a problem in v3
introducing v3.2,
our first step towards:
- streaming MoE processing (beneficial during prompt evaluation and when input batch size > 1)
- dynamic load balancing (beneficial even when batch size = 1, increases overlap between communication and computation during streaming)
new trouble with v3.2: needs warming up!
what do we mean by warm up?
sync computation time to reduce driver activity
distributed DBRX v3 performance evaluation
| num GPUs | price in USD per GPU | communication | fp16 TFLOPS per GPU | bfloat16 TFLOPS per GPU | single user throughput (t/s) |
M2 Ultra | 4 | ~8,500 ? | 10 GbE switch | ~54 | ??? | ~6 |
L40S | 8 | ~9,287 | PCIe Gen4 x16: 64GB/s bidirectional | ~362 | ~362 | ~60 ? |
H100 | 4? | ~31,710 | nvlink: 900GB/s PCIe Gen5: 128GB/s | ~1979 | ~1979 | ~115 ? |
Evaluation Goal:
A Performance Model that Takes Compute Power into Account
DBRX on A100: how fast is inter-GPU communication in terms of latency?
hardware setup: 1 node, 4 GPUs, 4 x 80 = 320GB memory, 600 GB/s GPU interconnect through NVLink
framework: TensorRT-LLM, tensor parallelism
communication pattern:
1 x AllGather after all 40 layers
to piece together the generated token
1 token
1 layer
2 x AllReduce:
once before attention,
the other before MoE
DBRX on A100: how fast is inter-GPU communication in terms of latency?
communication latency breakdown:
compared to us:
operation | n times per token | avg latency in microsecond(s) | total latency per token in microsecond(s) |
AllReduce | 2 * 40 = 80 | 18 | 1,440 |
AllGather | 1 | 20 | 20 |
total: | 1,460 |
hardware setup | total communication latency per token in microsecond(s) |
A100 (1 node, 4 chips) | 1,460 |
M2 Ultra (2 node, 2 chips) | 60,000 |
DBRX performance model
Short-term Future Roadmap
resources available
several 4090s, configurable to single or multi node (are we interested in exploring RDMA NICs?)
topics to explore
Related Work: on insufficient memory problem
DeepSpeed Inference (Microsoft, June 2022)
Related Work: ZeRO-Inference
problem:
prefetching is difficult to achieve for MoE models because you don’t know which experts you need until the current layer’s router makes its decision
=> each expert’s weights are spread across all model layers (�for Mixtral8x7b, expert_0’s weights at layer_0 is�7B / 32 layers = 0.22B, which is 0.22B * 2 bytes = 440 MB)
=> you could prefetch every expert’s weights at the next layer and it would still fit in GPU’s memory (Mixtral8x7b: 8 experts x 440 MB = 3.5 GB, RTX 4090 has 24 GB memory)
=> however, wastes memory space and hard to scale (what if we are using smaller GPUs? Also, the current trend is to use more experts)
Related Work: on efficient MoE inference
Towards MoE Deployment: Mitigating Inefficiencies in MoE Inference (Meta AI, June 2022)
Related Work: Towards MoE Deployment … by Meta AI
questions/ problems
Related Work: on efficient MoE inference
Exploiting Inter-Layer Expert Affinity for Accelerating Mixture-of-Experts Model Inference (Ohio State University, Jan 2024, https://arxiv.org/pdf/2401.08383)
1. context coherent expert parallelism
expert parallelism under a centralized design warrants 2 all-to-all operations per MoE layer: once when router dispatches tokens to selected experts and another to aggregate expert outputs for weighted sum
our decentralized design replicates attention & router weights across nodes to allow independent but identical non-MoE calculation and thus eliminates the need for router dispatch
their decentralized design also replicates weights but removes repeated calculation to allow concurrent handling of multiple requests. However, this requires additional KV-cache syncing after each forward pass/ generation of one token
Related Work: Inter-Layer Expert Affinity … by OSU
how does context coherence work?
centralized
decentralized through context coherence
Related Work: Inter-Layer Expert Affinity … by OSU
problems with context coherence
so, doesn’t really reduce the number of all-to-all operations to 1!
Running a MoE model with 4 experts, parallelized on 2 GPUs. Each layer chooses 2 experts.
Related Work: Inter-Layer Expert Affinity … by OSU
2. expert affinity
problems
Related Work: on efficient MoE inference
FasterMoE: Modeling and Optimizing Training of Large-Scale Dynamic Pre-Trained Models (Tsinghua University, April 2022, https://dl.acm.org/doi/pdf/10.1145/3503221.3508418)
Related Work: FasterMoE by Tsinghua
problems with dynamic expert shadowing
Related Work: FasterMoE by Tsinghua
communication and computation overlap
S: send
C: compute
R: receive
Related Work: on efficient MoE inference
DeepSpeed-MoE (Microsoft, July 2022)
Related Work: DeepSpeed MoE
Step 1
Step 2
Step 3
Categorizing Existing Optimizations
an important observation is that expert parallelism creates a lot of inconveniences.
overcoming GPU memory constraint | load balance when using expert parallelism | minimizing communication in parallel systems | |||||
expert buffering | expert + tensor parallelism | evenly distributing hot experts | dynamic expert shadowing | context coherence expert parallelism | expert affinity | parallelism coordinated communication | computation communication overlap |
Proposing a new research direction
tensor parallelism is suitable for our target environment (lower-end hardwares). Before each layer’s computation starts, tensor parallelize every expert’s weights on all available GPUs (e.g. every GPU possesses a slice of every expert).
why?
Proposing a new research direction: Feasibility Analysis
DBRX has 16 experts and 40 layers.
running DBRX with tensor parallelism on 4 GPUs. In each layer, each GPU needs in its memory a slice of all 16 experts, which totals 1.59 GB. Here, we assume a target token generation throughput of 5 t/s, which means each layer’s compute + communication takes 5 ms.
Weekly Updates
access guide:
Prospective Projects on A100 server
1. DBRX 132B & Mixtral8x22B
inference performance tuning & profiling in the context of
2. Google’s Switch-MoE (2048 experts 1.6T params)�inference performance tuning & profiling in the context of
Explore model parallelization techniques that accommodates usage of heterogeneous accelerators
Prospective Projects on A100 server
3. fine-tune pre-trained MoE models’ routing network
to overcome difficulty imposed by dynamic expert activation on weights pre-fetching�
4. Llama-3-70B
inference / fine-tuning performance profiling & optimization
5. Llama-3-405B
inference / fine-tuning performance profiling & optimization
Weekly Updates
Weekly Updates (08/29/2024)
preliminary performance analysis on 4xRTX4090-24GB
token generation throughput comparison between 8xA100-40GB and 4xA100-80GB
throughput with different batch sizes on 1xRTX4090-24GB
POC v0 done:�1. runs on 1xRTX4090-24GB, expert weights and compute on CPU, self-attention & router weights and compute on GPU (https://github.com/muchi674/ntu_paslab_llm)
2. the following versions will focus on better utilizing the GPU
3. performance analysis: TBA
Recap on Phi3.5-MoE
Author: Microsoft
Release Date: 08/22/2024
Architecture: 16x3.8B parameters in BF16 with 6.6B active parameters when using 2 experts. The model is a mixture-of-expert decoder-only Transformer model using the tokenizer with vocabulary size of 32,064.
Context length: 128K tokens
GPUs: 512 H100-80G
Training time: 23 days
Training data: 4.9T tokens
Phi3.5 MoE Inference on 4 rtx4090
- CPU to GPU memory copy throughput: ~10 GB/s for each GPU
- token generation throughput (by naive model parallelism, or putting groups of model layers to different GPUs):
batch size | token generation throughput (t/s) |
1 | 12 |
2 | 19 |
4 | 30 |
8 | 48 |
Note: pipeline parallelism has not being applied
Phi3.5 MoE Inference on 4 rtx4090
illustration of naive model parallelism by using Nsight systems:
=> problem of under-utilization is apparent
DBRX Inference Performance Comparison
| 4xA100-80GB (MBZUAI) | 8xA100-40GB (borrowed) |
token generation throughput in t/s, when batch size = 1 | ~57 | ~85 |
What could be causing this difference?
- with batch size this small, performance is memory-bound (depends on how fast we can load model parameters from GPU memory to cache)
- both systems above are equipped with NVLink, meaning that GPU interconnect is very cheap (all-to-all operations are ~20 microseconds)
- the system with 8xA100 has a higher aggregate memory bandwidth, and the performance gain from breaking down a memory-bound computation to more workers outweighs the additional communication latency
results below are obtained using tensor parallelism
Llama3.1-8B Inference Performance Analysis
results above are obtained from 1xRTX4090-24GB
questions to consider:�token generation throughput growth seems to be plateauing faster than that of prompt evaluation. Why?
Batch Size | Evaluation TP (t/s) | Geneneration TP (t/s) |
1 | 53.3759 | 46.75 |
2 | 103.3154 | 88.5 |
3 | 139.8756 | 130 |
4 | 193.2283 | 168 |
8 | 380.4522 | 288 |
12 | 595.3785 | 402 |
16 | 755.9556 | 462 |
32 | 1392.01663 | 624 |
64 | 2499.4972 | 760 |
128 | OOM | OOM |
Weekly Updates (09/05/2024)
Mixtral8x7B Inference on 1xRTX4090-24GB
- performance analysis on 4 versions of implementation
for all versions below, self-attention & router are on GPU
v0 (CPU only) performance analysis
batch size | prefill throughput (t/s) | decode throughput (t/s) |
1 | 65.02 | 3.83 |
2 | 92.48 | 4.71 |
4 | 130.3 | 6.15 |
8 | 172.25 | 9.58 |
16 | 219.65 | 15 |
32 | 220.84 | 26.94 |
64 | 216.37 | 50.25 |
128 | 225.25 | 78.28 |
workload: ~110 inputs tokens, generate 128 tokens
v1 (GPU only) performance analysis
batch size | prefill throughput (t/s) | decode throughput (t/s) |
1 | 38.19 | 1.27 |
2 | 82.08 | 1.48 |
4 | 159.35 | 1.9 |
8 | 298.14 | 2.9 |
16 | 537.48 | 5.26 |
32 | 924.12 | 10.36 |
64 | 1503.02 | 20.6 |
128 | 2262.64 | 40.89 |
workload: ~110 inputs tokens, generate 128 tokens
v2 (static collab) performance analysis
batch size | prefill throughput (t/s) | decode throughput (t/s) |
1 | 74.61 | 4.28 |
2 | 106.83 | 5.3 |
4 | 142.83 | 7.03 |
8 | 205.12 | 10.62 |
16 | 234.57 | 17.42 |
32 | 256.97 | 34.62 |
64 | 245.89 | 61.04 |
128 | 248.88 | 94.77 |
workload: ~110 inputs tokens, generate 128 tokens
v3 (dynamic collab) performance analysis
batch size | prefill throughput (t/s) | decode throughput (t/s) |
1 | 65.26 | 4.02 |
2 | 113.71 | 5.06 |
4 | 217.58 | 6.37 |
8 | 306.86 | 7.74 |
16 | 538.42 | 12.06 |
32 | 925.01 | 23.45 |
64 | 1503.23 | 40.15 |
128 | 2259.6 | 69.74 |
workload: ~110 inputs tokens, generate 128 tokens
TODOs
Weekly Updates (09/05/2024)
Llama3.1-70B-Instruct Inference
achieves 1.75 t/s token generation throughput on a single RTX-4090 with llama-2-70B,� leverages techniques from speculative decoding, uses llama-2-7B as draft model
Briefing on llama3.1-70B-Instruct
dtype: bfloat16 (2 bytes)
n_layers: 80
vocab_size: 128,256
d_model (embedding size): 8,192
n_attention_heads: 64
n_kv_heads: 8
ffn_hidden_size: 28,672
per layer weights size (in GB):
(the remaining weights are mainly for the embedding layer & model output projection)
From past experiments:
model: llama-2-70b-chat
| prompt evaluation (t/s) | token generation (t/s) |
llama.cpp | 11.87 | 4.78 |
mlx | 1.063 | 0.201 |
llama.cpp/ mlx | 11.16 | 23.78 |
Analysis: Decode Stage
- workload: each input prompt contains ~100 tokens�- when batch size is large enough, longer output length (n_gen_tokens) start to create noticeable pressure during attention calculation
Analysis: Prefill Stage
- as observed with other models, latency increases sub-linearly with respect to batch size
Performance Comparison
Mu-Chi’s Updates (10/02/2024)
Llama-3 inference on a single RTX-4090 with EAGLE speculative decoding
(on a sidenote: llama3.2 released models with 1B, 3B, 11B, and 90B parameters)
outline:
Recap: the gist of speculative decoding
speculative decoding workflow:
1. draft / smaller model generate multiple draft tokens based on input sequence
2. target / bigger model processes input sequence + all draft tokens in parallel in a single forward pass
What is EAGLE about?
premise: good draft model is hard to find (should use architecture, training data, and tokenizer similar to that of the target model)
1. EAGLE trains its own light-weight draft model (0.24B for Llama-2-7B, 1B for Llama-2-70B, 0.28B for Mixtral8x7B. The 1B draft model was trained in 1-2 days on 4x A100 (40G) GPUs)
2. the draft model performs autoregressive inference on the feature level (as opposed to the token level)
3. grows a dynamic tree of draft tokens to increase the number of accepted tokens
What do you mean by feature level?
eagle argues that “autoregression at the feature level is simpler than at the token level [since] feature exhibits more regularity than tokens, which are transformations of natural language”
EAGLE draft model architecture
What do you mean by tree of draft tokens?
tree size: top_k + depth * topk ^ 2�# forward passes to construct tree: 1 + depth
llama-3-8B inference on a single RTX 4090 with EAGLE
performance comparison (end to end)
https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference
generation cycle latency breakdown
vanilla (t/s) | with Eagle (t/s) | speedup |
54.34 | 116.36 | 2.14x |
drafting (ms) | target model verification forward pass (ms) | misc. (ms) | |
12.57 | 22.99 | ~2 |
tree:�top-k = 5,�depth = 10,
size = 5 + 10 * 5 ^ 2 = 255
# draft model forward passes: 10
116.36 / (1000 / (12.57 + 22.99 + 2))
= 116.36 / 26.62
= 4.37 tokens per cycle
llama-3-70B inference on a single RTX 4090 with EAGLE & weights offloading
- original EAGLE framework only support scenarios where both the draft and the target model fit on GPU memory (allows multi-GPU)
- I modified EAGLE to store the draft model on GPU and memcpy the target model layer by layer from CPU to GPU memory during forward pass (no cache. all computation happen on GPU).
- I mimicked Sequoia’s “optimized offloading engine” to allow overlap between memcpy and GPU computation
Offloaded Eagle
overlapping:
1. drafting and target model first layer’s memcpy
2. current target model layer’s computation and next layer’s memcpy
(uses 2 cuda streams, pinned memory, and double buffering on GPU memory)
llama-3-70B inference on a single RTX 4090 with EAGLE & weights offloading
performance comparison (end to end)
generation cycle latency breakdown
1x L40 48GB | 1x RTX 4090 24GB | |
DeepSpeed-Zero-Inference: pure offloading (t/s) | Sequoia: 7B draft model (t/s) | Offloaded Eagle: 1B draft model (t/s) |
0.18 | 1.79 | 0.79 |
drafting (ms) | target model verification forward pass (ms) | misc. (ms) | |
33.28 | 5,235 | ~4.5 |
tree:�top-k = 5,�depth = 10,
size = 5 + 10 * 5 ^ 2 = 255
# draft model forward passes: 10
0.79 / (1000 / (33.28 + 5235 + 4.5))
= 0.79 / 0.19
= 4.16 tokens per cycle
Moving Forward: Speculative Decoding
recap
- during the last few weeks, we have been experimenting with speculative decoding as a way to accelerate LLM inference without degrading model accuracy in offloading settings where performance is often bottlenecked by CPU to GPU PCIe bandwidth
- last time, we introduced EAGLE, a speculative decoding framework that trains its own draft model instead of using one off the shelf
- By enabling offloading with basic optimizations such as delegating FFN computation to the CPU, we were able to achieve ~1.75 t/s throughput with EAGLE when running Llama-3-70B with a 1B self-trained draft model on 1x RTX-4090 24GB, which is on par with Sequoia’s results with Llama-3-8B as the draft model
Moving Forward: Speculative Decoding
recap
- this made us wonder: can we run Llama3.1-405B on a 1x RTX-4090 24GB?
- to do so, we need to:� 1. store weights in SSD & CPU memory => more PCIe limitations
2. incorporate hierarchical speculative decoding: big / medium / small models
- but before we rush into things, we should see whether others have already implemented this idea (spoiler alert: yes)
Moving Forward: Speculative Decoding
today’s outline
- brief introduction to related work:� 1. Cascade Speculative Drafting for Even Faster LLM Inference� by team at University of Illinois Urbana-Champaign, December, 2023� https://arxiv.org/pdf/2312.11462
2. TRIFORCE: Lossless Acceleration of Long Sequence Generation with � Hierarchical Speculative Decoding� by team at CMU (Beidi Chen, https://www.andrew.cmu.edu/user/beidic/), Meta AI FAIR lab, April, 2024� https://arxiv.org/pdf/2404.11912
- experiments:� opportunities we discovered when examining TriForce and EAGLE side by side
Related Work: Cascade Speculative Decoding
- recall that, when verifying the draft sequence, if token at the current position is rejected, all following tokens are discarded�- therefore, draft tokens at later positions in the sequence have lower chances of being accepted�- inspired by this observation, Cascade uses larger draft models for earlier positions and smaller ones later
Related Work: TriForce
background (not from the paper but relevant)
- in LLM, context (input prompt + previously generated tokens) is used to generate the next token in the sequence
- to avoid repeating computation, the KV-cache, which stores the results of�matmul(input, K_matrix) and matmul(input, V_matrix), is used to accelerate the attention mechanism
- Llama-3.1 supports context length up to 128K (131,072 to be precise) tokens
- popularity of using Chain of Thought prompting to assist LLMs in reasoning also implies longer contexts
Related Work: TriForce
- note that KV-cache size grows linearly with batch size
- this means, when serving these SOTA models, we need to account for both model weights and KV-cache sizes to realize their full-potential
- seeing this and the fact that draft models grow less accurate with longer contexts (perhaps because they are not trained to memorize as much context as the target model. However, this is not the case for the llama3.1 family), TriForce proposes …
- how big is the KV-cache?�2 * n_layers * batch_size * max_sequence_len * n_kv_heads * head_dim * 2 (bytes)
- which means, 1x sequence of 128K tokens’ KV-cache occupies� Llama-3.1-8B: 2 * 32 * 1 * 128,000 * 8 * 128 * 2 / 1000^3 = 16.78 GB� Llama-3.1-70B: 2 * 80 * 1 * 128,000 * 8 * 128 * 2 / 1000^3 = 41.94 GB� Llama-3.1-405B: 2 * 126 * 1 * 128,000 * 8 * 128 * 2 / 1000^3 = 66.06 GB
Related Work: TriForce
a three-tier architecture:
1. at the bottom, a tiny draft model that uses fixed-sized KV-cache to accelerate
2. a “self-speculating” mechanism in the middle that uses the original target model weights + ~3% of the full KV-cache, which is then used to speed-up
3. the target model with full KV-cache
Related Work: TriForce
TriForce argues self-speculation in the “middle” works because …
- attention sparsity is a well-explored topic that argues most attention score can be recovered from a small portion of the KV-cache
- TriForce observed that adjacently generated tokens tend to pay attention to the same part of the KV-cache (intuitively, e.g.: summarize the experiments setup section of this paper)
Experiments: TriForce
- we hypothesized that, due to self-speculation (which still uses the full target model weights), TriForce would not perform well with shorter contexts�- note that vanilla autoregressive decoding with Llama-3.1-8B is reported to reach 54 t/s on same machine
context length | retrieval cache budget | # tokens per generation cycle |
130048 | 8192 | 7.56 |
32784 | 8192 | 9.84 |
8192 | 8192 | 11.74 |
2048 | 2048 | 11.87 |
512 | 512 | 11.22 |
128 | 128 | 11.78 |
Experiments: TriForce
interestingly, with shorter contexts, TriForce is performing badly with low retrieval cache budget. The best performance is recorded when the “middle” self-speculative mechanism is equivalent to the target model
Experiments: EAGLE
naturally, we became curious about how EAGLE performs with longer contexts�note that results published by their paper (same as Sequoia) uses very short context: 128 tokens
context length | # tokens per generation cycle |
128 | 3.71 |
256 | 3.70 |
512 | 3.34 |
1024 | 3.13 |
1765 | 2.35 |
3072 | 1.85 |
Comparative Analysis
- EAGLE is bad at longer contexts�- TriForce is bad at shorter contexts�=> good synergy?
note that:�- we’re not using EAGLE with offloading here yet�- TriForce only offloads the target model’s full KV-cache. What happens when the draft model’s weight can’t fit in GPU?�- aside from computation and memcpy overlap, little optimizations were introduced for the target model’s verification runs
interesting to know:�longer contexts, during token generation, makes attention-score calculation even more memory-bound�because most time will be spent on loading KV-cache
An intriguing idea: Accelerating MoE LLM with Self-Speculation
motivation
- when experimenting with Mixtral8x7B, 彥文 saw that the model could still perform well when using the top-1 & random 2nd experts (the original design chooses the top-2 out of 8 experts)
- this means we can choose experts already on the GPU as the 2nd expert to reduce execution time
- the goal of 彥文’s work is to balance performance improvement against accuracy degradation
- well, I would like to improve MoE LLM’s performance without any accuracy loss. Maybe we can use “top-1 + random 2nd ” as a draft model to accelerate the original model through speculative decoding
An intriguing idea: Accelerating MoE LLM with Self-Speculation
related work
- EAGLE (self trained draft model with 0.28B params) has attempted this but did not get very nice results
speedup & acceptance length possibly uses a 60-token draft tree (no evaluation code),�# - alpha denote acceptance rate of draft token at position # in sequence (no tree)�temperature=0 implies greedy decoding, i.e. selects token with highest probability
- their paper argued that MoE models are harder to speedup than their dense counterparts since the draft tokens can activate different parameters during the verification forward pass
Recap: how does speculative decoding achieves speedup
speculative decoding workflow:
1. drafting: the smaller draft model generate multiple draft tokens based on the input sequence
2. verification: the bigger target model processes the input sequence + all draft tokens in parallel in a single forward pass
when performing autoregressive inference, generating N tokens require moving the same parameters from memory to cache N times
- with dense models and perfect drafting accuracy, you reduce data movement to 1 time
- however, with MoE models, the draft tokens can activate different experts, increasing data movement during verification
Performance Model: accelerating MoE LLM with speculative decoding
Performance Model: accelerating MoE LLM with speculative decoding
through a sample case: things to watch out for
Experiments: 4 implementations
- basic: draft model uses only the top-1 expert. Attention weights + expert0-1 to the 25th layer on GPU. Computation happens where the model weights are, no memcpy weights.
- clear_start: Attention weights + all 8 experts to the 7th layer on GPU. Uses top-2 for the first 7 layers and top-1 for the rest.
- random_2nd: draft model uses the top-1 expert + whichever expert is on GPU. Same weights placement as basic.
- dyn_draft_len: same as random_2nd, however,� 1. performs drafting + verification together in the first forward pass� 2. If the first draft token is rejected, terminate drafting early.
Evaluation: 4 implementations
implementation | 6, non-greedy | 6, greedy | 8, non-greedy | 8, greedy |
basic | 0.80 | 0.61 | 0.69 | 0.51 |
clear_start | 0.75 | 0.62 | 0.69 | 0.53 |
random_2nd | 0.88 | 0.74 | 0.80 | 0.63 |
dnm_draft_len | 0.92 | 0.85 | 0.79 | 0.76 |
Acceptance Rates
here, we are not even using a tree! However, high draft accuracy comes at a cost
Evaluation: 4 implementations
in August, my implementation that enables static collaboration between the CPU and GPU achieves 4.28 t/s decode throughput, which translates to ~234 ms per token latency�
- we need to make draft model faster
- hierarchical architecture?