1 of 409

by Mu-Chi Chen

2 of 409

What Is a Data Hub?

mediator between various data sources and data consumers
how does it differ from normal data storage solutions (e.g. database, data lake) ?

3 of 409

What Is a Data Hub: a one stop shop for data consumers

4 of 409

What Is a Data Hub: Main Functions

5 of 409

Examples

from platform providers:

Cumulocity
MarkLogic
Cloudera

from an alternative/complementing definition of data hub:

LinkedIn’s DataHub
Microsoft’s OneLake data hub

from Google as a narrowly purposed product:

Ads Data Hub

6 of 409

Example: Cumulocity

Highlights:

IoT devices send data to mongoDB
Dremio (distributed SQL engine) as query interface & ETL orchestrator
stores data as Apache Parquet (relational) on data lakes

reference: https://cumulocity.com/guides/datahub/datahub-overview/

7 of 409

Example: MarkLogic

Highlights:

allows data ingestion from Apache Kafka
uses a multi-model database called MarkLogic Server

reference: https://www.marklogic.com/solutions/architectures/data-hub/

8 of 409

Example: Cloudera

workflow:

ingest data by creating a data hub cluster that hosts Apache Nifi (data flow manager, enables getting data via HTTP, s3, Kafka, Mongo …)
launches transformation/ analytic jobs on perhaps another cluster (supports Apache Spark for batch & stream processing)
clusters in the same environment (e.g. test, staging, production) can save data to a shared data lake where they also become accessible by consumers

reference:

https://www.cloudera.com/products/data-hub.html#

https://docs.cloudera.com/data-hub/cloud/index.html

9 of 409

Things to Consider

how to move data into our data hub? (data ingestion & integration)

Apache Nifi as open source solution
AWS Glue or Azure Data Factory depending on our cloud service provider

where do we store the data in our data hub?

data lake (azure, aws, GCP, Snowflake …)
multi-model database (open-source: Apache Ignite, ArangoDB, OrientDB, FoundationDB, paid: Azure Cosmos DB)

implications of our infrastructure choices?

10 of 409

slides above were covered in

Data Hub Project Weekly Meeting on 9/28

11 of 409

Example: Data Hub as Data Catalog/ Metadata Platform

What is a Data Catalog?

According to IBM, “a data catalog is a detailed inventory of all data assets in an organization, designed to help data professionals quickly find the most appropriate data for any analytical or business purpose.”

reference: https://www.ibm.com/topics/data-catalog

Why does a Data Hub need this?

Aside from enabling people to get the data they need easily, data hubs should also help them find the data they need (this would essentially be the face of this project).

encourages and therefore increases usage
facilitates data sharing and open research

12 of 409

Example: Data Hub as Data Catalog/ Metadata Platform

Common functions of a data catalog

IAM
search
data lineage (most likely requires logs)
data management (source, ingestion, retention, and export configurations)
data ops (pipeline execution and statistics)
data quality (rule definitions and pipeline execution results)
compliance (SOC 2, right to be forgotten, right of access)
explainable and reproducible AI

reference: https://engineering.linkedin.com/blog/2020/datahub-popular-metadata-architectures-explained

13 of 409

Example: LinkedIn & Acryl Data’s DataHub

open source

https://datahubproject.io/

slides 12, 14-16 references a LinkedIn Engineering Blog about metadata catalog architectures by Acryl Data’s CTO (previously LinkedIn’s Principal Staff Software Engineer leading the data infrastructure team)

14 of 409

Example: LinkedIn & Acryl Data’s DataHub

architecture

metadata are event-based, subscribable in real-time
metadata providers push to stream or service API
changes to metadata generate logs
logs can be materialized in different forms

reference: https://engineering.linkedin.com/blog/2020/datahub-popular-metadata-architectures-explained

15 of 409

Example: LinkedIn & Acryl Data’s DataHub

architecture

metadata should be strongly typed and have collaboratively defined relationships

reference: https://engineering.linkedin.com/blog/2020/datahub-popular-metadata-architectures-explained

16 of 409

Example: LinkedIn & Acryl Data’s DataHub

reference: https://engineering.linkedin.com/blog/2020/datahub-popular-metadata-architectures-explained

providers push metadata to change proposal stream (like a topic on Apache Kafka), where it is processed made available for subscription
metadata (main) service subscribes to processed change stream, make logs, and TL the metadata to different storage solutions depending on use cases
consumers e.g. data hub dashboard powers different applications using the various data storages

17 of 409

Example: Microsoft’s OneLake Data Hub

question:

What does MBZUAI need?

18 of 409

Example: Ads Data Hub (will skip in presentation)

Highlights:

reads data from: Google Ads, Youtube, your GCP BigQuery (data warehouse)
writes to your GCP project upon query
privacy & security by Google

19 of 409

For weekly meeting on 10/5/2023

topic: continue to survey data hub elsewhere

Internet of Water Data Hub
Art Institute of Chicago Data Hub

maybe

circle back to talk about data catalogs
Open Data Hub (originally an internal Red Hat project but then open sourced)

20 of 409

Example: Internet of Water Data Hub

What is the Internet of Water Coalition?

A group of organizations (e.g. Duke University Nicholas Institute, Lincoln Institute Center for Geospatial Solutions, Western States Water Council) working with federal, state, and local governments to improve water management by building a water data infrastructure.

In summary:

Better data means better management. Therefore, we need good infrastructure to make water data more discoverable, accessible, and usable

21 of 409

Example: Internet of Water (IoW) Data Hub

collects water data from disparate sources into one place
standardizes collected data according to a commonly adopted principle (which implies that it only stores structured data)
enables discovery and access by Geoconnex (metadata catalog also developed by IoW)

22 of 409

Example: Internet of Water (IoW) Data Hub

basic components of a Data Hub

data producers
data wrappers: (the transform and load jobs of an ETL process)
data store
metadata catalog: searchable collection of metadata that points to data in data store including API to access the data (this is where data users can come in)

individual data hubs are then connected by Geoconnex, which is essentially a global metadata catalog (this is another place data users can come in)

23 of 409

Example: Internet of Water (IoW) Data Hub

there are four types of data hub, and this is type A

distributed in the sense that the data producers are responsible for the entire ETL process & storage
here, data hub provides only a window to discover and access data.

24 of 409

Example: Internet of Water (IoW) Data Hub

type B:

data hub also stores the data
data producers push transformed data to data hub

type C:

same as above except that data hub is responsible for pulling transformed data from data producers

* data sharing creates a privacy issue

25 of 409

Example: Internet of Water (IoW) Data Hub

type D (which is more like our original vision):

data producers send raw data to data hub, who is responsible for the rest

* can the data producers trust the data hub to transform the data right? (which brings us back to the ETL v.s. ELT question)

26 of 409

Example: Internet of Water (IoW) Geoconnex

purpose: make it easier for data consumers to find/access the data they want/need. Roughly similar to what LinkedIn and Microsoft is doing

why I think it is not a good product

every org publishing data are required to create a web page for each location for which they have water data
every created web page needs to embed structured metadata in JSON-LD
every page needs its own unique ID. If you every modify the URL, you gotta go to Geoconnex to remap it

this is so that a web crawler can then go to these pages to extract the data it needs and store it to a central database once everything above is done right

27 of 409

Example: Internet of Water (IoW) Data Hub

IoW Data Hub examples:

https://newmexicowaterdata.org/

https://healthofthehaw.org/

references:

https://internetofwater.org/wp-content/uploads/2022/09/Blog019_IoWWaterDataHubs_Watson.pdf (slides 20-25)

https://internetofwater.org/blog/geoconnex/ (slide 26)

28 of 409

Example: the Art Institute of Chicago Data Hub

background

disparate data sources
some data transformation jobs repeated across different applications
wants robust support for website (web & mobile app) search (full-text search, autocomplete, and general SQL-like queries such as filtering and aggregation)

29 of 409

Example: the Art Institute of Chicago Data Hub

design goals

only stores data that are already publicly accessible, which preclude data privacy issues. This, they argue makes it okay for:
using cloud infrastructures
not care about authentication & authorization for now
not treat data hub as a central, unique, and permanent storage: that is the data producer’s job (not necessarily applicable to MBZUAI)
minimize need to ask data producers to modify their systems for data ingestion

30 of 409

Example: the Art Institute of Chicago Data Hub

architecture

hub and spoke model consisted of multiple microservices: APIs, HTTPS, JSON (you know, the yoozh …)
spokes
one microservice for each type of data source
each microservice is responsible for extracting the needed data, transform it right, and expose an API for reading the transformed data (pull instead of push)
can involve scraping data into local dbs for storage and access
hub
pulls, aggregate, and index data from spokes
mySQL for primary storage :(
Elasticsearch as search engine (enables full-text search, but is a document db itself like Mongo, no longer open-sourced since 2021)
exposes an API for data consumers: user submits query to hub, hub asks mySQL and/or Elasticsearch for data

31 of 409

Example: the Art Institute of Chicago Data Hub

challenges

performance: high website traffic (increase in data consumers) results in data hub dropping connections
auto-scaling capabilities
debugging: hard to trace where mistakes were made
importance of data lineage (done very well by LinkedIn) - where the data originated, how it has changed, and its ultimate destination

references: https://mw19.mwconf.org/paper/building-a-data-hub-microservices-apis-and-system-integration-at-the-art-institute-of-chicago/index.html (slides 28-31)

32 of 409

Example: Red Hat’s Open Data Hub (ODH)

end-to-end AI platform
contrary to all the data hubs discussed above, ODH provides a space for data scientists to develop their applications and tools for deploying them into production
this is a platform worth introducing because it is built mostly with open source projects

33 of 409

Example: Red Hat’s Open Data Hub (ODH)

Architecture highlights

uses ceph for (distributed file, object, block) storage -> an example of open source data lake
uses Apache Hive Metastore for metadata to enable data discovery
uses OpenShift as container management platform (powered by Kubernetes), which means the underlying compute resource can come from public/private cloud or on-premise data centers

34 of 409

Example: Red Hat’s Open Data Hub (ODH)

references:

https://opendatahub.io/docs/architecture/ (slide 32)

https://cloud.redhat.com/hubfs/Open-Data-Hub-A-Deep-Dive.pdf (slide 33)

35 of 409

Example: Google Analytics Hub

feature of BigQuery

short intro to BigQuery:

GCP’s fully managed data warehouse
separates storage (BigQuery storage) & compute (BigQuery analytics) engine
two main use cases
store & analyze data in BigQuery
use BigQuery to access your data where it lives (only supports
external table: Cloud Bigtable (NoSQL), Storage (data lake), Google Drive
federated queries: Cloud Spanner (relational database that scales like a non-relational one), and SQL

36 of 409

Example: Google Analytics Hub

How might it look like for MBZUAI?

37 of 409

Example: Google Analytics Hub

references:

slide 35

https://cloud.google.com/bigquery/docs/introduction

https://cloud.google.com/spanner?hl=en

slide 36

https://cloud.google.com/bigquery/docs/analytics-hub-introduction

https://beam.apache.org/documentation/io/connectors/

38 of 409

Alright, so how would I build it?

39 of 409

40 of 409

Scaling

41 of 409

What could be MBZUAI Data Hub’s unique offering?

42 of 409

Automatic Sampling Bias Detection

what is sampling bias?

sample unrepresentative of population

why do we care?

internal & external (generalizability) validity
fairness in machine learning

what role can data hub play in this?

by enabling automatic and real-time exploratory data analysis, we can

improve data collection ASAP
optimize storage by not saving data that are too similar
enhance transparency for users of resulting models

43 of 409

Automatic Sampling Bias Detection

how do we detect sampling bias?

what does detecting sampling bias mean?

uncovering patterns in data (arguably data mining, knowledge discovery in

data): 90% of the data collected today came from people who identifies as

cisgender man (assigned male at birth and identifies as man)

44 of 409

Automatic Sampling Bias Detection

ways to uncover patterns in data

methods of unsupervised learning:

clustering

exclusive (K-means)

hierarchical (agglomerative)

probabilistic (Gaussian Mixture, determines probability of data belonging to to a particular distribution)

association rules

Apriori algorithm (frequent data combination)

45 of 409

Automatic Sampling Bias Detection

ways to uncover patterns in data

methods of unsupervised learning:

outlier detection

local outlier factor (local density of focus point v.s. neighbor)

isolation forest (build overfitted decision tree by splitting at random threshold & feature until every point is isolated. Those isolated early are outliers)

autoencoders (perhaps for images. Encoders learns a latent representation of input and decoders reconstruct input from encoded output. Larger

reconstruction error implies anomaly)

46 of 409

Automatic Sampling Bias Detection

ways to uncover patterns in data

methods of supervised learning:

classification

understand the different ways in which your data can be categorized to adjust data collection strategy

e.g.

classification results seems geographical, and much more data are classified as coming from East Asia

KNN, K-nearest neighbors

decision tree & random forest

47 of 409

Academic Discussions on Data Hub

48 of 409

Technical Report: Developing a Working Data Hub

paper by MIT Lincoln Laboratory Supercomputing Center in March, 2020
authors:

Vijay Gadepally (OSU Ph.D. in ECE, technical staff at MIT Lincoln Lab)

Jeremy Kepner (Princeton Ph.D. in Astrophysics, head of MIT Lincoln Lab)

the Lincoln Lab never seem to have built a data hub of their own
the closest product MIT has helped build is probably BigDAWG (a polystore database prototyped at a hackathon hosted by Intel)
no concrete experiences to learn from, but interesting guidelines
https://arxiv.org/pdf/2004.00190.pdf

49 of 409

MIT on the need of data hub

big data challenges:

Volume, Velocity, Variety (three Vs), Veracity (Fourth V? preserving trust in data and analytic)
in summary, needs to extract value from lots of heterogeneously sourced data fast and accurately

evolving solution:

one-stop solution for storing (providing) and using (consuming) data
which is data hub

50 of 409

MIT on data hub architecture

at the bottom:

external data sources + your own data storage including databases, data warehouses, and data lakes

in the middle:

heterogeneous data management and transformation. enable users to query multiple sources at once

on the top:

data discovery and define data transformation rules

51 of 409

MIT on data hub system engineering

NOTE: “system engineering studies the development of complex systems (p.8)”

HIGHLIGHTS:

3a is an operation supported on databases while 3b is a strategy for file systems/ object storage/ block storage i.e. place for unstructured/ raw data like data lakes

=> similar to my data hub architecture proposal

52 of 409

MIT on raw data storage: Lustre

distributed parallel object storage that is great for HPC (kinda similar to s3 but open-source and has a file system interface)
stores objects (binary data stored in file) and associated metadata (ith node, filename, timestamp etc) in a flat structure
one metadata server (MDS), multiple object servers (OSSes)
clients request data from metadata server which saves pointer to data in object servers
usually uses (not built-in) RAID6 to prevent data loss (allows concurrent failures of up to 2 disks, on average imposes a 35% storage penalty)

53 of 409

Lustre further reading

“Azure Managed Lustre: not your grandparents' parallel file system”

https://techcommunity.microsoft.com/t5/azure-high-performance-computing/azure-managed-lustre-not-your-grandparents-parallel-file-system/ba-p/3889946

Ceph (used in Redhat’s Open Data Hub, originally part of Sage Weil’s doctoral dissertation at UC Santa Cruz, also used extensively in HPC) vs Lustre:

https://www.linkedin.com/pulse/ceph-lustre-yashar-esmaildokht/

AWS FSx for Lustre (integrates with s3)

https://docs.aws.amazon.com/fsx/latest/LustreGuide/what-is.html

official site:

https://www.lustre.org/

54 of 409

MIT on DBMS

what is DMBS?

software interface between users and database

why use DMBS?

define schema
CRUD data
access control
analytics

55 of 409

MIT on DBMS

ACID v.s. BASE

relational db: atomicity, consistency, isolation, and durability. OLTP

non-relational db: basically available, soft state, eventual consistency

CAP theorem

you can only choose 2 between consistency, availability, and partition tolerance

(introduced by professor at Cal, supported by professors at MIT)

relational db: C + A

non-relational db: C + P (MongoDB), A + P (Cassandra, developed at FB, and founder later went to AWS to build DynamoDB)

56 of 409

MIT on DBMS

scaling

relational db: vertical (improve hardware because they are mainly on single mode)

non-relational db: horizontal (add nodes, which creates need of strong partition tolerance, which makes relaxing C and/or A inevitable if the CAP theorem is correct)

That said, there are controversy surrounding CAP

NewSQL (as opposed to SQL and NoSQL) movement that aims to build databases that are both ACID compliant and performant (i.e. horizontally scalable)

some examples include …

CockroachDB: https://www.cockroachlabs.com/

Google Cloud Spanner (which we have seen before when introducing BigQuery, now advertised as being half the cost of DynamoDB): https://cloud.google.com/spanner?hl=en

57 of 409

MIT on DBMS

Access Control: who can access what

granularity: table, row, to cell levels

principal: the “who,” can be role/ user/ groups

asset: the “what,” i.e. resources stored in database

view-based access control being most popular

often implemented through metadata
often used with role-based strategies
principal requests access, db evaluates whether principal has rights to access requested data, if so, presents data with view

query control strategies: restricts what query the principal is allowed to issue

58 of 409

MIT on DBMS

ways to add security

well … duh … encryption and/or masking

CryptDB (also by MIT):

executes SQL queries over encrypted data
data can optionally only decryptable by password-holders, which means db admin does not get to access decrypted data

59 of 409

MIT on heterogeneous data management

modern applications often rely on diverse datasets with different models and sources
ETL is expensive:
converting everything to a single model may degrade performance
high maintenance cost, transformation rule has to change with data
“custom data stores provide 100X better performance than general purpose databases”
sometimes, data sharing policy may prohibit moving the data out to a centralized location
federating (like federal to state governments in the U.S.) multiple specialized data stores. Single or multiple query interfaces?
how to avoid writing and using a different connector for every selected store
how to integrate data

60 of 409

MIT on heterogeneous data management

BigDAWG (Big Data Working Group)

unified interface (connector)
islands: group databases with the same data model, trade off semantic completeness for location transparency (which implies having a single query language)
shim translates island queries to db engine’s native language
cast enables moving data between db engines

61 of 409

MIT on heterogeneous data management

BigDAWG currently supports PostgreSQL, MySQL, Vertica (?), Apache Accumulo (k-v store), SciDB (??), and S-Store (???)
interesting challenges that BigDAWG tackles
cross-engine query plans
query performance monitoring
cross-engine data migration
query execution

62 of 409

MIT on key considerations when developing data hub

Technological considerations

don’t rely on a centralized data lake
single point of failure
difficult to store operational data
data becomes stale (hard to update)
scalable storage architecture that federates multiple specialized data stores
provide access to computing infrastructure for data processing

63 of 409

MIT on key considerations when developing data hub

Infrastructure considerations

compute & storage nodes, networking equipments
cloud computing
almost unlimited resources as long as you have $
cloud clawback: hybrid cloud
vendor lock-in
** some tools may not run on sensitive networks **
data catalog that is updated automatically and perhaps even provide information on related previous work (information sharing & research replication)

64 of 409

MIT on key considerations when developing data hub

Data formatting considerations

80% ML/ AI research time are spent on data wrangling
data providers are a crucial source to understand the meaning of data
maximize data provider input (knowledge transfer) during data ingestion & wrangling

65 of 409

MIT on key considerations when developing data hub

Security Considerations

data providers and consumers should be able to trust you
work with Information Security Officers and Subject Matter Experts
data aggregation can make insensitive data sensitive: name and address by themselves may not be sensitive, but matching them together creates PII (personally identifiable information)
data use agreements (like cookies ~)

66 of 409

MIT on key considerations when developing data hub

Policy considerations

new data may warrant addition of new technology
open v.s. closes source: balance between long-term support and developer community
software licensing may prohibit technology sharing (MBZUAI to NTU for instance)

67 of 409

MIT on key considerations when developing data hub

User engagement

active and engaged users and developers
quick, transparent, and effective feedback loop
ask big projects to share data
remove costs of data sharing
invest in automatic data discovery: data hub should have the latest data to attract more users

68 of 409

UCSC Xena Platform

https://www.nature.com/articles/s41587-020-0546-8

enables visualization and interpretation of cancer genomics data from disparate sources

Background

large public datasets & small and private datasets from individual labs
web app (usually for public datasets) | desktop app (usually for private datasets)
driving use case: visualize public and private datasets together without having to upload private data to external servers.

69 of 409

UCSC Xena Platform

Architectural Design

two components:

frontend (Xena Browser): users create requests, frontend fetches and integrates data from different hubs. data source configuration is done via hub URLs
backend (Xena Hubs): can be installed on desktop to launch a local server or hosted on public servers which sometimes have their own access control via firewall. no integration here, private data made available by local server never gets uploaded to anywhere else

70 of 409

UCSC Xena Platform

71 of 409

Lessons from government agencies

72 of 409

Taiwan’s Health and Welfare Data Center (HWDC)

https://pubmed.ncbi.nlm.nih.gov/31118821/

Background

National Health Insurance launched in 1995, 99.99% population enrollment
National Health Research Institutes established in 2002, in charge of National Health Insurance Research Database (NHIRD)
NHIRD stores insurance claims, which means we have data to the entire population
NHIRD sits at the core of HWDC

73 of 409

Taiwan’s Health and Welfare Data Center (HWDC)

Background

relies heavily on personal identification numbers (PINs) to join data from different registries (beneficiaries, inpatient claims, prescriptions …)
has allowed international collaborative study
aggregates data from more than 70 databases
linkable (JOIN ON PIN)
de-identified publicly available data
takes roughly two years for data to be available on NHIRD
only handling structured data: missing out on medical images!

74 of 409

Taiwan’s Health and Welfare Data Center (HWDC)

Challenges: Privacy

top concern. got sued.
encrypts (transforms to unique identifiers) names: data masking?
access restricted to applicants from academia, research institutes, or hospitals
supports remote access to server but requires office visit after each session
for research purposes only
cannot request data from more than 10% of population

75 of 409

Taiwan’s Health and Welfare Data Center (HWDC)

Challenges: Data Quality

upcoding (health care providers submitting exaggerated diagnoses to guarantee reimbursement)
is this noise that can be overcome with large samples?
how to assess impact?
less of a problem when diagnoses were made to certify catastrophic illness (waives premium and co-pay)

76 of 409

Taiwan’s Health and Welfare Data Center (HWDC)

Challenges: Unmeasured or unavailable variables

lacks data on disease severity, a confounding variable
causes both dependent and independent variables
disease severity affects both treatment and effect
solution:
instrumental variables: Z -> X -> Y
active comparator: instead of placebo, compares against another commonly administered treatment
propensity score: compare probabilities of receiving treatment
predict disease severity from diagnoses when available (e.g. strokes)

77 of 409

Querying Encrypted Data

by Microsoft Research

https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/sigmod-tutorial-final.pdf

78 of 409

Querying Encrypted Data: Demand

migration to cloud and thus storage on cloud

79 of 409

Querying Encrypted Data

enters encryption:

you won’t be able to understand/ use the data even if you managed to gain access to them

the quick brown fox … a pangram

80 of 409

Querying Encrypted Data

two fundamental techniques:

compute over encrypted data: homomorphic encryption
compute over plain text in “secure” location

81 of 409

Querying Encrypted Data

82 of 409

Querying Encrypted Data

symmetric encryption

83 of 409

Querying Encrypted Data

asymmetric encryption

84 of 409

Querying Encrypted Data

Advanced encryption standard cipher block chaining mode

85 of 409

Querying Encrypted Data

nondeterministic (more secure):

>> for i in range(2):

>> print(encrypt(“foo”))

>> 1qaz2wsx

>> 3edc4rfv

deterministic:

encrypted output is always the same
less secure: if we know encrypt(x) == y, we know c * encrypt(x)
enables equality check
mongoDB https://www.mongodb.com/blog/post/mongodb-announces-queryable-encryption

86 of 409

Querying Encrypted Data

homomorphic encryption:

>> encrypt(1) + encrypt(1) == encrypt(2)

can be deterministic or nondeterministic
question: https://crypto.stackexchange.com/questions/58047/equality-check-in-homomorphic-encryption
CKKS FHE (2016, Seoul National University): https://eprint.iacr.org/2016/421.pdf

order preserving encryption:

if x < y, encrypt(x) < encrypt(y)

87 of 409

Querying Encrypted Data

to summarize

88 of 409

Querying Encrypted Data

authors’ opinion: outdated?

89 of 409

Querying Encrypted Data

90 of 409

Querying Encrypted Data: Trusted Client Architecture

client: applications performing CRUD operations against a server DBMS

DBMS only stores encrypted data
client responsible for hosting encryption/decryption microservice and related query translator

91 of 409

Querying Encrypted Data: CryptDB as Trusted Client example

CryptDB:

uses partially homomorphic encryption
no computation on client side

92 of 409

Querying Encrypted Data: CryptDB as Trusted Client example

example operation

client app

>> SELECT SUM(grade) FROM students;

web proxy

>> decrypt(paillier_grade_sum)

DBMS

>> SELECT PAILLIER_SUM(paillier_grade) AS paillier_grade_sum

FROM students;

students table’s schema:

from user’s perspective

{

“id”: int,

“name”: str,

“grade”: int

}

on DBMS

{

“paillier_id”: str,

“name”: str,

“paillier_grade”: str

}

93 of 409

Querying Encrypted Data: Blob Store as Trusted Client example

What is Blob Store?

database design
blob: encrypted data (e.g. paillier_grade -> blob_grade)
blob used in conjunction with partition id

students table’s schema:

{

“name”: str,

“blob_grade”: str,

“partition_id”: int

}

range partition:

grade	parition_id
1 - 9	0
10 - 19	1
20 - 29	2

94 of 409

Querying Encrypted Data: Blob Store as Trusted Client example

Architecture

bit similar to CryptDB
difference? more query processing is moved to the client’s end (not just translating anymore)

95 of 409

Querying Encrypted Data: Blob Store as Trusted Client example

client app

>> SELECT SUM(grade)

FROM students

WHERE grade > 10;

DBMS Shell (trusted, client side)

>> SELECT SUM(DECRYPT(grade_blob))

FROM students

WHERE DECRYPT(grade_blob) > 10;

DMBS (untrusted, server side)

>> SELECT grade_blob

WHERE partition_id > 0;

example operation

96 of 409

Querying Encrypted Data: Blob Store as Trusted Client example

challenge

appropriate partitioning
optimal query splitting

97 of 409

Querying Encrypted Data: Monomi as Enhanced Blob Store

uses partially homomorphic encryption to push more computations on untrusted DBMS server
requires pre-computation for complex queries (e.g.

WHERE updated_at = created_at + 1 -> without key, you don’t know encrypt(1))

example operation

client app

>> SELECT SUM(grade)

FROM students

WHERE grade > 10;

DBMS Shell (trusted, client side)

SUM(DECRYPT(grade_paillier))

DMBS (untrusted, server side)

>> SELECT grade_paillier

WHERE grade_paillier > 10;

98 of 409

Querying Encrypted Data: Trusted Client Summary

does not require changing untrusted server DBMS
works well when amount of data transferred is small
relies on distributed query after all
pre-computation for complex query is expensive
generalizability?
partially homomorphic encryption is still expensive, requires 2048 bits to store int32 -> 64x expansion!
partially homomorphic encryption support a limited set of operations

99 of 409

Querying Encrypted Data: Secure In-Cloud Processing

we talked about trusted client in the previous slides

100 of 409

Querying Encrypted Data: Secure In-Cloud Processing

how do queries under different architecture work?

benefit of having smaller TCB?

notice how isolation (hardware, memory) affects size of base

101 of 409

Querying Encrypted Data: Secure In-Cloud Processing

larger TCB	smaller TCB
less secure	more secure
more administration	less administration

according to this paper:

102 of 409

Querying Encrypted Data: TrustedDB as an Example of Secure In-Cloud Processing

Highlights

secure coprocessor as secure hardware
MySQL: untrusted main storage

SQLite: trusted storage for sensitive data

similar to what we talked about in blob store: distributed query processing
PCI: payment card industry

103 of 409

Querying Encrypted Data: TrustedDB as an Example of Secure In-Cloud Processing

id	name	birthday_timestamp
0	Andrew	866476800
1	Bart	961171200
2	Casey	1087401600

id	SSN
0	123456789
1	123456790
2	123456791

MySQL (untrusted)

SQLite (trusted)

students table

104 of 409

Querying Encrypted Data: TrustedDB as an Example of Secure In-Cloud Processing

example query

User Query

>> SELECT *

>> FROM students

>> WHERE birthday_timestamp >866476800 AND

>> SSN > 123456790

105 of 409

Querying Encrypted Data: TrustedDB as an Example of Secure In-Cloud Processing

What Happens Under the Hood

>> SELECT *

>> FROM mysql_students

>> RIGHT JOIN (

>> SELECT id

>> FROM students

>> WHERE SSN > 123456790

>> ) AS sqlite_result

>> ON mysql_students.id = sqlite_result.id

>> WHERE mysql_students.birthday_timestamp >866476800

106 of 409

Querying Encrypted Data: TrustedDB as an Example of Secure In-Cloud Processing

performance bottleneck:

secure coprocessors have limited computational and storage capacities
this becomes a problem when most data are confidential
the trusted part of the server is a full DBMS, how do you go about telling your customers that?

107 of 409

Querying Encrypted Data: Cipherbase as an Example of Secure In-Cloud Processing

should remind you of CryptDB (encrypted data on MySQL, trusted web proxy to perform query translation)
here, client proxy is replaced by FPGA, which by the way is on the server end

108 of 409

Querying Encrypted Data: Cipherbase as an Example of Secure In-Cloud Processing

“when all data is strongly encrypted (PHE), Cipherbase achieves 40% the throughput of plaintext SQL Server”
“when all columns in TPC-C (Transaction Processing Performance Council Benchmark C) are changed from plaintext to strongly encrypted, the performance drops by about a factor 2. This number can be around 100 for a client-based system”
more about this paper next week !!!

https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/cipherbase.pdf

109 of 409

Querying Encrypted Data: Semantic Security

“An adversary is allowed to choose between two plaintexts, m0 and m1, and he receives an encryption of either one of the plaintexts. An encryption scheme is semantically secure, if an adversary cannot guess with better probability than 1/2 whether the given ciphertext is an encryption of message m0 or m1. The notion is also referred to as indistinguishability of encryptions. The encryption reveals no information no matter what kind of semantics (meanings) are embedded in the encryption.”

https://link.springer.com/referenceworkentry/10.1007/978-1-4419-5906-5_23

110 of 409

Querying Encrypted Data: Leakage

what information does encryption schemes reveal?

AES in CBC mode (non-deterministic)

semantically secure
only reveal input length

deterministic

hacker can still understand little about data by counting number of unique values and their distribution

order preserving

order …

111 of 409

Querying Encrypted Data: Leakage

memory access pattern leakage

malicious actors watching on server end can study storage locations different operation access

e.g.

observer learns the following workflow

query A -> access location 0 ~ 9 -> wires money to some account

which implies

location 0 ~ 9 must store information required for some transaction

112 of 409

Querying Encrypted Data: Leakage

Encryption != no information leakage
In trusted-client architecture, we can hide sizes of query result by standardizing them to limit, but this seriously hinders performance
can we, instead, make memory access pattern oblivious (independent of input) ?

113 of 409

Querying Encrypted Data: Access Pattern Oblivious Algorithms

bubble sort

for i in range(len(arr) - 1, -1, -1):

for j in range(0, i):

if arr[j] < arr[j + 1]:

continue

arr[j], arr[j + 1] = arr[j + 1], arr[j]

image source:

https://www.geeksforgeeks.org/bubble-sort/

114 of 409

Querying Encrypted Data: Access Pattern Oblivious Algorithms

aggregator oblivious encryption

“an aggregator [that] can compute an aggregated sum of data and [without learning] anything else”
even if “homomorphic encryption is employed, the aggregator can still decrypt the ciphertexts being compute upon”
interesting paper to be further studied

https://eprint.iacr.org/2017/479.pdf

oblivious simulation

an additional layer that makes memory access pattern looks random -> disables benefitting from spatial and temporal locality

115 of 409

Querying Encrypted Data

(from the Cipherbase paper)

In conclusion, what do we want?

solution that balances

data confidentiality: minimize information leakage
generality: try to support full SQL operations, minimize change to legacy code
performance: throughput and latency

116 of 409

Transaction Processing on Confidential Data using Cipherbase

background

the previous paper is somewhat an introduction to this one
collaboration between Microsoft Research, Stanford University, and ETH Zurich (Einstein’s alma mater, also taught theoretical physics there)

117 of 409

Transaction Processing on Confidential Data using Cipherbase

Abstract (on Cipherbase):

database system
end-to-end data confidentiality through encryption
what’s new?
Architecture with smallest Trusted Computing Base at the time
combines industrial strength SQL database with lightweight encrypted data processing on secure hardware (FPGAs, novel usage they argued)
competitive performance for transactional workloads (OLAP v. OLTP)
discusses optimization strategies:
communication between main untrusted db server and trusted FPGA
limited computational and storage capacities on secure hardware

118 of 409

Transaction Processing on Confidential Data using Cipherbase

Architecture Benefits (aside from what was already mentioned last week)

inherits SQL functionalities for free: concurrency, recovery, stored procedures, and indexes => trusted module simply as a remote function to call
relies on PCIe-based FPGA board

119 of 409

Transaction Processing on Confidential Data using Cipherbase

Threat Model (categories below are assumed to be passive i.e. does not tamper with database contents or processing)

strong adversary:

privileged OS and database access, can view the contents on memory/disk and all internal/external communication. However, cannot observe state and computations in secure hardware.

weak adversary:

can obtain a one-time snapshot of data-at-rest

in real world:

adversary usually lies between the two

120 of 409

Transaction Processing on Confidential Data using Cipherbase

Data Confidentiality (assuming all columns are all strongly, or non-deterministically, encrypted)

no metadata (num tables and columns, data lengths) confidentiality
operational data confidentiality against strong adversary: operations against data leaks some information
equivalent to an abstract system that uses oracle to compute strongly encrypted values
non-boolean operations such as addition: encrypted input & output
boolean operations such as equality check: plaintext True/False

121 of 409

Transaction Processing on Confidential Data using Cipherbase

relational algebra notations:

projection - subset of columns

restriction (selection) - subset of rows

security guarantee of range index is similar to that of order preserving encryption

122 of 409

Transaction Processing on Confidential Data using Cipherbase

Data Confidentiality comparing to prior work

CryptDB (PHE)

operational information leaked even to weak adversary
scope of information leakage is an entire column and not just the data touched during query

e.g. to evaluate

CryptDB has to store the entire column deterministically encrypted, which leaks equivalence relation & distribution when weak adversary obtains snapshot

On the other hand, with Cipherbase

scan-based plan only reveal unknown_predicate(A) over R tuples (bc vals can be non-deterministically encrypted)

123 of 409

Transaction Processing on Confidential Data using Cipherbase

Architecture & Overview of query/transaction processing

database processing is about

data value semantics (expression evaluation, not easily done over encrypted data)
moving data
query setup/ result communication
concurrency control
recovery

124 of 409

Transaction Processing on Confidential Data using Cipherbase

Cipherbase client

presents a plaintext database interface
query encryption & result decryption
has encryption schema and key

125 of 409

Transaction Processing on Confidential Data using Cipherbase

e.g. to evaluate

needs a stack program that

inputs encrypted A, outputs plaintext True/False

invoked once for each tuple of R

this usage pattern inspires the design => registered once, invoked using handle (id) returned by register method

has a data cache but is stateless apart from that

Trusted Module

evaluate expressions over encrypted data during query processing

126 of 409

Transaction Processing on Confidential Data using Cipherbase

Key Management

keys communicated to trusted module using standard key-exchange techniques: secure hardware device manufacturer assigns a public key

Data Encryption

supports cell-level (row + column specification) encryption

=> minimizes TM traffic

=> incurs storage penalty (the ciphertext of a 4-byte integer is 12 bytes in AES + CTR mode)

127 of 409

Transaction Processing on Confidential Data using Cipherbase

Indexing

range index:

can be non-deterministically encrypted, but indexed keys in B-tree are ordered by their plaintext values
index building and lookups require routing to TM
TM stores pre-registered programs that return comparison result (<, =, >) in plaintext
B-tree stored in UM

=> which means weak adversary learns ordering but not equality relationship

=> strong adversary learns equality relationship of queried ranges

128 of 409

Transaction Processing on Confidential Data using Cipherbase

inserting 7 into a range index

129 of 409

Transaction Processing on Confidential Data using Cipherbase

Indexing

equality index:

index keys are deterministically encrypted first
then stored in range index
ordering is determined by ciphertext not plaintext
unclear about intention of design above

130 of 409

Transaction Processing on Confidential Data using Cipherbase

transaction processing

First, client issues query Q

during the prepare step:

client identifies TM programs required to evaluate expressions in Q
client registers program with TM, who in turn responds with an id that references said program
client gets response, rewrites Q to reference program registered using id

131 of 409

Transaction Processing on Confidential Data using Cipherbase

e.g.

user issues:

>> UPDATE Accounts

>> SET Balance = Balance + @Amt

>> WHERE Id = @Id

Cipherbase Client rewrites into:

>> UPDATE Accounts

>> SET Balance = TMEval(21, Balance, @Amt)

>> WHERE Id = @Id

TMEval: built-in function on UM (SQL server) added by Cipherbase to invoke stack programs on TM

After rewritten query submitted to UM, it prepares a query plan like the following:

assuming the Id column is indexed, identify record that matches filter
sends its balance and required params (encrypted) to program 21
program 21 returns encrypted sum
UM only thinks that it is replacing one binary value with another (encrypted data are aliased to binary data types)

132 of 409

Transaction Processing on Confidential Data using Cipherbase

PHE avoids round-trips to TM:

in the previous example, under a scan-based plan, and if Id column is deterministically encrypted, query does not have to involve TM

133 of 409

Transaction Processing on Confidential Data using Cipherbase

Trusted Module

in this paper, realized using FPGA, but does not have to
author discussed using Intel Software Guard Extensions (SGX), a set of instructions that allow creating a process within a protected address space called an enclave (where data and computation is shielded even from OS)
performance concerns regardless of installation:
communication bandwidth (between UM and TM)
latency incurred during communication

134 of 409

Transaction Processing on Confidential Data using Cipherbase

System engineering and optimization details

Transaction processing challenges:

transactions incur large number of TM accesses
good thing is that these calls only transmit a small amount of data (e.g. encrypted addition involves 2 * 12 bytes input and 12 bytes output)
however, round-trips to TM incur latency:
PCIe transfer latency
TM has limited computational resource

135 of 409

Transaction Processing on Confidential Data using Cipherbase

latency in expression evaluation indirectly reduces throughput (transactions hold on to locks longer, which increases data contention)
TM is a shared resource, which means concurrency can be a problem
we can increase concurrency by batching communication (multiple work units in a single communication)
but batching exposes computational resource and bandwidth as bottlenecks because stack programs and data transfer are invoked at a higher rate

136 of 409

Transaction Processing on Confidential Data using Cipherbase

Optimizations

increase parallelism by having multiple FPGA stack machines => process n calls independently
as mentioned previously, batching
increases throughput: stack machines (SM) spend more time computing, free bus for other SMs
sometimes increase latency when others have to wait for a longer running work unit in the same batch to complete. Empirically, authors argue that the benefits outweigh the potential drawbacks

137 of 409

Transaction Processing on Confidential Data using Cipherbase

expression folding (intra-transaction batching): reduce TMEval calls in the same query if possible => reduces TM workload, PCIe traffic, and latency (computation reuse)

138 of 409

Transaction Processing on Confidential Data using Cipherbase

user issues

>> UPDATE Inventory

>> SET item_cnt = item_cnt + @qty

>> SET total_cnt = total_cnt + @qty

>> WHERE storeId = @Id

Naive Cipherbase rewrite

>> UPDATE Inventory

>> SET item_cnt = TMEval(0, item_cnt, @qty)

>> SET total_cnt = TMEval(1, total_cnt, @qty)

>> WHERE storeId = @Id

expression folding

>> UPDATE Inventory

>> SET @var = TMEval(2, item_cnt, total_cnt, @qty)

>> SET item_cnt = UMExtract(@var, 0)

>> SET total_cnt = UMExtract(@var, 1)

>> WHERE storeId = @Id

i.e. common subexpression elimination

139 of 409

Transaction Processing on Confidential Data using Cipherbase

index lookup vectorization: requires traversing the B-tree
naive: TM call for every comparison
optimized: in each call
pass search key and the entire tree node (multiple indice in one node) to TM
binary search target key over node (which is sorted by construction)
only keys touched by binary search is decrypted, search key only decrypted once within a node

140 of 409

Hitting Pause on Cipherbase

solutions we have explored during the past few weeks (CryptDB, TrustedDB, Cipherbase, etc.) were built on partially homomorphic encryption, which supports only one type of gate, typically + or *.
due to the limited set of available operations, we saw 2 common approaches:
trusted client: encrypted data on db server, delegate much of query processing to client who holds the secret key
secure location (trusted computing base): encrypted data on main db, but server also holds key perform complex operations on plaintext data in secure hardware
question now is: does advancement in fully homomorphic encryption (FHE) render the above technologies obsolete?

141 of 409

Hitting Pause on Cipherbase

theoretical benefits of FHE
capable of evaluating arbitrary circuits, multiple gates, for unbounded depth (i.e. any function)
in other words, computation can be performed as normal on untrusted server database when data there are fully homomorphically encrypted.
only require slight modification of database: special addition, multiplication, etc.
in our meeting last week (11/30/2023), we looked at Microsoft SEAL, which promises to enable building “end-to-end encrypted data storage and computation services.” see their website below:

https://www.microsoft.com/en-us/research/project/microsoft-seal/

142 of 409

Hitting Pause on Cipherbase

however, what is SEAL actually capable of?

silent confession by Microsoft:

https://learn.microsoft.com/en-us/azure/architecture/solution-ideas/articles/homomorphic-encryption-seal

143 of 409

Hitting Pause on Cipherbase: Microsoft SEAL

allows + and * on encrypted integers or real numbers. Comparison and sorting are not supported.

=> this means a large portion of SQL queries is unachievable

supports 2 FHE schemes
BFV (2012): modular arithmetic, yields exact values
CKKS (2016): + and * on real or complex numbers, yields approximates

this restriction is entirely contradictory to the idea of FHE

unrecommended to “encrypt entire large database” with SEAL, computations should be “fairly lightweight”

144 of 409

Hitting Pause on Cipherbase: Microsoft SEAL

homomorphic encryption schemes typically have a single secret key, which means it is unsuitable for use cases where multiple private data owners (each with their own secret key) want to perform collaborate computations

this is rather a problem common to Cipherbase and all the previous technologies discussed

145 of 409

Switching to Data Observatory

146 of 409

Clarifying Survey Direction

questions to address:

what to put on the big screen
how can this hardware contribute to advancing artificial intelligence
visualization at the core of interpretability and explainability
towards trustworthy AI, improves accountability
helps build better models
where do we come in and how could this project be related to data hub
visualization depends on data
connect to local data center? small, independent copy?
local data hub software/ service that communicates with MBZUAI data center (similar to the Internet of Water Data Hub design)?

147 of 409

Thinking Outside the Box: AI4VIS

https://arxiv.org/pdf/2102.01330.pdf

background

on the previous slide, we talked about visualizing artificial intelligence,

i.e. model => visualization

what if we reverse the arrow above?

i.e. visualization => model

traditionally, visualizations are discussed as something created for humans
increasingly, however, we see visualizations “created, shared, collected, and reused”
so, why not treat visualization like any other data format (i.e. text and image)?
however, “why should we teach machines to read charts made for humans”?

148 of 409

Thinking Outside the Box: AI4VIS

Why should MBZUAI care?

we have a gigantic screen, but it seems like we are not so sure about what to show on it
heavier utilization helps further knowledge, but generating visualizations is a time-consuming task
on the previous slide, we mentioned reuse
main reasons of applying artificial intelligence to visualization data discussed by this paper:
visualization generation
visualization enhancement
visualization analysis
this means, if we build the following cycle:

model => visualization => model

we are not only advancing AI in two directions but giving more content to show on our screen

149 of 409

Thinking Outside the Box: AI4VIS

credence: author affiliation

three at Hong Kong University of Science and Technology
another three at Microsoft Research Asia
Dominik Moritz: Carnegie Mellon University (HCI Professor) and Apple (Research Scientist)

150 of 409

Thinking Outside the Box: AI4VIS

definitions

visualization defined as graphical representation of data
main categorizations
graphics: appear most frequently, raster (bitmaps)/ vector (describes visual elements in shapes and styles)
programs: instructions to construct the visualization
hybrid: embed programs/meta information into graphics
representation
internal: programs contain design details but less often semantic information. system warrants simpler and more structured format (e.g. {“data”: “cars.csv”, “type”: “hist”, “x”: “mpg”, “y”: “hp”})
feature: input to ML models, extracted by feature engineering and feature learning

151 of 409

Thinking Outside the Box: AI4VIS

why apply AI to visualization data (mentioned previously but now in more detail)

generation
central research question in community
data-based (extensively studied. given data, show me sth), anchor-based (given anchor visualization, show me next), design-based (inject data into template design), and context-based (given contextual info such as natural language description, show me sth)
enhancement
retarget to different environment (e.g. fit image to cell phone display while keeping all the important elements)
annotate, add captions
question answering
interactions

152 of 409

Thinking Outside the Box: AI4VIS

analysis
recent research have focused on creating visualization database
retrieval, help users find examples they need to retrieve-then-adapt
mine: understanding design patterns, population statistics (crawl on web to discover trends in visualizations)

153 of 409

Thinking Outside the Box: AI4VIS

how is the research community applying AI to visualizations?

transformation
program <=> graphics (reverse engineering is difficult and expensive)
decomposing: detect text, shapes, chart, element clustering (semantic groups e.g. axes, legends, annotations)
composing: figure out coordinates on scatter plot based on information obtained from decomposition
assessment
rank quality of visualizations
related to enhancement bc scoring metrics can be used as cost functions
challenging bc humans define when visualizations are “good” => requires large-scale empirical assessment

154 of 409

Thinking Outside the Box: AI4VIS

comparison
how “close” are the given visualizations?
can also be used as cost functions to build recommendation and query systems
difference: Graphscape (UDub) proposes directed graph model. Each edge represents an edit and each vertice denote the resulting visualizations. Edge weight/cost determined by human judgement
distance: project to feature vector, apply distance function
querying
how to specify user needs: keywords, natural language, structural queries, example
how to match needs: exact v. best match

155 of 409

Thinking Outside the Box: AI4VIS

reasoning
interpret visualizations to derive insights beyond data encodings such as semantic information
visual perceptual learning (e.g. neural network to predict human-perceived importance of visualization images)
chart summarization: natural language description/ caption on how to interpret the chart (e.g. shows average precipitation in Taipei over 2023)
visual question answering: traditionally, transform visualizations into tables and translate questions into queries
recommendation
seeDB recommends visualization by deviation with anchor
map data with encodings to recommend styling

156 of 409

Thinking Outside the Box: AI4VIS

mining
design patterns: should motivate design ideas instead of just showing simple statistics with little meaning
data patterns: respective to type of charts

157 of 409

Thinking Outside the Box: AI4VIS

future research opportunities

interoperability: how to combine visualizations outputted from different systems together
CV models for natural images do not work well for visualizations bc small errors could greatly distort meaning of visualization (e.g. misinterpreting the axes)
augmenting ML models with results from empirical studies
collecting data on how humans perceive, assess, and use visualization data on scale
striking better balance between human v. machine friendly visualization

158 of 409

AI4VIS: implementation example from Data2VIS => LIDA

159 of 409

Data2Vis => LIDA

links to papers

Data2Vis: https://arxiv.org/pdf/1804.03126.pdf

LIDA: https://arxiv.org/pdf/2303.02927.pdf

160 of 409

Data2Vis => LIDA

background

main theme of these two papers: data driven visualization generation approach (in AI4VIS paper’s terminology)
relevance between the two

Data2Vis (2019, submitted on arXiv in 2018) author:

Victor Dibia when at IBM Research, Çağatay Demiralp at MIT CSAIL

LIDA (

submitted on arXiv in March, 2023,

open-sourced on GitHub only in August this year,

claimed by Victor as an update of Data2Vis

) author:

Victor Dibia at Microsoft Research

161 of 409

Data2Vis => LIDA

summary - Data2Vis

formulation of visualization generation as a language translation problem (i.e. mapping data to declarative visualization program languages such as Vega-Lite)
model: bidirectional recurrent neural network with Long-Short Term Memory cells under encoder-decoder architecture with attention mechanism
results: difficult to quantify. Yet, qualitatively, the authors observed that model learns Vega-Lite syntax, appropriate transformation on different fields, data selection patterns in creating bivariate plots (e.g. group by age/ sex)
limitations
small training data: 4300 Vega-Lite visualizations from 11 datasets
model sometimes (in 15-20% of tests) select fields that does not exist
model sometimes select fields that have little information value

162 of 409

Data2Vis => LIDA

Visualizations created by Data2Vis

163 of 409

Data2Vis => LIDA

summary - LIDA

tool for generating grammar-agnostic visualization and infographics
solved by LLMs and IGMs (image generation models)
visualization generation as a multi-stage generation problem
summarizer:

data to semantically rich natural language description,

LLMs suffer from hallucination which sometimes lead them to produce

output not grounded in data. Informing summary augments model by provides grounding context for generating visualizations

goal explorer: generates a set of visualization goals

question (hypothesis): what are we trying to understand

visualization: e.g. histogram of miles per gallon

rationale: why should we ask the question above?

164 of 409

Data2Vis => LIDA

visGenerator

generate, evaluate, repair, and execute visualization code

code scaffold constructor: imports dependent libraries and create

function stubs

code generator & executor: fill in TODOs, get visualization

infographer: style generated visualization

165 of 409

Data2Vis => LIDA

results evaluation
visualization error rate: percentage of generated visualizations that does not compile
asks GPT-4 to score output based on empirically agreed metrics

166 of 409

Formalizing Visualization Design Knowledge as Constraints: Draco

167 of 409

Visualization Design Knowledge as Constraints: Draco

visualization research communities often study how people decode and interpret visualizations. However, their findings are slow to be adapted by practical tools
automated design systems do not reuse knowledge built up by their predecessors
warrants a medium for representing and acting upon visualization design knowledge
Draco is a formal model that represents
visualizations as sets of logical facts
design guidelines as hard/soft constraints over these facts

168 of 409

Visualization Design Knowledge as Constraints: Draco

constraint programming:
a constraint program (declarative) is a set of constraints defining relations among several variables that must/should be satisfied by its solutions
constraints restrict values variables can take (represent partial information)
solutions are computed by (off-the-shelf) constraint solvers via inference and search over the constraint space
Draco models visualization knowledge using Answer Set Programming and Clingo as answer set solver

169 of 409

Visualization Design Knowledge as Constraints: Draco

taste of Answer Set Programming

A :- L_1, …, L_n => a rule

A => atoms, proposition that may be true/false

L_i’s => literals, A is true if only all of them is true

bodiless rules encode a fact whereas headless rules impose integrity constraints (cannot all be satisfied)

e.g.

light_on => power_on, not broken

170 of 409

Visualization Design Knowledge as Constraints: Draco

integrity constraint defines how attributes should interact with each other:

:- mark(bar), channel(E, y), continuous(E), not zero(E)

rules out vertical bar charts that does not use 0 as baseline

set of aggregate rules: specifies attributes of a visualization (e.g. mark, encoding)

171 of 409

Visualization Design Knowledge as Constraints: Draco

primary use case:

finding optimal completion of

partially specified

visualization

last slide covered the “search space definition section”
preference model leads us to soft constraints

172 of 409

Visualization Design Knowledge as Constraints: Draco

preference model (soft constraints) learned from data and forms a Markov logic network where constraints are associated with learnable weights indicating preference levels

e.g.

:~ continuous(E), not zero(E). [5]

this says that model prefers to include 0 for continuous fields and that violating this constraint incurs a cost of 5

=>

where n_p_i denote # violations of soft constraint p_i

173 of 409

Visualization Design Knowledge as Constraints: Draco

model trained using RankSVM (learning to rank approach)
dataset: labeled visualization pairs

using the cost function defined on the last page,

(v1, v2, y = sign(Cost(v1) - Cost(v2)))

=> y = -1 if v1 preferred over v2

=> let

x be vector of [n_p_0(v), …, n_p_k(v)],

corresponding weight vector be w

=> Cost(v1) - Cost(v2) = w * x_1 - w * x_2 = w * (x_1 - x_2)

Linear regression with Ridge regularization, minimizes hinge loss

174 of 409

Visualization Design Knowledge as Constraints: Draco

Draco 1 (2018, from UDub Interactive Data Lab)

https://idl.cs.washington.edu/files/2019-Draco-InfoVis.pdf

Draco 2 (2023, from UDub, University of Vienna, University of Maryland, CMU)

https://arxiv.org/pdf/2308.14247.pdf

175 of 409

Experiments with Mac Studio

176 of 409

Last Week: exploring MLX on Mac Studio

references:

https://github.com/ml-explore/mlx

https://github.com/ml-explore/mlx-examples

https://github.com/ml-explore/mlx-examples/tree/main/llms/llama

177 of 409

Last Week: exploring MLX on Mac Studio

ran Mistral-7B, llama2-7B, Llama2-7B-chat successfully with MLX (weights need to be specially converted to fit into the framework)
llama2-70B returned weird output like the following “ensoensoensoenso…” for any attempted prompt
we suspected that something might have gone wrong during the weight conversion step (hardware resources should be sufficient)
upon further inspection

178 of 409

Last Week: llama 2 70B with MLX

179 of 409

Last Week: llama 2 70B with MLX

llama 2 -> MLX weights conversion example source code by Apple

only shard weights seem to be hardcoded
thoughts?

180 of 409

Llama 2 with Hugging Face: quickstart module

allows running inference on

self converted models stored locally
readily converted models stored remotely here:

https://huggingface.co/meta-llama

either approaches require submitting request to Meta and/or Hugging Face

181 of 409

Llama 2 with Hugging Face: quickstart module

Why Hugging Face?
Many blogs online try running llama-2 locally using

https://github.com/ggerganov/llama.cpp

, which is mainly designed for quantized models (requires less computational and storage capacity)

I doubt we are necessarily in a resource scarce environment, why not shoot for the stars?

182 of 409

Llama 2 with Hugging Face: 7B, sample prompt 1

183 of 409

Llama 2 with Hugging Face: 7B, sample prompt 2

184 of 409

Llama 2 with Hugging Face: 13B, sample prompt 1

185 of 409

Llama 2 with Hugging Face: 13B, sample prompt 2

186 of 409

Llama 2 with Hugging Face: 70B, sample prompt 1

187 of 409

Llama 2 with Hugging Face: 70B, sample prompt 2

188 of 409

Llama 2 with llama.cpp

why are we back?

as mentioned in link below, Hugging Face benchmarking tools are deprecated, they recommend to use external tools

https://huggingface.co/docs/transformers/benchmarks

thought about adding timers here and there to measure performance myself in that little module I wrote to run llama 2 models with the hf Transformers library
but, in my opinion, these libraries abstract away so many details that it is hard to know what you are actually measuring
there is an example from AWS where they tried to benchmark running llama 2 with the hf Transformers library on EC2 instances, but from their source code, most latency measurements were done on CloudWatch, a black box impossible to transfer knowledge from

189 of 409

Llama 2 with llama.cpp: 7B

-p: prompt (following llama.cpp main author’s example, shouldn’t matter much)

“Input length is not significant for performance but important for hardware requirements”

source:

https://www.databricks.com/blog/llm-inference-performance-engineering-best-practices

-n: num tokens to predict (reflected in sample runs)
-ngl: any nonzero value enables Metal processing (Apple CUDA)
batch size defaults to 512 (tokens)

190 of 409

Llama 2 with llama.cpp: interpreting console logs

explanation by project collaborator

191 of 409

Llama 2 with llama.cpp: 13B

192 of 409

Llama 2 with llama.cpp: 70B

193 of 409

Llama 2 with llama.cpp: comparison with other experiments

y axis: time spent on generating output (i.e. eval time in llama.cpp console log)
Model: llama-2-7b-chat.Q4_0

=> quantized to 4 bits

(originally 16)

source:

https://towardsdatascience.com/apple-m3-machine-learning-speed-test-f1346e23a1b2

194 of 409

Llama 2 with llama.cpp: comparison with other experiments

source: https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF

unverified: many are claiming online that M3 Max == M2 Ultra
from the bar chart on the previous slide, M3 Max yielding roughly 47 t/s on eval time on the 7B model quantized to 4 bits
on the other hand, we got, on our M2 Ultra, 39.72 t/s on the original 7B model (unquantized, 16 bits)

195 of 409

Llama 2 with llama.cpp: comparison with other experiments

from llama.cpp authors: typical run using M2 Ultra (13B, quantized to 4 bits)
our results: 51.45 t/s on prompt eval, 22.41 t/s on eval time

source:

https://github.com/ggerganov/llama.cpp

196 of 409

Llama 2 with llama.cpp: comparison with other experiments

hardware: M2 Max Mac Studio, 96 GB RAM

source:

https://medium.com/@andreask_75652/benchmarking-apples-mlx-vs-llama-cpp-bbbebdc18416

197 of 409

Llama 2 with llama.cpp: comparison with other experiments

what I worry about:

“Overall latency scales sub-linearly with model size” - from the databricks blog

mentioned in slide 189

from what we are seeing:

70 / 13 = 3.58�209.19 (70B eval t/ms) / 44.62 (13B eval t/ms) = 4.68

198 of 409

Question from last week: suspicious prompt eval speed

from llama.cpp maintainer (llama-2-7b fp16):

https://github.com/ggerganov/llama.cpp/discussions/4167#user-content-fn-2-e6c6be5781f97c64904423a487430321

my local replication:

199 of 409

Question from last week: suspicious prompt eval speed

screenshot from slide 196:

I overlooked that these are results from llama-2-7b
the experiments I ran last week were using llama-2-7b-chat (fine-tuned, yielded 83.42 t/s prompt eval time & 39.72 t/s eval time)
question to ponder: does fine-tuning has a direct effect on prompt evaluation time, or is something else at play?

200 of 409

Optimizing LLM Inference: batching and vLLM

why batch?

trade-off latency for throughput, i.e. process multiple sequences together but at lower speed
optimize memory bandwidth usage: with batch size one, model parameters need to be loaded for each new input sequence => load once and process many

reference: https://www.anyscale.com/blog/continuous-batching-llm-inference

201 of 409

Optimizing LLM Inference: batching and vLLM

naive/ static batching

used by HuggingFace’s Pipelines (their easy-to-use API that supports most models), NVIDIA’s FasterTransformer
batched sequences each generate 1 new token per iteration
here, GPU underutilization happens bc of variation in sequence generation length

202 of 409

Optimizing LLM Inference: batching and vLLM

continuous batching

once a sequence finishes in a batch, a new one takes its place
chellenging given that prefill/ prompt evaluation stage has different computation pattern than token generation => wants a good ratio of

sequences waiting prefill/ sequences still generating

203 of 409

Optimizing LLM Inference: batching and vLLM

vLLM:

created by a UC Berkeley team (Prof. Gonzalez, Prof. Stoica, …)
open-source, can only be ran on linux
“delivers up to 24x higher throughput than [HuggingFace’s Pipelines], 3.5x improvement over HuggingFace Text Generation Inference,” previous SOTA that powers live inference

reference: https://blog.vllm.ai/2023/06/20/vllm.html

204 of 409

Optimizing LLM Inference: batching and vLLM

core problem addressed:

inefficient memory management on KV cache�=> kv pairs of all prev tokens required to generate the next one�=> size is large: takes up to 1.7 GB for a single sequence in llama-13B

=> and dynamic: depends on highly variable sequence length, leads to wastage due to fragmentation and over-reservation

devised solution:

PagedAttention

=> allows kv cache stored non-contiguously in memory by partitioning the kv pairs of each sequence into blocks each dedicated to a fixed number of tokens

205 of 409

Optimizing LLM Inference: batching and vLLM

=> memory wastage only occurs at the last block

=> enables efficient memory sharing by mapping to the same physical block comes in handy when we need to generate multiple output from the same sequence

206 of 409

Optimizing LLM Inference: batching and vLLM

=> baseline: HuggingFace’s Pipelines library, which implements static batching

=> generation length sampled from exponential dist with mean = 128 tokens

=> FasterTransformer (optimized model implementation) actually do pretty well

=> continuous batching clearly yields strong improvement over static, and so does vLLM over the other three on the graph

207 of 409

Optimizing LLM Inference: batching and vLLM

QPS: query per second, expected rate of poisson distribution sampling from

208 of 409

Optimizing LLM Inference: batching and vLLM

takeaway from the two graphs:

when batching is employed, the continuous scheme alleviates the throughput latency tradeoff
as system becomes saturated, continuous batching’s advantage diminishes (harder to inject new sequences?)

209 of 409

MLPerf: trouble we are running in

how do we run it?

hardware vendors usually do this to build trust with customers
reference implementation:
defines representative tasks & workloads (models),
specifies model requirements (e.g. pretrained weights to use), validation datasets, performance & accuracy metrics
hardware vendors create their own implementation based on ref that
uses MLPerf LoadGen API to issue inference queries
uses their own hardware-specific API to process said queries
for us to run the benchmark,
hardware vendor implementations must be available for the desired task/workload
compatible hardware

210 of 409

MLPerf: trouble we are running in

last time, we looked at MLPerf Training results on “GPT-3”

but how? GPT-3, unlike llama, is proprietary
real model under test: Megatron-LM’s GPT-3
by applied deep learning research team at NVIDIA (researches training LLMs at scale)
largely follows OpenAI’s paper, but uses different tokenizer, attention layer, and model parameters
current (v3.1) submitted implementations are for H100, Intel Gaudi2, and Google’s TPU-v5e

211 of 409

MLPerf: comforting news

CTuning (MLCommons founding member, https://ctuning.org/)

created the MLCommons Collective Mind framework
main objective: make it easier to run MLPerf benchmarks across different hardwares
keynote introduced June 2023
typical usage: either with python virtual environment or docker

212 of 409

MLPerf: comforting news

running BERT-large on Mac Studio M2 Ultra

213 of 409

MLPerf: comforting news

comparison with results published by MLCommons (MLPerf Inference: Edge)

https://mlcommons.org/benchmarks/inference-edge/

upon closer review, some hyperparameters seem different
may require tuning to optimize resource utilization
need tools to measure resource utilization to determine potential on pushing inference speed

system	framework	offline (sample/s)	single stream (latency in ms)
Macbook Pro M1	onnx runtime	1.69	597.08
Mac Studio M2 Ultra	onnx runtime	6.48	580.58

214 of 409

MLPerf … on Mac?

215 of 409

MLPerf … on Mac?

216 of 409

MLPerf … on Mac?

217 of 409

MLPerf … on Mac?

218 of 409

MLPerf … on Mac?

link to drawing above

https://docs.google.com/drawings/d/1OeIpVmCSgLQk4JSc1364l8wdFaXFKoEAQIWrftISNCQ/edit

219 of 409

MLPerf … on Mac?

main obstacle

we need a reliable llama 2 model implementation using mlx framework
we did not before when we first surveyed MLX

https://github.com/ml-explore/mlx-examples/tree/main/llms/llama

7b & 13b worked, 70b returns weird output

mlx_lm library released for easy integration with HF models

https://github.com/ml-explore/mlx-examples/tree/main/llms/mlx_lm

seems working but incredibly slow?

https://github.com/ml-explore/mlx-examples/issues/437

220 of 409

MLPerf … on Mac?

discussion with Awni Hannun

top contributor to mlx
research scientist at Apple
Stanford PhD advised by the Andrew Ng (advised by David Blei, Michael Jordan)

221 of 409

MLPerf … on Mac?

model: llama-2-7b-chat

	prompt evaluation (t/s)	token generation (t/s)
llama.cpp	83.42	39.72
mlx	25.455	29.416
llama.cpp/ mlx	3.27	1.35

222 of 409

MLPerf … on Mac?

model: llama-2-13b-chat

	prompt evaluation (t/s)	token generation (t/s)
llama.cpp	51.45	22.41
mlx	14.164	18.662
llama.cpp/ mlx	3.63	1.2

223 of 409

MLPerf … on Mac?

point being …

model: llama-2-70b-chat

	prompt evaluation (t/s)	token generation (t/s)
llama.cpp	11.87	4.78
mlx	1.063	0.201
llama.cpp/ mlx	11.16	23.78

will continue investigating

224 of 409

MLPerf … on Mac?

225 of 409

mlx v. llama.cpp

performance anomaly?

mlx token generation almost as fast as llama.cpp at first run after reboot
performance degradation (seemingly exponential) in the following runs
llama.cpp unaffected by mlx

226 of 409

mlx v. llama.cpp

performance anomaly?

mlx prompt evaluation performance surpassing that of llama.cpp’s
less severe performance degradation
should rerun experiment with larger input (currently seeing large variations)

227 of 409

MLX GPU profile

228 of 409

Llama.cpp GPU profile

229 of 409

Llama.cpp parallelization

230 of 409

MLX representative workload: on reboot perf anomaly

231 of 409

MLX representative workload: on perf with diff loading strategy

232 of 409

MLX v. Metal: on perf with diff loading strategy

233 of 409

experiments below were conducted in�environment:�mlx 0.6.0�https://github.com/MBZUAI-Research-Office/research_muchi�commit: fd2376d405b8fe786fe2851b828e435f40421edb

234 of 409

Almost root to the issue?

How weights loading strategy affects application performance

batch

sep

upon closer examination, matmul performance seems identical across loading strategies

235 of 409

Almost root to the issue?

batch

sep

zooming out …�we notice the space between computations are caused by driver (wire memory) activities making sure that there are physical memory backing to GPU resources consumed in metal code

(

where could 256 MiB come from? Maybe …

size of weights per layer

8192 * 8192 * 4 / (1024 **2)

= 256 MiB

)

https://developer.apple.com/forums/thread/653038

https://developer.apple.com/library/archive/documentation/DeviceDrivers/Conceptual/IOKitFundamentals/DataMgmt/DataMgmt.html

236 of 409

Almost root to the issue?

well … how does that space/ duration caused by driver activity changes upon reboot?

for the first 2 seconds (after the computation starts), there were still plenty driver (wire memory) activities

237 of 409

Almost root to the issue?

well … how does that space/ duration caused by driver activity changes upon reboot?

after that, however, driver (wire memory) activities ceased => lowering the average and thus making the first run post reboot seem much faster than subsequent runs

space between computations diminishes along with decrease in driver (wire memory) activities!

238 of 409

Almost root to the issue?

2nd run post reboot

3rd run post reboot

What does this mean?�The performance degradation we observed from sep loading strategy can also be accounted by these driver (wire memory) activity

239 of 409

Sidenote: how does matmul in mlx work under the hood?

https://drive.google.com/file/d/1yEw00X1hdEMYM8pDyHSdnkd-_nYYl4iN/view?usp=drive_link

240 of 409

DBRX: distributed inference with M2 Ultras

241 of 409

What is DBRX?

transformer-based decoder-only LLM (like Llama-2 & GPT-4) that was trained using next-token prediction
fine-grained mixture-of-experts (MoE) architecture with 132B total parameters of which 36B parameters are active on any input (more on this later)
other prominent MoE models: Mixtral 8x7B, Grok-1
uses GPT-4 tokenizer, rotary position encodings, gated linear units, and grouped query attention (similar to Llama-2)
pretrained on 12T tokens (carefully curated and substantially helped improve model quality) with max context length: 32768 tokens (synonymous to max sequence len)
trained on 3072 NVIDIA H100s connected by 3.2Tbps Infiniband

reference: https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm

242 of 409

DBRX: inference efficiency

setting:

NVIDIA TensorRT-LLM (open-source library, also used in their MLPerf implementation) at 16-bit precision
input prompt: 2000 tokens, output: 256 tokens�(per response)
No mention of hardware specs?
seems faster than model-size would suggest due to its MoE architecture (more on this later!)
Mixtral has higher t/s because its model is smaller

243 of 409

What is MoE?

TL;DR

MoE models

are pre-trained much faster & have faster inference�compared to dense models (where all the parameters are used for all the inputs) with the same number of params
still, require high VRAM as all experts are loaded in memory
face many challenges in fine-tuning such as overfitting, but recent work with MoE instruction-tuning is promising

reference: https://huggingface.co/blog/moe

244 of 409

What is MoE?

dense: refers to the feed-forward layer, which is really just linear combination followed by activation (in Llama-2: multi-layer perceptron)

traditional transformer encoder block

reference:

https://medium.com/towards-data-science/transformers-explained-visually-part-1-overview-of-functionality-95a6dd460452

https://github.com/ml-explore/mlx-examples/blob/main/llms/mlx_lm/models/llama.py

245 of 409

What is MoE?

traditional transformer

transformer with MoE

vs

the traditional/ dense feed-forward network (FFN) layer is replaced by a

sparse MoE layer that operates independently on the tokens in the sequence

- the router/ gate network decides which expert the token goes

- output of MoE layer = router gate value * output of selected expert

246 of 409

What is MoE?

what does this remind you of?

ensemble methods (e.g. random forest)
MoE layer is a system composed of separate networks/ experts that each specializes in a different region of the input space

in addition,

each expert is usually a FFN
we can send a token to more than one expert
Mixtral 8x7B and Grok-1 have 8 experts and choses 2
DBRX have 16 smaller experts are chose 4 (which is what they mean by fine-grained) => allows 65x more combinations
the router is composed of learned parameters pre-trained at the same time as the rest of the network

247 of 409

So, how do we run distributed inference with DBRX?

example:�transformer encoder block from Google’s GShard�- attention weights are replicated across devices�- each device is responsible for a FFN/ expert (model parallelism: model partitioned across nodes)

reference: https://arxiv.org/abs/2006.16668

248 of 409

Requirements to run DBRX

132B params, each param being a bfloat16�=> 132 * 2 (bytes) = 264 * (10 ^ 9) = 264 GB�=> however, DBRX’s github repo README recommends having at least 320 GB of memory��right now, model implementations are available in�- TensorRT-LLM (requires CUDA)�- vLLM (requires CUDA)�- MLX (requires Apple Silicon, focus, commitment, and sheer will)

interestingly,�HuggingFace’s transformer library integrates an experiment project called PiPPy that aims to enable easy pipeline parallelism (where each node gets a portion of the model’s layers) for pyTorch => solves the memory problem, but seemingly little performance benefits�https://github.com/pytorch/PiPPy/

249 of 409

Potential Strategy with M2 Ultras

connect M2 Ultras with thunderbolt (should provide up to 40 Gbit per sec throughput)
communication between nodes via IP
each Mac Studio has 6 thunderbolt ports, which should enable up to a fully-connected cluster of 7 nodes
mimic GShard architecture by putting each shard in a container
communication between containers orchestrated by kubernetes

250 of 409

M2 Ultra Cluster Communication latency

recap:

each of our Mac Studio M2 Ultra has 192GB of memory, but DBRX inference requires 264GB (Databricks even recommends to have at least 320GB of memory)

our goal is to run distributed inference on a cluster (2 - 8 nodes), which should provide us more than enough computation & memory resources. However, we need to make sure that communication between nodes does not become a bottleneck

251 of 409

what information is being exchanged?

DBRX has …

40 layers (transformer decoder blocks)
input/output to each layer during inference is an 1x6144 float16 vector
one may ask … where is the autoregressive-ness => hidden in KV cache
under expert parallelism, 2 * 4 * 2 * 6144 / 1024 = 96 KiB of data is being exchanged per layer

L_0

L_39

ATT

FFN_0

FFN_1

FFN_15

.

…

router

B 6144

L_38

to 0, 4, 7, 15

A

6144

C 6144

+

B 6144

choosing 4 from 16 experts

float16

dispatch/ aggregate

252 of 409

M2 Ultra Cluster Communication latency

4 nodes (Thunderbolt with star topology, Ethernet with switch)

Protocol:

gRPC

(

avg latency includes serialization and deserialization time

)

253 of 409

DBRX: distributed inference with M2 Ultras

recap

running DBRX warrants at least 264GB of memory, yet a single M2 Ultra is only equipped with 192GB of memory
we cannot run it on one node, but any cluster of size >= 2 would do
we did a small feasibility study to ensure cluster communication overhead would not place an unreasonable performance cap
running DBRX inference in a distributed setting requires model rewrite

254 of 409

DBRX: distributed inference with M2 Ultras

our progress so far

finished rewriting DBRX’s MLX implementation to enable expert parallelism on a M2 Ultra cluster of >= 2 nodes
observed sub-par performance of
2 t/s for prompt evaluation
0.5 t/s for token generation
profiled performance with instruments to locate bottlenecks

where did we get stuck? (like for quite a while)

pre-processing weights with MLX leads to unexpected data corruption�mx.split() followed by mx.savez()
Therefore, we switched to pyTorch since it supports both safetensors and bfloat16.�This implementation, however, is slower because everything is ran on CPU (currently, MPS backend does not support bfloat16)

255 of 409

DBRX: distributed architecture design

256 of 409

DBRX: distributed inference with M2 Ultras

0.5 t/s performance breakdown

each token takes 2 seconds to process
DBRX has 40 layers, which means each layer takes, on average, ~50 ms
in each layer,

attention + router + weighted_sum usually takes ~7 ms (including driver activity that checks for wired memory)
expert calculation + communication (~30 ms), and
waits/stalls ~13 ms (needs further verification)

on average, 2 experts are assigned to each moe_shard per layer

one layer

performance view from Instruments

expert calc, comm

wait/stall

257 of 409

DBRX: distributed inference with M2 Ultras

optimization strategies

improve weights loading strategy to avoid driver activity�should reduce ~21 ms per layer
figure out and eliminate cause of stalls/ wait
parallelization within node so that all 4 chosen experts can be executed at the same time
currently, expert calculation within node is sequential
ALU utilization is ~30%

258 of 409

updates from 4/30 PAS lab meeting

https://docs.google.com/presentation/d/16Vp4hrvRlbK0NnqzO-yj2YpApBO_j98AFnqd5bsMIac/edit#slide=id.g2d0db823531_0_480

259 of 409

DBRX: distributed inference with M2 Ultras

latest employed optimization strategies

expert parallelization within nodes:
before: 4 nodes, 4 experts per node, experts ran sequentially
now: …, …, experts ran in parallel by spawning independent processes
[NEEDS WORK]: deeper performance analysis
improved weights loading strategy within expert (i.e. process in setting above)
before: 40 layers, 3 weight arrays per layer, loaded to memory separately (total size = 40 * 3 * 10752 * 6144 * 2 = roughly 15.85 GB)
now: …, …, 15.85 GB of weights loaded to memory in one chunk
[NEEDS WORK]: currently, every expert is busy in every layer to reap the benefits brought by above weights loading strategy => dense model : (

260 of 409

DBRX: distributed inference with M2 Ultras

model performance

workload: batch size = 1, input = 20 tokens, output = 64 tokens

	prompt evaluation	token generation
week_0:�baseline	1.3 t/s	0.6 t/s
week_1:�with expert parallelization, weights loading in one-chunk, dense	3 t/s	2.4 t/s

261 of 409

05/09/2024 weekly report

outline

detailed performance breakdown between versions
v2 & v3 architecture design

262 of 409

DBRX distributed inference architecture design

263 of 409

DBRX distributed inference performance breakdown

	per layer			per prompt
	attention + router	MoE	wait / stall	token generation t/s
v1, �- baseline	~7 ms	~30 ms	~13 ms	~0.5 t/s
v1.5, - load each expert’s weights as one-chunk - dense	~ 4 ms	~3.5 ms	~3 ms	~2.4 t/s
v2, - everything from v1.5 - earlier driver eval - custom serialization	~1.5 ms	~3.5 ms	~3 ms	~3.1 t/s
v3, - everything from v2 - architecture redesign - ½ communication - ½ expert computation	estimated: ~1.5 ms	estimated: ~2 ms	estimated: ~1.5 ms	estimated: ~5 t/s

filename: dbrx_poc_one_custom_serialization

filename: dbrx_poc_batch2

filename: dbrx_poc_perf

264 of 409

Obstacles encountered during v3 development:

driver activity surfacing from idleness

sleep in ms	incurs heavy driver activity
1	False
10	False
100	False
500	True
1000	True

driver activity here refers to:

time driver spent on wiring down sufficient physical memory for GPU computation

where does idleness come from (more on this next page):

wait during all-reduce (the shard that finishes first has to wait for the one that finishes last)

observation from v3 MoE layer design (4 chooses 2 experts activated per shard):

waiting/sleeping between layers causes lengthy driver activity to surface

when sleep = 1000 ms, driver processing takes ~400 ms for a ~1 ms GPU computation

265 of 409

How idleness becomes a problem in v3

driver activity is unavoidable during first layer, and its processing time has high variance
these long waits, as shown in the previous page, trigger more driver activity => inescapable loop!
for reasons unknown, the last shard to finish also tend to wait before next layer’s calculation
for reasons unknown, moe design from v2 seems to mitigate this disturbance pretty well

266 of 409

introducing v3.2,

our first step towards:

- streaming MoE processing (beneficial during prompt evaluation and when input batch size > 1)

- dynamic load balancing (beneficial even when batch size = 1, increases overlap between communication and computation during streaming)

267 of 409

new trouble with v3.2: needs warming up!

what do we mean by warm up?

sync computation time to reduce driver activity

268 of 409

distributed DBRX v3 performance evaluation

	num GPUs	price in USD per GPU	communication	fp16 TFLOPS per GPU	bfloat16 TFLOPS per GPU	single user throughput (t/s)
M2 Ultra	4	~8,500 ?	10 GbE switch	~54	???	~6
L40S	8	~9,287	PCIe Gen4 x16: 64GB/s bidirectional	~362	~362	~60 ?
H100	4?	~31,710	nvlink: 900GB/s PCIe Gen5: 128GB/s	~1979	~1979	~115 ?

L40S : https://resources.nvidia.com/en-us-l40s/l40s-datasheet-28413?_gl=1*twh71g*_gcl_au*Mzk5MjAyNDU3LjE3MTcwNDU3MjY.

H100: https://www.nvidia.com/en-us/data-center/h100/

60 t/s: https://crusoe.ai/blog/achieving-60-t-s-with-dbrx-in-15-minutes/

115 t/s: https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm

269 of 409

Evaluation Goal:

A Performance Model that Takes Compute Power into Account

270 of 409

DBRX on A100: how fast is inter-GPU communication in terms of latency?

hardware setup: 1 node, 4 GPUs, 4 x 80 = 320GB memory, 600 GB/s GPU interconnect through NVLink

framework: TensorRT-LLM, tensor parallelism

communication pattern:

1 x AllGather after all 40 layers

to piece together the generated token

1 token

1 layer

2 x AllReduce:

once before attention,

the other before MoE

271 of 409

DBRX on A100: how fast is inter-GPU communication in terms of latency?

communication latency breakdown:

compared to us:

operation	n times per token	avg latency in microsecond(s)	total latency per token in microsecond(s)
AllReduce	2 * 40 = 80	18	1,440
AllGather	1	20	20
total:			1,460

hardware setup	total communication latency per token in microsecond(s)
A100 (1 node, 4 chips)	1,460
M2 Ultra (2 node, 2 chips)	60,000

272 of 409

DBRX performance model

since H100 has 8 GPUs in 1 node, we estimate its comm latency to be 2x of A100’s

predicted / realized is roughly 1.6 for each row

273 of 409

Short-term Future Roadmap

resources available

several 4090s, configurable to single or multi node (are we interested in exploring RDMA NICs?)

topics to explore

efficient LLM deployment on lower-end Nvidia hardwares
would like to start from inference -> fine tuning (requires data, algorithmic knowledge) -> training (requires data, compute, and algorithmic knowledge)
from our past survey, inference with large MoE models under a resource constrained setting is relatively under-explored (efficiency with higher grade hardwares is hard enough already)

274 of 409

Related Work: on insufficient memory problem

DeepSpeed Inference (Microsoft, June 2022)

mentioned in last week’s lab meeting (linked below)
proposes a series of optimization called ZeRO-Inference
offloads all weights to CPU memory (DRAM) or NVMe (SSD), copy to GPU memory only when needed
overlaps weights prefetching with previous layers’ computation to hide memory copy latency
parallelizes prefetching across multiple GPUs and leverage faster than PCIe interconnect to all-gather the weights

https://arxiv.org/pdf/2207.00032

https://docs.google.com/presentation/d/1zh_EH8Xs4QQnHGb35uVuJvd-7-waKei054ZT8vOVqJc/edit?usp=sharing

275 of 409

Related Work: ZeRO-Inference

problem:

prefetching is difficult to achieve for MoE models because you don’t know which experts you need until the current layer’s router makes its decision

=> each expert’s weights are spread across all model layers (�for Mixtral8x7b, expert_0’s weights at layer_0 is�7B / 32 layers = 0.22B, which is 0.22B * 2 bytes = 440 MB)

=> you could prefetch every expert’s weights at the next layer and it would still fit in GPU’s memory (Mixtral8x7b: 8 experts x 440 MB = 3.5 GB, RTX 4090 has 24 GB memory)

=> however, wastes memory space and hard to scale (what if we are using smaller GPUs? Also, the current trend is to use more experts)

276 of 409

Related Work: on efficient MoE inference

Towards MoE Deployment: Mitigating Inefficiencies in MoE Inference (Meta AI, June 2022)

Expert Buffering: only cache a small number of experts in GPU memory, and the rest of the experts’ weights are offloaded to CPU memory. Expert weights copied to GPU memory when there is a cache miss. LIFO eviction strategy
In the context of expert parallelism: expert placement driven by historical data for better load balance. Frequently selected experts should reside on different GPUs

https://arxiv.org/pdf/2303.06182

277 of 409

Related Work: Towards MoE Deployment … by Meta AI

questions/ problems

loads an entire expert (all layers)? Otherwise, only caching an expert’s weights at the current layer is useless for the next layer
performance comparison with loading the selected experts’ layer weights just in time?
their load balancing is useless if model training already accounts for expert selection imbalance (e.g. DBRX)

278 of 409

Related Work: on efficient MoE inference

Exploiting Inter-Layer Expert Affinity for Accelerating Mixture-of-Experts Model Inference (Ohio State University, Jan 2024, https://arxiv.org/pdf/2401.08383)

1. context coherent expert parallelism

expert parallelism under a centralized design warrants 2 all-to-all operations per MoE layer: once when router dispatches tokens to selected experts and another to aggregate expert outputs for weighted sum

our decentralized design replicates attention & router weights across nodes to allow independent but identical non-MoE calculation and thus eliminates the need for router dispatch

their decentralized design also replicates weights but removes repeated calculation to allow concurrent handling of multiple requests. However, this requires additional KV-cache syncing after each forward pass/ generation of one token

279 of 409

Related Work: Inter-Layer Expert Affinity … by OSU

how does context coherence work?

before each forward pass during token generation phase, perform an all-gather to replicate/sync every request’s KV-cache on every GPU
this allows attention & router to be done anywhere after expert computation, eliminating the need to send local expert outputs back to original dispatcher

centralized

decentralized through context coherence

280 of 409

Related Work: Inter-Layer Expert Affinity … by OSU

problems with context coherence

useless when the model selects > 1 experts every layer.
although attention & router can be done on any GPU, you still need to acquire expert outputs from other GPUs. In other words, an all-reduce is still needed

so, doesn’t really reduce the number of all-to-all operations to 1!

Running a MoE model with 4 experts, parallelized on 2 GPUs. Each layer chooses 2 experts.

281 of 409

Related Work: Inter-Layer Expert Affinity … by OSU

2. expert affinity

argument/ observation: given that a token is routed to expert_i at the current layer, certain experts exhibit a higher probability of being selected next
the goal is to place weights strategically according to inferred expert selection path to minimize communication (i.e. you don’t need to dispatch tokens to other machines if the next selected experts’ weights are available locally)

problems

requires training to deduce “expert affinity.” Needs to retraining for models with different architecture (e.g. more experts/ increased granularity)
their conditional probability model assumes Markov property (next state only depends on the current state) and is inapplicable to cases where multiple experts are selected per layer
this problem becomes harder with more complex MoE design

282 of 409

Related Work: on efficient MoE inference

FasterMoE: Modeling and Optimizing Training of Large-Scale Dynamic Pre-Trained Models (Tsinghua University, April 2022, https://dl.acm.org/doi/pdf/10.1145/3503221.3508418)

dynamic expert shadowing: difference in expert popularity leads to load imbalance when adopting naive expert parallelism. When an expert is overloaded, we can replicate its weights on other GPUs to distribute the work. When data parallelism is also employed, this also eliminates the need to all-dispatch input tokens.

283 of 409

Related Work: FasterMoE by Tsinghua

problems with dynamic expert shadowing

replicating model weights introduces a lot of overhead (even more so during training because of the need for gradient update)
FasterMoE addresses these challenges by deciding whether to shadow experts during runtime (for every training iteration/ gradient update cycle). Experts are only shadowed when it takes longer to transfer input tokens than weights (could occur with large batch sizes) OR when time saved on computation is more than the added communication latency
if applied to model inference, this mechanism would seldom be activated since batch size is unlikely to be that big

284 of 409

Related Work: FasterMoE by Tsinghua

communication and computation overlap

well, we do this as well with DBRX
difference between B and C below is that FasterMoE adjust scheduling depending on mini-batch size: schedules longer communication earlier to allow more overlap
we didn’t have the adjustment above because our group size is always 1

S: send

C: compute

R: receive

285 of 409

Related Work: on efficient MoE inference

DeepSpeed-MoE (Microsoft, July 2022)

Expert + Tensor Parallelism for MoE block (configuration could depend on whether GPUs reside on the same node)
Data + Tensor Parallelism for non-MoE components

https://arxiv.org/pdf/2201.0559 6

286 of 409

Related Work: DeepSpeed MoE

hierarchical all-to-all operations (within node first, between nodes later)
parallelism coordinated communication

Step 1

Step 2

Step 3

287 of 409

Categorizing Existing Optimizations

an important observation is that expert parallelism creates a lot of inconveniences.

overcoming GPU memory constraint		load balance when using expert parallelism		minimizing communication in parallel systems
expert buffering	expert + tensor parallelism	evenly distributing hot experts	dynamic expert shadowing	context coherence expert parallelism	expert affinity	parallelism coordinated communication	computation communication overlap

288 of 409

Proposing a new research direction

tensor parallelism is suitable for our target environment (lower-end hardwares). Before each layer’s computation starts, tensor parallelize every expert’s weights on all available GPUs (e.g. every GPU possesses a slice of every expert).

why?

load balance wise: no issue here at all. Computation is distributed evenly across the entire parallel system.
communication wise: only requires 1 all-to-all operation per layer, same as expert parallelism
memory constraint wise: offload to CPU memory and even SSD. Load weights by layer to GPU memory in streaming fashion (start pre-fetching many layers ago). No need to guess through some fancy tricks what experts will be selected in the current layer

289 of 409

Proposing a new research direction: Feasibility Analysis

DBRX has 16 experts and 40 layers.

In each layer, the 16 experts’ weights total 2 * 16 * 3 * 6144 * 10752 = 6.35 GB
assuming tensor parallelism level of 4, each GPU needs to load�6.35 / 4 = 1.59 GB of weights each layer
according to our preliminary experiments, our machine can achieve 25 GB/s bandwidth copying data from CPU to GPU memory
from 2 & 3, loading each layer’s expert weights to GPU takes�1.59 / 25 = 0.0636 seconds = 63.6 ms
assuming that our target inference throughput is 5 t/s during output generation stage, we need to complete each layer in 1000 / 5 / 40 = 5 ms
from 4 & 5, we need to start pre-fetching 63.6 / 5 = 13 layers before
that means, our GPU needs to be able to buffer 13 layers of weights, which is 13 * 1.59 = 20.67 GB of weights, which fits inside our RTX 4090 (24GB memory)

290 of 409

running DBRX with tensor parallelism on 4 GPUs. In each layer, each GPU needs in its memory a slice of all 16 experts, which totals 1.59 GB. Here, we assume a target token generation throughput of 5 t/s, which means each layer’s compute + communication takes 5 ms.

291 of 409

Weekly Updates

Some company has kindly lent us their 8 x A100 40GB server

access guide:

install openVPN client: https://openvpn.net/client/
get VPN profile / certificate & ssh credentials from me
right now, we have 2 ongoing efforts:
profiling DBRX’s performance on A100 using TensorRT-LLM & expert / tensor parallelism (comparison with results we got from MBZUAI’s 4 x A100 80GB setup?)
running Mixtral8x7B 16-bit on 1 x RTX 4090 24GB through offloading weights AND computation to CPU. Later, I would like to combine this with expert / tensor parallelism on 4 x 4090s.
earlier this week, I mentioned in our group chat that Fiddler by UDub looks like a scam: https://arxiv.org/abs/2402.07033
however, from my experiments, I was able to finish one Mixtral8x7B’s MoE layer (2 selected experts) in 5.6 ms => actually quite promising
my approach: combine different good ideas and see where to go from there

292 of 409

Prospective Projects on A100 server

1. DBRX 132B & Mixtral8x22B

inference performance tuning & profiling in the context of

expert + tensor parallelism
multiple concurrent users�

2. Google’s Switch-MoE (2048 experts 1.6T params)�inference performance tuning & profiling in the context of

weights offloading to CPU memory / SSD
predict expert activation
expert compression through unstructured pruning / mixed precision quantization
expert buffering (i.e. treating GPU memory as cache)

Explore model parallelization techniques that accommodates usage of heterogeneous accelerators

offloading computation to CPU

293 of 409

Prospective Projects on A100 server

3. fine-tune pre-trained MoE models’ routing network

to overcome difficulty imposed by dynamic expert activation on weights pre-fetching�

4. Llama-3-70B

inference / fine-tuning performance profiling & optimization

Comparison with experiences on MoE: What techniques are transferable?�

5. Llama-3-405B

inference / fine-tuning performance profiling & optimization

Is there any special performance issues related to running monster-scale dense models in a resource constrained environment?
How is it different from running a >100B MoE model on 4x RTX 4090 24GB?
How do we formulate our approaches with different-sized models on different-sized machines as a methodology?

294 of 409

Weekly Updates

issues encountered with 8 x A100 40GB server access
unstable SSH connection (terminal often freezes, connection timed out)
insufficient network bandwidth (download speed can be < 100 KiB/s)
DBRX progress on A100
just got it running today
experimenting with different parallelism strategies for performance analysis
Mixtral8x7B on RTX 4090
experimenting HtoD memcpy, CPU compute, and GPU compute overlap (using the model’s MoE weights and workload)
profiling with nsight systems to confirm feasibility of our approach
also actively re-writing Mistral’s open-sourced inference engine
phase 1 targets single GPU, will expand to multi GPUs after this POC

295 of 409

Weekly Updates (08/29/2024)

Phi3.5 MoE Inference

preliminary performance analysis on 4xRTX4090-24GB

DBRX Inference

token generation throughput comparison between 8xA100-40GB and 4xA100-80GB

Llama3.1 8B Inference (first step towards training)

throughput with different batch sizes on 1xRTX4090-24GB

Mixtral8x7B Inference

POC v0 done:�1. runs on 1xRTX4090-24GB, expert weights and compute on CPU, self-attention & router weights and compute on GPU (https://github.com/muchi674/ntu_paslab_llm)

2. the following versions will focus on better utilizing the GPU

3. performance analysis: TBA

296 of 409

Recap on Phi3.5-MoE

Author: Microsoft

Release Date: 08/22/2024

Architecture: 16x3.8B parameters in BF16 with 6.6B active parameters when using 2 experts. The model is a mixture-of-expert decoder-only Transformer model using the tokenizer with vocabulary size of 32,064.

Context length: 128K tokens

GPUs: 512 H100-80G

Training time: 23 days

Training data: 4.9T tokens

https://huggingface.co/microsoft/Phi-3.5-MoE-instruct

297 of 409

Phi3.5 MoE Inference on 4 rtx4090

- CPU to GPU memory copy throughput: ~10 GB/s for each GPU

- token generation throughput (by naive model parallelism, or putting groups of model layers to different GPUs):

batch size	token generation throughput (t/s)
1	12
2	19
4	30
8	48

Note: pipeline parallelism has not being applied

298 of 409

Phi3.5 MoE Inference on 4 rtx4090

illustration of naive model parallelism by using Nsight systems:

=> problem of under-utilization is apparent

299 of 409

DBRX Inference Performance Comparison

	4xA100-80GB (MBZUAI)	8xA100-40GB (borrowed)
token generation throughput in t/s, when batch size = 1	~57	~85

What could be causing this difference?

- with batch size this small, performance is memory-bound (depends on how fast we can load model parameters from GPU memory to cache)

- both systems above are equipped with NVLink, meaning that GPU interconnect is very cheap (all-to-all operations are ~20 microseconds)

- the system with 8xA100 has a higher aggregate memory bandwidth, and the performance gain from breaking down a memory-bound computation to more workers outweighs the additional communication latency

results below are obtained using tensor parallelism

300 of 409

Llama3.1-8B Inference Performance Analysis

results above are obtained from 1xRTX4090-24GB

questions to consider:�token generation throughput growth seems to be plateauing faster than that of prompt evaluation. Why?

Batch Size	Evaluation TP (t/s)	Geneneration TP (t/s)
1	53.3759	46.75
2	103.3154	88.5
3	139.8756	130
4	193.2283	168
8	380.4522	288
12	595.3785	402
16	755.9556	462
32	1392.01663	624
64	2499.4972	760
128	OOM	OOM

301 of 409

Weekly Updates (09/05/2024)

Mixtral8x7B Inference on 1xRTX4090-24GB

- performance analysis on 4 versions of implementation

for all versions below, self-attention & router are on GPU

v0 (CPU experts): experts weights and computation on CPU
v1 (GPU experts): experts weights on CPU, memcpy to GPU for computation when selected, no cache
v2 (static collaboration): 7/8 expert weights on CPU, 1/8 on GPU. Computation happens where the weights are
v3 (dynamic collaboration): Same as v2, but memcpy weights to GPU for when communication cost < computation savings (done with 2 priorly determined hyper-parameters).

302 of 409

v0 (CPU only) performance analysis

batch size	prefill throughput (t/s)	decode throughput (t/s)
1	65.02	3.83
2	92.48	4.71
4	130.3	6.15
8	172.25	9.58
16	219.65	15
32	220.84	26.94
64	216.37	50.25
128	225.25	78.28

workload: ~110 inputs tokens, generate 128 tokens

303 of 409

v1 (GPU only) performance analysis

batch size	prefill throughput (t/s)	decode throughput (t/s)
1	38.19	1.27
2	82.08	1.48
4	159.35	1.9
8	298.14	2.9
16	537.48	5.26
32	924.12	10.36
64	1503.02	20.6
128	2262.64	40.89

workload: ~110 inputs tokens, generate 128 tokens

304 of 409

v2 (static collab) performance analysis

batch size	prefill throughput (t/s)	decode throughput (t/s)
1	74.61	4.28
2	106.83	5.3
4	142.83	7.03
8	205.12	10.62
16	234.57	17.42
32	256.97	34.62
64	245.89	61.04
128	248.88	94.77

workload: ~110 inputs tokens, generate 128 tokens

305 of 409

v3 (dynamic collab) performance analysis

batch size	prefill throughput (t/s)	decode throughput (t/s)
1	65.26	4.02
2	113.71	5.06
4	217.58	6.37
8	306.86	7.74
16	538.42	12.06
32	925.01	23.45
64	1503.23	40.15
128	2259.6	69.74

workload: ~110 inputs tokens, generate 128 tokens

306 of 409

307 of 409

308 of 409

309 of 409

310 of 409

311 of 409

312 of 409

TODOs

grouping similar requests in the same batch. Similar as in activates the similar experts
using “second-best” experts that reside on GPU. Assuming that the model uses top-2, but top-1st & 3rd on GPU while top-2nd on CPU, what’s the accuracy penalty for using the top-3rd expert?
while employing both prefetching and caching, how do we analyze cache miss information in time to correct the pre-fetching strategy?
what’s the optimal cache hit rate assuming that you know which experts are going to be selected in the future? (for different cache sizes)
what’s the best eviction strategy? (Moscow uses LRU, Meta AI uses LIFO)
with large batch sizes, the expert activation statistics from previous requests might be more applicable to the next one (wouldn’t we see a uniform distribution with batch sizes that are too big?)
further investigation into expert’s “temporal locality”. How do we exploit this to achieve optimal static expert placement?

313 of 409

Weekly Updates (09/05/2024)

Llama3.1-70B-Instruct Inference

goal: optimizing performance on a single / cluster of RTX-4090s (24GB memory)
first step: an up and running implementation => v0_cpu_only (everything: attention & ffn is done on the CPU)
before deciding what work to offload to the GPU, we could profile the CPU only version to understand the model’s performance characteristics during inference
related work (https://infini-ai-lab.github.io/Sequoia-Page/):�by CMU, Together AI, Meta AI�published in Feb, 2024

achieves 1.75 t/s token generation throughput on a single RTX-4090 with llama-2-70B,� leverages techniques from speculative decoding, uses llama-2-7B as draft model

our POC version achieves 0.64 t/s during token generation

314 of 409

Briefing on llama3.1-70B-Instruct

dtype: bfloat16 (2 bytes)

n_layers: 80

vocab_size: 128,256

d_model (embedding size): 8,192

n_attention_heads: 64

n_kv_heads: 8

ffn_hidden_size: 28,672

per layer weights size (in GB):

attention: ((8192 * 64 * 128 * 2) + (8192 * 8 * 128 * 2)) * 2 / 1000^3 = 0.3
ffn: 8192 * 28672 * 3 * 2 / 1000^3 = 1.4

(the remaining weights are mainly for the embedding layer & model output projection)

315 of 409

From past experiments:

model: llama-2-70b-chat

	prompt evaluation (t/s)	token generation (t/s)
llama.cpp	11.87	4.78
mlx	1.063	0.201
llama.cpp/ mlx	11.16	23.78

316 of 409

Analysis: Decode Stage

- workload: each input prompt contains ~100 tokens�- when batch size is large enough, longer output length (n_gen_tokens) start to create noticeable pressure during attention calculation

317 of 409

Analysis: Prefill Stage

- as observed with other models, latency increases sub-linearly with respect to batch size

318 of 409

Performance Comparison

319 of 409

Mu-Chi’s Updates (10/02/2024)

Llama-3 inference on a single RTX-4090 with EAGLE speculative decoding

(on a sidenote: llama3.2 released models with 1B, 3B, 11B, and 90B parameters)

outline:

brief intro to the EAGLE speculative decoding framework
experiment results with llama-3-7B
experiment results with llama-3-70B (by modifying EAGLE to enable offloading)

320 of 409

Recap: the gist of speculative decoding

speculative decoding workflow:

1. draft / smaller model generate multiple draft tokens based on input sequence

2. target / bigger model processes input sequence + all draft tokens in parallel in a single forward pass

https://hackmd.io/@shaoeChen/rJESTVr40

321 of 409

What is EAGLE about?

premise: good draft model is hard to find (should use architecture, training data, and tokenizer similar to that of the target model)

1. EAGLE trains its own light-weight draft model (0.24B for Llama-2-7B, 1B for Llama-2-70B, 0.28B for Mixtral8x7B. The 1B draft model was trained in 1-2 days on 4x A100 (40G) GPUs)

2. the draft model performs autoregressive inference on the feature level (as opposed to the token level)

3. grows a dynamic tree of draft tokens to increase the number of accepted tokens

322 of 409

What do you mean by feature level?

eagle argues that “autoregression at the feature level is simpler than at the token level [since] feature exhibits more regularity than tokens, which are transformations of natural language”

323 of 409

EAGLE draft model architecture

324 of 409

What do you mean by tree of draft tokens?

tree size: top_k + depth * topk ^ 2�# forward passes to construct tree: 1 + depth

325 of 409

llama-3-8B inference on a single RTX 4090 with EAGLE

performance comparison (end to end)

https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

generation cycle latency breakdown

vanilla (t/s)	with Eagle (t/s)	speedup
54.34	116.36	2.14x

drafting (ms)	target model verification forward pass (ms)		misc. (ms)
12.57	22.99		~2

tree:�top-k = 5,�depth = 10,

size = 5 + 10 * 5 ^ 2 = 255

# draft model forward passes: 10

116.36 / (1000 / (12.57 + 22.99 + 2))

= 116.36 / 26.62

= 4.37 tokens per cycle

326 of 409

llama-3-70B inference on a single RTX 4090 with EAGLE & weights offloading

- original EAGLE framework only support scenarios where both the draft and the target model fit on GPU memory (allows multi-GPU)

- I modified EAGLE to store the draft model on GPU and memcpy the target model layer by layer from CPU to GPU memory during forward pass (no cache. all computation happen on GPU).

- I mimicked Sequoia’s “optimized offloading engine” to allow overlap between memcpy and GPU computation

327 of 409

Offloaded Eagle

overlapping:

1. drafting and target model first layer’s memcpy

2. current target model layer’s computation and next layer’s memcpy

(uses 2 cuda streams, pinned memory, and double buffering on GPU memory)

328 of 409

llama-3-70B inference on a single RTX 4090 with EAGLE & weights offloading

performance comparison (end to end)

https://arxiv.org/pdf/2402.12374

generation cycle latency breakdown

1x L40 48GB		1x RTX 4090 24GB
DeepSpeed-Zero-Inference: pure offloading (t/s)	Sequoia: 7B draft model (t/s)	Offloaded Eagle: 1B draft model (t/s)
0.18	1.79	0.79

drafting (ms)	target model verification forward pass (ms)		misc. (ms)
33.28	5,235		~4.5

tree:�top-k = 5,�depth = 10,

size = 5 + 10 * 5 ^ 2 = 255

# draft model forward passes: 10

0.79 / (1000 / (33.28 + 5235 + 4.5))

= 0.79 / 0.19

= 4.16 tokens per cycle

329 of 409

Moving Forward: Speculative Decoding

recap

- during the last few weeks, we have been experimenting with speculative decoding as a way to accelerate LLM inference without degrading model accuracy in offloading settings where performance is often bottlenecked by CPU to GPU PCIe bandwidth

- last time, we introduced EAGLE, a speculative decoding framework that trains its own draft model instead of using one off the shelf

- By enabling offloading with basic optimizations such as delegating FFN computation to the CPU, we were able to achieve ~1.75 t/s throughput with EAGLE when running Llama-3-70B with a 1B self-trained draft model on 1x RTX-4090 24GB, which is on par with Sequoia’s results with Llama-3-8B as the draft model

330 of 409

Moving Forward: Speculative Decoding

recap

- this made us wonder: can we run Llama3.1-405B on a 1x RTX-4090 24GB?

- to do so, we need to:� 1. store weights in SSD & CPU memory => more PCIe limitations

2. incorporate hierarchical speculative decoding: big / medium / small models

- but before we rush into things, we should see whether others have already implemented this idea (spoiler alert: yes)

331 of 409

Moving Forward: Speculative Decoding

today’s outline

- brief introduction to related work:� 1. Cascade Speculative Drafting for Even Faster LLM Inference� by team at University of Illinois Urbana-Champaign, December, 2023� https://arxiv.org/pdf/2312.11462

2. TRIFORCE: Lossless Acceleration of Long Sequence Generation with � Hierarchical Speculative Decoding� by team at CMU (Beidi Chen, https://www.andrew.cmu.edu/user/beidic/), Meta AI FAIR lab, April, 2024� https://arxiv.org/pdf/2404.11912

- experiments:� opportunities we discovered when examining TriForce and EAGLE side by side

https://hackmd.io/@aben20807/HyKAHCfg0/https%3A%2F%2Fhackmd.io%2F%40aben20807%2FHJZICWrgC#Speculative-Decoding

332 of 409

Related Work: Cascade Speculative Decoding

- recall that, when verifying the draft sequence, if token at the current position is rejected, all following tokens are discarded�- therefore, draft tokens at later positions in the sequence have lower chances of being accepted�- inspired by this observation, Cascade uses larger draft models for earlier positions and smaller ones later

333 of 409

Related Work: TriForce

background (not from the paper but relevant)

- in LLM, context (input prompt + previously generated tokens) is used to generate the next token in the sequence

- to avoid repeating computation, the KV-cache, which stores the results of�matmul(input, K_matrix) and matmul(input, V_matrix), is used to accelerate the attention mechanism

- Llama-3.1 supports context length up to 128K (131,072 to be precise) tokens

- popularity of using Chain of Thought prompting to assist LLMs in reasoning also implies longer contexts

334 of 409

Related Work: TriForce

- note that KV-cache size grows linearly with batch size

- this means, when serving these SOTA models, we need to account for both model weights and KV-cache sizes to realize their full-potential

- seeing this and the fact that draft models grow less accurate with longer contexts (perhaps because they are not trained to memorize as much context as the target model. However, this is not the case for the llama3.1 family), TriForce proposes …

- how big is the KV-cache?�2 * n_layers * batch_size * max_sequence_len * n_kv_heads * head_dim * 2 (bytes)

- which means, 1x sequence of 128K tokens’ KV-cache occupies� Llama-3.1-8B: 2 * 32 * 1 * 128,000 * 8 * 128 * 2 / 1000^3 = 16.78 GB� Llama-3.1-70B: 2 * 80 * 1 * 128,000 * 8 * 128 * 2 / 1000^3 = 41.94 GB� Llama-3.1-405B: 2 * 126 * 1 * 128,000 * 8 * 128 * 2 / 1000^3 = 66.06 GB

335 of 409

Related Work: TriForce

a three-tier architecture:

1. at the bottom, a tiny draft model that uses fixed-sized KV-cache to accelerate

2. a “self-speculating” mechanism in the middle that uses the original target model weights + ~3% of the full KV-cache, which is then used to speed-up

3. the target model with full KV-cache

336 of 409

Related Work: TriForce

TriForce argues self-speculation in the “middle” works because …

- attention sparsity is a well-explored topic that argues most attention score can be recovered from a small portion of the KV-cache

- TriForce observed that adjacently generated tokens tend to pay attention to the same part of the KV-cache (intuitively, e.g.: summarize the experiments setup section of this paper)

337 of 409

Experiments: TriForce

- we hypothesized that, due to self-speculation (which still uses the full target model weights), TriForce would not perform well with shorter contexts�- note that vanilla autoregressive decoding with Llama-3.1-8B is reported to reach 54 t/s on same machine

context length	retrieval cache budget	# tokens per generation cycle
130048	8192	7.56
32784	8192	9.84
8192	8192	11.74
2048	2048	11.87
512	512	11.22
128	128	11.78

338 of 409

Experiments: TriForce

interestingly, with shorter contexts, TriForce is performing badly with low retrieval cache budget. The best performance is recorded when the “middle” self-speculative mechanism is equivalent to the target model

339 of 409

Experiments: EAGLE

naturally, we became curious about how EAGLE performs with longer contexts�note that results published by their paper (same as Sequoia) uses very short context: 128 tokens

context length	# tokens per generation cycle
128	3.71
256	3.70
512	3.34
1024	3.13
1765	2.35
3072	1.85

340 of 409

Comparative Analysis

- EAGLE is bad at longer contexts�- TriForce is bad at shorter contexts�=> good synergy?

note that:�- we’re not using EAGLE with offloading here yet�- TriForce only offloads the target model’s full KV-cache. What happens when the draft model’s weight can’t fit in GPU?�- aside from computation and memcpy overlap, little optimizations were introduced for the target model’s verification runs

interesting to know:�longer contexts, during token generation, makes attention-score calculation even more memory-bound�because most time will be spent on loading KV-cache

341 of 409

An intriguing idea: Accelerating MoE LLM with Self-Speculation

motivation

- when experimenting with Mixtral8x7B, 彥文 saw that the model could still perform well when using the top-1 & random 2nd experts (the original design chooses the top-2 out of 8 experts)

- this means we can choose experts already on the GPU as the 2nd expert to reduce execution time

- the goal of 彥文’s work is to balance performance improvement against accuracy degradation

- well, I would like to improve MoE LLM’s performance without any accuracy loss. Maybe we can use “top-1 + random 2nd ” as a draft model to accelerate the original model through speculative decoding

342 of 409

An intriguing idea: Accelerating MoE LLM with Self-Speculation

related work

- EAGLE (self trained draft model with 0.28B params) has attempted this but did not get very nice results

speedup & acceptance length possibly uses a 60-token draft tree (no evaluation code),�# - alpha denote acceptance rate of draft token at position # in sequence (no tree)�temperature=0 implies greedy decoding, i.e. selects token with highest probability

- their paper argued that MoE models are harder to speedup than their dense counterparts since the draft tokens can activate different parameters during the verification forward pass

343 of 409

Recap: how does speculative decoding achieves speedup

speculative decoding workflow:

1. drafting: the smaller draft model generate multiple draft tokens based on the input sequence

2. verification: the bigger target model processes the input sequence + all draft tokens in parallel in a single forward pass

https://hackmd.io/@shaoeChen/rJESTVr40

when performing autoregressive inference, generating N tokens require moving the same parameters from memory to cache N times

- with dense models and perfect drafting accuracy, you reduce data movement to 1 time

- however, with MoE models, the draft tokens can activate different experts, increasing data movement during verification

344 of 409

Performance Model: accelerating MoE LLM with speculative decoding

345 of 409

Performance Model: accelerating MoE LLM with speculative decoding

through a sample case: things to watch out for

346 of 409

Experiments: 4 implementations

- basic: draft model uses only the top-1 expert. Attention weights + expert0-1 to the 25th layer on GPU. Computation happens where the model weights are, no memcpy weights.

- clear_start: Attention weights + all 8 experts to the 7th layer on GPU. Uses top-2 for the first 7 layers and top-1 for the rest.

- random_2nd: draft model uses the top-1 expert + whichever expert is on GPU. Same weights placement as basic.

- dyn_draft_len: same as random_2nd, however,� 1. performs drafting + verification together in the first forward pass� 2. If the first draft token is rejected, terminate drafting early.

347 of 409

Evaluation: 4 implementations

implementation	6, non-greedy	6, greedy	8, non-greedy	8, greedy
basic	0.80	0.61	0.69	0.51
clear_start	0.75	0.62	0.69	0.53
random_2nd	0.88	0.74	0.80	0.63
dnm_draft_len	0.92	0.85	0.79	0.76

Acceptance Rates

here, we are not even using a tree! However, high draft accuracy comes at a cost

348 of 409

Evaluation: 4 implementations

in August, my implementation that enables static collaboration between the CPU and GPU achieves 4.28 t/s decode throughput, which translates to ~234 ms per token latency�

- we need to make draft model faster

- hierarchical architecture?

349 of 409

12/5/2024: Path Forward

- continue building our framework

- time to transfer our experiments to an on-chip setting (model fits inside aggregate GPU memory)

- Mixtral-8x7B (46.7B, 93.4GB) barely fits inside server 46 (4x RTX-4090-24GB), so the plan is to adopt multiple nodes (server 51 + 46)

- paper review:�Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models (ACL 2024)�https://arxiv.org/pdf/2402.14800

350 of 409

Communication setup between server 51 and 46

- peer-to-peer topology, no switch

- theoretical bandwidth: 10 Gbits/sec

- results from iperf3 gives 9.41 Gbits/sec

- ping latency (transmitting 64 bytes):�0.148 ms

351 of 409

Server 51 expert + all-reduce execution time

- workload (mimics Mixtral-8x7B): 2 GPUs, each GPU activates 1x expert, perform 1x all-reduce after that

batch size	avg execution time (ms)
1	0.38
4	0.4
16	0.42
64	0.48
256	0.74
1024	2.83

tensor size (bytes)	avg execution time (ms)
1 * 32 * 2 = 64	0.018
1 * 4096 * 2 = 8192	0.018

all-reduce time with tensors of different size

352 of 409

Not All Experts are Equal: Methods

expert pruning:�uses a small calibration dataset post-training to heuristically determine which experts to prune, saves memory but not compute (each token is still processed by top-k experts)

dynamic expert skipping:�uses a small calibration dataset post-training to heuristically determine a hyper-parameter 𝛽.�During inference, if�top-2’s score < 𝛽 x top-1’s score,�top-2 expert is skipped

353 of 409

Not All Experts are Equal: a tad more details on methods

expert pruning:

perform inference with small calibration dataset, records input-output pairs of each MoE layer
let the original number of experts be N and the number of preserved experts be r,�for each layer,� find combination of N - r experts that minimizes the Frobenius norm of the difference� between recorded output and output with N - r experts.

354 of 409

Not All Experts are Equal: a tad more details on methods

expert skipping:

the hyper-parameter 𝛽 (used to check if top-2’s score < 𝛽 x top-1’s score) for each layer�is determined by performing inference with calibration dataset and recording�the median of top-2’s score / top-1’s score

355 of 409

Not All Experts are Equal: accuracy evaluation

accuracy of proposed pruning technique compared to other methods.�Random: randomly selects experts to prune in each expert�Frequency: prunes experts based on activation frequency from inference on calibration data

356 of 409

Not All Experts are Equal: performance evaluation

LM-eval: average accuracy of 8 zero-shot tasks

in conclusion, our team should be able to do this too and even better

357 of 409

[12/12/2024 updates]: Mixtral-8x7B on 2 nodes, 6 GPUs

background

the MERLIN project assumes that model weights fit inside GPU memory and proposes a three-tier framework that begins with multi-dimensional parallelism. Therefore, to get things going, we built a distributed computing system that serves Mixtral-8x7B on multiple nodes/GPUs using tensor parallelism.

outline

- setup�- inference system architecture�- algorithmic details�- evaluation

358 of 409

Mixtral-8x7B on 2 nodes, 6 GPUs

setup

2 nodes connected with 10 Gbits/sec ethernet:� - server 51: 2x RTX-4090-24GB� - server 46: 4x RTX-4090-24GB�

everything implemented in PyTorch� - uses torch.distributed as communication package, which uses NCCL under� the hood

359 of 409

Mixtral-8x7B on 2 nodes, 6 GPUs

inference system architecture�https://github.com/muchi674/ntu_paslab_llm/tree/main/merlin

weight partitioner: one time only. splits weights according to parallelization strategy for easy distribution across the cluster
model implementation
system launcher [kudos to 瑄哥]: uses torchrun to launch a process for each GPU

360 of 409

361 of 409

Mixtral-8x7B on 2 nodes, 6 GPUs: evaluation

in comparison,others have reported getting�- 23 t/s on 2x A100-80GB (https://github.com/vllm-project/vllm/issues/2098)�- ~50 t/s on 4x H100-sxm5-80GB (https://arxiv.org/abs/2411.00136)

batch size	prefill throughput (t/s)	decode throughput (t/s)
1	927	24
2	1190	42
4	1345	72
8	1665	121
16	1759	236
32	1846	391
64	1875	649

362 of 409

[1/2/2025] Continued: Mixtral-8x7B distributed Inference

- 2 weeks ago, we tried tensor parallelism on experts’ weights (v0)

- this week, we came up with 3 other implementations:

(v1) intra-node tensor + inter-node expert parallelism on experts’ weights
(v2) v1 + intra-node tensor parallel attention
(v3) v1 + inter-node data parallelism

363 of 409

[Recap] v0: experts inter-node TP architecture

note:

e_i,j denotes slice j of expert_i

here, each expert is sliced into 6 pieces, one for each GPU�

364 of 409

v1: experts intra-node TP + inter-node EP architecture

365 of 409

v1: experts intra-node TP + inter-node EP evaluation

batch size	prefill throughput (t/s)	decode throughput (t/s)
1	971	28
2	1246	50
4	1450	84
8	1665	145
16	1777	254
32	1822	434
64	OOM	OOM

366 of 409

batch size	prefill throughput (t/s)	decode throughput (t/s)
1	927	24
2	1190	42
4	1345	72
8	1665	121
16	1759	236
32	1846	391
64	1875	649

batch size	prefill throughput (t/s)	decode throughput (t/s)
1	971	28
2	1246	50
4	1450	84
8	1665	145
16	1777	254
32	1822	434
64	OOM	OOM

batch size	prefill throughput (t/s)	decode throughput (t/s)
1	847	26
2	1060	46
4	1193	80
8	1334	142
16	1382	259
32	1426	444
64	1422	663
128	1457	913
256	OOM	OOM

batch size	prefill throughput (t/s)	decode throughput (t/s)
1	283	22
2	449	39
4	942	73
8	1079	128
16	1529	242
32	1579	428
64	1852	752
128	1811	1080
256	1879	1395
512	OOM	OOM

367 of 409

v2:�experts intra-node TP + inter-node EP & attention intra-node TP�architecture

368 of 409

v2 evaluation

batch size	prefill throughput (t/s)	decode throughput (t/s)
1	847	26
2	1060	46
4	1193	80
8	1334	142
16	1382	259
32	1426	444
64	1422	663
128	1457	913
256	OOM	OOM

369 of 409

v3:�experts intra-node TP + inter-node EP & inter-node DP�architecture

370 of 409

v3 evaluation

batch size	prefill throughput (t/s)	decode throughput (t/s)
1	283	22
2	449	39
4	942	73
8	1079	128
16	1529	242
32	1579	428
64	1852	752
128	1811	1080
256	1879	1395
512	OOM	OOM

371 of 409

Pipeline Parallelism by Mistral: evaluation

batch size	prefill throughput (t/s)	decode throughput (t/s)
1	667	14
2	1189	22
4	1959	35
8	2169	58
16	2931	107
32	2909	207
64	3111	367
128	2909	606
256	OOM	OOM

372 of 409

[02.06.2024 update]

MoE Inference Performance Model

- the grandeur goal:�given an arbitrary hardware environment and a target MoE model, what is the best multi-dimensional parallelization strategy for inference?

- more realistic one (which we intend to pursue):�given a cluster consisted of the same type of GPU (arbitrary n_nodes, n_gpus per node, network equipment/topology. assumes sufficient aggregate GPU memory) and a few selected representative MoE models, what is the multi-dimensional parallelization strategy optimized for inference throughput?

373 of 409

[02/27/2025 update]: minimizing synchronization overhead in distributed inference with cuda graph

“GPU kernel launch timing can be unpredictable due to various CPU load and operating system factors”
this issue could lead to unstable & subpar performance when the workload in interest is consisted of multiple small GPU kernels (instability from kernel bubbles, low GPU utilization from high scheduling latency for microsecond level GPU operation)
in a distributed/parallel setting, the unpredictable kernel bubbles causes synchronization overhead, which could significantly degrades performance
together, issues above create immense difficulty in modeling distributed inference performance

https://pytorch.org/blog/accelerating-pytorch-with-cuda-graphs/

374 of 409

kernel bubbles & synchronization overhead

red box:�attention & router

blue box:�expert input sorting

green box:�expert computations

#NOTE:�all-reduce cannot start until everyone finishes their work

375 of 409

cuda graph: how could it help?

without graph, high scheduling latency could cause significant gaps between kernels
with cuda graph, kernels are scheduled together as a bundle, minimizing CPU scheduling latency and kernel gaps (also increases GPU utilization)

376 of 409

the cost/restrictions of cuda graph

building said graph takes time (but the cost is amortized across iterations of graph execution, e.g. token generation forward pass)
building said graph takes memory (but multiple graphs are allowed to share the same memory pool)
have to re-write the application to be “graph-safe”:�1. ops that synchronize the CPU with the GPU are prohibited�2. every replay reads from and writes to the same (virtual) memory addresses�3. dynamic control flow (based on CPU or GPU data) is prohibited�4. dynamic shapes are prohibited. The graph assumes every tensor in the captured op sequence has the same size and layout in every replay.

377 of 409

to see whether cuda graph could improve Mixtral8x7B distributed inference,�I rewrote the Attention module to be graph safe (borrowing many ideas from Sequoia)

original implementation

graph-safe version

378 of 409

Caveats to my graph-safe attention ...

original implementation:�keys & values used in attention score calculation grows with token generation length

graph-safe version (leads to increased computation):�keys & values needs to have fixed size, therefore I had to use�max_seq_len = max(prompt length of this batch) + n_tokens_to_generate

379 of 409

evaluation: efficacy of cuda graph

picture on the left shows profiler results from server 51, 2x RTX-4090-24GB
running Mixtral8x7B distributed inference with tensor parallel experts on 2 nodes (server 51&46, 6x GPUs in total)
this is during decode phase when batch_size = 1
cuda graph (marked in orange) encapsulates attention + router computation
as shown, minimizes synchronization overhead

380 of 409

Performance comparison [Prefill]: original vs. graphed version

batch size	orig_prefill	graphed_prefill
1	927	1349
2	1190	1944
4	1345	2492
8	1665	3300
16	1759	4583

running Mixtral8x7B distributed inference with tensor parallel experts on 2 nodes (server 51&46, 6x GPUs in total)

381 of 409

Performance comparison [Decode]: original vs. graphed version

running Mixtral8x7B distributed inference with tensor parallel experts on 2 nodes (server 51&46, 6x GPUs in total)

batch size	orig_decode	graphed_decode
1	24	42
2	42	62
4	72	84
8	121	108
16	236	187

382 of 409

Future Work

- it remains unclear why the graphed version’s decode throughput is deteriorating with larger batch sizes

- we are still trying to figure out whether the graph version more closely aligns with our performance model’s predictions

383 of 409

[03/05/2025] Paper Review: LLM reasoning

outline

- definition & overview of recent efforts on LLM reasoning

UIUC (May 2023): https://arxiv.org/pdf/2212.10403�Leiden University, Netherlands (July 2024): https://arxiv.org/pdf/2407.11511

- case study: s1, improving LLM reasoning with extra test-time computation

Stanford (March 2025): https://arxiv.org/pdf/2501.19393�

384 of 409

Introduction to LLM reasoning

we will focus mainly on Techniques today

385 of 409

What is reasoning?

(or, rather, how is this term defined by scholars working on LLM reasoning)

“Reasoning is the process of thinking about something in a logical and systematic way, using evidence and past experiences to reach a conclusion or make a decision”

- inductive, deductive, abductive reasoning�- analogical, causal, probabilistic reasoning�- formal, informal reasoning

386 of 409

Techniques to elicit the ability to reason in LLM

UIUC survey wrote: “Recent research has suggested that reasoning ability may emerge in language models at a certain scale, such as models with over 100 billion parameters”�=> clearly sounds outdated now��1. supervised fine-tuning (usually on smaller models, used in DeepSeek-v2, replaced by RL in DeepSeek-R1)�the gist: training models to explain themselves (e.g. provide steps to conclusion)�limitation: requires curating dataset that contains explicit reasoning, and fine-tuned model may not generalize well (which could indicate that any resemblance of reasoning may be the model relying on artifacts)

387 of 409

2. Chain of Thought (CoT) prompting

guiding LLMs to generating “reasoning” explicitly (assumes the model has inherent reasoning abilities)�N-Shot-CoT: in CoT prompting, <input, output> demonstrations are replaced with <input, chain of thought, output> triples

388 of 409

Further into Chain of Thought Prompting

Rationale Exploration:

Complex problems often admit multiple ways of thinking that can lead to their unique correct answer.

A paper in 2022 proposed a decoding strategy called self-consistency to replace greedy decoding (temperature = 0) in traditional CoT. This strategy involves sampling a diverse set of rationales on top of the greedy one, and selecting the most consistent answer by marginalizing out the sampled rationales.��extended reading:�Forest-of-Thought by Huawei: https://arxiv.org/pdf/2412.09078

389 of 409

3. Hybrid Method

trains the model more to endow reasoning capabilities + prompt engineering like CoT to bring out said abilities

Bootstrapping as a way to self-improve:�the Self-Taught Reasoner (StaR) project in 2022 experimented training and refining a model iteratively with its own output.�1. use CoT to get the model to generate rationales�2. fine-tune the model with rationales that lead to correct answers�(there is follow-up work that demonstrates possibility of self-supervising this process)

390 of 409

[Case Study] s1: Simple test-time scaling

main points:

- this paper uses the “Hybrid” technique stated earlier�- “test-time” is synonymous to “during inference”�- the idea is to use extra test-time compute to improve performance�- only uses 1K dataset with reasoning traces for supervised fine-tuning�- completed training in 26 minutes on 16x H100 GPUs�- employs a budget-forcing strategy to control test-time compute�- uses Qwen-2.5-32B-Instruct as base model. The fine-tuned model is called s1-32B�- uses AIME24 as benchmark

391 of 409

s1: dataset curation

- initially collected 59K questions from 16 sources, but found that�quality >> quantity�- uses Gemini Flash Thinking API to generate reasoning trace for each question

interesting data selection scheme:�to ensure question difficulty, uses Qwen2.5-7B-Instruct and Qwen2.5-32B-Instruct to generate answer, Claude 3.5 Sonnet to evaluate answers against reference solution. Questions that either model can answer correctly are removed (too easy)

392 of 409

s1: test-time scaling methods

2 major scaling methods:�- sequential (later computation depends on earlier ones, which this paper adopts)�- parallel (majority voting)��budget forcing strategy:�- min & max number of thinking tokens�- min: asks LLM to “Wait”�- max: end-of-thinking token (like EOS token) or append “Final Answer: ”��compares against:�token-conditional, step-conditional, and class-conditional control. Rejection Sampling�

393 of 409

s1: evaluation

improvement with test-time scaling eventually flattens out, and the context window of the underlying language model constrains it

394 of 409

s1: discussion

395 of 409

[03.27.2025]: Modeling the performance of MoE LLM distributed inference with multi-dimensional parallelism

outline

motivation
overview of our performance model
Performance model evaluation: big trends
Performance model evaluation: breakdown

396 of 409

Brief Review: multi-dimensional parallelism

reference: https://huggingface.co/blog/moe#parallelism

note: here, model parallelism is equivalent to tensor parallelism

397 of 409

Motivation: performance tuning is a tedious and time consuming task

in the past few months, we have experimented with combinations of

pipeline parallelism (PP), tensor parallelism (TP), expert parallelism (EP), and even data parallelism (DP) in different hardware environments (e.g. h100, RTX-4090)��note that these parallelization strategy can be intra or inter node!

=> so many combinations�=> so many implementations�=> takes so much time to profile every one of them�

398 of 409

Performance Model Overview: perhaps there is a better way …

399 of 409

Few assumptions of our performance model

all gpus in the given cluster are the same (not considering heterogeneous computing)
parallelization of attention layer is intra-node, parallelization of expert layer is inter-node
expert activation is perfectly uniform regardless of sample size

400 of 409

Case Study: deep dive into the differences of�experts-TP v.s. experts-attn-TP�through our performance model

401 of 409

Quick Review: experts-TP

note: in the figure above, we are running Mixtral-8x7B (46.7B weights in BF16) on 6x RTX-4090 on 2 nodes

basic idea: since experts account for >95% of MoE LLM’s weights, we could distribute their weights evenly across our GPUs to lessen the memory pressure

402 of 409

Quick Review: experts-attn-TP

basic idea: attn weights are much smaller than that of experts’.�Parallelizing attn weights inter-node may introduce more communication overhead than computation savings. However, intra-node communication via PCIe is much faster. Perhaps�parallelizing attn weights intra-node could save execution time by reducing computation while introducing little communication overhead.

403 of 409

Evaluation [big trends]: whether our performance model captures the difference between experts-TP and experts-attn-TP in terms of throughput

batch size	actual�prefill t/s	predicted prefill t/s	actual�decode t/s	predicted decode t/s
1	1516	1726	46	63
2	1913	2082	70	89
4	2420	2995	108	116
8	3016	3388	170	200
16	3253	3540	304	369

batch size	actual�prefill t/s	predicted�prefill t/s	actual�decode t/s	predicted decode t/s
1	1279	1458	44	64
2	1539	1601	70	90
4	1880	2061	107	117
8	2244	2226	175	203
16	2251	2258	304	372

404 of 409

Evaluation [breakdown, decode phase]: how does our performance model explain the observed trends on the previous slide?

summary: computation savings from parallelizing attn weights is offset by communication overhead

405 of 409

Evaluation [breakdown, prefill phase]: how does our performance model explain the observed trends on the previous slide?

summary: computation savings from parallelizing attn weights�is very little compared to the communication overhead

406 of 409

Next Steps

adding support for expert parallelism and pipeline parallelism
would share performance comparison against vLLM next week

407 of 409

[04.03.2025]: Continuing on LLM Reasoning

outline

recap: What is LLM Reasoning?
we have started some survey, let us continue …�Forest-of-Thought: Scaling Test-Time Compute for Enhancing LLM Reasoning (Huawei Noah’s Ark Lab, March 26th 2025)�https://arxiv.org/pdf/2412.09078�

408 of 409

CoT v.s. ToT v.s. FoT

old dogs can’t learn new tricks? (ensemble method, speculative decoding …)