1 of 348

by Mu-Chi Chen

2 of 348

What Is a Data Hub?

  • mediator between various data sources and data consumers
  • how does it differ from normal data storage solutions (e.g. database, data lake) ?

3 of 348

What Is a Data Hub: a one stop shop for data consumers

4 of 348

What Is a Data Hub: Main Functions

5 of 348

Examples

from platform providers:

  • Cumulocity
  • MarkLogic
  • Cloudera

from an alternative/complementing definition of data hub:

  • LinkedIn’s DataHub
  • Microsoft’s OneLake data hub

from Google as a narrowly purposed product:

  • Ads Data Hub

6 of 348

Example: Cumulocity

Highlights:

  • IoT devices send data to mongoDB
  • Dremio (distributed SQL engine) as query interface & ETL orchestrator
  • stores data as Apache Parquet (relational) on data lakes

7 of 348

Example: MarkLogic

Highlights:

  • allows data ingestion from Apache Kafka
  • uses a multi-model database called MarkLogic Server

8 of 348

Example: Cloudera

workflow:

  • ingest data by creating a data hub cluster that hosts Apache Nifi (data flow manager, enables getting data via HTTP, s3, Kafka, Mongo …)
  • launches transformation/ analytic jobs on perhaps another cluster (supports Apache Spark for batch & stream processing)
  • clusters in the same environment (e.g. test, staging, production) can save data to a shared data lake where they also become accessible by consumers

reference:

https://www.cloudera.com/products/data-hub.html#

https://docs.cloudera.com/data-hub/cloud/index.html

9 of 348

Things to Consider

how to move data into our data hub? (data ingestion & integration)

  • Apache Nifi as open source solution
  • AWS Glue or Azure Data Factory depending on our cloud service provider

where do we store the data in our data hub?

  • data lake (azure, aws, GCP, Snowflake …)
  • multi-model database (open-source: Apache Ignite, ArangoDB, OrientDB, FoundationDB, paid: Azure Cosmos DB)

implications of our infrastructure choices?

10 of 348

slides above were covered in

Data Hub Project Weekly Meeting on 9/28

11 of 348

Example: Data Hub as Data Catalog/ Metadata Platform

What is a Data Catalog?

According to IBM, “a data catalog is a detailed inventory of all data assets in an organization, designed to help data professionals quickly find the most appropriate data for any analytical or business purpose.”

reference: https://www.ibm.com/topics/data-catalog

Why does a Data Hub need this?

Aside from enabling people to get the data they need easily, data hubs should also help them find the data they need (this would essentially be the face of this project).

  • encourages and therefore increases usage
  • facilitates data sharing and open research

12 of 348

Example: Data Hub as Data Catalog/ Metadata Platform

Common functions of a data catalog

  • IAM
  • search
  • data lineage (most likely requires logs)
  • data management (source, ingestion, retention, and export configurations)
  • data ops (pipeline execution and statistics)
  • data quality (rule definitions and pipeline execution results)
  • compliance (SOC 2, right to be forgotten, right of access)
  • explainable and reproducible AI

reference: https://engineering.linkedin.com/blog/2020/datahub-popular-metadata-architectures-explained

13 of 348

Example: LinkedIn & Acryl Data’s DataHub

open source

https://datahubproject.io/

slides 12, 14-16 references a LinkedIn Engineering Blog about metadata catalog architectures by Acryl Data’s CTO (previously LinkedIn’s Principal Staff Software Engineer leading the data infrastructure team)

14 of 348

Example: LinkedIn & Acryl Data’s DataHub

architecture

  • metadata are event-based, subscribable in real-time
  • metadata providers push to stream or service API
  • changes to metadata generate logs
  • logs can be materialized in different forms

reference: https://engineering.linkedin.com/blog/2020/datahub-popular-metadata-architectures-explained

15 of 348

Example: LinkedIn & Acryl Data’s DataHub

architecture

  • metadata should be strongly typed and have collaboratively defined relationships

reference: https://engineering.linkedin.com/blog/2020/datahub-popular-metadata-architectures-explained

16 of 348

Example: LinkedIn & Acryl Data’s DataHub

reference: https://engineering.linkedin.com/blog/2020/datahub-popular-metadata-architectures-explained

  • providers push metadata to change proposal stream (like a topic on Apache Kafka), where it is processed made available for subscription
  • metadata (main) service subscribes to processed change stream, make logs, and TL the metadata to different storage solutions depending on use cases
  • consumers e.g. data hub dashboard powers different applications using the various data storages

17 of 348

Example: Microsoft’s OneLake Data Hub

question:

What does MBZUAI need?

18 of 348

Example: Ads Data Hub (will skip in presentation)

Highlights:

  • reads data from: Google Ads, Youtube, your GCP BigQuery (data warehouse)
  • writes to your GCP project upon query
  • privacy & security by Google

19 of 348

For weekly meeting on 10/5/2023

topic: continue to survey data hub elsewhere

  • Internet of Water Data Hub
  • Art Institute of Chicago Data Hub

maybe

  • circle back to talk about data catalogs
  • Open Data Hub (originally an internal Red Hat project but then open sourced)

20 of 348

Example: Internet of Water Data Hub

What is the Internet of Water Coalition?

A group of organizations (e.g. Duke University Nicholas Institute, Lincoln Institute Center for Geospatial Solutions, Western States Water Council) working with federal, state, and local governments to improve water management by building a water data infrastructure.

In summary:

Better data means better management. Therefore, we need good infrastructure to make water data more discoverable, accessible, and usable

21 of 348

Example: Internet of Water (IoW) Data Hub

  • collects water data from disparate sources into one place
  • standardizes collected data according to a commonly adopted principle (which implies that it only stores structured data)
  • enables discovery and access by Geoconnex (metadata catalog also developed by IoW)

22 of 348

Example: Internet of Water (IoW) Data Hub

basic components of a Data Hub

  • data producers
  • data wrappers: (the transform and load jobs of an ETL process)
  • data store
  • metadata catalog: searchable collection of metadata that points to data in data store including API to access the data (this is where data users can come in)

individual data hubs are then connected by Geoconnex, which is essentially a global metadata catalog (this is another place data users can come in)

23 of 348

Example: Internet of Water (IoW) Data Hub

there are four types of data hub, and this is type A

  • distributed in the sense that the data producers are responsible for the entire ETL process & storage
  • here, data hub provides only a window to discover and access data.

24 of 348

Example: Internet of Water (IoW) Data Hub

type B:

  • data hub also stores the data
  • data producers push transformed data to data hub

type C:

  • same as above except that data hub is responsible for pulling transformed data from data producers

* data sharing creates a privacy issue

25 of 348

Example: Internet of Water (IoW) Data Hub

type D (which is more like our original vision):

  • data producers send raw data to data hub, who is responsible for the rest

* can the data producers trust the data hub to transform the data right? (which brings us back to the ETL v.s. ELT question)

26 of 348

Example: Internet of Water (IoW) Geoconnex

purpose: make it easier for data consumers to find/access the data they want/need. Roughly similar to what LinkedIn and Microsoft is doing

why I think it is not a good product

  • every org publishing data are required to create a web page for each location for which they have water data
  • every created web page needs to embed structured metadata in JSON-LD
  • every page needs its own unique ID. If you every modify the URL, you gotta go to Geoconnex to remap it

this is so that a web crawler can then go to these pages to extract the data it needs and store it to a central database once everything above is done right

27 of 348

Example: Internet of Water (IoW) Data Hub

28 of 348

Example: the Art Institute of Chicago Data Hub

background

  • disparate data sources
  • some data transformation jobs repeated across different applications
  • wants robust support for website (web & mobile app) search (full-text search, autocomplete, and general SQL-like queries such as filtering and aggregation)

29 of 348

Example: the Art Institute of Chicago Data Hub

design goals

  • only stores data that are already publicly accessible, which preclude data privacy issues. This, they argue makes it okay for:
  • using cloud infrastructures
  • not care about authentication & authorization for now
  • not treat data hub as a central, unique, and permanent storage: that is the data producer’s job (not necessarily applicable to MBZUAI)
  • minimize need to ask data producers to modify their systems for data ingestion

30 of 348

Example: the Art Institute of Chicago Data Hub

architecture

  • hub and spoke model consisted of multiple microservices: APIs, HTTPS, JSON (you know, the yoozh …)
  • spokes
  • one microservice for each type of data source
  • each microservice is responsible for extracting the needed data, transform it right, and expose an API for reading the transformed data (pull instead of push)
  • can involve scraping data into local dbs for storage and access
  • hub
  • pulls, aggregate, and index data from spokes
  • mySQL for primary storage :(
  • Elasticsearch as search engine (enables full-text search, but is a document db itself like Mongo, no longer open-sourced since 2021)
  • exposes an API for data consumers: user submits query to hub, hub asks mySQL and/or Elasticsearch for data

31 of 348

Example: the Art Institute of Chicago Data Hub

challenges

  • performance: high website traffic (increase in data consumers) results in data hub dropping connections
  • auto-scaling capabilities
  • debugging: hard to trace where mistakes were made
  • importance of data lineage (done very well by LinkedIn) - where the data originated, how it has changed, and its ultimate destination

references: https://mw19.mwconf.org/paper/building-a-data-hub-microservices-apis-and-system-integration-at-the-art-institute-of-chicago/index.html (slides 28-31)

32 of 348

Example: Red Hat’s Open Data Hub (ODH)

  • end-to-end AI platform
  • contrary to all the data hubs discussed above, ODH provides a space for data scientists to develop their applications and tools for deploying them into production
  • this is a platform worth introducing because it is built mostly with open source projects

33 of 348

Example: Red Hat’s Open Data Hub (ODH)

Architecture highlights

  • uses ceph for (distributed file, object, block) storage -> an example of open source data lake
  • uses Apache Hive Metastore for metadata to enable data discovery
  • uses OpenShift as container management platform (powered by Kubernetes), which means the underlying compute resource can come from public/private cloud or on-premise data centers

34 of 348

Example: Red Hat’s Open Data Hub (ODH)

35 of 348

Example: Google Analytics Hub

feature of BigQuery

short intro to BigQuery:

  • GCP’s fully managed data warehouse
  • separates storage (BigQuery storage) & compute (BigQuery analytics) engine
  • two main use cases
  • store & analyze data in BigQuery
  • use BigQuery to access your data where it lives (only supports
  • external table: Cloud Bigtable (NoSQL), Storage (data lake), Google Drive
  • federated queries: Cloud Spanner (relational database that scales like a non-relational one), and SQL

36 of 348

Example: Google Analytics Hub

How might it look like for MBZUAI?

37 of 348

Example: Google Analytics Hub

38 of 348

Alright, so how would I build it?

39 of 348

40 of 348

Scaling

41 of 348

What could be MBZUAI Data Hub’s unique offering?

42 of 348

Automatic Sampling Bias Detection

what is sampling bias?

sample unrepresentative of population

why do we care?

  • internal & external (generalizability) validity
  • fairness in machine learning

what role can data hub play in this?

by enabling automatic and real-time exploratory data analysis, we can

  • improve data collection ASAP
  • optimize storage by not saving data that are too similar
  • enhance transparency for users of resulting models

43 of 348

Automatic Sampling Bias Detection

how do we detect sampling bias?

what does detecting sampling bias mean?

uncovering patterns in data (arguably data mining, knowledge discovery in

data): 90% of the data collected today came from people who identifies as

cisgender man (assigned male at birth and identifies as man)

44 of 348

Automatic Sampling Bias Detection

ways to uncover patterns in data

methods of unsupervised learning:

  • clustering

exclusive (K-means)

hierarchical (agglomerative)

probabilistic (Gaussian Mixture, determines probability of data belonging to to a particular distribution)

  • association rules

Apriori algorithm (frequent data combination)

45 of 348

Automatic Sampling Bias Detection

ways to uncover patterns in data

methods of unsupervised learning:

  • outlier detection

local outlier factor (local density of focus point v.s. neighbor)

isolation forest (build overfitted decision tree by splitting at random threshold & feature until every point is isolated. Those isolated early are outliers)

autoencoders (perhaps for images. Encoders learns a latent representation of input and decoders reconstruct input from encoded output. Larger

reconstruction error implies anomaly)

46 of 348

Automatic Sampling Bias Detection

ways to uncover patterns in data

methods of supervised learning:

  • classification

understand the different ways in which your data can be categorized to adjust data collection strategy

e.g.

classification results seems geographical, and much more data are classified as coming from East Asia

KNN, K-nearest neighbors

decision tree & random forest

47 of 348

Academic Discussions on Data Hub

48 of 348

Technical Report: Developing a Working Data Hub

  • paper by MIT Lincoln Laboratory Supercomputing Center in March, 2020
  • authors:

Vijay Gadepally (OSU Ph.D. in ECE, technical staff at MIT Lincoln Lab)

Jeremy Kepner (Princeton Ph.D. in Astrophysics, head of MIT Lincoln Lab)

  • the Lincoln Lab never seem to have built a data hub of their own
  • the closest product MIT has helped build is probably BigDAWG (a polystore database prototyped at a hackathon hosted by Intel)
  • no concrete experiences to learn from, but interesting guidelines
  • https://arxiv.org/pdf/2004.00190.pdf

49 of 348

MIT on the need of data hub

big data challenges:

  • Volume, Velocity, Variety (three Vs), Veracity (Fourth V? preserving trust in data and analytic)
  • in summary, needs to extract value from lots of heterogeneously sourced data fast and accurately

evolving solution:

  • one-stop solution for storing (providing) and using (consuming) data
  • which is data hub

50 of 348

MIT on data hub architecture

at the bottom:

external data sources + your own data storage including databases, data warehouses, and data lakes

in the middle:

heterogeneous data management and transformation. enable users to query multiple sources at once

on the top:

data discovery and define data transformation rules

51 of 348

MIT on data hub system engineering

NOTE: “system engineering studies the development of complex systems (p.8)”

HIGHLIGHTS:

3a is an operation supported on databases while 3b is a strategy for file systems/ object storage/ block storage i.e. place for unstructured/ raw data like data lakes

=> similar to my data hub architecture proposal

52 of 348

MIT on raw data storage: Lustre

  • distributed parallel object storage that is great for HPC (kinda similar to s3 but open-source and has a file system interface)
  • stores objects (binary data stored in file) and associated metadata (ith node, filename, timestamp etc) in a flat structure
  • one metadata server (MDS), multiple object servers (OSSes)
  • clients request data from metadata server which saves pointer to data in object servers
  • usually uses (not built-in) RAID6 to prevent data loss (allows concurrent failures of up to 2 disks, on average imposes a 35% storage penalty)

53 of 348

Lustre further reading

“Azure Managed Lustre: not your grandparents' parallel file system”

https://techcommunity.microsoft.com/t5/azure-high-performance-computing/azure-managed-lustre-not-your-grandparents-parallel-file-system/ba-p/3889946

Ceph (used in Redhat’s Open Data Hub, originally part of Sage Weil’s doctoral dissertation at UC Santa Cruz, also used extensively in HPC) vs Lustre:

https://www.linkedin.com/pulse/ceph-lustre-yashar-esmaildokht/

AWS FSx for Lustre (integrates with s3)

https://docs.aws.amazon.com/fsx/latest/LustreGuide/what-is.html

official site:

https://www.lustre.org/

54 of 348

MIT on DBMS

what is DMBS?

software interface between users and database

why use DMBS?

  • define schema
  • CRUD data
  • access control
  • analytics

55 of 348

MIT on DBMS

ACID v.s. BASE

relational db: atomicity, consistency, isolation, and durability. OLTP

non-relational db: basically available, soft state, eventual consistency

CAP theorem

you can only choose 2 between consistency, availability, and partition tolerance

(introduced by professor at Cal, supported by professors at MIT)

relational db: C + A

non-relational db: C + P (MongoDB), A + P (Cassandra, developed at FB, and founder later went to AWS to build DynamoDB)

56 of 348

MIT on DBMS

scaling

relational db: vertical (improve hardware because they are mainly on single mode)

non-relational db: horizontal (add nodes, which creates need of strong partition tolerance, which makes relaxing C and/or A inevitable if the CAP theorem is correct)

That said, there are controversy surrounding CAP

NewSQL (as opposed to SQL and NoSQL) movement that aims to build databases that are both ACID compliant and performant (i.e. horizontally scalable)

some examples include …

CockroachDB: https://www.cockroachlabs.com/

Google Cloud Spanner (which we have seen before when introducing BigQuery, now advertised as being half the cost of DynamoDB): https://cloud.google.com/spanner?hl=en

57 of 348

MIT on DBMS

Access Control: who can access what

granularity: table, row, to cell levels

principal: the “who,” can be role/ user/ groups

asset: the “what,” i.e. resources stored in database

view-based access control being most popular

  • often implemented through metadata
  • often used with role-based strategies
  • principal requests access, db evaluates whether principal has rights to access requested data, if so, presents data with view

query control strategies: restricts what query the principal is allowed to issue

58 of 348

MIT on DBMS

ways to add security

well … duh … encryption and/or masking

CryptDB (also by MIT):

  • executes SQL queries over encrypted data
  • data can optionally only decryptable by password-holders, which means db admin does not get to access decrypted data

59 of 348

MIT on heterogeneous data management

  • modern applications often rely on diverse datasets with different models and sources
  • ETL is expensive:
  • converting everything to a single model may degrade performance
  • high maintenance cost, transformation rule has to change with data
  • “custom data stores provide 100X better performance than general purpose databases”
  • sometimes, data sharing policy may prohibit moving the data out to a centralized location
  • federating (like federal to state governments in the U.S.) multiple specialized data stores. Single or multiple query interfaces?
  • how to avoid writing and using a different connector for every selected store
  • how to integrate data

60 of 348

MIT on heterogeneous data management

BigDAWG (Big Data Working Group)

  • unified interface (connector)
  • islands: group databases with the same data model, trade off semantic completeness for location transparency (which implies having a single query language)
  • shim translates island queries to db engine’s native language
  • cast enables moving data between db engines

61 of 348

MIT on heterogeneous data management

  • BigDAWG currently supports PostgreSQL, MySQL, Vertica (?), Apache Accumulo (k-v store), SciDB (??), and S-Store (???)
  • interesting challenges that BigDAWG tackles
  • cross-engine query plans
  • query performance monitoring
  • cross-engine data migration
  • query execution

62 of 348

MIT on key considerations when developing data hub

Technological considerations

  • don’t rely on a centralized data lake
  • single point of failure
  • difficult to store operational data
  • data becomes stale (hard to update)
  • scalable storage architecture that federates multiple specialized data stores
  • provide access to computing infrastructure for data processing

63 of 348

MIT on key considerations when developing data hub

Infrastructure considerations

  • compute & storage nodes, networking equipments
  • cloud computing
  • almost unlimited resources as long as you have $
  • cloud clawback: hybrid cloud
  • vendor lock-in
  • ** some tools may not run on sensitive networks **
  • data catalog that is updated automatically and perhaps even provide information on related previous work (information sharing & research replication)

64 of 348

MIT on key considerations when developing data hub

Data formatting considerations

  • 80% ML/ AI research time are spent on data wrangling
  • data providers are a crucial source to understand the meaning of data
  • maximize data provider input (knowledge transfer) during data ingestion & wrangling

65 of 348

MIT on key considerations when developing data hub

Security Considerations

  • data providers and consumers should be able to trust you
  • work with Information Security Officers and Subject Matter Experts
  • data aggregation can make insensitive data sensitive: name and address by themselves may not be sensitive, but matching them together creates PII (personally identifiable information)
  • data use agreements (like cookies ~)

66 of 348

MIT on key considerations when developing data hub

Policy considerations

  • new data may warrant addition of new technology
  • open v.s. closes source: balance between long-term support and developer community
  • software licensing may prohibit technology sharing (MBZUAI to NTU for instance)

67 of 348

MIT on key considerations when developing data hub

User engagement

  • active and engaged users and developers
  • quick, transparent, and effective feedback loop
  • ask big projects to share data
  • remove costs of data sharing
  • invest in automatic data discovery: data hub should have the latest data to attract more users

68 of 348

UCSC Xena Platform

https://www.nature.com/articles/s41587-020-0546-8

enables visualization and interpretation of cancer genomics data from disparate sources

Background

  • large public datasets & small and private datasets from individual labs
  • web app (usually for public datasets) | desktop app (usually for private datasets)
  • driving use case: visualize public and private datasets together without having to upload private data to external servers.

69 of 348

UCSC Xena Platform

Architectural Design

two components:

  • frontend (Xena Browser): users create requests, frontend fetches and integrates data from different hubs. data source configuration is done via hub URLs
  • backend (Xena Hubs): can be installed on desktop to launch a local server or hosted on public servers which sometimes have their own access control via firewall. no integration here, private data made available by local server never gets uploaded to anywhere else

70 of 348

UCSC Xena Platform

71 of 348

Lessons from government agencies

72 of 348

Taiwan’s Health and Welfare Data Center (HWDC)

https://pubmed.ncbi.nlm.nih.gov/31118821/

Background

  • National Health Insurance launched in 1995, 99.99% population enrollment
  • National Health Research Institutes established in 2002, in charge of National Health Insurance Research Database (NHIRD)
  • NHIRD stores insurance claims, which means we have data to the entire population
  • NHIRD sits at the core of HWDC

73 of 348

Taiwan’s Health and Welfare Data Center (HWDC)

Background

  • relies heavily on personal identification numbers (PINs) to join data from different registries (beneficiaries, inpatient claims, prescriptions …)
  • has allowed international collaborative study
  • aggregates data from more than 70 databases
  • linkable (JOIN ON PIN)
  • de-identified publicly available data
  • takes roughly two years for data to be available on NHIRD
  • only handling structured data: missing out on medical images!

74 of 348

Taiwan’s Health and Welfare Data Center (HWDC)

Challenges: Privacy

  • top concern. got sued.
  • encrypts (transforms to unique identifiers) names: data masking?
  • access restricted to applicants from academia, research institutes, or hospitals
  • supports remote access to server but requires office visit after each session
  • for research purposes only
  • cannot request data from more than 10% of population

75 of 348

Taiwan’s Health and Welfare Data Center (HWDC)

Challenges: Data Quality

  • upcoding (health care providers submitting exaggerated diagnoses to guarantee reimbursement)
  • is this noise that can be overcome with large samples?
  • how to assess impact?
  • less of a problem when diagnoses were made to certify catastrophic illness (waives premium and co-pay)

76 of 348

Taiwan’s Health and Welfare Data Center (HWDC)

Challenges: Unmeasured or unavailable variables

  • lacks data on disease severity, a confounding variable
  • causes both dependent and independent variables
  • disease severity affects both treatment and effect
  • solution:
  • instrumental variables: Z -> X -> Y
  • active comparator: instead of placebo, compares against another commonly administered treatment
  • propensity score: compare probabilities of receiving treatment
  • predict disease severity from diagnoses when available (e.g. strokes)

77 of 348

Querying Encrypted Data

by Microsoft Research

78 of 348

Querying Encrypted Data: Demand

migration to cloud and thus storage on cloud

79 of 348

Querying Encrypted Data

enters encryption:

you won’t be able to understand/ use the data even if you managed to gain access to them

the quick brown fox … a pangram

80 of 348

Querying Encrypted Data

two fundamental techniques:

  • compute over encrypted data: homomorphic encryption
  • compute over plain text in “secure” location

81 of 348

Querying Encrypted Data

more on the previous page

partial: one gate (+ or *)

full: multiple gates and unbounded depth (i.e. arbitrary functions)

Monomi: work of MIT CS and AI Lab (CSAIL) in 2013

CryptDB: from CSAIL too, 2011, Raluca Ada Popa is an Associate Professor at Berkeley EECS now

TrustedDB: Stony Brook University (NY), 2014

Cipherbase: Microsoft research, 2015

82 of 348

Querying Encrypted Data

symmetric encryption

83 of 348

Querying Encrypted Data

asymmetric encryption

84 of 348

Querying Encrypted Data

Advanced encryption standard cipher block chaining mode

85 of 348

Querying Encrypted Data

nondeterministic (more secure):

>> for i in range(2):

>> print(encrypt(“foo”))

>> 1qaz2wsx

>> 3edc4rfv

deterministic:

86 of 348

Querying Encrypted Data

homomorphic encryption:

>> encrypt(1) + encrypt(1) == encrypt(2)

order preserving encryption:

  • if x < y, encrypt(x) < encrypt(y)

87 of 348

Querying Encrypted Data

to summarize

88 of 348

Querying Encrypted Data

authors’ opinion: outdated?

89 of 348

Querying Encrypted Data

90 of 348

Querying Encrypted Data: Trusted Client Architecture

client: applications performing CRUD operations against a server DBMS

  • DBMS only stores encrypted data
  • client responsible for hosting encryption/decryption microservice and related query translator

91 of 348

Querying Encrypted Data: CryptDB as Trusted Client example

CryptDB:

  • uses partially homomorphic encryption
  • no computation on client side

92 of 348

Querying Encrypted Data: CryptDB as Trusted Client example

example operation

client app

>> SELECT SUM(grade) FROM students;

web proxy

>> decrypt(paillier_grade_sum)

DBMS

>> SELECT PAILLIER_SUM(paillier_grade) AS paillier_grade_sum

FROM students;

students table’s schema:

from user’s perspective

{

“id”: int,

“name”: str,

“grade”: int

}

on DBMS

{

“paillier_id”: str,

“name”: str,

“paillier_grade”: str

}

93 of 348

Querying Encrypted Data: Blob Store as Trusted Client example

What is Blob Store?

  • database design
  • blob: encrypted data (e.g. paillier_grade -> blob_grade)
  • blob used in conjunction with partition id

students table’s schema:

{

“name”: str,

“blob_grade”: str,

“partition_id”: int

}

range partition:

grade

parition_id

1 - 9

0

10 - 19

1

20 - 29

2

94 of 348

Querying Encrypted Data: Blob Store as Trusted Client example

Architecture

  • bit similar to CryptDB
  • difference? more query processing is moved to the client’s end (not just translating anymore)

95 of 348

Querying Encrypted Data: Blob Store as Trusted Client example

client app

>> SELECT SUM(grade)

FROM students

WHERE grade > 10;

DBMS Shell (trusted, client side)

>> SELECT SUM(DECRYPT(grade_blob))

FROM students

WHERE DECRYPT(grade_blob) > 10;

DMBS (untrusted, server side)

>> SELECT grade_blob

WHERE partition_id > 0;

example operation

96 of 348

Querying Encrypted Data: Blob Store as Trusted Client example

challenge

  • appropriate partitioning
  • optimal query splitting

97 of 348

Querying Encrypted Data: Monomi as Enhanced Blob Store

  • uses partially homomorphic encryption to push more computations on untrusted DBMS server
  • requires pre-computation for complex queries (e.g.

WHERE updated_at = created_at + 1 -> without key, you don’t know encrypt(1))

example operation

client app

>> SELECT SUM(grade)

FROM students

WHERE grade > 10;

DBMS Shell (trusted, client side)

SUM(DECRYPT(grade_paillier))

DMBS (untrusted, server side)

>> SELECT grade_paillier

WHERE grade_paillier > 10;

98 of 348

Querying Encrypted Data: Trusted Client Summary

  • does not require changing untrusted server DBMS
  • works well when amount of data transferred is small
  • relies on distributed query after all
  • pre-computation for complex query is expensive
  • generalizability?
  • partially homomorphic encryption is still expensive, requires 2048 bits to store int32 -> 64x expansion!
  • partially homomorphic encryption support a limited set of operations

99 of 348

Querying Encrypted Data: Secure In-Cloud Processing

we talked about trusted client in the previous slides

100 of 348

Querying Encrypted Data: Secure In-Cloud Processing

how do queries under different architecture work?

benefit of having smaller TCB?

notice how isolation (hardware, memory) affects size of base

101 of 348

Querying Encrypted Data: Secure In-Cloud Processing

larger TCB

smaller TCB

less secure

more secure

more administration

less administration

according to this paper:

102 of 348

Querying Encrypted Data: TrustedDB as an Example of Secure In-Cloud Processing

Highlights

  • secure coprocessor as secure hardware
  • MySQL: untrusted main storage

SQLite: trusted storage for sensitive data

  • similar to what we talked about in blob store: distributed query processing
  • PCI: payment card industry

103 of 348

Querying Encrypted Data: TrustedDB as an Example of Secure In-Cloud Processing

id

name

birthday_timestamp

0

Andrew

866476800

1

Bart

961171200

2

Casey

1087401600

id

SSN

0

123456789

1

123456790

2

123456791

MySQL (untrusted)

SQLite (trusted)

students table

104 of 348

Querying Encrypted Data: TrustedDB as an Example of Secure In-Cloud Processing

example query

User Query

>> SELECT *

>> FROM students

>> WHERE birthday_timestamp >866476800 AND

>> SSN > 123456790

105 of 348

Querying Encrypted Data: TrustedDB as an Example of Secure In-Cloud Processing

What Happens Under the Hood

>> SELECT *

>> FROM mysql_students

>> RIGHT JOIN (

>> SELECT id

>> FROM students

>> WHERE SSN > 123456790

>> ) AS sqlite_result

>> ON mysql_students.id = sqlite_result.id

>> WHERE mysql_students.birthday_timestamp >866476800

106 of 348

Querying Encrypted Data: TrustedDB as an Example of Secure In-Cloud Processing

performance bottleneck:

  • secure coprocessors have limited computational and storage capacities
  • this becomes a problem when most data are confidential
  • the trusted part of the server is a full DBMS, how do you go about telling your customers that?

107 of 348

Querying Encrypted Data: Cipherbase as an Example of Secure In-Cloud Processing

  • should remind you of CryptDB (encrypted data on MySQL, trusted web proxy to perform query translation)
  • here, client proxy is replaced by FPGA, which by the way is on the server end

108 of 348

Querying Encrypted Data: Cipherbase as an Example of Secure In-Cloud Processing

  • “when all data is strongly encrypted (PHE), Cipherbase achieves 40% the throughput of plaintext SQL Server”
  • “when all columns in TPC-C (Transaction Processing Performance Council Benchmark C) are changed from plaintext to strongly encrypted, the performance drops by about a factor 2. This number can be around 100 for a client-based system”
  • more about this paper next week !!!

https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/cipherbase.pdf

109 of 348

Querying Encrypted Data: Semantic Security

“An adversary is allowed to choose between two plaintexts, m0 and m1, and he receives an encryption of either one of the plaintexts. An encryption scheme is semantically secure, if an adversary cannot guess with better probability than 1/2 whether the given ciphertext is an encryption of message m0 or m1. The notion is also referred to as indistinguishability of encryptions. The encryption reveals no information no matter what kind of semantics (meanings) are embedded in the encryption.”

https://link.springer.com/referenceworkentry/10.1007/978-1-4419-5906-5_23

110 of 348

Querying Encrypted Data: Leakage

what information does encryption schemes reveal?

AES in CBC mode (non-deterministic)

  • semantically secure
  • only reveal input length

deterministic

  • hacker can still understand little about data by counting number of unique values and their distribution

order preserving

  • order …

111 of 348

Querying Encrypted Data: Leakage

memory access pattern leakage

malicious actors watching on server end can study storage locations different operation access

e.g.

observer learns the following workflow

query A -> access location 0 ~ 9 -> wires money to some account

which implies

location 0 ~ 9 must store information required for some transaction

112 of 348

Querying Encrypted Data: Leakage

  • Encryption != no information leakage
  • In trusted-client architecture, we can hide sizes of query result by standardizing them to limit, but this seriously hinders performance
  • can we, instead, make memory access pattern oblivious (independent of input) ?

113 of 348

Querying Encrypted Data: Access Pattern Oblivious Algorithms

bubble sort

for i in range(len(arr) - 1, -1, -1):

for j in range(0, i):

if arr[j] < arr[j + 1]:

continue

arr[j], arr[j + 1] = arr[j + 1], arr[j]

114 of 348

Querying Encrypted Data: Access Pattern Oblivious Algorithms

aggregator oblivious encryption

  • “an aggregator [that] can compute an aggregated sum of data and [without learning] anything else”
  • even if “homomorphic encryption is employed, the aggregator can still decrypt the ciphertexts being compute upon”
  • interesting paper to be further studied

https://eprint.iacr.org/2017/479.pdf

oblivious simulation

an additional layer that makes memory access pattern looks random -> disables benefitting from spatial and temporal locality

115 of 348

Querying Encrypted Data

(from the Cipherbase paper)

In conclusion, what do we want?

solution that balances

  • data confidentiality: minimize information leakage
  • generality: try to support full SQL operations, minimize change to legacy code
  • performance: throughput and latency

116 of 348

Transaction Processing on Confidential Data using Cipherbase

background

  • the previous paper is somewhat an introduction to this one
  • collaboration between Microsoft Research, Stanford University, and ETH Zurich (Einstein’s alma mater, also taught theoretical physics there)

117 of 348

Transaction Processing on Confidential Data using Cipherbase

Abstract (on Cipherbase):

  • database system
  • end-to-end data confidentiality through encryption
  • what’s new?
  • Architecture with smallest Trusted Computing Base at the time
  • combines industrial strength SQL database with lightweight encrypted data processing on secure hardware (FPGAs, novel usage they argued)
  • competitive performance for transactional workloads (OLAP v. OLTP)
  • discusses optimization strategies:
  • communication between main untrusted db server and trusted FPGA
  • limited computational and storage capacities on secure hardware

118 of 348

Transaction Processing on Confidential Data using Cipherbase

Architecture Benefits (aside from what was already mentioned last week)

  • inherits SQL functionalities for free: concurrency, recovery, stored procedures, and indexes => trusted module simply as a remote function to call
  • relies on PCIe-based FPGA board

119 of 348

Transaction Processing on Confidential Data using Cipherbase

Threat Model (categories below are assumed to be passive i.e. does not tamper with database contents or processing)

strong adversary:

privileged OS and database access, can view the contents on memory/disk and all internal/external communication. However, cannot observe state and computations in secure hardware.

weak adversary:

can obtain a one-time snapshot of data-at-rest

in real world:

adversary usually lies between the two

120 of 348

Transaction Processing on Confidential Data using Cipherbase

Data Confidentiality (assuming all columns are all strongly, or non-deterministically, encrypted)

  • no metadata (num tables and columns, data lengths) confidentiality
  • operational data confidentiality against strong adversary: operations against data leaks some information
  • equivalent to an abstract system that uses oracle to compute strongly encrypted values
  • non-boolean operations such as addition: encrypted input & output
  • boolean operations such as equality check: plaintext True/False

121 of 348

Transaction Processing on Confidential Data using Cipherbase

relational algebra notations:

projection - subset of columns

restriction (selection) - subset of rows

security guarantee of range index is similar to that of order preserving encryption

122 of 348

Transaction Processing on Confidential Data using Cipherbase

Data Confidentiality comparing to prior work

CryptDB (PHE)

  • operational information leaked even to weak adversary
  • scope of information leakage is an entire column and not just the data touched during query

e.g. to evaluate

CryptDB has to store the entire column deterministically encrypted, which leaks equivalence relation & distribution when weak adversary obtains snapshot

On the other hand, with Cipherbase

scan-based plan only reveal unknown_predicate(A) over R tuples (bc vals can be non-deterministically encrypted)

123 of 348

Transaction Processing on Confidential Data using Cipherbase

Architecture & Overview of query/transaction processing

database processing is about

  • data value semantics (expression evaluation, not easily done over encrypted data)
  • moving data
  • query setup/ result communication
  • concurrency control
  • recovery

124 of 348

Transaction Processing on Confidential Data using Cipherbase

Cipherbase client

  • presents a plaintext database interface
  • query encryption & result decryption
  • has encryption schema and key

125 of 348

Transaction Processing on Confidential Data using Cipherbase

e.g. to evaluate

needs a stack program that

inputs encrypted A, outputs plaintext True/False

invoked once for each tuple of R

this usage pattern inspires the design => registered once, invoked using handle (id) returned by register method

has a data cache but is stateless apart from that

Trusted Module

evaluate expressions over encrypted data during query processing

126 of 348

Transaction Processing on Confidential Data using Cipherbase

Key Management

keys communicated to trusted module using standard key-exchange techniques: secure hardware device manufacturer assigns a public key

Data Encryption

supports cell-level (row + column specification) encryption

=> minimizes TM traffic

=> incurs storage penalty (the ciphertext of a 4-byte integer is 12 bytes in AES + CTR mode)

127 of 348

Transaction Processing on Confidential Data using Cipherbase

Indexing

range index:

  • can be non-deterministically encrypted, but indexed keys in B-tree are ordered by their plaintext values
  • index building and lookups require routing to TM
  • TM stores pre-registered programs that return comparison result (<, =, >) in plaintext
  • B-tree stored in UM

=> which means weak adversary learns ordering but not equality relationship

=> strong adversary learns equality relationship of queried ranges

128 of 348

Transaction Processing on Confidential Data using Cipherbase

inserting 7 into a range index

129 of 348

Transaction Processing on Confidential Data using Cipherbase

Indexing

equality index:

  • index keys are deterministically encrypted first
  • then stored in range index
  • ordering is determined by ciphertext not plaintext
  • unclear about intention of design above

130 of 348

Transaction Processing on Confidential Data using Cipherbase

transaction processing

First, client issues query Q

during the prepare step:

  • client identifies TM programs required to evaluate expressions in Q
  • client registers program with TM, who in turn responds with an id that references said program
  • client gets response, rewrites Q to reference program registered using id

131 of 348

Transaction Processing on Confidential Data using Cipherbase

e.g.

user issues:

>> UPDATE Accounts

>> SET Balance = Balance + @Amt

>> WHERE Id = @Id

Cipherbase Client rewrites into:

>> UPDATE Accounts

>> SET Balance = TMEval(21, Balance, @Amt)

>> WHERE Id = @Id

TMEval: built-in function on UM (SQL server) added by Cipherbase to invoke stack programs on TM

After rewritten query submitted to UM, it prepares a query plan like the following:

  • assuming the Id column is indexed, identify record that matches filter
  • sends its balance and required params (encrypted) to program 21
  • program 21 returns encrypted sum
  • UM only thinks that it is replacing one binary value with another (encrypted data are aliased to binary data types)

132 of 348

Transaction Processing on Confidential Data using Cipherbase

PHE avoids round-trips to TM:

in the previous example, under a scan-based plan, and if Id column is deterministically encrypted, query does not have to involve TM

133 of 348

Transaction Processing on Confidential Data using Cipherbase

Trusted Module

  • in this paper, realized using FPGA, but does not have to
  • author discussed using Intel Software Guard Extensions (SGX), a set of instructions that allow creating a process within a protected address space called an enclave (where data and computation is shielded even from OS)
  • performance concerns regardless of installation:
  • communication bandwidth (between UM and TM)
  • latency incurred during communication

134 of 348

Transaction Processing on Confidential Data using Cipherbase

System engineering and optimization details

Transaction processing challenges:

  • transactions incur large number of TM accesses
  • good thing is that these calls only transmit a small amount of data (e.g. encrypted addition involves 2 * 12 bytes input and 12 bytes output)
  • however, round-trips to TM incur latency:
  • PCIe transfer latency
  • TM has limited computational resource

135 of 348

Transaction Processing on Confidential Data using Cipherbase

  • latency in expression evaluation indirectly reduces throughput (transactions hold on to locks longer, which increases data contention)
  • TM is a shared resource, which means concurrency can be a problem
  • we can increase concurrency by batching communication (multiple work units in a single communication)
  • but batching exposes computational resource and bandwidth as bottlenecks because stack programs and data transfer are invoked at a higher rate

136 of 348

Transaction Processing on Confidential Data using Cipherbase

Optimizations

  • increase parallelism by having multiple FPGA stack machines => process n calls independently
  • as mentioned previously, batching
  • increases throughput: stack machines (SM) spend more time computing, free bus for other SMs
  • sometimes increase latency when others have to wait for a longer running work unit in the same batch to complete. Empirically, authors argue that the benefits outweigh the potential drawbacks

137 of 348

Transaction Processing on Confidential Data using Cipherbase

  • expression folding (intra-transaction batching): reduce TMEval calls in the same query if possible => reduces TM workload, PCIe traffic, and latency (computation reuse)

138 of 348

Transaction Processing on Confidential Data using Cipherbase

user issues

>> UPDATE Inventory

>> SET item_cnt = item_cnt + @qty

>> SET total_cnt = total_cnt + @qty

>> WHERE storeId = @Id

Naive Cipherbase rewrite

>> UPDATE Inventory

>> SET item_cnt = TMEval(0, item_cnt, @qty)

>> SET total_cnt = TMEval(1, total_cnt, @qty)

>> WHERE storeId = @Id

expression folding

>> UPDATE Inventory

>> SET @var = TMEval(2, item_cnt, total_cnt, @qty)

>> SET item_cnt = UMExtract(@var, 0)

>> SET total_cnt = UMExtract(@var, 1)

>> WHERE storeId = @Id

i.e. common subexpression elimination

139 of 348

Transaction Processing on Confidential Data using Cipherbase

  • index lookup vectorization: requires traversing the B-tree
  • naive: TM call for every comparison
  • optimized: in each call
  • pass search key and the entire tree node (multiple indice in one node) to TM
  • binary search target key over node (which is sorted by construction)
  • only keys touched by binary search is decrypted, search key only decrypted once within a node

140 of 348

Hitting Pause on Cipherbase

  • solutions we have explored during the past few weeks (CryptDB, TrustedDB, Cipherbase, etc.) were built on partially homomorphic encryption, which supports only one type of gate, typically + or *.
  • due to the limited set of available operations, we saw 2 common approaches:
  • trusted client: encrypted data on db server, delegate much of query processing to client who holds the secret key
  • secure location (trusted computing base): encrypted data on main db, but server also holds key perform complex operations on plaintext data in secure hardware
  • question now is: does advancement in fully homomorphic encryption (FHE) render the above technologies obsolete?

141 of 348

Hitting Pause on Cipherbase

  • theoretical benefits of FHE
  • capable of evaluating arbitrary circuits, multiple gates, for unbounded depth (i.e. any function)
  • in other words, computation can be performed as normal on untrusted server database when data there are fully homomorphically encrypted.
  • only require slight modification of database: special addition, multiplication, etc.
  • in our meeting last week (11/30/2023), we looked at Microsoft SEAL, which promises to enable building “end-to-end encrypted data storage and computation services.” see their website below:

https://www.microsoft.com/en-us/research/project/microsoft-seal/

142 of 348

Hitting Pause on Cipherbase

however, what is SEAL actually capable of?

silent confession by Microsoft:

https://learn.microsoft.com/en-us/azure/architecture/solution-ideas/articles/homomorphic-encryption-seal

143 of 348

Hitting Pause on Cipherbase: Microsoft SEAL

  • allows + and * on encrypted integers or real numbers. Comparison and sorting are not supported.

=> this means a large portion of SQL queries is unachievable

  • supports 2 FHE schemes
  • BFV (2012): modular arithmetic, yields exact values
  • CKKS (2016): + and * on real or complex numbers, yields approximates

this restriction is entirely contradictory to the idea of FHE

  • unrecommended to “encrypt entire large database” with SEAL, computations should be “fairly lightweight”

144 of 348

Hitting Pause on Cipherbase: Microsoft SEAL

  • homomorphic encryption schemes typically have a single secret key, which means it is unsuitable for use cases where multiple private data owners (each with their own secret key) want to perform collaborate computations

this is rather a problem common to Cipherbase and all the previous technologies discussed

145 of 348

Switching to Data Observatory

146 of 348

Clarifying Survey Direction

questions to address:

  • what to put on the big screen
  • how can this hardware contribute to advancing artificial intelligence
  • visualization at the core of interpretability and explainability
  • towards trustworthy AI, improves accountability
  • helps build better models
  • where do we come in and how could this project be related to data hub
  • visualization depends on data
  • connect to local data center? small, independent copy?
  • local data hub software/ service that communicates with MBZUAI data center (similar to the Internet of Water Data Hub design)?

147 of 348

Thinking Outside the Box: AI4VIS

https://arxiv.org/pdf/2102.01330.pdf

background

  • on the previous slide, we talked about visualizing artificial intelligence,

i.e. model => visualization

  • what if we reverse the arrow above?

i.e. visualization => model

  • traditionally, visualizations are discussed as something created for humans
  • increasingly, however, we see visualizations “created, shared, collected, and reused”
  • so, why not treat visualization like any other data format (i.e. text and image)?
  • however, “why should we teach machines to read charts made for humans”?

148 of 348

Thinking Outside the Box: AI4VIS

Why should MBZUAI care?

  • we have a gigantic screen, but it seems like we are not so sure about what to show on it
  • heavier utilization helps further knowledge, but generating visualizations is a time-consuming task
  • on the previous slide, we mentioned reuse
  • main reasons of applying artificial intelligence to visualization data discussed by this paper:
  • visualization generation
  • visualization enhancement
  • visualization analysis
  • this means, if we build the following cycle:

model => visualization => model

we are not only advancing AI in two directions but giving more content to show on our screen

149 of 348

Thinking Outside the Box: AI4VIS

credence: author affiliation

  • three at Hong Kong University of Science and Technology
  • another three at Microsoft Research Asia
  • Dominik Moritz: Carnegie Mellon University (HCI Professor) and Apple (Research Scientist)

150 of 348

Thinking Outside the Box: AI4VIS

definitions

  • visualization defined as graphical representation of data
  • main categorizations
  • graphics: appear most frequently, raster (bitmaps)/ vector (describes visual elements in shapes and styles)
  • programs: instructions to construct the visualization
  • hybrid: embed programs/meta information into graphics
  • representation
  • internal: programs contain design details but less often semantic information. system warrants simpler and more structured format (e.g. {“data”: “cars.csv”, “type”: “hist”, “x”: “mpg”, “y”: “hp”})
  • feature: input to ML models, extracted by feature engineering and feature learning

151 of 348

Thinking Outside the Box: AI4VIS

why apply AI to visualization data (mentioned previously but now in more detail)

  • generation
  • central research question in community
  • data-based (extensively studied. given data, show me sth), anchor-based (given anchor visualization, show me next), design-based (inject data into template design), and context-based (given contextual info such as natural language description, show me sth)
  • enhancement
  • retarget to different environment (e.g. fit image to cell phone display while keeping all the important elements)
  • annotate, add captions
  • question answering
  • interactions

152 of 348

Thinking Outside the Box: AI4VIS

  • analysis
  • recent research have focused on creating visualization database
  • retrieval, help users find examples they need to retrieve-then-adapt
  • mine: understanding design patterns, population statistics (crawl on web to discover trends in visualizations)

153 of 348

Thinking Outside the Box: AI4VIS

how is the research community applying AI to visualizations?

  • transformation
  • program <=> graphics (reverse engineering is difficult and expensive)
  • decomposing: detect text, shapes, chart, element clustering (semantic groups e.g. axes, legends, annotations)
  • composing: figure out coordinates on scatter plot based on information obtained from decomposition
  • assessment
  • rank quality of visualizations
  • related to enhancement bc scoring metrics can be used as cost functions
  • challenging bc humans define when visualizations are “good” => requires large-scale empirical assessment

154 of 348

Thinking Outside the Box: AI4VIS

  • comparison
  • how “close” are the given visualizations?
  • can also be used as cost functions to build recommendation and query systems
  • difference: Graphscape (UDub) proposes directed graph model. Each edge represents an edit and each vertice denote the resulting visualizations. Edge weight/cost determined by human judgement
  • distance: project to feature vector, apply distance function
  • querying
  • how to specify user needs: keywords, natural language, structural queries, example
  • how to match needs: exact v. best match

155 of 348

Thinking Outside the Box: AI4VIS

  • reasoning
  • interpret visualizations to derive insights beyond data encodings such as semantic information
  • visual perceptual learning (e.g. neural network to predict human-perceived importance of visualization images)
  • chart summarization: natural language description/ caption on how to interpret the chart (e.g. shows average precipitation in Taipei over 2023)
  • visual question answering: traditionally, transform visualizations into tables and translate questions into queries
  • recommendation
  • seeDB recommends visualization by deviation with anchor
  • map data with encodings to recommend styling

156 of 348

Thinking Outside the Box: AI4VIS

  • mining
  • design patterns: should motivate design ideas instead of just showing simple statistics with little meaning
  • data patterns: respective to type of charts

157 of 348

Thinking Outside the Box: AI4VIS

future research opportunities

  • interoperability: how to combine visualizations outputted from different systems together
  • CV models for natural images do not work well for visualizations bc small errors could greatly distort meaning of visualization (e.g. misinterpreting the axes)
  • augmenting ML models with results from empirical studies
  • collecting data on how humans perceive, assess, and use visualization data on scale
  • striking better balance between human v. machine friendly visualization

158 of 348

AI4VIS: implementation example from Data2VIS => LIDA

159 of 348

Data2Vis => LIDA

160 of 348

Data2Vis => LIDA

background

  • main theme of these two papers: data driven visualization generation approach (in AI4VIS paper’s terminology)
  • relevance between the two

Data2Vis (2019, submitted on arXiv in 2018) author:

Victor Dibia when at IBM Research, Çağatay Demiralp at MIT CSAIL

LIDA (

submitted on arXiv in March, 2023,

open-sourced on GitHub only in August this year,

claimed by Victor as an update of Data2Vis

) author:

Victor Dibia at Microsoft Research

161 of 348

Data2Vis => LIDA

summary - Data2Vis

  • formulation of visualization generation as a language translation problem (i.e. mapping data to declarative visualization program languages such as Vega-Lite)
  • model: bidirectional recurrent neural network with Long-Short Term Memory cells under encoder-decoder architecture with attention mechanism
  • results: difficult to quantify. Yet, qualitatively, the authors observed that model learns Vega-Lite syntax, appropriate transformation on different fields, data selection patterns in creating bivariate plots (e.g. group by age/ sex)
  • limitations
  • small training data: 4300 Vega-Lite visualizations from 11 datasets
  • model sometimes (in 15-20% of tests) select fields that does not exist
  • model sometimes select fields that have little information value

162 of 348

Data2Vis => LIDA

Visualizations created by Data2Vis

163 of 348

Data2Vis => LIDA

summary - LIDA

  • tool for generating grammar-agnostic visualization and infographics
  • solved by LLMs and IGMs (image generation models)
  • visualization generation as a multi-stage generation problem
  • summarizer:

data to semantically rich natural language description,

LLMs suffer from hallucination which sometimes lead them to produce

output not grounded in data. Informing summary augments model by provides grounding context for generating visualizations

  • goal explorer: generates a set of visualization goals

question (hypothesis): what are we trying to understand

visualization: e.g. histogram of miles per gallon

rationale: why should we ask the question above?

164 of 348

Data2Vis => LIDA

  • visGenerator

generate, evaluate, repair, and execute visualization code

code scaffold constructor: imports dependent libraries and create

function stubs

code generator & executor: fill in TODOs, get visualization

  • infographer: style generated visualization

165 of 348

Data2Vis => LIDA

  • results evaluation
  • visualization error rate: percentage of generated visualizations that does not compile
  • asks GPT-4 to score output based on empirically agreed metrics

166 of 348

Formalizing Visualization Design Knowledge as Constraints: Draco

167 of 348

Visualization Design Knowledge as Constraints: Draco

  • visualization research communities often study how people decode and interpret visualizations. However, their findings are slow to be adapted by practical tools
  • automated design systems do not reuse knowledge built up by their predecessors
  • warrants a medium for representing and acting upon visualization design knowledge
  • Draco is a formal model that represents
  • visualizations as sets of logical facts
  • design guidelines as hard/soft constraints over these facts

168 of 348

Visualization Design Knowledge as Constraints: Draco

  • constraint programming:
  • a constraint program (declarative) is a set of constraints defining relations among several variables that must/should be satisfied by its solutions
  • constraints restrict values variables can take (represent partial information)
  • solutions are computed by (off-the-shelf) constraint solvers via inference and search over the constraint space
  • Draco models visualization knowledge using Answer Set Programming and Clingo as answer set solver

169 of 348

Visualization Design Knowledge as Constraints: Draco

taste of Answer Set Programming

A :- L_1, …, L_n => a rule

A => atoms, proposition that may be true/false

L_i’s => literals, A is true if only all of them is true

bodiless rules encode a fact whereas headless rules impose integrity constraints (cannot all be satisfied)

e.g.

light_on => power_on, not broken

170 of 348

Visualization Design Knowledge as Constraints: Draco

integrity constraint defines how attributes should interact with each other:

:- mark(bar), channel(E, y), continuous(E), not zero(E)

rules out vertical bar charts that does not use 0 as baseline

set of aggregate rules: specifies attributes of a visualization (e.g. mark, encoding)

171 of 348

Visualization Design Knowledge as Constraints: Draco

  • primary use case:

finding optimal completion of

partially specified

visualization

  • last slide covered the “search space definition section”
  • preference model leads us to soft constraints

172 of 348

Visualization Design Knowledge as Constraints: Draco

  • preference model (soft constraints) learned from data and forms a Markov logic network where constraints are associated with learnable weights indicating preference levels

e.g.

:~ continuous(E), not zero(E). [5]

this says that model prefers to include 0 for continuous fields and that violating this constraint incurs a cost of 5

=>

where n_p_i denote # violations of soft constraint p_i

173 of 348

Visualization Design Knowledge as Constraints: Draco

  • model trained using RankSVM (learning to rank approach)
  • dataset: labeled visualization pairs

using the cost function defined on the last page,

(v1, v2, y = sign(Cost(v1) - Cost(v2)))

=> y = -1 if v1 preferred over v2

=> let

x be vector of [n_p_0(v), …, n_p_k(v)],

corresponding weight vector be w

=> Cost(v1) - Cost(v2) = w * x_1 - w * x_2 = w * (x_1 - x_2)

Linear regression with Ridge regularization, minimizes hinge loss

174 of 348

Visualization Design Knowledge as Constraints: Draco

Draco 1 (2018, from UDub Interactive Data Lab)

https://idl.cs.washington.edu/files/2019-Draco-InfoVis.pdf

Draco 2 (2023, from UDub, University of Vienna, University of Maryland, CMU)

https://arxiv.org/pdf/2308.14247.pdf

175 of 348

Experiments with Mac Studio

176 of 348

Last Week: exploring MLX on Mac Studio

177 of 348

Last Week: exploring MLX on Mac Studio

  • ran Mistral-7B, llama2-7B, Llama2-7B-chat successfully with MLX (weights need to be specially converted to fit into the framework)
  • llama2-70B returned weird output like the following “ensoensoensoenso…” for any attempted prompt
  • we suspected that something might have gone wrong during the weight conversion step (hardware resources should be sufficient)
  • upon further inspection

178 of 348

Last Week: llama 2 70B with MLX

179 of 348

Last Week: llama 2 70B with MLX

llama 2 -> MLX weights conversion example source code by Apple

  • only shard weights seem to be hardcoded
  • thoughts?

180 of 348

Llama 2 with Hugging Face: quickstart module

allows running inference on

  • self converted models stored locally
  • readily converted models stored remotely here:

https://huggingface.co/meta-llama

either approaches require submitting request to Meta and/or Hugging Face

181 of 348

Llama 2 with Hugging Face: quickstart module

  • Why Hugging Face?
  • Many blogs online try running llama-2 locally using

https://github.com/ggerganov/llama.cpp

, which is mainly designed for quantized models (requires less computational and storage capacity)

  • I doubt we are necessarily in a resource scarce environment, why not shoot for the stars?

182 of 348

Llama 2 with Hugging Face: 7B, sample prompt 1

183 of 348

Llama 2 with Hugging Face: 7B, sample prompt 2

184 of 348

Llama 2 with Hugging Face: 13B, sample prompt 1

185 of 348

Llama 2 with Hugging Face: 13B, sample prompt 2

186 of 348

Llama 2 with Hugging Face: 70B, sample prompt 1

187 of 348

Llama 2 with Hugging Face: 70B, sample prompt 2

188 of 348

Llama 2 with llama.cpp

why are we back?

  • as mentioned in link below, Hugging Face benchmarking tools are deprecated, they recommend to use external tools

https://huggingface.co/docs/transformers/benchmarks

  • thought about adding timers here and there to measure performance myself in that little module I wrote to run llama 2 models with the hf Transformers library
  • but, in my opinion, these libraries abstract away so many details that it is hard to know what you are actually measuring
  • there is an example from AWS where they tried to benchmark running llama 2 with the hf Transformers library on EC2 instances, but from their source code, most latency measurements were done on CloudWatch, a black box impossible to transfer knowledge from

189 of 348

Llama 2 with llama.cpp: 7B

  • -p: prompt (following llama.cpp main author’s example, shouldn’t matter much)

“Input length is not significant for performance but important for hardware requirements”

source:

https://www.databricks.com/blog/llm-inference-performance-engineering-best-practices

  • -n: num tokens to predict (reflected in sample runs)
  • -ngl: any nonzero value enables Metal processing (Apple CUDA)
  • batch size defaults to 512 (tokens)

190 of 348

Llama 2 with llama.cpp: interpreting console logs

explanation by project collaborator

191 of 348

Llama 2 with llama.cpp: 13B

192 of 348

Llama 2 with llama.cpp: 70B

193 of 348

Llama 2 with llama.cpp: comparison with other experiments

  • y axis: time spent on generating output (i.e. eval time in llama.cpp console log)
  • Model: llama-2-7b-chat.Q4_0

=> quantized to 4 bits

(originally 16)

source:

https://towardsdatascience.com/apple-m3-machine-learning-speed-test-f1346e23a1b2

194 of 348

Llama 2 with llama.cpp: comparison with other experiments

source: https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF

  • unverified: many are claiming online that M3 Max == M2 Ultra
  • from the bar chart on the previous slide, M3 Max yielding roughly 47 t/s on eval time on the 7B model quantized to 4 bits
  • on the other hand, we got, on our M2 Ultra, 39.72 t/s on the original 7B model (unquantized, 16 bits)

195 of 348

Llama 2 with llama.cpp: comparison with other experiments

  • from llama.cpp authors: typical run using M2 Ultra (13B, quantized to 4 bits)
  • our results: 51.45 t/s on prompt eval, 22.41 t/s on eval time

source:

https://github.com/ggerganov/llama.cpp

196 of 348

Llama 2 with llama.cpp: comparison with other experiments

  • hardware: M2 Max Mac Studio, 96 GB RAM

source:

https://medium.com/@andreask_75652/benchmarking-apples-mlx-vs-llama-cpp-bbbebdc18416

197 of 348

Llama 2 with llama.cpp: comparison with other experiments

  • what I worry about:

“Overall latency scales sub-linearly with model size” - from the databricks blog

mentioned in slide 189

  • from what we are seeing:

70 / 13 = 3.58�209.19 (70B eval t/ms) / 44.62 (13B eval t/ms) = 4.68

198 of 348

Question from last week: suspicious prompt eval speed

my local replication:

199 of 348

Question from last week: suspicious prompt eval speed

screenshot from slide 196:

  • I overlooked that these are results from llama-2-7b
  • the experiments I ran last week were using llama-2-7b-chat (fine-tuned, yielded 83.42 t/s prompt eval time & 39.72 t/s eval time)
  • question to ponder: does fine-tuning has a direct effect on prompt evaluation time, or is something else at play?

200 of 348

Optimizing LLM Inference: batching and vLLM

why batch?

  • trade-off latency for throughput, i.e. process multiple sequences together but at lower speed
  • optimize memory bandwidth usage: with batch size one, model parameters need to be loaded for each new input sequence => load once and process many

201 of 348

Optimizing LLM Inference: batching and vLLM

naive/ static batching

  • used by HuggingFace’s Pipelines (their easy-to-use API that supports most models), NVIDIA’s FasterTransformer
  • batched sequences each generate 1 new token per iteration
  • here, GPU underutilization happens bc of variation in sequence generation length

202 of 348

Optimizing LLM Inference: batching and vLLM

continuous batching

  • once a sequence finishes in a batch, a new one takes its place
  • chellenging given that prefill/ prompt evaluation stage has different computation pattern than token generation => wants a good ratio of

sequences waiting prefill/ sequences still generating

203 of 348

Optimizing LLM Inference: batching and vLLM

vLLM:

  • created by a UC Berkeley team (Prof. Gonzalez, Prof. Stoica, …)
  • open-source, can only be ran on linux
  • “delivers up to 24x higher throughput than [HuggingFace’s Pipelines], 3.5x improvement over HuggingFace Text Generation Inference,” previous SOTA that powers live inference

204 of 348

Optimizing LLM Inference: batching and vLLM

core problem addressed:

inefficient memory management on KV cache�=> kv pairs of all prev tokens required to generate the next one�=> size is large: takes up to 1.7 GB for a single sequence in llama-13B

=> and dynamic: depends on highly variable sequence length, leads to wastage due to fragmentation and over-reservation

devised solution:

PagedAttention

=> allows kv cache stored non-contiguously in memory by partitioning the kv pairs of each sequence into blocks each dedicated to a fixed number of tokens

205 of 348

Optimizing LLM Inference: batching and vLLM

=> memory wastage only occurs at the last block

=> enables efficient memory sharing by mapping to the same physical block comes in handy when we need to generate multiple output from the same sequence

206 of 348

Optimizing LLM Inference: batching and vLLM

=> baseline: HuggingFace’s Pipelines library, which implements static batching

=> generation length sampled from exponential dist with mean = 128 tokens

=> FasterTransformer (optimized model implementation) actually do pretty well

=> continuous batching clearly yields strong improvement over static, and so does vLLM over the other three on the graph

207 of 348

Optimizing LLM Inference: batching and vLLM

QPS: query per second, expected rate of poisson distribution sampling from

208 of 348

Optimizing LLM Inference: batching and vLLM

takeaway from the two graphs:

  • when batching is employed, the continuous scheme alleviates the throughput latency tradeoff
  • as system becomes saturated, continuous batching’s advantage diminishes (harder to inject new sequences?)

209 of 348

MLPerf: trouble we are running in

how do we run it?

  • hardware vendors usually do this to build trust with customers
  • reference implementation:
  • defines representative tasks & workloads (models),
  • specifies model requirements (e.g. pretrained weights to use), validation datasets, performance & accuracy metrics
  • hardware vendors create their own implementation based on ref that
  • uses MLPerf LoadGen API to issue inference queries
  • uses their own hardware-specific API to process said queries
  • for us to run the benchmark,
  • hardware vendor implementations must be available for the desired task/workload
  • compatible hardware

210 of 348

MLPerf: trouble we are running in

last time, we looked at MLPerf Training results on “GPT-3”

  • but how? GPT-3, unlike llama, is proprietary
  • real model under test: Megatron-LM’s GPT-3
  • by applied deep learning research team at NVIDIA (researches training LLMs at scale)
  • largely follows OpenAI’s paper, but uses different tokenizer, attention layer, and model parameters
  • current (v3.1) submitted implementations are for H100, Intel Gaudi2, and Google’s TPU-v5e

211 of 348

MLPerf: comforting news

CTuning (MLCommons founding member, https://ctuning.org/)

  • created the MLCommons Collective Mind framework
  • main objective: make it easier to run MLPerf benchmarks across different hardwares
  • keynote introduced June 2023
  • typical usage: either with python virtual environment or docker

212 of 348

MLPerf: comforting news

running BERT-large on Mac Studio M2 Ultra

213 of 348

MLPerf: comforting news

comparison with results published by MLCommons (MLPerf Inference: Edge)

https://mlcommons.org/benchmarks/inference-edge/

  • upon closer review, some hyperparameters seem different
  • may require tuning to optimize resource utilization
  • need tools to measure resource utilization to determine potential on pushing inference speed

system

framework

offline (sample/s)

single stream (latency in ms)

Macbook Pro M1

onnx runtime

1.69

597.08

Mac Studio M2 Ultra

onnx runtime

6.48

580.58

214 of 348

MLPerf … on Mac?

215 of 348

MLPerf … on Mac?

216 of 348

MLPerf … on Mac?

217 of 348

MLPerf … on Mac?

218 of 348

MLPerf … on Mac?

219 of 348

MLPerf … on Mac?

main obstacle

  • we need a reliable llama 2 model implementation using mlx framework
  • we did not before when we first surveyed MLX

https://github.com/ml-explore/mlx-examples/tree/main/llms/llama

7b & 13b worked, 70b returns weird output

  • mlx_lm library released for easy integration with HF models

https://github.com/ml-explore/mlx-examples/tree/main/llms/mlx_lm

  • seems working but incredibly slow?

https://github.com/ml-explore/mlx-examples/issues/437

220 of 348

MLPerf … on Mac?

discussion with Awni Hannun

  • top contributor to mlx
  • research scientist at Apple
  • Stanford PhD advised by the Andrew Ng (advised by David Blei, Michael Jordan)

221 of 348

MLPerf … on Mac?

model: llama-2-7b-chat

prompt evaluation (t/s)

token generation (t/s)

llama.cpp

83.42

39.72

mlx

25.455

29.416

llama.cpp/ mlx

3.27

1.35

222 of 348

MLPerf … on Mac?

model: llama-2-13b-chat

prompt evaluation (t/s)

token generation (t/s)

llama.cpp

51.45

22.41

mlx

14.164

18.662

llama.cpp/ mlx

3.63

1.2

223 of 348

MLPerf … on Mac?

point being …

model: llama-2-70b-chat

prompt evaluation (t/s)

token generation (t/s)

llama.cpp

11.87

4.78

mlx

1.063

0.201

llama.cpp/ mlx

11.16

23.78

will continue investigating

224 of 348

MLPerf … on Mac?

225 of 348

mlx v. llama.cpp

performance anomaly?

  • mlx token generation almost as fast as llama.cpp at first run after reboot
  • performance degradation (seemingly exponential) in the following runs
  • llama.cpp unaffected by mlx

226 of 348

mlx v. llama.cpp

performance anomaly?

  • mlx prompt evaluation performance surpassing that of llama.cpp’s
  • less severe performance degradation
  • should rerun experiment with larger input (currently seeing large variations)

227 of 348

MLX GPU profile

228 of 348

Llama.cpp GPU profile

229 of 348

Llama.cpp parallelization

230 of 348

MLX representative workload: on reboot perf anomaly

231 of 348

MLX representative workload: on perf with diff loading strategy

232 of 348

MLX v. Metal: on perf with diff loading strategy

233 of 348

experiments below were conducted in�environment:�mlx 0.6.0�https://github.com/MBZUAI-Research-Office/research_muchi�commit: fd2376d405b8fe786fe2851b828e435f40421edb

234 of 348

Almost root to the issue?

How weights loading strategy affects application performance

batch

sep

upon closer examination, matmul performance seems identical across loading strategies

235 of 348

Almost root to the issue?

batch

sep

zooming out …�we notice the space between computations are caused by driver (wire memory) activities making sure that there are physical memory backing to GPU resources consumed in metal code

(

where could 256 MiB come from? Maybe …

size of weights per layer

8192 * 8192 * 4 / (1024 **2)

= 256 MiB

)

https://developer.apple.com/forums/thread/653038

https://developer.apple.com/library/archive/documentation/DeviceDrivers/Conceptual/IOKitFundamentals/DataMgmt/DataMgmt.html

236 of 348

Almost root to the issue?

well … how does that space/ duration caused by driver activity changes upon reboot?

for the first 2 seconds (after the computation starts), there were still plenty driver (wire memory) activities

237 of 348

Almost root to the issue?

well … how does that space/ duration caused by driver activity changes upon reboot?

after that, however, driver (wire memory) activities ceased => lowering the average and thus making the first run post reboot seem much faster than subsequent runs

space between computations diminishes along with decrease in driver (wire memory) activities!

238 of 348

Almost root to the issue?

2nd run post reboot

3rd run post reboot

What does this mean?�The performance degradation we observed from sep loading strategy can also be accounted by these driver (wire memory) activity

239 of 348

Sidenote: how does matmul in mlx work under the hood?

240 of 348

DBRX: distributed inference with M2 Ultras

241 of 348

What is DBRX?

  • transformer-based decoder-only LLM (like Llama-2 & GPT-4) that was trained using next-token prediction
  • fine-grained mixture-of-experts (MoE) architecture with 132B total parameters of which 36B parameters are active on any input (more on this later)
  • other prominent MoE models: Mixtral 8x7B, Grok-1
  • uses GPT-4 tokenizer, rotary position encodings, gated linear units, and grouped query attention (similar to Llama-2)
  • pretrained on 12T tokens (carefully curated and substantially helped improve model quality) with max context length: 32768 tokens (synonymous to max sequence len)
  • trained on 3072 NVIDIA H100s connected by 3.2Tbps Infiniband

242 of 348

DBRX: inference efficiency

setting:

  • NVIDIA TensorRT-LLM (open-source library, also used in their MLPerf implementation) at 16-bit precision
  • input prompt: 2000 tokens, output: 256 tokens�(per response)
  • No mention of hardware specs?
  • seems faster than model-size would suggest due to its MoE architecture (more on this later!)
  • Mixtral has higher t/s because its model is smaller

243 of 348

What is MoE?

TL;DR

MoE models

  • are pre-trained much faster & have faster inference�compared to dense models (where all the parameters are used for all the inputs) with the same number of params
  • still, require high VRAM as all experts are loaded in memory
  • face many challenges in fine-tuning such as overfitting, but recent work with MoE instruction-tuning is promising

244 of 348

What is MoE?

dense: refers to the feed-forward layer, which is really just linear combination followed by activation (in Llama-2: multi-layer perceptron)

traditional transformer encoder block

245 of 348

What is MoE?

traditional transformer

transformer with MoE

vs

the traditional/ dense feed-forward network (FFN) layer is replaced by a

sparse MoE layer that operates independently on the tokens in the sequence

- the router/ gate network decides which expert the token goes

- output of MoE layer = router gate value * output of selected expert

246 of 348

What is MoE?

what does this remind you of?

  • ensemble methods (e.g. random forest)
  • MoE layer is a system composed of separate networks/ experts that each specializes in a different region of the input space

in addition,

  • each expert is usually a FFN
  • we can send a token to more than one expert
  • Mixtral 8x7B and Grok-1 have 8 experts and choses 2
  • DBRX have 16 smaller experts are chose 4 (which is what they mean by fine-grained) => allows 65x more combinations
  • the router is composed of learned parameters pre-trained at the same time as the rest of the network

247 of 348

So, how do we run distributed inference with DBRX?

example:�transformer encoder block from Google’s GShard�- attention weights are replicated across devices�- each device is responsible for a FFN/ expert (model parallelism: model partitioned across nodes)

248 of 348

Requirements to run DBRX

132B params, each param being a bfloat16�=> 132 * 2 (bytes) = 264 * (10 ^ 9) = 264 GB�=> however, DBRX’s github repo README recommends having at least 320 GB of memory��right now, model implementations are available in�- TensorRT-LLM (requires CUDA)�- vLLM (requires CUDA)�- MLX (requires Apple Silicon, focus, commitment, and sheer will)

interestingly,�HuggingFace’s transformer library integrates an experiment project called PiPPy that aims to enable easy pipeline parallelism (where each node gets a portion of the model’s layers) for pyTorch => solves the memory problem, but seemingly little performance benefits�https://github.com/pytorch/PiPPy/

249 of 348

Potential Strategy with M2 Ultras

  • connect M2 Ultras with thunderbolt (should provide up to 40 Gbit per sec throughput)
  • communication between nodes via IP
  • each Mac Studio has 6 thunderbolt ports, which should enable up to a fully-connected cluster of 7 nodes
  • mimic GShard architecture by putting each shard in a container
  • communication between containers orchestrated by kubernetes

250 of 348

M2 Ultra Cluster Communication latency

recap:

each of our Mac Studio M2 Ultra has 192GB of memory, but DBRX inference requires 264GB (Databricks even recommends to have at least 320GB of memory)

our goal is to run distributed inference on a cluster (2 - 8 nodes), which should provide us more than enough computation & memory resources. However, we need to make sure that communication between nodes does not become a bottleneck

251 of 348

what information is being exchanged?

DBRX has …

  • 40 layers (transformer decoder blocks)
  • input/output to each layer during inference is an 1x6144 float16 vector
  • one may ask … where is the autoregressive-ness => hidden in KV cache
  • under expert parallelism, 2 * 4 * 2 * 6144 / 1024 = 96 KiB of data is being exchanged per layer

L_0

L_39

ATT

FFN_0

FFN_1

FFN_15

.

.

.

router

B 6144

B 6144

L_38

to 0, 4, 7, 15

A

6144

C 6144

+

B 6144

B 6144

B 6144

choosing 4 from 16 experts

float16

dispatch/ aggregate

252 of 348

M2 Ultra Cluster Communication latency

4 nodes (Thunderbolt with star topology, Ethernet with switch)

Protocol:

gRPC

(

avg latency includes serialization and deserialization time

)

253 of 348

DBRX: distributed inference with M2 Ultras

recap

  • running DBRX warrants at least 264GB of memory, yet a single M2 Ultra is only equipped with 192GB of memory
  • we cannot run it on one node, but any cluster of size >= 2 would do
  • we did a small feasibility study to ensure cluster communication overhead would not place an unreasonable performance cap
  • running DBRX inference in a distributed setting requires model rewrite

254 of 348

DBRX: distributed inference with M2 Ultras

our progress so far

  • finished rewriting DBRX’s MLX implementation to enable expert parallelism on a M2 Ultra cluster of >= 2 nodes
  • observed sub-par performance of
  • 2 t/s for prompt evaluation
  • 0.5 t/s for token generation
  • profiled performance with instruments to locate bottlenecks

where did we get stuck? (like for quite a while)

  • pre-processing weights with MLX leads to unexpected data corruption�mx.split() followed by mx.savez()
  • Therefore, we switched to pyTorch since it supports both safetensors and bfloat16.�This implementation, however, is slower because everything is ran on CPU (currently, MPS backend does not support bfloat16)

255 of 348

DBRX: distributed architecture design

256 of 348

DBRX: distributed inference with M2 Ultras

0.5 t/s performance breakdown

  • each token takes 2 seconds to process
  • DBRX has 40 layers, which means each layer takes, on average, ~50 ms
  • in each layer,
    • attention + router + weighted_sum usually takes ~7 ms (including driver activity that checks for wired memory)
    • expert calculation + communication (~30 ms), and
    • waits/stalls ~13 ms (needs further verification)
  • on average, 2 experts are assigned to each moe_shard per layer

one layer

performance view from Instruments

expert calc, comm

wait/stall

257 of 348

DBRX: distributed inference with M2 Ultras

optimization strategies

  • improve weights loading strategy to avoid driver activity�should reduce ~21 ms per layer
  • figure out and eliminate cause of stalls/ wait
  • parallelization within node so that all 4 chosen experts can be executed at the same time
  • currently, expert calculation within node is sequential
  • ALU utilization is ~30%

258 of 348

updates from 4/30 PAS lab meeting

259 of 348

DBRX: distributed inference with M2 Ultras

latest employed optimization strategies

  • expert parallelization within nodes:
  • before: 4 nodes, 4 experts per node, experts ran sequentially
  • now: …, …, experts ran in parallel by spawning independent processes
  • [NEEDS WORK]: deeper performance analysis
  • improved weights loading strategy within expert (i.e. process in setting above)
  • before: 40 layers, 3 weight arrays per layer, loaded to memory separately (total size = 40 * 3 * 10752 * 6144 * 2 = roughly 15.85 GB)
  • now: …, …, 15.85 GB of weights loaded to memory in one chunk
  • [NEEDS WORK]: currently, every expert is busy in every layer to reap the benefits brought by above weights loading strategy => dense model : (

260 of 348

DBRX: distributed inference with M2 Ultras

model performance

workload: batch size = 1, input = 20 tokens, output = 64 tokens

prompt evaluation

token generation

week_0:�baseline

1.3 t/s

0.6 t/s

week_1:�with expert parallelization, weights loading in one-chunk, dense

3 t/s

2.4 t/s

261 of 348

05/09/2024 weekly report

outline

  • detailed performance breakdown between versions
  • v2 & v3 architecture design

262 of 348

DBRX distributed inference architecture design

263 of 348

DBRX distributed inference performance breakdown

per layer

per prompt

attention + router

MoE

wait / stall

token generation t/s

v1, �- baseline

~7 ms

~30 ms

~13 ms

~0.5 t/s

v1.5,

- load each expert’s weights as one-chunk

- dense

~ 4 ms

~3.5 ms

~3 ms

~2.4 t/s

v2,

- everything from v1.5

- earlier driver eval

- custom serialization

~1.5 ms

~3.5 ms

~3 ms

~3.1 t/s

v3,

- everything from v2

- architecture redesign

- ½ communication

- ½ expert computation

estimated:

~1.5 ms

estimated:

~2 ms

estimated:

~1.5 ms

estimated:

~5 t/s

filename: dbrx_poc_one_custom_serialization

filename: dbrx_poc_batch2

filename: dbrx_poc_perf

264 of 348

Obstacles encountered during v3 development:

driver activity surfacing from idleness

sleep in ms

incurs heavy driver activity

1

False

10

False

100

False

500

True

1000

True

driver activity here refers to:

time driver spent on wiring down sufficient physical memory for GPU computation

where does idleness come from (more on this next page):

wait during all-reduce (the shard that finishes first has to wait for the one that finishes last)

observation from v3 MoE layer design (4 chooses 2 experts activated per shard):

waiting/sleeping between layers causes lengthy driver activity to surface

when sleep = 1000 ms, driver processing takes ~400 ms for a ~1 ms GPU computation

265 of 348

How idleness becomes a problem in v3

  • driver activity is unavoidable during first layer, and its processing time has high variance
  • these long waits, as shown in the previous page, trigger more driver activity => inescapable loop!
  • for reasons unknown, the last shard to finish also tend to wait before next layer’s calculation
  • for reasons unknown, moe design from v2 seems to mitigate this disturbance pretty well

266 of 348

introducing v3.2,

our first step towards:

- streaming MoE processing (beneficial during prompt evaluation and when input batch size > 1)

- dynamic load balancing (beneficial even when batch size = 1, increases overlap between communication and computation during streaming)

267 of 348

new trouble with v3.2: needs warming up!

what do we mean by warm up?

sync computation time to reduce driver activity

268 of 348

distributed DBRX v3 performance evaluation

num GPUs

price in USD per GPU

communication

fp16 TFLOPS per GPU

bfloat16 TFLOPS per GPU

single user throughput (t/s)

M2 Ultra

4

~8,500 ?

10 GbE switch

~54

???

~6

L40S

8

~9,287

PCIe Gen4 x16: 64GB/s bidirectional

~362

~362

~60 ?

H100

4?

~31,710

nvlink: 900GB/s PCIe Gen5: 128GB/s

~1979

~1979

~115 ?

269 of 348

Evaluation Goal:

A Performance Model that Takes Compute Power into Account

270 of 348

DBRX on A100: how fast is inter-GPU communication in terms of latency?

hardware setup: 1 node, 4 GPUs, 4 x 80 = 320GB memory, 600 GB/s GPU interconnect through NVLink

framework: TensorRT-LLM, tensor parallelism

communication pattern:

1 x AllGather after all 40 layers

to piece together the generated token

1 token

1 layer

2 x AllReduce:

once before attention,

the other before MoE

271 of 348

DBRX on A100: how fast is inter-GPU communication in terms of latency?

communication latency breakdown:

compared to us:

operation

n times

per token

avg latency in microsecond(s)

total latency

per token in microsecond(s)

AllReduce

2 * 40 = 80

18

1,440

AllGather

1

20

20

total:

1,460

hardware setup

total communication latency per token in microsecond(s)

A100 (1 node, 4 chips)

1,460

M2 Ultra (2 node, 2 chips)

60,000

272 of 348

DBRX performance model

  • since H100 has 8 GPUs in 1 node, we estimate its comm latency to be 2x of A100’s

  • predicted / realized is roughly 1.6 for each row

273 of 348

Short-term Future Roadmap

resources available

several 4090s, configurable to single or multi node (are we interested in exploring RDMA NICs?)

topics to explore

  • efficient LLM deployment on lower-end Nvidia hardwares
  • would like to start from inference -> fine tuning (requires data, algorithmic knowledge) -> training (requires data, compute, and algorithmic knowledge)
  • from our past survey, inference with large MoE models under a resource constrained setting is relatively under-explored (efficiency with higher grade hardwares is hard enough already)

274 of 348

Related Work: on insufficient memory problem

DeepSpeed Inference (Microsoft, June 2022)

  • mentioned in last week’s lab meeting (linked below)
  • proposes a series of optimization called ZeRO-Inference
  • offloads all weights to CPU memory (DRAM) or NVMe (SSD), copy to GPU memory only when needed
  • overlaps weights prefetching with previous layers’ computation to hide memory copy latency
  • parallelizes prefetching across multiple GPUs and leverage faster than PCIe interconnect to all-gather the weights

275 of 348

Related Work: ZeRO-Inference

problem:

prefetching is difficult to achieve for MoE models because you don’t know which experts you need until the current layer’s router makes its decision

=> each expert’s weights are spread across all model layers (�for Mixtral8x7b, expert_0’s weights at layer_0 is�7B / 32 layers = 0.22B, which is 0.22B * 2 bytes = 440 MB)

=> you could prefetch every expert’s weights at the next layer and it would still fit in GPU’s memory (Mixtral8x7b: 8 experts x 440 MB = 3.5 GB, RTX 4090 has 24 GB memory)

=> however, wastes memory space and hard to scale (what if we are using smaller GPUs? Also, the current trend is to use more experts)

276 of 348

Related Work: on efficient MoE inference

Towards MoE Deployment: Mitigating Inefficiencies in MoE Inference (Meta AI, June 2022)

  • Expert Buffering: only cache a small number of experts in GPU memory, and the rest of the experts’ weights are offloaded to CPU memory. Expert weights copied to GPU memory when there is a cache miss. LIFO eviction strategy
  • In the context of expert parallelism: expert placement driven by historical data for better load balance. Frequently selected experts should reside on different GPUs

277 of 348

Related Work: Towards MoE Deployment … by Meta AI

questions/ problems

  • loads an entire expert (all layers)? Otherwise, only caching an expert’s weights at the current layer is useless for the next layer
  • performance comparison with loading the selected experts’ layer weights just in time?
  • their load balancing is useless if model training already accounts for expert selection imbalance (e.g. DBRX)

278 of 348

Related Work: on efficient MoE inference

Exploiting Inter-Layer Expert Affinity for Accelerating Mixture-of-Experts Model Inference (Ohio State University, Jan 2024, https://arxiv.org/pdf/2401.08383)

1. context coherent expert parallelism

expert parallelism under a centralized design warrants 2 all-to-all operations per MoE layer: once when router dispatches tokens to selected experts and another to aggregate expert outputs for weighted sum

our decentralized design replicates attention & router weights across nodes to allow independent but identical non-MoE calculation and thus eliminates the need for router dispatch

their decentralized design also replicates weights but removes repeated calculation to allow concurrent handling of multiple requests. However, this requires additional KV-cache syncing after each forward pass/ generation of one token

279 of 348

Related Work: Inter-Layer Expert Affinity … by OSU

how does context coherence work?

  • before each forward pass during token generation phase, perform an all-gather to replicate/sync every request’s KV-cache on every GPU
  • this allows attention & router to be done anywhere after expert computation, eliminating the need to send local expert outputs back to original dispatcher

centralized

decentralized through context coherence

280 of 348

Related Work: Inter-Layer Expert Affinity … by OSU

problems with context coherence

  • useless when the model selects > 1 experts every layer.
  • although attention & router can be done on any GPU, you still need to acquire expert outputs from other GPUs. In other words, an all-reduce is still needed

so, doesn’t really reduce the number of all-to-all operations to 1!

Running a MoE model with 4 experts, parallelized on 2 GPUs. Each layer chooses 2 experts.

281 of 348

Related Work: Inter-Layer Expert Affinity … by OSU

2. expert affinity

  • argument/ observation: given that a token is routed to expert_i at the current layer, certain experts exhibit a higher probability of being selected next
  • the goal is to place weights strategically according to inferred expert selection path to minimize communication (i.e. you don’t need to dispatch tokens to other machines if the next selected experts’ weights are available locally)

problems

  • requires training to deduce “expert affinity.” Needs to retraining for models with different architecture (e.g. more experts/ increased granularity)
  • their conditional probability model assumes Markov property (next state only depends on the current state) and is inapplicable to cases where multiple experts are selected per layer
  • this problem becomes harder with more complex MoE design

282 of 348

Related Work: on efficient MoE inference

FasterMoE: Modeling and Optimizing Training of Large-Scale Dynamic Pre-Trained Models (Tsinghua University, April 2022, https://dl.acm.org/doi/pdf/10.1145/3503221.3508418)

  • dynamic expert shadowing: difference in expert popularity leads to load imbalance when adopting naive expert parallelism. When an expert is overloaded, we can replicate its weights on other GPUs to distribute the work. When data parallelism is also employed, this also eliminates the need to all-dispatch input tokens.

283 of 348

Related Work: FasterMoE by Tsinghua

problems with dynamic expert shadowing

  • replicating model weights introduces a lot of overhead (even more so during training because of the need for gradient update)
  • FasterMoE addresses these challenges by deciding whether to shadow experts during runtime (for every training iteration/ gradient update cycle). Experts are only shadowed when it takes longer to transfer input tokens than weights (could occur with large batch sizes) OR when time saved on computation is more than the added communication latency
  • if applied to model inference, this mechanism would seldom be activated since batch size is unlikely to be that big

284 of 348

Related Work: FasterMoE by Tsinghua

communication and computation overlap

  • well, we do this as well with DBRX
  • difference between B and C below is that FasterMoE adjust scheduling depending on mini-batch size: schedules longer communication earlier to allow more overlap
  • we didn’t have the adjustment above because our group size is always 1

S: send

C: compute

R: receive

285 of 348

Related Work: on efficient MoE inference

DeepSpeed-MoE (Microsoft, July 2022)

  • Expert + Tensor Parallelism for MoE block (configuration could depend on whether GPUs reside on the same node)
  • Data + Tensor Parallelism for non-MoE components

286 of 348

Related Work: DeepSpeed MoE

  • hierarchical all-to-all operations (within node first, between nodes later)
  • parallelism coordinated communication

Step 1

Step 2

Step 3

287 of 348

Categorizing Existing Optimizations

an important observation is that expert parallelism creates a lot of inconveniences.

overcoming GPU memory constraint

load balance when using expert parallelism

minimizing communication in parallel systems

expert buffering

expert + tensor parallelism

evenly distributing hot experts

dynamic expert shadowing

context coherence expert parallelism

expert affinity

parallelism coordinated communication

computation communication overlap

288 of 348

Proposing a new research direction

tensor parallelism is suitable for our target environment (lower-end hardwares). Before each layer’s computation starts, tensor parallelize every expert’s weights on all available GPUs (e.g. every GPU possesses a slice of every expert).

why?

  • load balance wise: no issue here at all. Computation is distributed evenly across the entire parallel system.
  • communication wise: only requires 1 all-to-all operation per layer, same as expert parallelism
  • memory constraint wise: offload to CPU memory and even SSD. Load weights by layer to GPU memory in streaming fashion (start pre-fetching many layers ago). No need to guess through some fancy tricks what experts will be selected in the current layer

289 of 348

Proposing a new research direction: Feasibility Analysis

DBRX has 16 experts and 40 layers.

  • In each layer, the 16 experts’ weights total 2 * 16 * 3 * 6144 * 10752 = 6.35 GB
  • assuming tensor parallelism level of 4, each GPU needs to load�6.35 / 4 = 1.59 GB of weights each layer
  • according to our preliminary experiments, our machine can achieve 25 GB/s bandwidth copying data from CPU to GPU memory
  • from 2 & 3, loading each layer’s expert weights to GPU takes�1.59 / 25 = 0.0636 seconds = 63.6 ms
  • assuming that our target inference throughput is 5 t/s during output generation stage, we need to complete each layer in 1000 / 5 / 40 = 5 ms
  • from 4 & 5, we need to start pre-fetching 63.6 / 5 = 13 layers before
  • that means, our GPU needs to be able to buffer 13 layers of weights, which is 13 * 1.59 = 20.67 GB of weights, which fits inside our RTX 4090 (24GB memory)

290 of 348

running DBRX with tensor parallelism on 4 GPUs. In each layer, each GPU needs in its memory a slice of all 16 experts, which totals 1.59 GB. Here, we assume a target token generation throughput of 5 t/s, which means each layer’s compute + communication takes 5 ms.

291 of 348

Weekly Updates

  • Some company has kindly lent us their 8 x A100 40GB server

access guide:

  • install openVPN client: https://openvpn.net/client/
  • get VPN profile / certificate & ssh credentials from me
  • right now, we have 2 ongoing efforts:
  • profiling DBRX’s performance on A100 using TensorRT-LLM & expert / tensor parallelism (comparison with results we got from MBZUAI’s 4 x A100 80GB setup?)
  • running Mixtral8x7B 16-bit on 1 x RTX 4090 24GB through offloading weights AND computation to CPU. Later, I would like to combine this with expert / tensor parallelism on 4 x 4090s.
  • earlier this week, I mentioned in our group chat that Fiddler by UDub looks like a scam: https://arxiv.org/abs/2402.07033
  • however, from my experiments, I was able to finish one Mixtral8x7B’s MoE layer (2 selected experts) in 5.6 ms => actually quite promising
  • my approach: combine different good ideas and see where to go from there

292 of 348

Prospective Projects on A100 server

1. DBRX 132B & Mixtral8x22B

inference performance tuning & profiling in the context of

  • expert + tensor parallelism
  • multiple concurrent users�

2. Google’s Switch-MoE (2048 experts 1.6T params)�inference performance tuning & profiling in the context of

  • weights offloading to CPU memory / SSD
  • predict expert activation
  • expert compression through unstructured pruning / mixed precision quantization
  • expert buffering (i.e. treating GPU memory as cache)

Explore model parallelization techniques that accommodates usage of heterogeneous accelerators

  • offloading computation to CPU

293 of 348

Prospective Projects on A100 server

3. fine-tune pre-trained MoE models’ routing network

to overcome difficulty imposed by dynamic expert activation on weights pre-fetching�

4. Llama-3-70B

inference / fine-tuning performance profiling & optimization

  • Comparison with experiences on MoE: What techniques are transferable?�

5. Llama-3-405B

inference / fine-tuning performance profiling & optimization

  • Is there any special performance issues related to running monster-scale dense models in a resource constrained environment?
  • How is it different from running a >100B MoE model on 4x RTX 4090 24GB?
  • How do we formulate our approaches with different-sized models on different-sized machines as a methodology?

294 of 348

Weekly Updates

  • issues encountered with 8 x A100 40GB server access
  • unstable SSH connection (terminal often freezes, connection timed out)
  • insufficient network bandwidth (download speed can be < 100 KiB/s)
  • DBRX progress on A100
  • just got it running today
  • experimenting with different parallelism strategies for performance analysis
  • Mixtral8x7B on RTX 4090
  • experimenting HtoD memcpy, CPU compute, and GPU compute overlap (using the model’s MoE weights and workload)
  • profiling with nsight systems to confirm feasibility of our approach
  • also actively re-writing Mistral’s open-sourced inference engine
  • phase 1 targets single GPU, will expand to multi GPUs after this POC

295 of 348

Weekly Updates (08/29/2024)

  • Phi3.5 MoE Inference

preliminary performance analysis on 4xRTX4090-24GB

  • DBRX Inference

token generation throughput comparison between 8xA100-40GB and 4xA100-80GB

  • Llama3.1 8B Inference (first step towards training)

throughput with different batch sizes on 1xRTX4090-24GB

  • Mixtral8x7B Inference

POC v0 done:�1. runs on 1xRTX4090-24GB, expert weights and compute on CPU, self-attention & router weights and compute on GPU (https://github.com/muchi674/ntu_paslab_llm)

2. the following versions will focus on better utilizing the GPU

3. performance analysis: TBA

296 of 348

Recap on Phi3.5-MoE

Author: Microsoft

Release Date: 08/22/2024

Architecture: 16x3.8B parameters in BF16 with 6.6B active parameters when using 2 experts. The model is a mixture-of-expert decoder-only Transformer model using the tokenizer with vocabulary size of 32,064.

Context length: 128K tokens

GPUs: 512 H100-80G

Training time: 23 days

Training data: 4.9T tokens

297 of 348

Phi3.5 MoE Inference on 4 rtx4090

- CPU to GPU memory copy throughput: ~10 GB/s for each GPU

- token generation throughput (by naive model parallelism, or putting groups of model layers to different GPUs):

batch size

token generation throughput (t/s)

1

12

2

19

4

30

8

48

Note: pipeline parallelism has not being applied

298 of 348

Phi3.5 MoE Inference on 4 rtx4090

illustration of naive model parallelism by using Nsight systems:

=> problem of under-utilization is apparent

299 of 348

DBRX Inference Performance Comparison

4xA100-80GB (MBZUAI)

8xA100-40GB (borrowed)

token generation throughput in t/s,

when batch size = 1

~57

~85

What could be causing this difference?

- with batch size this small, performance is memory-bound (depends on how fast we can load model parameters from GPU memory to cache)

- both systems above are equipped with NVLink, meaning that GPU interconnect is very cheap (all-to-all operations are ~20 microseconds)

- the system with 8xA100 has a higher aggregate memory bandwidth, and the performance gain from breaking down a memory-bound computation to more workers outweighs the additional communication latency

results below are obtained using tensor parallelism

300 of 348

Llama3.1-8B Inference Performance Analysis

results above are obtained from 1xRTX4090-24GB

questions to consider:�token generation throughput growth seems to be plateauing faster than that of prompt evaluation. Why?

Batch Size

Evaluation TP (t/s)

Geneneration TP (t/s)

1

53.3759

46.75

2

103.3154

88.5

3

139.8756

130

4

193.2283

168

8

380.4522

288

12

595.3785

402

16

755.9556

462

32

1392.01663

624

64

2499.4972

760

128

OOM

OOM

301 of 348

Weekly Updates (09/05/2024)

Mixtral8x7B Inference on 1xRTX4090-24GB

- performance analysis on 4 versions of implementation

for all versions below, self-attention & router are on GPU

  • v0 (CPU experts): experts weights and computation on CPU
  • v1 (GPU experts): experts weights on CPU, memcpy to GPU for computation when selected, no cache
  • v2 (static collaboration): 7/8 expert weights on CPU, 1/8 on GPU. Computation happens where the weights are
  • v3 (dynamic collaboration): Same as v2, but memcpy weights to GPU for when communication cost < computation savings (done with 2 priorly determined hyper-parameters).

302 of 348

v0 (CPU only) performance analysis

batch size

prefill throughput (t/s)

decode throughput (t/s)

1

65.02

3.83

2

92.48

4.71

4

130.3

6.15

8

172.25

9.58

16

219.65

15

32

220.84

26.94

64

216.37

50.25

128

225.25

78.28

workload: ~110 inputs tokens, generate 128 tokens

303 of 348

v1 (GPU only) performance analysis

batch size

prefill throughput (t/s)

decode throughput (t/s)

1

38.19

1.27

2

82.08

1.48

4

159.35

1.9

8

298.14

2.9

16

537.48

5.26

32

924.12

10.36

64

1503.02

20.6

128

2262.64

40.89

workload: ~110 inputs tokens, generate 128 tokens

304 of 348

v2 (static collab) performance analysis

batch size

prefill throughput (t/s)

decode throughput (t/s)

1

74.61

4.28

2

106.83

5.3

4

142.83

7.03

8

205.12

10.62

16

234.57

17.42

32

256.97

34.62

64

245.89

61.04

128

248.88

94.77

workload: ~110 inputs tokens, generate 128 tokens

305 of 348

v3 (dynamic collab) performance analysis

batch size

prefill throughput (t/s)

decode throughput (t/s)

1

65.26

4.02

2

113.71

5.06

4

217.58

6.37

8

306.86

7.74

16

538.42

12.06

32

925.01

23.45

64

1503.23

40.15

128

2259.6

69.74

workload: ~110 inputs tokens, generate 128 tokens

306 of 348

307 of 348

308 of 348

309 of 348

310 of 348

311 of 348

312 of 348

TODOs

  • grouping similar requests in the same batch. Similar as in activates the similar experts
  • using “second-best” experts that reside on GPU. Assuming that the model uses top-2, but top-1st & 3rd on GPU while top-2nd on CPU, what’s the accuracy penalty for using the top-3rd expert?
  • while employing both prefetching and caching, how do we analyze cache miss information in time to correct the pre-fetching strategy?
  • what’s the optimal cache hit rate assuming that you know which experts are going to be selected in the future? (for different cache sizes)
  • what’s the best eviction strategy? (Moscow uses LRU, Meta AI uses LIFO)
  • with large batch sizes, the expert activation statistics from previous requests might be more applicable to the next one (wouldn’t we see a uniform distribution with batch sizes that are too big?)
  • further investigation into expert’s “temporal locality”. How do we exploit this to achieve optimal static expert placement?

313 of 348

Weekly Updates (09/05/2024)

Llama3.1-70B-Instruct Inference

  • goal: optimizing performance on a single / cluster of RTX-4090s (24GB memory)
  • first step: an up and running implementation => v0_cpu_only (everything: attention & ffn is done on the CPU)
  • before deciding what work to offload to the GPU, we could profile the CPU only version to understand the model’s performance characteristics during inference
  • related work (https://infini-ai-lab.github.io/Sequoia-Page/):�by CMU, Together AI, Meta AI�published in Feb, 2024

achieves 1.75 t/s token generation throughput on a single RTX-4090 with llama-2-70B,� leverages techniques from speculative decoding, uses llama-2-7B as draft model

  • our POC version achieves 0.64 t/s during token generation

314 of 348

Briefing on llama3.1-70B-Instruct

dtype: bfloat16 (2 bytes)

n_layers: 80

vocab_size: 128,256

d_model (embedding size): 8,192

n_attention_heads: 64

n_kv_heads: 8

ffn_hidden_size: 28,672

per layer weights size (in GB):

  • attention: ((8192 * 64 * 128 * 2) + (8192 * 8 * 128 * 2)) * 2 / 1000^3 = 0.3
  • ffn: 8192 * 28672 * 3 * 2 / 1000^3 = 1.4

(the remaining weights are mainly for the embedding layer & model output projection)

315 of 348

From past experiments:

model: llama-2-70b-chat

prompt evaluation (t/s)

token generation (t/s)

llama.cpp

11.87

4.78

mlx

1.063

0.201

llama.cpp/ mlx

11.16

23.78

316 of 348

Analysis: Decode Stage

- workload: each input prompt contains ~100 tokens�- when batch size is large enough, longer output length (n_gen_tokens) start to create noticeable pressure during attention calculation

317 of 348

Analysis: Prefill Stage

- as observed with other models, latency increases sub-linearly with respect to batch size

318 of 348

Performance Comparison

319 of 348

Mu-Chi’s Updates (10/02/2024)

Llama-3 inference on a single RTX-4090 with EAGLE speculative decoding

(on a sidenote: llama3.2 released models with 1B, 3B, 11B, and 90B parameters)

outline:

  • brief intro to the EAGLE speculative decoding framework
  • experiment results with llama-3-7B
  • experiment results with llama-3-70B (by modifying EAGLE to enable offloading)

320 of 348

Recap: the gist of speculative decoding

speculative decoding workflow:

1. draft / smaller model generate multiple draft tokens based on input sequence

2. target / bigger model processes input sequence + all draft tokens in parallel in a single forward pass

321 of 348

What is EAGLE about?

premise: good draft model is hard to find (should use architecture, training data, and tokenizer similar to that of the target model)

1. EAGLE trains its own light-weight draft model (0.24B for Llama-2-7B, 1B for Llama-2-70B, 0.28B for Mixtral8x7B. The 1B draft model was trained in 1-2 days on 4x A100 (40G) GPUs)

2. the draft model performs autoregressive inference on the feature level (as opposed to the token level)

3. grows a dynamic tree of draft tokens to increase the number of accepted tokens

322 of 348

What do you mean by feature level?

eagle argues that “autoregression at the feature level is simpler than at the token level [since] feature exhibits more regularity than tokens, which are transformations of natural language”

323 of 348

EAGLE draft model architecture

324 of 348

What do you mean by tree of draft tokens?

tree size: top_k + depth * topk ^ 2�# forward passes to construct tree: 1 + depth

325 of 348

llama-3-8B inference on a single RTX 4090 with EAGLE

generation cycle latency breakdown

vanilla (t/s)

with Eagle (t/s)

speedup

54.34

116.36

2.14x

drafting (ms)

target model verification forward pass (ms)

misc. (ms)

12.57

22.99

~2

tree:�top-k = 5,�depth = 10,

size = 5 + 10 * 5 ^ 2 = 255

# draft model forward passes: 10

116.36 / (1000 / (12.57 + 22.99 + 2))

= 116.36 / 26.62

= 4.37 tokens per cycle

326 of 348

llama-3-70B inference on a single RTX 4090 with EAGLE & weights offloading

- original EAGLE framework only support scenarios where both the draft and the target model fit on GPU memory (allows multi-GPU)

- I modified EAGLE to store the draft model on GPU and memcpy the target model layer by layer from CPU to GPU memory during forward pass (no cache. all computation happen on GPU).

- I mimicked Sequoia’s “optimized offloading engine” to allow overlap between memcpy and GPU computation

327 of 348

Offloaded Eagle

overlapping:

1. drafting and target model first layer’s memcpy

2. current target model layer’s computation and next layer’s memcpy

(uses 2 cuda streams, pinned memory, and double buffering on GPU memory)

328 of 348

llama-3-70B inference on a single RTX 4090 with EAGLE & weights offloading

performance comparison (end to end)

https://arxiv.org/pdf/2402.12374

generation cycle latency breakdown

1x L40 48GB

1x RTX 4090 24GB

DeepSpeed-Zero-Inference: pure offloading (t/s)

Sequoia: 7B draft model (t/s)

Offloaded Eagle: 1B draft model (t/s)

0.18

1.79

0.79

drafting (ms)

target model verification forward pass (ms)

misc. (ms)

33.28

5,235

~4.5

tree:�top-k = 5,�depth = 10,

size = 5 + 10 * 5 ^ 2 = 255

# draft model forward passes: 10

0.79 / (1000 / (33.28 + 5235 + 4.5))

= 0.79 / 0.19

= 4.16 tokens per cycle

329 of 348

Moving Forward: Speculative Decoding

recap

- during the last few weeks, we have been experimenting with speculative decoding as a way to accelerate LLM inference without degrading model accuracy in offloading settings where performance is often bottlenecked by CPU to GPU PCIe bandwidth

- last time, we introduced EAGLE, a speculative decoding framework that trains its own draft model instead of using one off the shelf

- By enabling offloading with basic optimizations such as delegating FFN computation to the CPU, we were able to achieve ~1.75 t/s throughput with EAGLE when running Llama-3-70B with a 1B self-trained draft model on 1x RTX-4090 24GB, which is on par with Sequoia’s results with Llama-3-8B as the draft model

330 of 348

Moving Forward: Speculative Decoding

recap

- this made us wonder: can we run Llama3.1-405B on a 1x RTX-4090 24GB?

- to do so, we need to:� 1. store weights in SSD & CPU memory => more PCIe limitations

2. incorporate hierarchical speculative decoding: big / medium / small models

- but before we rush into things, we should see whether others have already implemented this idea (spoiler alert: yes)

331 of 348

Moving Forward: Speculative Decoding

today’s outline

- brief introduction to related work:� 1. Cascade Speculative Drafting for Even Faster LLM Inference� by team at University of Illinois Urbana-Champaign, December, 2023� https://arxiv.org/pdf/2312.11462

2. TRIFORCE: Lossless Acceleration of Long Sequence Generation with � Hierarchical Speculative Decoding� by team at CMU (Beidi Chen, https://www.andrew.cmu.edu/user/beidic/), Meta AI FAIR lab, April, 2024� https://arxiv.org/pdf/2404.11912

- experiments:� opportunities we discovered when examining TriForce and EAGLE side by side

332 of 348

Related Work: Cascade Speculative Decoding

- recall that, when verifying the draft sequence, if token at the current position is rejected, all following tokens are discarded�- therefore, draft tokens at later positions in the sequence have lower chances of being accepted�- inspired by this observation, Cascade uses larger draft models for earlier positions and smaller ones later

333 of 348

Related Work: TriForce

background (not from the paper but relevant)

- in LLM, context (input prompt + previously generated tokens) is used to generate the next token in the sequence

- to avoid repeating computation, the KV-cache, which stores the results of�matmul(input, K_matrix) and matmul(input, V_matrix), is used to accelerate the attention mechanism

- Llama-3.1 supports context length up to 128K (131,072 to be precise) tokens

- popularity of using Chain of Thought prompting to assist LLMs in reasoning also implies longer contexts

334 of 348

Related Work: TriForce

- note that KV-cache size grows linearly with batch size

- this means, when serving these SOTA models, we need to account for both model weights and KV-cache sizes to realize their full-potential

- seeing this and the fact that draft models grow less accurate with longer contexts (perhaps because they are not trained to memorize as much context as the target model. However, this is not the case for the llama3.1 family), TriForce proposes …

- how big is the KV-cache?�2 * n_layers * batch_size * max_sequence_len * n_kv_heads * head_dim * 2 (bytes)

- which means, 1x sequence of 128K tokens’ KV-cache occupies� Llama-3.1-8B: 2 * 32 * 1 * 128,000 * 8 * 128 * 2 / 1000^3 = 16.78 GB� Llama-3.1-70B: 2 * 80 * 1 * 128,000 * 8 * 128 * 2 / 1000^3 = 41.94 GB� Llama-3.1-405B: 2 * 126 * 1 * 128,000 * 8 * 128 * 2 / 1000^3 = 66.06 GB

335 of 348

Related Work: TriForce

a three-tier architecture:

1. at the bottom, a tiny draft model that uses fixed-sized KV-cache to accelerate

2. a “self-speculating” mechanism in the middle that uses the original target model weights + ~3% of the full KV-cache, which is then used to speed-up

3. the target model with full KV-cache

336 of 348

Related Work: TriForce

TriForce argues self-speculation in the “middle” works because …

- attention sparsity is a well-explored topic that argues most attention score can be recovered from a small portion of the KV-cache

- TriForce observed that adjacently generated tokens tend to pay attention to the same part of the KV-cache (intuitively, e.g.: summarize the experiments setup section of this paper)

337 of 348

Experiments: TriForce

- we hypothesized that, due to self-speculation (which still uses the full target model weights), TriForce would not perform well with shorter contexts�- note that vanilla autoregressive decoding with Llama-3.1-8B is reported to reach 54 t/s on same machine

context length

retrieval cache budget

# tokens per generation cycle

130048

8192

7.56

32784

8192

9.84

8192

8192

11.74

2048

2048

11.87

512

512

11.22

128

128

11.78

338 of 348

Experiments: TriForce

interestingly, with shorter contexts, TriForce is performing badly with low retrieval cache budget. The best performance is recorded when the “middle” self-speculative mechanism is equivalent to the target model

339 of 348

Experiments: EAGLE

naturally, we became curious about how EAGLE performs with longer contexts�note that results published by their paper (same as Sequoia) uses very short context: 128 tokens

context length

# tokens per generation cycle

128

3.71

256

3.70

512

3.34

1024

3.13

1765

2.35

3072

1.85

340 of 348

Comparative Analysis

- EAGLE is bad at longer contexts�- TriForce is bad at shorter contexts�=> good synergy?

note that:�- we’re not using EAGLE with offloading here yet�- TriForce only offloads the target model’s full KV-cache. What happens when the draft model’s weight can’t fit in GPU?�- aside from computation and memcpy overlap, little optimizations were introduced for the target model’s verification runs

interesting to know:�longer contexts, during token generation, makes attention-score calculation even more memory-bound�because most time will be spent on loading KV-cache

341 of 348

An intriguing idea: Accelerating MoE LLM with Self-Speculation

motivation

- when experimenting with Mixtral8x7B, 彥文 saw that the model could still perform well when using the top-1 & random 2nd experts (the original design chooses the top-2 out of 8 experts)

- this means we can choose experts already on the GPU as the 2nd expert to reduce execution time

- the goal of 彥文’s work is to balance performance improvement against accuracy degradation

- well, I would like to improve MoE LLM’s performance without any accuracy loss. Maybe we can use “top-1 + random 2nd ” as a draft model to accelerate the original model through speculative decoding

342 of 348

An intriguing idea: Accelerating MoE LLM with Self-Speculation

related work

- EAGLE (self trained draft model with 0.28B params) has attempted this but did not get very nice results

speedup & acceptance length possibly uses a 60-token draft tree (no evaluation code),�# - alpha denote acceptance rate of draft token at position # in sequence (no tree)�temperature=0 implies greedy decoding, i.e. selects token with highest probability

- their paper argued that MoE models are harder to speedup than their dense counterparts since the draft tokens can activate different parameters during the verification forward pass

343 of 348

Recap: how does speculative decoding achieves speedup

speculative decoding workflow:

1. drafting: the smaller draft model generate multiple draft tokens based on the input sequence

2. verification: the bigger target model processes the input sequence + all draft tokens in parallel in a single forward pass

when performing autoregressive inference, generating N tokens require moving the same parameters from memory to cache N times

- with dense models and perfect drafting accuracy, you reduce data movement to 1 time

- however, with MoE models, the draft tokens can activate different experts, increasing data movement during verification

344 of 348

Performance Model: accelerating MoE LLM with speculative decoding

345 of 348

Performance Model: accelerating MoE LLM with speculative decoding

through a sample case: things to watch out for

346 of 348

Experiments: 4 implementations

- basic: draft model uses only the top-1 expert. Attention weights + expert0-1 to the 25th layer on GPU. Computation happens where the model weights are, no memcpy weights.

- clear_start: Attention weights + all 8 experts to the 7th layer on GPU. Uses top-2 for the first 7 layers and top-1 for the rest.

- random_2nd: draft model uses the top-1 expert + whichever expert is on GPU. Same weights placement as basic.

- dyn_draft_len: same as random_2nd, however,� 1. performs drafting + verification together in the first forward pass� 2. If the first draft token is rejected, terminate drafting early.

347 of 348

Evaluation: 4 implementations

implementation

6, non-greedy

6, greedy

8, non-greedy

8, greedy

basic

0.80

0.61

0.69

0.51

clear_start

0.75

0.62

0.69

0.53

random_2nd

0.88

0.74

0.80

0.63

dnm_draft_len

0.92

0.85

0.79

0.76

Acceptance Rates

here, we are not even using a tree! However, high draft accuracy comes at a cost

348 of 348

Evaluation: 4 implementations

in August, my implementation that enables static collaboration between the CPU and GPU achieves 4.28 t/s decode throughput, which translates to ~234 ms per token latency�

- we need to make draft model faster

- hierarchical architecture?