1 of 272

by Mu-Chi Chen

2 of 272

What Is a Data Hub?

  • mediator between various data sources and data consumers
  • how does it differ from normal data storage solutions (e.g. database, data lake) ?

3 of 272

What Is a Data Hub: a one stop shop for data consumers

4 of 272

What Is a Data Hub: Main Functions

5 of 272

Examples

from platform providers:

  • Cumulocity
  • MarkLogic
  • Cloudera

from an alternative/complementing definition of data hub:

  • LinkedIn’s DataHub
  • Microsoft’s OneLake data hub

from Google as a narrowly purposed product:

  • Ads Data Hub

6 of 272

Example: Cumulocity

Highlights:

  • IoT devices send data to mongoDB
  • Dremio (distributed SQL engine) as query interface & ETL orchestrator
  • stores data as Apache Parquet (relational) on data lakes

7 of 272

Example: MarkLogic

Highlights:

  • allows data ingestion from Apache Kafka
  • uses a multi-model database called MarkLogic Server

8 of 272

Example: Cloudera

workflow:

  • ingest data by creating a data hub cluster that hosts Apache Nifi (data flow manager, enables getting data via HTTP, s3, Kafka, Mongo …)
  • launches transformation/ analytic jobs on perhaps another cluster (supports Apache Spark for batch & stream processing)
  • clusters in the same environment (e.g. test, staging, production) can save data to a shared data lake where they also become accessible by consumers

reference:

https://www.cloudera.com/products/data-hub.html#

https://docs.cloudera.com/data-hub/cloud/index.html

9 of 272

Things to Consider

how to move data into our data hub? (data ingestion & integration)

  • Apache Nifi as open source solution
  • AWS Glue or Azure Data Factory depending on our cloud service provider

where do we store the data in our data hub?

  • data lake (azure, aws, GCP, Snowflake …)
  • multi-model database (open-source: Apache Ignite, ArangoDB, OrientDB, FoundationDB, paid: Azure Cosmos DB)

implications of our infrastructure choices?

10 of 272

slides above were covered in

Data Hub Project Weekly Meeting on 9/28

11 of 272

Example: Data Hub as Data Catalog/ Metadata Platform

What is a Data Catalog?

According to IBM, “a data catalog is a detailed inventory of all data assets in an organization, designed to help data professionals quickly find the most appropriate data for any analytical or business purpose.”

reference: https://www.ibm.com/topics/data-catalog

Why does a Data Hub need this?

Aside from enabling people to get the data they need easily, data hubs should also help them find the data they need (this would essentially be the face of this project).

  • encourages and therefore increases usage
  • facilitates data sharing and open research

12 of 272

Example: Data Hub as Data Catalog/ Metadata Platform

Common functions of a data catalog

  • IAM
  • search
  • data lineage (most likely requires logs)
  • data management (source, ingestion, retention, and export configurations)
  • data ops (pipeline execution and statistics)
  • data quality (rule definitions and pipeline execution results)
  • compliance (SOC 2, right to be forgotten, right of access)
  • explainable and reproducible AI

reference: https://engineering.linkedin.com/blog/2020/datahub-popular-metadata-architectures-explained

13 of 272

Example: LinkedIn & Acryl Data’s DataHub

open source

https://datahubproject.io/

slides 12, 14-16 references a LinkedIn Engineering Blog about metadata catalog architectures by Acryl Data’s CTO (previously LinkedIn’s Principal Staff Software Engineer leading the data infrastructure team)

14 of 272

Example: LinkedIn & Acryl Data’s DataHub

architecture

  • metadata are event-based, subscribable in real-time
  • metadata providers push to stream or service API
  • changes to metadata generate logs
  • logs can be materialized in different forms

reference: https://engineering.linkedin.com/blog/2020/datahub-popular-metadata-architectures-explained

15 of 272

Example: LinkedIn & Acryl Data’s DataHub

architecture

  • metadata should be strongly typed and have collaboratively defined relationships

reference: https://engineering.linkedin.com/blog/2020/datahub-popular-metadata-architectures-explained

16 of 272

Example: LinkedIn & Acryl Data’s DataHub

reference: https://engineering.linkedin.com/blog/2020/datahub-popular-metadata-architectures-explained

  • providers push metadata to change proposal stream (like a topic on Apache Kafka), where it is processed made available for subscription
  • metadata (main) service subscribes to processed change stream, make logs, and TL the metadata to different storage solutions depending on use cases
  • consumers e.g. data hub dashboard powers different applications using the various data storages

17 of 272

Example: Microsoft’s OneLake Data Hub

question:

What does MBZUAI need?

18 of 272

Example: Ads Data Hub (will skip in presentation)

Highlights:

  • reads data from: Google Ads, Youtube, your GCP BigQuery (data warehouse)
  • writes to your GCP project upon query
  • privacy & security by Google

19 of 272

For weekly meeting on 10/5/2023

topic: continue to survey data hub elsewhere

  • Internet of Water Data Hub
  • Art Institute of Chicago Data Hub

maybe

  • circle back to talk about data catalogs
  • Open Data Hub (originally an internal Red Hat project but then open sourced)

20 of 272

Example: Internet of Water Data Hub

What is the Internet of Water Coalition?

A group of organizations (e.g. Duke University Nicholas Institute, Lincoln Institute Center for Geospatial Solutions, Western States Water Council) working with federal, state, and local governments to improve water management by building a water data infrastructure.

In summary:

Better data means better management. Therefore, we need good infrastructure to make water data more discoverable, accessible, and usable

21 of 272

Example: Internet of Water (IoW) Data Hub

  • collects water data from disparate sources into one place
  • standardizes collected data according to a commonly adopted principle (which implies that it only stores structured data)
  • enables discovery and access by Geoconnex (metadata catalog also developed by IoW)

22 of 272

Example: Internet of Water (IoW) Data Hub

basic components of a Data Hub

  • data producers
  • data wrappers: (the transform and load jobs of an ETL process)
  • data store
  • metadata catalog: searchable collection of metadata that points to data in data store including API to access the data (this is where data users can come in)

individual data hubs are then connected by Geoconnex, which is essentially a global metadata catalog (this is another place data users can come in)

23 of 272

Example: Internet of Water (IoW) Data Hub

there are four types of data hub, and this is type A

  • distributed in the sense that the data producers are responsible for the entire ETL process & storage
  • here, data hub provides only a window to discover and access data.

24 of 272

Example: Internet of Water (IoW) Data Hub

type B:

  • data hub also stores the data
  • data producers push transformed data to data hub

type C:

  • same as above except that data hub is responsible for pulling transformed data from data producers

* data sharing creates a privacy issue

25 of 272

Example: Internet of Water (IoW) Data Hub

type D (which is more like our original vision):

  • data producers send raw data to data hub, who is responsible for the rest

* can the data producers trust the data hub to transform the data right? (which brings us back to the ETL v.s. ELT question)

26 of 272

Example: Internet of Water (IoW) Geoconnex

purpose: make it easier for data consumers to find/access the data they want/need. Roughly similar to what LinkedIn and Microsoft is doing

why I think it is not a good product

  • every org publishing data are required to create a web page for each location for which they have water data
  • every created web page needs to embed structured metadata in JSON-LD
  • every page needs its own unique ID. If you every modify the URL, you gotta go to Geoconnex to remap it

this is so that a web crawler can then go to these pages to extract the data it needs and store it to a central database once everything above is done right

27 of 272

Example: Internet of Water (IoW) Data Hub

28 of 272

Example: the Art Institute of Chicago Data Hub

background

  • disparate data sources
  • some data transformation jobs repeated across different applications
  • wants robust support for website (web & mobile app) search (full-text search, autocomplete, and general SQL-like queries such as filtering and aggregation)

29 of 272

Example: the Art Institute of Chicago Data Hub

design goals

  • only stores data that are already publicly accessible, which preclude data privacy issues. This, they argue makes it okay for:
  • using cloud infrastructures
  • not care about authentication & authorization for now
  • not treat data hub as a central, unique, and permanent storage: that is the data producer’s job (not necessarily applicable to MBZUAI)
  • minimize need to ask data producers to modify their systems for data ingestion

30 of 272

Example: the Art Institute of Chicago Data Hub

architecture

  • hub and spoke model consisted of multiple microservices: APIs, HTTPS, JSON (you know, the yoozh …)
  • spokes
  • one microservice for each type of data source
  • each microservice is responsible for extracting the needed data, transform it right, and expose an API for reading the transformed data (pull instead of push)
  • can involve scraping data into local dbs for storage and access
  • hub
  • pulls, aggregate, and index data from spokes
  • mySQL for primary storage :(
  • Elasticsearch as search engine (enables full-text search, but is a document db itself like Mongo, no longer open-sourced since 2021)
  • exposes an API for data consumers: user submits query to hub, hub asks mySQL and/or Elasticsearch for data

31 of 272

Example: the Art Institute of Chicago Data Hub

challenges

  • performance: high website traffic (increase in data consumers) results in data hub dropping connections
  • auto-scaling capabilities
  • debugging: hard to trace where mistakes were made
  • importance of data lineage (done very well by LinkedIn) - where the data originated, how it has changed, and its ultimate destination

references: https://mw19.mwconf.org/paper/building-a-data-hub-microservices-apis-and-system-integration-at-the-art-institute-of-chicago/index.html (slides 28-31)

32 of 272

Example: Red Hat’s Open Data Hub (ODH)

  • end-to-end AI platform
  • contrary to all the data hubs discussed above, ODH provides a space for data scientists to develop their applications and tools for deploying them into production
  • this is a platform worth introducing because it is built mostly with open source projects

33 of 272

Example: Red Hat’s Open Data Hub (ODH)

Architecture highlights

  • uses ceph for (distributed file, object, block) storage -> an example of open source data lake
  • uses Apache Hive Metastore for metadata to enable data discovery
  • uses OpenShift as container management platform (powered by Kubernetes), which means the underlying compute resource can come from public/private cloud or on-premise data centers

34 of 272

Example: Red Hat’s Open Data Hub (ODH)

35 of 272

Example: Google Analytics Hub

feature of BigQuery

short intro to BigQuery:

  • GCP’s fully managed data warehouse
  • separates storage (BigQuery storage) & compute (BigQuery analytics) engine
  • two main use cases
  • store & analyze data in BigQuery
  • use BigQuery to access your data where it lives (only supports
  • external table: Cloud Bigtable (NoSQL), Storage (data lake), Google Drive
  • federated queries: Cloud Spanner (relational database that scales like a non-relational one), and SQL

36 of 272

Example: Google Analytics Hub

How might it look like for MBZUAI?

37 of 272

Example: Google Analytics Hub

38 of 272

Alright, so how would I build it?

39 of 272

40 of 272

Scaling

41 of 272

What could be MBZUAI Data Hub’s unique offering?

42 of 272

Automatic Sampling Bias Detection

what is sampling bias?

sample unrepresentative of population

why do we care?

  • internal & external (generalizability) validity
  • fairness in machine learning

what role can data hub play in this?

by enabling automatic and real-time exploratory data analysis, we can

  • improve data collection ASAP
  • optimize storage by not saving data that are too similar
  • enhance transparency for users of resulting models

43 of 272

Automatic Sampling Bias Detection

how do we detect sampling bias?

what does detecting sampling bias mean?

uncovering patterns in data (arguably data mining, knowledge discovery in

data): 90% of the data collected today came from people who identifies as

cisgender man (assigned male at birth and identifies as man)

44 of 272

Automatic Sampling Bias Detection

ways to uncover patterns in data

methods of unsupervised learning:

  • clustering

exclusive (K-means)

hierarchical (agglomerative)

probabilistic (Gaussian Mixture, determines probability of data belonging to to a particular distribution)

  • association rules

Apriori algorithm (frequent data combination)

45 of 272

Automatic Sampling Bias Detection

ways to uncover patterns in data

methods of unsupervised learning:

  • outlier detection

local outlier factor (local density of focus point v.s. neighbor)

isolation forest (build overfitted decision tree by splitting at random threshold & feature until every point is isolated. Those isolated early are outliers)

autoencoders (perhaps for images. Encoders learns a latent representation of input and decoders reconstruct input from encoded output. Larger

reconstruction error implies anomaly)

46 of 272

Automatic Sampling Bias Detection

ways to uncover patterns in data

methods of supervised learning:

  • classification

understand the different ways in which your data can be categorized to adjust data collection strategy

e.g.

classification results seems geographical, and much more data are classified as coming from East Asia

KNN, K-nearest neighbors

decision tree & random forest

47 of 272

Academic Discussions on Data Hub

48 of 272

Technical Report: Developing a Working Data Hub

  • paper by MIT Lincoln Laboratory Supercomputing Center in March, 2020
  • authors:

Vijay Gadepally (OSU Ph.D. in ECE, technical staff at MIT Lincoln Lab)

Jeremy Kepner (Princeton Ph.D. in Astrophysics, head of MIT Lincoln Lab)

  • the Lincoln Lab never seem to have built a data hub of their own
  • the closest product MIT has helped build is probably BigDAWG (a polystore database prototyped at a hackathon hosted by Intel)
  • no concrete experiences to learn from, but interesting guidelines
  • https://arxiv.org/pdf/2004.00190.pdf

49 of 272

MIT on the need of data hub

big data challenges:

  • Volume, Velocity, Variety (three Vs), Veracity (Fourth V? preserving trust in data and analytic)
  • in summary, needs to extract value from lots of heterogeneously sourced data fast and accurately

evolving solution:

  • one-stop solution for storing (providing) and using (consuming) data
  • which is data hub

50 of 272

MIT on data hub architecture

at the bottom:

external data sources + your own data storage including databases, data warehouses, and data lakes

in the middle:

heterogeneous data management and transformation. enable users to query multiple sources at once

on the top:

data discovery and define data transformation rules

51 of 272

MIT on data hub system engineering

NOTE: “system engineering studies the development of complex systems (p.8)”

HIGHLIGHTS:

3a is an operation supported on databases while 3b is a strategy for file systems/ object storage/ block storage i.e. place for unstructured/ raw data like data lakes

=> similar to my data hub architecture proposal

52 of 272

MIT on raw data storage: Lustre

  • distributed parallel object storage that is great for HPC (kinda similar to s3 but open-source and has a file system interface)
  • stores objects (binary data stored in file) and associated metadata (ith node, filename, timestamp etc) in a flat structure
  • one metadata server (MDS), multiple object servers (OSSes)
  • clients request data from metadata server which saves pointer to data in object servers
  • usually uses (not built-in) RAID6 to prevent data loss (allows concurrent failures of up to 2 disks, on average imposes a 35% storage penalty)

53 of 272

Lustre further reading

“Azure Managed Lustre: not your grandparents' parallel file system”

https://techcommunity.microsoft.com/t5/azure-high-performance-computing/azure-managed-lustre-not-your-grandparents-parallel-file-system/ba-p/3889946

Ceph (used in Redhat’s Open Data Hub, originally part of Sage Weil’s doctoral dissertation at UC Santa Cruz, also used extensively in HPC) vs Lustre:

https://www.linkedin.com/pulse/ceph-lustre-yashar-esmaildokht/

AWS FSx for Lustre (integrates with s3)

https://docs.aws.amazon.com/fsx/latest/LustreGuide/what-is.html

official site:

https://www.lustre.org/

54 of 272

MIT on DBMS

what is DMBS?

software interface between users and database

why use DMBS?

  • define schema
  • CRUD data
  • access control
  • analytics

55 of 272

MIT on DBMS

ACID v.s. BASE

relational db: atomicity, consistency, isolation, and durability. OLTP

non-relational db: basically available, soft state, eventual consistency

CAP theorem

you can only choose 2 between consistency, availability, and partition tolerance

(introduced by professor at Cal, supported by professors at MIT)

relational db: C + A

non-relational db: C + P (MongoDB), A + P (Cassandra, developed at FB, and founder later went to AWS to build DynamoDB)

56 of 272

MIT on DBMS

scaling

relational db: vertical (improve hardware because they are mainly on single mode)

non-relational db: horizontal (add nodes, which creates need of strong partition tolerance, which makes relaxing C and/or A inevitable if the CAP theorem is correct)

That said, there are controversy surrounding CAP

NewSQL (as opposed to SQL and NoSQL) movement that aims to build databases that are both ACID compliant and performant (i.e. horizontally scalable)

some examples include …

CockroachDB: https://www.cockroachlabs.com/

Google Cloud Spanner (which we have seen before when introducing BigQuery, now advertised as being half the cost of DynamoDB): https://cloud.google.com/spanner?hl=en

57 of 272

MIT on DBMS

Access Control: who can access what

granularity: table, row, to cell levels

principal: the “who,” can be role/ user/ groups

asset: the “what,” i.e. resources stored in database

view-based access control being most popular

  • often implemented through metadata
  • often used with role-based strategies
  • principal requests access, db evaluates whether principal has rights to access requested data, if so, presents data with view

query control strategies: restricts what query the principal is allowed to issue

58 of 272

MIT on DBMS

ways to add security

well … duh … encryption and/or masking

CryptDB (also by MIT):

  • executes SQL queries over encrypted data
  • data can optionally only decryptable by password-holders, which means db admin does not get to access decrypted data

59 of 272

MIT on heterogeneous data management

  • modern applications often rely on diverse datasets with different models and sources
  • ETL is expensive:
  • converting everything to a single model may degrade performance
  • high maintenance cost, transformation rule has to change with data
  • “custom data stores provide 100X better performance than general purpose databases”
  • sometimes, data sharing policy may prohibit moving the data out to a centralized location
  • federating (like federal to state governments in the U.S.) multiple specialized data stores. Single or multiple query interfaces?
  • how to avoid writing and using a different connector for every selected store
  • how to integrate data

60 of 272

MIT on heterogeneous data management

BigDAWG (Big Data Working Group)

  • unified interface (connector)
  • islands: group databases with the same data model, trade off semantic completeness for location transparency (which implies having a single query language)
  • shim translates island queries to db engine’s native language
  • cast enables moving data between db engines

61 of 272

MIT on heterogeneous data management

  • BigDAWG currently supports PostgreSQL, MySQL, Vertica (?), Apache Accumulo (k-v store), SciDB (??), and S-Store (???)
  • interesting challenges that BigDAWG tackles
  • cross-engine query plans
  • query performance monitoring
  • cross-engine data migration
  • query execution

62 of 272

MIT on key considerations when developing data hub

Technological considerations

  • don’t rely on a centralized data lake
  • single point of failure
  • difficult to store operational data
  • data becomes stale (hard to update)
  • scalable storage architecture that federates multiple specialized data stores
  • provide access to computing infrastructure for data processing

63 of 272

MIT on key considerations when developing data hub

Infrastructure considerations

  • compute & storage nodes, networking equipments
  • cloud computing
  • almost unlimited resources as long as you have $
  • cloud clawback: hybrid cloud
  • vendor lock-in
  • ** some tools may not run on sensitive networks **
  • data catalog that is updated automatically and perhaps even provide information on related previous work (information sharing & research replication)

64 of 272

MIT on key considerations when developing data hub

Data formatting considerations

  • 80% ML/ AI research time are spent on data wrangling
  • data providers are a crucial source to understand the meaning of data
  • maximize data provider input (knowledge transfer) during data ingestion & wrangling

65 of 272

MIT on key considerations when developing data hub

Security Considerations

  • data providers and consumers should be able to trust you
  • work with Information Security Officers and Subject Matter Experts
  • data aggregation can make insensitive data sensitive: name and address by themselves may not be sensitive, but matching them together creates PII (personally identifiable information)
  • data use agreements (like cookies ~)

66 of 272

MIT on key considerations when developing data hub

Policy considerations

  • new data may warrant addition of new technology
  • open v.s. closes source: balance between long-term support and developer community
  • software licensing may prohibit technology sharing (MBZUAI to NTU for instance)

67 of 272

MIT on key considerations when developing data hub

User engagement

  • active and engaged users and developers
  • quick, transparent, and effective feedback loop
  • ask big projects to share data
  • remove costs of data sharing
  • invest in automatic data discovery: data hub should have the latest data to attract more users

68 of 272

UCSC Xena Platform

https://www.nature.com/articles/s41587-020-0546-8

enables visualization and interpretation of cancer genomics data from disparate sources

Background

  • large public datasets & small and private datasets from individual labs
  • web app (usually for public datasets) | desktop app (usually for private datasets)
  • driving use case: visualize public and private datasets together without having to upload private data to external servers.

69 of 272

UCSC Xena Platform

Architectural Design

two components:

  • frontend (Xena Browser): users create requests, frontend fetches and integrates data from different hubs. data source configuration is done via hub URLs
  • backend (Xena Hubs): can be installed on desktop to launch a local server or hosted on public servers which sometimes have their own access control via firewall. no integration here, private data made available by local server never gets uploaded to anywhere else

70 of 272

UCSC Xena Platform

71 of 272

Lessons from government agencies

72 of 272

Taiwan’s Health and Welfare Data Center (HWDC)

https://pubmed.ncbi.nlm.nih.gov/31118821/

Background

  • National Health Insurance launched in 1995, 99.99% population enrollment
  • National Health Research Institutes established in 2002, in charge of National Health Insurance Research Database (NHIRD)
  • NHIRD stores insurance claims, which means we have data to the entire population
  • NHIRD sits at the core of HWDC

73 of 272

Taiwan’s Health and Welfare Data Center (HWDC)

Background

  • relies heavily on personal identification numbers (PINs) to join data from different registries (beneficiaries, inpatient claims, prescriptions …)
  • has allowed international collaborative study
  • aggregates data from more than 70 databases
  • linkable (JOIN ON PIN)
  • de-identified publicly available data
  • takes roughly two years for data to be available on NHIRD
  • only handling structured data: missing out on medical images!

74 of 272

Taiwan’s Health and Welfare Data Center (HWDC)

Challenges: Privacy

  • top concern. got sued.
  • encrypts (transforms to unique identifiers) names: data masking?
  • access restricted to applicants from academia, research institutes, or hospitals
  • supports remote access to server but requires office visit after each session
  • for research purposes only
  • cannot request data from more than 10% of population

75 of 272

Taiwan’s Health and Welfare Data Center (HWDC)

Challenges: Data Quality

  • upcoding (health care providers submitting exaggerated diagnoses to guarantee reimbursement)
  • is this noise that can be overcome with large samples?
  • how to assess impact?
  • less of a problem when diagnoses were made to certify catastrophic illness (waives premium and co-pay)

76 of 272

Taiwan’s Health and Welfare Data Center (HWDC)

Challenges: Unmeasured or unavailable variables

  • lacks data on disease severity, a confounding variable
  • causes both dependent and independent variables
  • disease severity affects both treatment and effect
  • solution:
  • instrumental variables: Z -> X -> Y
  • active comparator: instead of placebo, compares against another commonly administered treatment
  • propensity score: compare probabilities of receiving treatment
  • predict disease severity from diagnoses when available (e.g. strokes)

77 of 272

Querying Encrypted Data

by Microsoft Research

78 of 272

Querying Encrypted Data: Demand

migration to cloud and thus storage on cloud

79 of 272

Querying Encrypted Data

enters encryption:

you won’t be able to understand/ use the data even if you managed to gain access to them

the quick brown fox … a pangram

80 of 272

Querying Encrypted Data

two fundamental techniques:

  • compute over encrypted data: homomorphic encryption
  • compute over plain text in “secure” location

81 of 272

Querying Encrypted Data

more on the previous page

partial: one gate (+ or *)

full: multiple gates and unbounded depth (i.e. arbitrary functions)

Monomi: work of MIT CS and AI Lab (CSAIL) in 2013

CryptDB: from CSAIL too, 2011, Raluca Ada Popa is an Associate Professor at Berkeley EECS now

TrustedDB: Stony Brook University (NY), 2014

Cipherbase: Microsoft research, 2015

82 of 272

Querying Encrypted Data

symmetric encryption

83 of 272

Querying Encrypted Data

asymmetric encryption

84 of 272

Querying Encrypted Data

Advanced encryption standard cipher block chaining mode

85 of 272

Querying Encrypted Data

nondeterministic (more secure):

>> for i in range(2):

>> print(encrypt(“foo”))

>> 1qaz2wsx

>> 3edc4rfv

deterministic:

86 of 272

Querying Encrypted Data

homomorphic encryption:

>> encrypt(1) + encrypt(1) == encrypt(2)

order preserving encryption:

  • if x < y, encrypt(x) < encrypt(y)

87 of 272

Querying Encrypted Data

to summarize

88 of 272

Querying Encrypted Data

authors’ opinion: outdated?

89 of 272

Querying Encrypted Data

90 of 272

Querying Encrypted Data: Trusted Client Architecture

client: applications performing CRUD operations against a server DBMS

  • DBMS only stores encrypted data
  • client responsible for hosting encryption/decryption microservice and related query translator

91 of 272

Querying Encrypted Data: CryptDB as Trusted Client example

CryptDB:

  • uses partially homomorphic encryption
  • no computation on client side

92 of 272

Querying Encrypted Data: CryptDB as Trusted Client example

example operation

client app

>> SELECT SUM(grade) FROM students;

web proxy

>> decrypt(paillier_grade_sum)

DBMS

>> SELECT PAILLIER_SUM(paillier_grade) AS paillier_grade_sum

FROM students;

students table’s schema:

from user’s perspective

{

“id”: int,

“name”: str,

“grade”: int

}

on DBMS

{

“paillier_id”: str,

“name”: str,

“paillier_grade”: str

}

93 of 272

Querying Encrypted Data: Blob Store as Trusted Client example

What is Blob Store?

  • database design
  • blob: encrypted data (e.g. paillier_grade -> blob_grade)
  • blob used in conjunction with partition id

students table’s schema:

{

“name”: str,

“blob_grade”: str,

“partition_id”: int

}

range partition:

grade

parition_id

1 - 9

0

10 - 19

1

20 - 29

2

94 of 272

Querying Encrypted Data: Blob Store as Trusted Client example

Architecture

  • bit similar to CryptDB
  • difference? more query processing is moved to the client’s end (not just translating anymore)

95 of 272

Querying Encrypted Data: Blob Store as Trusted Client example

client app

>> SELECT SUM(grade)

FROM students

WHERE grade > 10;

DBMS Shell (trusted, client side)

>> SELECT SUM(DECRYPT(grade_blob))

FROM students

WHERE DECRYPT(grade_blob) > 10;

DMBS (untrusted, server side)

>> SELECT grade_blob

WHERE partition_id > 0;

example operation

96 of 272

Querying Encrypted Data: Blob Store as Trusted Client example

challenge

  • appropriate partitioning
  • optimal query splitting

97 of 272

Querying Encrypted Data: Monomi as Enhanced Blob Store

  • uses partially homomorphic encryption to push more computations on untrusted DBMS server
  • requires pre-computation for complex queries (e.g.

WHERE updated_at = created_at + 1 -> without key, you don’t know encrypt(1))

example operation

client app

>> SELECT SUM(grade)

FROM students

WHERE grade > 10;

DBMS Shell (trusted, client side)

SUM(DECRYPT(grade_paillier))

DMBS (untrusted, server side)

>> SELECT grade_paillier

WHERE grade_paillier > 10;

98 of 272

Querying Encrypted Data: Trusted Client Summary

  • does not require changing untrusted server DBMS
  • works well when amount of data transferred is small
  • relies on distributed query after all
  • pre-computation for complex query is expensive
  • generalizability?
  • partially homomorphic encryption is still expensive, requires 2048 bits to store int32 -> 64x expansion!
  • partially homomorphic encryption support a limited set of operations

99 of 272

Querying Encrypted Data: Secure In-Cloud Processing

we talked about trusted client in the previous slides

100 of 272

Querying Encrypted Data: Secure In-Cloud Processing

how do queries under different architecture work?

benefit of having smaller TCB?

notice how isolation (hardware, memory) affects size of base

101 of 272

Querying Encrypted Data: Secure In-Cloud Processing

larger TCB

smaller TCB

less secure

more secure

more administration

less administration

according to this paper:

102 of 272

Querying Encrypted Data: TrustedDB as an Example of Secure In-Cloud Processing

Highlights

  • secure coprocessor as secure hardware
  • MySQL: untrusted main storage

SQLite: trusted storage for sensitive data

  • similar to what we talked about in blob store: distributed query processing
  • PCI: payment card industry

103 of 272

Querying Encrypted Data: TrustedDB as an Example of Secure In-Cloud Processing

id

name

birthday_timestamp

0

Andrew

866476800

1

Bart

961171200

2

Casey

1087401600

id

SSN

0

123456789

1

123456790

2

123456791

MySQL (untrusted)

SQLite (trusted)

students table

104 of 272

Querying Encrypted Data: TrustedDB as an Example of Secure In-Cloud Processing

example query

User Query

>> SELECT *

>> FROM students

>> WHERE birthday_timestamp >866476800 AND

>> SSN > 123456790

105 of 272

Querying Encrypted Data: TrustedDB as an Example of Secure In-Cloud Processing

What Happens Under the Hood

>> SELECT *

>> FROM mysql_students

>> RIGHT JOIN (

>> SELECT id

>> FROM students

>> WHERE SSN > 123456790

>> ) AS sqlite_result

>> ON mysql_students.id = sqlite_result.id

>> WHERE mysql_students.birthday_timestamp >866476800

106 of 272

Querying Encrypted Data: TrustedDB as an Example of Secure In-Cloud Processing

performance bottleneck:

  • secure coprocessors have limited computational and storage capacities
  • this becomes a problem when most data are confidential
  • the trusted part of the server is a full DBMS, how do you go about telling your customers that?

107 of 272

Querying Encrypted Data: Cipherbase as an Example of Secure In-Cloud Processing

  • should remind you of CryptDB (encrypted data on MySQL, trusted web proxy to perform query translation)
  • here, client proxy is replaced by FPGA, which by the way is on the server end

108 of 272

Querying Encrypted Data: Cipherbase as an Example of Secure In-Cloud Processing

  • “when all data is strongly encrypted (PHE), Cipherbase achieves 40% the throughput of plaintext SQL Server”
  • “when all columns in TPC-C (Transaction Processing Performance Council Benchmark C) are changed from plaintext to strongly encrypted, the performance drops by about a factor 2. This number can be around 100 for a client-based system”
  • more about this paper next week !!!

https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/cipherbase.pdf

109 of 272

Querying Encrypted Data: Semantic Security

“An adversary is allowed to choose between two plaintexts, m0 and m1, and he receives an encryption of either one of the plaintexts. An encryption scheme is semantically secure, if an adversary cannot guess with better probability than 1/2 whether the given ciphertext is an encryption of message m0 or m1. The notion is also referred to as indistinguishability of encryptions. The encryption reveals no information no matter what kind of semantics (meanings) are embedded in the encryption.”

https://link.springer.com/referenceworkentry/10.1007/978-1-4419-5906-5_23

110 of 272

Querying Encrypted Data: Leakage

what information does encryption schemes reveal?

AES in CBC mode (non-deterministic)

  • semantically secure
  • only reveal input length

deterministic

  • hacker can still understand little about data by counting number of unique values and their distribution

order preserving

  • order …

111 of 272

Querying Encrypted Data: Leakage

memory access pattern leakage

malicious actors watching on server end can study storage locations different operation access

e.g.

observer learns the following workflow

query A -> access location 0 ~ 9 -> wires money to some account

which implies

location 0 ~ 9 must store information required for some transaction

112 of 272

Querying Encrypted Data: Leakage

  • Encryption != no information leakage
  • In trusted-client architecture, we can hide sizes of query result by standardizing them to limit, but this seriously hinders performance
  • can we, instead, make memory access pattern oblivious (independent of input) ?

113 of 272

Querying Encrypted Data: Access Pattern Oblivious Algorithms

bubble sort

for i in range(len(arr) - 1, -1, -1):

for j in range(0, i):

if arr[j] < arr[j + 1]:

continue

arr[j], arr[j + 1] = arr[j + 1], arr[j]

114 of 272

Querying Encrypted Data: Access Pattern Oblivious Algorithms

aggregator oblivious encryption

  • “an aggregator [that] can compute an aggregated sum of data and [without learning] anything else”
  • even if “homomorphic encryption is employed, the aggregator can still decrypt the ciphertexts being compute upon”
  • interesting paper to be further studied

https://eprint.iacr.org/2017/479.pdf

oblivious simulation

an additional layer that makes memory access pattern looks random -> disables benefitting from spatial and temporal locality

115 of 272

Querying Encrypted Data

(from the Cipherbase paper)

In conclusion, what do we want?

solution that balances

  • data confidentiality: minimize information leakage
  • generality: try to support full SQL operations, minimize change to legacy code
  • performance: throughput and latency

116 of 272

Transaction Processing on Confidential Data using Cipherbase

background

  • the previous paper is somewhat an introduction to this one
  • collaboration between Microsoft Research, Stanford University, and ETH Zurich (Einstein’s alma mater, also taught theoretical physics there)

117 of 272

Transaction Processing on Confidential Data using Cipherbase

Abstract (on Cipherbase):

  • database system
  • end-to-end data confidentiality through encryption
  • what’s new?
  • Architecture with smallest Trusted Computing Base at the time
  • combines industrial strength SQL database with lightweight encrypted data processing on secure hardware (FPGAs, novel usage they argued)
  • competitive performance for transactional workloads (OLAP v. OLTP)
  • discusses optimization strategies:
  • communication between main untrusted db server and trusted FPGA
  • limited computational and storage capacities on secure hardware

118 of 272

Transaction Processing on Confidential Data using Cipherbase

Architecture Benefits (aside from what was already mentioned last week)

  • inherits SQL functionalities for free: concurrency, recovery, stored procedures, and indexes => trusted module simply as a remote function to call
  • relies on PCIe-based FPGA board

119 of 272

Transaction Processing on Confidential Data using Cipherbase

Threat Model (categories below are assumed to be passive i.e. does not tamper with database contents or processing)

strong adversary:

privileged OS and database access, can view the contents on memory/disk and all internal/external communication. However, cannot observe state and computations in secure hardware.

weak adversary:

can obtain a one-time snapshot of data-at-rest

in real world:

adversary usually lies between the two

120 of 272

Transaction Processing on Confidential Data using Cipherbase

Data Confidentiality (assuming all columns are all strongly, or non-deterministically, encrypted)

  • no metadata (num tables and columns, data lengths) confidentiality
  • operational data confidentiality against strong adversary: operations against data leaks some information
  • equivalent to an abstract system that uses oracle to compute strongly encrypted values
  • non-boolean operations such as addition: encrypted input & output
  • boolean operations such as equality check: plaintext True/False

121 of 272

Transaction Processing on Confidential Data using Cipherbase

relational algebra notations:

projection - subset of columns

restriction (selection) - subset of rows

security guarantee of range index is similar to that of order preserving encryption

122 of 272

Transaction Processing on Confidential Data using Cipherbase

Data Confidentiality comparing to prior work

CryptDB (PHE)

  • operational information leaked even to weak adversary
  • scope of information leakage is an entire column and not just the data touched during query

e.g. to evaluate

CryptDB has to store the entire column deterministically encrypted, which leaks equivalence relation & distribution when weak adversary obtains snapshot

On the other hand, with Cipherbase

scan-based plan only reveal unknown_predicate(A) over R tuples (bc vals can be non-deterministically encrypted)

123 of 272

Transaction Processing on Confidential Data using Cipherbase

Architecture & Overview of query/transaction processing

database processing is about

  • data value semantics (expression evaluation, not easily done over encrypted data)
  • moving data
  • query setup/ result communication
  • concurrency control
  • recovery

124 of 272

Transaction Processing on Confidential Data using Cipherbase

Cipherbase client

  • presents a plaintext database interface
  • query encryption & result decryption
  • has encryption schema and key

125 of 272

Transaction Processing on Confidential Data using Cipherbase

e.g. to evaluate

needs a stack program that

inputs encrypted A, outputs plaintext True/False

invoked once for each tuple of R

this usage pattern inspires the design => registered once, invoked using handle (id) returned by register method

has a data cache but is stateless apart from that

Trusted Module

evaluate expressions over encrypted data during query processing

126 of 272

Transaction Processing on Confidential Data using Cipherbase

Key Management

keys communicated to trusted module using standard key-exchange techniques: secure hardware device manufacturer assigns a public key

Data Encryption

supports cell-level (row + column specification) encryption

=> minimizes TM traffic

=> incurs storage penalty (the ciphertext of a 4-byte integer is 12 bytes in AES + CTR mode)

127 of 272

Transaction Processing on Confidential Data using Cipherbase

Indexing

range index:

  • can be non-deterministically encrypted, but indexed keys in B-tree are ordered by their plaintext values
  • index building and lookups require routing to TM
  • TM stores pre-registered programs that return comparison result (<, =, >) in plaintext
  • B-tree stored in UM

=> which means weak adversary learns ordering but not equality relationship

=> strong adversary learns equality relationship of queried ranges

128 of 272

Transaction Processing on Confidential Data using Cipherbase

inserting 7 into a range index

129 of 272

Transaction Processing on Confidential Data using Cipherbase

Indexing

equality index:

  • index keys are deterministically encrypted first
  • then stored in range index
  • ordering is determined by ciphertext not plaintext
  • unclear about intention of design above

130 of 272

Transaction Processing on Confidential Data using Cipherbase

transaction processing

First, client issues query Q

during the prepare step:

  • client identifies TM programs required to evaluate expressions in Q
  • client registers program with TM, who in turn responds with an id that references said program
  • client gets response, rewrites Q to reference program registered using id

131 of 272

Transaction Processing on Confidential Data using Cipherbase

e.g.

user issues:

>> UPDATE Accounts

>> SET Balance = Balance + @Amt

>> WHERE Id = @Id

Cipherbase Client rewrites into:

>> UPDATE Accounts

>> SET Balance = TMEval(21, Balance, @Amt)

>> WHERE Id = @Id

TMEval: built-in function on UM (SQL server) added by Cipherbase to invoke stack programs on TM

After rewritten query submitted to UM, it prepares a query plan like the following:

  • assuming the Id column is indexed, identify record that matches filter
  • sends its balance and required params (encrypted) to program 21
  • program 21 returns encrypted sum
  • UM only thinks that it is replacing one binary value with another (encrypted data are aliased to binary data types)

132 of 272

Transaction Processing on Confidential Data using Cipherbase

PHE avoids round-trips to TM:

in the previous example, under a scan-based plan, and if Id column is deterministically encrypted, query does not have to involve TM

133 of 272

Transaction Processing on Confidential Data using Cipherbase

Trusted Module

  • in this paper, realized using FPGA, but does not have to
  • author discussed using Intel Software Guard Extensions (SGX), a set of instructions that allow creating a process within a protected address space called an enclave (where data and computation is shielded even from OS)
  • performance concerns regardless of installation:
  • communication bandwidth (between UM and TM)
  • latency incurred during communication

134 of 272

Transaction Processing on Confidential Data using Cipherbase

System engineering and optimization details

Transaction processing challenges:

  • transactions incur large number of TM accesses
  • good thing is that these calls only transmit a small amount of data (e.g. encrypted addition involves 2 * 12 bytes input and 12 bytes output)
  • however, round-trips to TM incur latency:
  • PCIe transfer latency
  • TM has limited computational resource

135 of 272

Transaction Processing on Confidential Data using Cipherbase

  • latency in expression evaluation indirectly reduces throughput (transactions hold on to locks longer, which increases data contention)
  • TM is a shared resource, which means concurrency can be a problem
  • we can increase concurrency by batching communication (multiple work units in a single communication)
  • but batching exposes computational resource and bandwidth as bottlenecks because stack programs and data transfer are invoked at a higher rate

136 of 272

Transaction Processing on Confidential Data using Cipherbase

Optimizations

  • increase parallelism by having multiple FPGA stack machines => process n calls independently
  • as mentioned previously, batching
  • increases throughput: stack machines (SM) spend more time computing, free bus for other SMs
  • sometimes increase latency when others have to wait for a longer running work unit in the same batch to complete. Empirically, authors argue that the benefits outweigh the potential drawbacks

137 of 272

Transaction Processing on Confidential Data using Cipherbase

  • expression folding (intra-transaction batching): reduce TMEval calls in the same query if possible => reduces TM workload, PCIe traffic, and latency (computation reuse)

138 of 272

Transaction Processing on Confidential Data using Cipherbase

user issues

>> UPDATE Inventory

>> SET item_cnt = item_cnt + @qty

>> SET total_cnt = total_cnt + @qty

>> WHERE storeId = @Id

Naive Cipherbase rewrite

>> UPDATE Inventory

>> SET item_cnt = TMEval(0, item_cnt, @qty)

>> SET total_cnt = TMEval(1, total_cnt, @qty)

>> WHERE storeId = @Id

expression folding

>> UPDATE Inventory

>> SET @var = TMEval(2, item_cnt, total_cnt, @qty)

>> SET item_cnt = UMExtract(@var, 0)

>> SET total_cnt = UMExtract(@var, 1)

>> WHERE storeId = @Id

i.e. common subexpression elimination

139 of 272

Transaction Processing on Confidential Data using Cipherbase

  • index lookup vectorization: requires traversing the B-tree
  • naive: TM call for every comparison
  • optimized: in each call
  • pass search key and the entire tree node (multiple indice in one node) to TM
  • binary search target key over node (which is sorted by construction)
  • only keys touched by binary search is decrypted, search key only decrypted once within a node

140 of 272

Hitting Pause on Cipherbase

  • solutions we have explored during the past few weeks (CryptDB, TrustedDB, Cipherbase, etc.) were built on partially homomorphic encryption, which supports only one type of gate, typically + or *.
  • due to the limited set of available operations, we saw 2 common approaches:
  • trusted client: encrypted data on db server, delegate much of query processing to client who holds the secret key
  • secure location (trusted computing base): encrypted data on main db, but server also holds key perform complex operations on plaintext data in secure hardware
  • question now is: does advancement in fully homomorphic encryption (FHE) render the above technologies obsolete?

141 of 272

Hitting Pause on Cipherbase

  • theoretical benefits of FHE
  • capable of evaluating arbitrary circuits, multiple gates, for unbounded depth (i.e. any function)
  • in other words, computation can be performed as normal on untrusted server database when data there are fully homomorphically encrypted.
  • only require slight modification of database: special addition, multiplication, etc.
  • in our meeting last week (11/30/2023), we looked at Microsoft SEAL, which promises to enable building “end-to-end encrypted data storage and computation services.” see their website below:

https://www.microsoft.com/en-us/research/project/microsoft-seal/

142 of 272

Hitting Pause on Cipherbase

however, what is SEAL actually capable of?

silent confession by Microsoft:

https://learn.microsoft.com/en-us/azure/architecture/solution-ideas/articles/homomorphic-encryption-seal

143 of 272

Hitting Pause on Cipherbase: Microsoft SEAL

  • allows + and * on encrypted integers or real numbers. Comparison and sorting are not supported.

=> this means a large portion of SQL queries is unachievable

  • supports 2 FHE schemes
  • BFV (2012): modular arithmetic, yields exact values
  • CKKS (2016): + and * on real or complex numbers, yields approximates

this restriction is entirely contradictory to the idea of FHE

  • unrecommended to “encrypt entire large database” with SEAL, computations should be “fairly lightweight”

144 of 272

Hitting Pause on Cipherbase: Microsoft SEAL

  • homomorphic encryption schemes typically have a single secret key, which means it is unsuitable for use cases where multiple private data owners (each with their own secret key) want to perform collaborate computations

this is rather a problem common to Cipherbase and all the previous technologies discussed

145 of 272

Switching to Data Observatory

146 of 272

Clarifying Survey Direction

questions to address:

  • what to put on the big screen
  • how can this hardware contribute to advancing artificial intelligence
  • visualization at the core of interpretability and explainability
  • towards trustworthy AI, improves accountability
  • helps build better models
  • where do we come in and how could this project be related to data hub
  • visualization depends on data
  • connect to local data center? small, independent copy?
  • local data hub software/ service that communicates with MBZUAI data center (similar to the Internet of Water Data Hub design)?

147 of 272

Thinking Outside the Box: AI4VIS

https://arxiv.org/pdf/2102.01330.pdf

background

  • on the previous slide, we talked about visualizing artificial intelligence,

i.e. model => visualization

  • what if we reverse the arrow above?

i.e. visualization => model

  • traditionally, visualizations are discussed as something created for humans
  • increasingly, however, we see visualizations “created, shared, collected, and reused”
  • so, why not treat visualization like any other data format (i.e. text and image)?
  • however, “why should we teach machines to read charts made for humans”?

148 of 272

Thinking Outside the Box: AI4VIS

Why should MBZUAI care?

  • we have a gigantic screen, but it seems like we are not so sure about what to show on it
  • heavier utilization helps further knowledge, but generating visualizations is a time-consuming task
  • on the previous slide, we mentioned reuse
  • main reasons of applying artificial intelligence to visualization data discussed by this paper:
  • visualization generation
  • visualization enhancement
  • visualization analysis
  • this means, if we build the following cycle:

model => visualization => model

we are not only advancing AI in two directions but giving more content to show on our screen

149 of 272

Thinking Outside the Box: AI4VIS

credence: author affiliation

  • three at Hong Kong University of Science and Technology
  • another three at Microsoft Research Asia
  • Dominik Moritz: Carnegie Mellon University (HCI Professor) and Apple (Research Scientist)

150 of 272

Thinking Outside the Box: AI4VIS

definitions

  • visualization defined as graphical representation of data
  • main categorizations
  • graphics: appear most frequently, raster (bitmaps)/ vector (describes visual elements in shapes and styles)
  • programs: instructions to construct the visualization
  • hybrid: embed programs/meta information into graphics
  • representation
  • internal: programs contain design details but less often semantic information. system warrants simpler and more structured format (e.g. {“data”: “cars.csv”, “type”: “hist”, “x”: “mpg”, “y”: “hp”})
  • feature: input to ML models, extracted by feature engineering and feature learning

151 of 272

Thinking Outside the Box: AI4VIS

why apply AI to visualization data (mentioned previously but now in more detail)

  • generation
  • central research question in community
  • data-based (extensively studied. given data, show me sth), anchor-based (given anchor visualization, show me next), design-based (inject data into template design), and context-based (given contextual info such as natural language description, show me sth)
  • enhancement
  • retarget to different environment (e.g. fit image to cell phone display while keeping all the important elements)
  • annotate, add captions
  • question answering
  • interactions

152 of 272

Thinking Outside the Box: AI4VIS

  • analysis
  • recent research have focused on creating visualization database
  • retrieval, help users find examples they need to retrieve-then-adapt
  • mine: understanding design patterns, population statistics (crawl on web to discover trends in visualizations)

153 of 272

Thinking Outside the Box: AI4VIS

how is the research community applying AI to visualizations?

  • transformation
  • program <=> graphics (reverse engineering is difficult and expensive)
  • decomposing: detect text, shapes, chart, element clustering (semantic groups e.g. axes, legends, annotations)
  • composing: figure out coordinates on scatter plot based on information obtained from decomposition
  • assessment
  • rank quality of visualizations
  • related to enhancement bc scoring metrics can be used as cost functions
  • challenging bc humans define when visualizations are “good” => requires large-scale empirical assessment

154 of 272

Thinking Outside the Box: AI4VIS

  • comparison
  • how “close” are the given visualizations?
  • can also be used as cost functions to build recommendation and query systems
  • difference: Graphscape (UDub) proposes directed graph model. Each edge represents an edit and each vertice denote the resulting visualizations. Edge weight/cost determined by human judgement
  • distance: project to feature vector, apply distance function
  • querying
  • how to specify user needs: keywords, natural language, structural queries, example
  • how to match needs: exact v. best match

155 of 272

Thinking Outside the Box: AI4VIS

  • reasoning
  • interpret visualizations to derive insights beyond data encodings such as semantic information
  • visual perceptual learning (e.g. neural network to predict human-perceived importance of visualization images)
  • chart summarization: natural language description/ caption on how to interpret the chart (e.g. shows average precipitation in Taipei over 2023)
  • visual question answering: traditionally, transform visualizations into tables and translate questions into queries
  • recommendation
  • seeDB recommends visualization by deviation with anchor
  • map data with encodings to recommend styling

156 of 272

Thinking Outside the Box: AI4VIS

  • mining
  • design patterns: should motivate design ideas instead of just showing simple statistics with little meaning
  • data patterns: respective to type of charts

157 of 272

Thinking Outside the Box: AI4VIS

future research opportunities

  • interoperability: how to combine visualizations outputted from different systems together
  • CV models for natural images do not work well for visualizations bc small errors could greatly distort meaning of visualization (e.g. misinterpreting the axes)
  • augmenting ML models with results from empirical studies
  • collecting data on how humans perceive, assess, and use visualization data on scale
  • striking better balance between human v. machine friendly visualization

158 of 272

AI4VIS: implementation example from Data2VIS => LIDA

159 of 272

Data2Vis => LIDA

160 of 272

Data2Vis => LIDA

background

  • main theme of these two papers: data driven visualization generation approach (in AI4VIS paper’s terminology)
  • relevance between the two

Data2Vis (2019, submitted on arXiv in 2018) author:

Victor Dibia when at IBM Research, Çağatay Demiralp at MIT CSAIL

LIDA (

submitted on arXiv in March, 2023,

open-sourced on GitHub only in August this year,

claimed by Victor as an update of Data2Vis

) author:

Victor Dibia at Microsoft Research

161 of 272

Data2Vis => LIDA

summary - Data2Vis

  • formulation of visualization generation as a language translation problem (i.e. mapping data to declarative visualization program languages such as Vega-Lite)
  • model: bidirectional recurrent neural network with Long-Short Term Memory cells under encoder-decoder architecture with attention mechanism
  • results: difficult to quantify. Yet, qualitatively, the authors observed that model learns Vega-Lite syntax, appropriate transformation on different fields, data selection patterns in creating bivariate plots (e.g. group by age/ sex)
  • limitations
  • small training data: 4300 Vega-Lite visualizations from 11 datasets
  • model sometimes (in 15-20% of tests) select fields that does not exist
  • model sometimes select fields that have little information value

162 of 272

Data2Vis => LIDA

Visualizations created by Data2Vis

163 of 272

Data2Vis => LIDA

summary - LIDA

  • tool for generating grammar-agnostic visualization and infographics
  • solved by LLMs and IGMs (image generation models)
  • visualization generation as a multi-stage generation problem
  • summarizer:

data to semantically rich natural language description,

LLMs suffer from hallucination which sometimes lead them to produce

output not grounded in data. Informing summary augments model by provides grounding context for generating visualizations

  • goal explorer: generates a set of visualization goals

question (hypothesis): what are we trying to understand

visualization: e.g. histogram of miles per gallon

rationale: why should we ask the question above?

164 of 272

Data2Vis => LIDA

  • visGenerator

generate, evaluate, repair, and execute visualization code

code scaffold constructor: imports dependent libraries and create

function stubs

code generator & executor: fill in TODOs, get visualization

  • infographer: style generated visualization

165 of 272

Data2Vis => LIDA

  • results evaluation
  • visualization error rate: percentage of generated visualizations that does not compile
  • asks GPT-4 to score output based on empirically agreed metrics

166 of 272

Formalizing Visualization Design Knowledge as Constraints: Draco

167 of 272

Visualization Design Knowledge as Constraints: Draco

  • visualization research communities often study how people decode and interpret visualizations. However, their findings are slow to be adapted by practical tools
  • automated design systems do not reuse knowledge built up by their predecessors
  • warrants a medium for representing and acting upon visualization design knowledge
  • Draco is a formal model that represents
  • visualizations as sets of logical facts
  • design guidelines as hard/soft constraints over these facts

168 of 272

Visualization Design Knowledge as Constraints: Draco

  • constraint programming:
  • a constraint program (declarative) is a set of constraints defining relations among several variables that must/should be satisfied by its solutions
  • constraints restrict values variables can take (represent partial information)
  • solutions are computed by (off-the-shelf) constraint solvers via inference and search over the constraint space
  • Draco models visualization knowledge using Answer Set Programming and Clingo as answer set solver

169 of 272

Visualization Design Knowledge as Constraints: Draco

taste of Answer Set Programming

A :- L_1, …, L_n => a rule

A => atoms, proposition that may be true/false

L_i’s => literals, A is true if only all of them is true

bodiless rules encode a fact whereas headless rules impose integrity constraints (cannot all be satisfied)

e.g.

light_on => power_on, not broken

170 of 272

Visualization Design Knowledge as Constraints: Draco

integrity constraint defines how attributes should interact with each other:

:- mark(bar), channel(E, y), continuous(E), not zero(E)

rules out vertical bar charts that does not use 0 as baseline

set of aggregate rules: specifies attributes of a visualization (e.g. mark, encoding)

171 of 272

Visualization Design Knowledge as Constraints: Draco

  • primary use case:

finding optimal completion of

partially specified

visualization

  • last slide covered the “search space definition section”
  • preference model leads us to soft constraints

172 of 272

Visualization Design Knowledge as Constraints: Draco

  • preference model (soft constraints) learned from data and forms a Markov logic network where constraints are associated with learnable weights indicating preference levels

e.g.

:~ continuous(E), not zero(E). [5]

this says that model prefers to include 0 for continuous fields and that violating this constraint incurs a cost of 5

=>

where n_p_i denote # violations of soft constraint p_i

173 of 272

Visualization Design Knowledge as Constraints: Draco

  • model trained using RankSVM (learning to rank approach)
  • dataset: labeled visualization pairs

using the cost function defined on the last page,

(v1, v2, y = sign(Cost(v1) - Cost(v2)))

=> y = -1 if v1 preferred over v2

=> let

x be vector of [n_p_0(v), …, n_p_k(v)],

corresponding weight vector be w

=> Cost(v1) - Cost(v2) = w * x_1 - w * x_2 = w * (x_1 - x_2)

Linear regression with Ridge regularization, minimizes hinge loss

174 of 272

Visualization Design Knowledge as Constraints: Draco

Draco 1 (2018, from UDub Interactive Data Lab)

https://idl.cs.washington.edu/files/2019-Draco-InfoVis.pdf

Draco 2 (2023, from UDub, University of Vienna, University of Maryland, CMU)

https://arxiv.org/pdf/2308.14247.pdf

175 of 272

Experiments with Mac Studio

176 of 272

Last Week: exploring MLX on Mac Studio

177 of 272

Last Week: exploring MLX on Mac Studio

  • ran Mistral-7B, llama2-7B, Llama2-7B-chat successfully with MLX (weights need to be specially converted to fit into the framework)
  • llama2-70B returned weird output like the following “ensoensoensoenso…” for any attempted prompt
  • we suspected that something might have gone wrong during the weight conversion step (hardware resources should be sufficient)
  • upon further inspection

178 of 272

Last Week: llama 2 70B with MLX

179 of 272

Last Week: llama 2 70B with MLX

llama 2 -> MLX weights conversion example source code by Apple

  • only shard weights seem to be hardcoded
  • thoughts?

180 of 272

Llama 2 with Hugging Face: quickstart module

allows running inference on

  • self converted models stored locally
  • readily converted models stored remotely here:

https://huggingface.co/meta-llama

either approaches require submitting request to Meta and/or Hugging Face

181 of 272

Llama 2 with Hugging Face: quickstart module

  • Why Hugging Face?
  • Many blogs online try running llama-2 locally using

https://github.com/ggerganov/llama.cpp

, which is mainly designed for quantized models (requires less computational and storage capacity)

  • I doubt we are necessarily in a resource scarce environment, why not shoot for the stars?

182 of 272

Llama 2 with Hugging Face: 7B, sample prompt 1

183 of 272

Llama 2 with Hugging Face: 7B, sample prompt 2

184 of 272

Llama 2 with Hugging Face: 13B, sample prompt 1

185 of 272

Llama 2 with Hugging Face: 13B, sample prompt 2

186 of 272

Llama 2 with Hugging Face: 70B, sample prompt 1

187 of 272

Llama 2 with Hugging Face: 70B, sample prompt 2

188 of 272

Llama 2 with llama.cpp

why are we back?

  • as mentioned in link below, Hugging Face benchmarking tools are deprecated, they recommend to use external tools

https://huggingface.co/docs/transformers/benchmarks

  • thought about adding timers here and there to measure performance myself in that little module I wrote to run llama 2 models with the hf Transformers library
  • but, in my opinion, these libraries abstract away so many details that it is hard to know what you are actually measuring
  • there is an example from AWS where they tried to benchmark running llama 2 with the hf Transformers library on EC2 instances, but from their source code, most latency measurements were done on CloudWatch, a black box impossible to transfer knowledge from

189 of 272

Llama 2 with llama.cpp: 7B

  • -p: prompt (following llama.cpp main author’s example, shouldn’t matter much)

“Input length is not significant for performance but important for hardware requirements”

source:

https://www.databricks.com/blog/llm-inference-performance-engineering-best-practices

  • -n: num tokens to predict (reflected in sample runs)
  • -ngl: any nonzero value enables Metal processing (Apple CUDA)
  • batch size defaults to 512 (tokens)

190 of 272

Llama 2 with llama.cpp: interpreting console logs

explanation by project collaborator

191 of 272

Llama 2 with llama.cpp: 13B

192 of 272

Llama 2 with llama.cpp: 70B

193 of 272

Llama 2 with llama.cpp: comparison with other experiments

  • y axis: time spent on generating output (i.e. eval time in llama.cpp console log)
  • Model: llama-2-7b-chat.Q4_0

=> quantized to 4 bits

(originally 16)

source:

https://towardsdatascience.com/apple-m3-machine-learning-speed-test-f1346e23a1b2

194 of 272

Llama 2 with llama.cpp: comparison with other experiments

source: https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF

  • unverified: many are claiming online that M3 Max == M2 Ultra
  • from the bar chart on the previous slide, M3 Max yielding roughly 47 t/s on eval time on the 7B model quantized to 4 bits
  • on the other hand, we got, on our M2 Ultra, 39.72 t/s on the original 7B model (unquantized, 16 bits)

195 of 272

Llama 2 with llama.cpp: comparison with other experiments

  • from llama.cpp authors: typical run using M2 Ultra (13B, quantized to 4 bits)
  • our results: 51.45 t/s on prompt eval, 22.41 t/s on eval time

source:

https://github.com/ggerganov/llama.cpp

196 of 272

Llama 2 with llama.cpp: comparison with other experiments

  • hardware: M2 Max Mac Studio, 96 GB RAM

source:

https://medium.com/@andreask_75652/benchmarking-apples-mlx-vs-llama-cpp-bbbebdc18416

197 of 272

Llama 2 with llama.cpp: comparison with other experiments

  • what I worry about:

“Overall latency scales sub-linearly with model size” - from the databricks blog

mentioned in slide 189

  • from what we are seeing:

70 / 13 = 3.58�209.19 (70B eval t/ms) / 44.62 (13B eval t/ms) = 4.68

198 of 272

Question from last week: suspicious prompt eval speed

my local replication:

199 of 272

Question from last week: suspicious prompt eval speed

screenshot from slide 196:

  • I overlooked that these are results from llama-2-7b
  • the experiments I ran last week were using llama-2-7b-chat (fine-tuned, yielded 83.42 t/s prompt eval time & 39.72 t/s eval time)
  • question to ponder: does fine-tuning has a direct effect on prompt evaluation time, or is something else at play?

200 of 272

Optimizing LLM Inference: batching and vLLM

why batch?

  • trade-off latency for throughput, i.e. process multiple sequences together but at lower speed
  • optimize memory bandwidth usage: with batch size one, model parameters need to be loaded for each new input sequence => load once and process many

201 of 272

Optimizing LLM Inference: batching and vLLM

naive/ static batching

  • used by HuggingFace’s Pipelines (their easy-to-use API that supports most models), NVIDIA’s FasterTransformer
  • batched sequences each generate 1 new token per iteration
  • here, GPU underutilization happens bc of variation in sequence generation length

202 of 272

Optimizing LLM Inference: batching and vLLM

continuous batching

  • once a sequence finishes in a batch, a new one takes its place
  • chellenging given that prefill/ prompt evaluation stage has different computation pattern than token generation => wants a good ratio of

sequences waiting prefill/ sequences still generating

203 of 272

Optimizing LLM Inference: batching and vLLM

vLLM:

  • created by a UC Berkeley team (Prof. Gonzalez, Prof. Stoica, …)
  • open-source, can only be ran on linux
  • “delivers up to 24x higher throughput than [HuggingFace’s Pipelines], 3.5x improvement over HuggingFace Text Generation Inference,” previous SOTA that powers live inference

204 of 272

Optimizing LLM Inference: batching and vLLM

core problem addressed:

inefficient memory management on KV cache�=> kv pairs of all prev tokens required to generate the next one�=> size is large: takes up to 1.7 GB for a single sequence in llama-13B

=> and dynamic: depends on highly variable sequence length, leads to wastage due to fragmentation and over-reservation

devised solution:

PagedAttention

=> allows kv cache stored non-contiguously in memory by partitioning the kv pairs of each sequence into blocks each dedicated to a fixed number of tokens

205 of 272

Optimizing LLM Inference: batching and vLLM

=> memory wastage only occurs at the last block

=> enables efficient memory sharing by mapping to the same physical block comes in handy when we need to generate multiple output from the same sequence

206 of 272

Optimizing LLM Inference: batching and vLLM

=> baseline: HuggingFace’s Pipelines library, which implements static batching

=> generation length sampled from exponential dist with mean = 128 tokens

=> FasterTransformer (optimized model implementation) actually do pretty well

=> continuous batching clearly yields strong improvement over static, and so does vLLM over the other three on the graph

207 of 272

Optimizing LLM Inference: batching and vLLM

QPS: query per second, expected rate of poisson distribution sampling from

208 of 272

Optimizing LLM Inference: batching and vLLM

takeaway from the two graphs:

  • when batching is employed, the continuous scheme alleviates the throughput latency tradeoff
  • as system becomes saturated, continuous batching’s advantage diminishes (harder to inject new sequences?)

209 of 272

MLPerf: trouble we are running in

how do we run it?

  • hardware vendors usually do this to build trust with customers
  • reference implementation:
  • defines representative tasks & workloads (models),
  • specifies model requirements (e.g. pretrained weights to use), validation datasets, performance & accuracy metrics
  • hardware vendors create their own implementation based on ref that
  • uses MLPerf LoadGen API to issue inference queries
  • uses their own hardware-specific API to process said queries
  • for us to run the benchmark,
  • hardware vendor implementations must be available for the desired task/workload
  • compatible hardware

210 of 272

MLPerf: trouble we are running in

last time, we looked at MLPerf Training results on “GPT-3”

  • but how? GPT-3, unlike llama, is proprietary
  • real model under test: Megatron-LM’s GPT-3
  • by applied deep learning research team at NVIDIA (researches training LLMs at scale)
  • largely follows OpenAI’s paper, but uses different tokenizer, attention layer, and model parameters
  • current (v3.1) submitted implementations are for H100, Intel Gaudi2, and Google’s TPU-v5e

211 of 272

MLPerf: comforting news

CTuning (MLCommons founding member, https://ctuning.org/)

  • created the MLCommons Collective Mind framework
  • main objective: make it easier to run MLPerf benchmarks across different hardwares
  • keynote introduced June 2023
  • typical usage: either with python virtual environment or docker

212 of 272

MLPerf: comforting news

running BERT-large on Mac Studio M2 Ultra

213 of 272

MLPerf: comforting news

comparison with results published by MLCommons (MLPerf Inference: Edge)

https://mlcommons.org/benchmarks/inference-edge/

  • upon closer review, some hyperparameters seem different
  • may require tuning to optimize resource utilization
  • need tools to measure resource utilization to determine potential on pushing inference speed

system

framework

offline (sample/s)

single stream (latency in ms)

Macbook Pro M1

onnx runtime

1.69

597.08

Mac Studio M2 Ultra

onnx runtime

6.48

580.58

214 of 272

MLPerf … on Mac?

215 of 272

MLPerf … on Mac?

216 of 272

MLPerf … on Mac?

217 of 272

MLPerf … on Mac?

218 of 272

MLPerf … on Mac?

219 of 272

MLPerf … on Mac?

main obstacle

  • we need a reliable llama 2 model implementation using mlx framework
  • we did not before when we first surveyed MLX

https://github.com/ml-explore/mlx-examples/tree/main/llms/llama

7b & 13b worked, 70b returns weird output

  • mlx_lm library released for easy integration with HF models

https://github.com/ml-explore/mlx-examples/tree/main/llms/mlx_lm

  • seems working but incredibly slow?

https://github.com/ml-explore/mlx-examples/issues/437

220 of 272

MLPerf … on Mac?

discussion with Awni Hannun

  • top contributor to mlx
  • research scientist at Apple
  • Stanford PhD advised by the Andrew Ng (advised by David Blei, Michael Jordan)

221 of 272

MLPerf … on Mac?

model: llama-2-7b-chat

prompt evaluation (t/s)

token generation (t/s)

llama.cpp

83.42

39.72

mlx

25.455

29.416

llama.cpp/ mlx

3.27

1.35

222 of 272

MLPerf … on Mac?

model: llama-2-13b-chat

prompt evaluation (t/s)

token generation (t/s)

llama.cpp

51.45

22.41

mlx

14.164

18.662

llama.cpp/ mlx

3.63

1.2

223 of 272

MLPerf … on Mac?

point being …

model: llama-2-70b-chat

prompt evaluation (t/s)

token generation (t/s)

llama.cpp

11.87

4.78

mlx

1.063

0.201

llama.cpp/ mlx

11.16

23.78

will continue investigating

224 of 272

MLPerf … on Mac?

225 of 272

mlx v. llama.cpp

performance anomaly?

  • mlx token generation almost as fast as llama.cpp at first run after reboot
  • performance degradation (seemingly exponential) in the following runs
  • llama.cpp unaffected by mlx

226 of 272

mlx v. llama.cpp

performance anomaly?

  • mlx prompt evaluation performance surpassing that of llama.cpp’s
  • less severe performance degradation
  • should rerun experiment with larger input (currently seeing large variations)

227 of 272

MLX GPU profile

228 of 272

Llama.cpp GPU profile

229 of 272

Llama.cpp parallelization

230 of 272

MLX representative workload: on reboot perf anomaly

231 of 272

MLX representative workload: on perf with diff loading strategy

232 of 272

MLX v. Metal: on perf with diff loading strategy

233 of 272

experiments below were conducted in�environment:�mlx 0.6.0�https://github.com/MBZUAI-Research-Office/research_muchi�commit: fd2376d405b8fe786fe2851b828e435f40421edb

234 of 272

Almost root to the issue?

How weights loading strategy affects application performance

batch

sep

upon closer examination, matmul performance seems identical across loading strategies

235 of 272

Almost root to the issue?

batch

sep

zooming out …�we notice the space between computations are caused by driver (wire memory) activities making sure that there are physical memory backing to GPU resources consumed in metal code

(

where could 256 MiB come from? Maybe …

size of weights per layer

8192 * 8192 * 4 / (1024 **2)

= 256 MiB

)

https://developer.apple.com/forums/thread/653038

https://developer.apple.com/library/archive/documentation/DeviceDrivers/Conceptual/IOKitFundamentals/DataMgmt/DataMgmt.html

236 of 272

Almost root to the issue?

well … how does that space/ duration caused by driver activity changes upon reboot?

for the first 2 seconds (after the computation starts), there were still plenty driver (wire memory) activities

237 of 272

Almost root to the issue?

well … how does that space/ duration caused by driver activity changes upon reboot?

after that, however, driver (wire memory) activities ceased => lowering the average and thus making the first run post reboot seem much faster than subsequent runs

space between computations diminishes along with decrease in driver (wire memory) activities!

238 of 272

Almost root to the issue?

2nd run post reboot

3rd run post reboot

What does this mean?�The performance degradation we observed from sep loading strategy can also be accounted by these driver (wire memory) activity

239 of 272

Sidenote: how does matmul in mlx work under the hood?

240 of 272

DBRX: distributed inference with M2 Ultras

241 of 272

What is DBRX?

  • transformer-based decoder-only LLM (like Llama-2 & GPT-4) that was trained using next-token prediction
  • fine-grained mixture-of-experts (MoE) architecture with 132B total parameters of which 36B parameters are active on any input (more on this later)
  • other prominent MoE models: Mixtral 8x7B, Grok-1
  • uses GPT-4 tokenizer, rotary position encodings, gated linear units, and grouped query attention (similar to Llama-2)
  • pretrained on 12T tokens (carefully curated and substantially helped improve model quality) with max context length: 32768 tokens (synonymous to max sequence len)
  • trained on 3072 NVIDIA H100s connected by 3.2Tbps Infiniband

242 of 272

DBRX: inference efficiency

setting:

  • NVIDIA TensorRT-LLM (open-source library, also used in their MLPerf implementation) at 16-bit precision
  • input prompt: 2000 tokens, output: 256 tokens�(per response)
  • No mention of hardware specs?
  • seems faster than model-size would suggest due to its MoE architecture (more on this later!)
  • Mixtral has higher t/s because its model is smaller

243 of 272

What is MoE?

TL;DR

MoE models

  • are pre-trained much faster & have faster inference�compared to dense models (where all the parameters are used for all the inputs) with the same number of params
  • still, require high VRAM as all experts are loaded in memory
  • face many challenges in fine-tuning such as overfitting, but recent work with MoE instruction-tuning is promising

244 of 272

What is MoE?

dense: refers to the feed-forward layer, which is really just linear combination followed by activation (in Llama-2: multi-layer perceptron)

traditional transformer encoder block

245 of 272

What is MoE?

traditional transformer

transformer with MoE

vs

the traditional/ dense feed-forward network (FFN) layer is replaced by a

sparse MoE layer that operates independently on the tokens in the sequence

- the router/ gate network decides which expert the token goes

- output of MoE layer = router gate value * output of selected expert

246 of 272

What is MoE?

what does this remind you of?

  • ensemble methods (e.g. random forest)
  • MoE layer is a system composed of separate networks/ experts that each specializes in a different region of the input space

in addition,

  • each expert is usually a FFN
  • we can send a token to more than one expert
  • Mixtral 8x7B and Grok-1 have 8 experts and choses 2
  • DBRX have 16 smaller experts are chose 4 (which is what they mean by fine-grained) => allows 65x more combinations
  • the router is composed of learned parameters pre-trained at the same time as the rest of the network

247 of 272

So, how do we run distributed inference with DBRX?

example:�transformer encoder block from Google’s GShard�- attention weights are replicated across devices�- each device is responsible for a FFN/ expert (model parallelism: model partitioned across nodes)

248 of 272

Requirements to run DBRX

132B params, each param being a bfloat16�=> 132 * 2 (bytes) = 264 * (10 ^ 9) = 264 GB�=> however, DBRX’s github repo README recommends having at least 320 GB of memory��right now, model implementations are available in�- TensorRT-LLM (requires CUDA)�- vLLM (requires CUDA)�- MLX (requires Apple Silicon, focus, commitment, and sheer will)

interestingly,�HuggingFace’s transformer library integrates an experiment project called PiPPy that aims to enable easy pipeline parallelism (where each node gets a portion of the model’s layers) for pyTorch => solves the memory problem, but seemingly little performance benefits�https://github.com/pytorch/PiPPy/

249 of 272

Potential Strategy with M2 Ultras

  • connect M2 Ultras with thunderbolt (should provide up to 40 Gbit per sec throughput)
  • communication between nodes via IP
  • each Mac Studio has 6 thunderbolt ports, which should enable up to a fully-connected cluster of 7 nodes
  • mimic GShard architecture by putting each shard in a container
  • communication between containers orchestrated by kubernetes

250 of 272

M2 Ultra Cluster Communication latency

recap:

each of our Mac Studio M2 Ultra has 192GB of memory, but DBRX inference requires 264GB (Databricks even recommends to have at least 320GB of memory)

our goal is to run distributed inference on a cluster (2 - 8 nodes), which should provide us more than enough computation & memory resources. However, we need to make sure that communication between nodes does not become a bottleneck

251 of 272

what information is being exchanged?

DBRX has …

  • 40 layers (transformer decoder blocks)
  • input/output to each layer during inference is an 1x6144 float16 vector
  • one may ask … where is the autoregressive-ness => hidden in KV cache
  • under expert parallelism, 2 * 4 * 2 * 6144 / 1024 = 96 KiB of data is being exchanged per layer

L_0

L_39

ATT

FFN_0

FFN_1

FFN_15

.

.

.

router

B 6144

B 6144

L_38

to 0, 4, 7, 15

A

6144

C 6144

+

B 6144

B 6144

B 6144

choosing 4 from 16 experts

float16

dispatch/ aggregate

252 of 272

M2 Ultra Cluster Communication latency

4 nodes (Thunderbolt with star topology, Ethernet with switch)

Protocol:

gRPC

(

avg latency includes serialization and deserialization time

)

253 of 272

DBRX: distributed inference with M2 Ultras

recap

  • running DBRX warrants at least 264GB of memory, yet a single M2 Ultra is only equipped with 192GB of memory
  • we cannot run it on one node, but any cluster of size >= 2 would do
  • we did a small feasibility study to ensure cluster communication overhead would not place an unreasonable performance cap
  • running DBRX inference in a distributed setting requires model rewrite

254 of 272

DBRX: distributed inference with M2 Ultras

our progress so far

  • finished rewriting DBRX’s MLX implementation to enable expert parallelism on a M2 Ultra cluster of >= 2 nodes
  • observed sub-par performance of
  • 2 t/s for prompt evaluation
  • 0.5 t/s for token generation
  • profiled performance with instruments to locate bottlenecks

where did we get stuck? (like for quite a while)

  • pre-processing weights with MLX leads to unexpected data corruption�mx.split() followed by mx.savez()
  • Therefore, we switched to pyTorch since it supports both safetensors and bfloat16.�This implementation, however, is slower because everything is ran on CPU (currently, MPS backend does not support bfloat16)

255 of 272

DBRX: distributed architecture design

256 of 272

DBRX: distributed inference with M2 Ultras

0.5 t/s performance breakdown

  • each token takes 2 seconds to process
  • DBRX has 40 layers, which means each layer takes, on average, ~50 ms
  • in each layer,
    • attention + router + weighted_sum usually takes ~7 ms (including driver activity that checks for wired memory)
    • expert calculation + communication (~30 ms), and
    • waits/stalls ~13 ms (needs further verification)
  • on average, 2 experts are assigned to each moe_shard per layer

one layer

performance view from Instruments

expert calc, comm

wait/stall

257 of 272

DBRX: distributed inference with M2 Ultras

optimization strategies

  • improve weights loading strategy to avoid driver activity�should reduce ~21 ms per layer
  • figure out and eliminate cause of stalls/ wait
  • parallelization within node so that all 4 chosen experts can be executed at the same time
  • currently, expert calculation within node is sequential
  • ALU utilization is ~30%

258 of 272

updates from 4/30 PAS lab meeting

259 of 272

DBRX: distributed inference with M2 Ultras

latest employed optimization strategies

  • expert parallelization within nodes:
  • before: 4 nodes, 4 experts per node, experts ran sequentially
  • now: …, …, experts ran in parallel by spawning independent processes
  • [NEEDS WORK]: deeper performance analysis
  • improved weights loading strategy within expert (i.e. process in setting above)
  • before: 40 layers, 3 weight arrays per layer, loaded to memory separately (total size = 40 * 3 * 10752 * 6144 * 2 = roughly 15.85 GB)
  • now: …, …, 15.85 GB of weights loaded to memory in one chunk
  • [NEEDS WORK]: currently, every expert is busy in every layer to reap the benefits brought by above weights loading strategy => dense model : (

260 of 272

DBRX: distributed inference with M2 Ultras

model performance

workload: batch size = 1, input = 20 tokens, output = 64 tokens

prompt evaluation

token generation

week_0:�baseline

1.3 t/s

0.6 t/s

week_1:�with expert parallelization, weights loading in one-chunk, dense

3 t/s

2.4 t/s

261 of 272

05/09/2024 weekly report

outline

  • detailed performance breakdown between versions
  • v2 & v3 architecture design

262 of 272

DBRX distributed inference architecture design

263 of 272

DBRX distributed inference performance breakdown

per layer

per prompt

attention + router

MoE

wait / stall

token generation t/s

v1, �- baseline

~7 ms

~30 ms

~13 ms

~0.5 t/s

v1.5,

- load each expert’s weights as one-chunk

- dense

~ 4 ms

~3.5 ms

~3 ms

~2.4 t/s

v2,

- everything from v1.5

- earlier driver eval

- custom serialization

~1.5 ms

~3.5 ms

~3 ms

~3.1 t/s

v3,

- everything from v2

- architecture redesign

- ½ communication

- ½ expert computation

estimated:

~1.5 ms

estimated:

~2 ms

estimated:

~1.5 ms

estimated:

~5 t/s

filename: dbrx_poc_one_custom_serialization

filename: dbrx_poc_batch2

filename: dbrx_poc_perf

264 of 272

Obstacles encountered during v3 development:

driver activity surfacing from idleness

sleep in ms

incurs heavy driver activity

1

False

10

False

100

False

500

True

1000

True

driver activity here refers to:

time driver spent on wiring down sufficient physical memory for GPU computation

where does idleness come from (more on this next page):

wait during all-reduce (the shard that finishes first has to wait for the one that finishes last)

observation from v3 MoE layer design (4 chooses 2 experts activated per shard):

waiting/sleeping between layers causes lengthy driver activity to surface

when sleep = 1000 ms, driver processing takes ~400 ms for a ~1 ms GPU computation

265 of 272

How idleness becomes a problem in v3

  • driver activity is unavoidable during first layer, and its processing time has high variance
  • these long waits, as shown in the previous page, trigger more driver activity => inescapable loop!
  • for reasons unknown, the last shard to finish also tend to wait before next layer’s calculation
  • for reasons unknown, moe design from v2 seems to mitigate this disturbance pretty well

266 of 272

introducing v3.2,

our first step towards:

- streaming MoE processing (beneficial during prompt evaluation and when input batch size > 1)

- dynamic load balancing (beneficial even when batch size = 1, increases overlap between communication and computation during streaming)

267 of 272

new trouble with v3.2: needs warming up!

what do we mean by warm up?

sync computation time to reduce driver activity

268 of 272

distributed DBRX v3 performance evaluation

num GPUs

price in USD per GPU

communication

fp16 TFLOPS per GPU

bfloat16 TFLOPS per GPU

single user throughput (t/s)

M2 Ultra

4

~8,500 ?

10 GbE switch

~54

???

~6

L40S

8

~9,287

PCIe Gen4 x16: 64GB/s bidirectional

~362

~362

~60 ?

H100

4?

~31,710

nvlink: 900GB/s PCIe Gen5: 128GB/s

~1979

~1979

~115 ?

269 of 272

Evaluation Goal:

A Performance Model that Takes Compute Power into Account

270 of 272

DBRX on A100: how fast is inter-GPU communication in terms of latency?

hardware setup: 1 node, 4 GPUs, 4 x 80 = 320GB memory, 600 GB/s GPU interconnect through NVLink

framework: TensorRT-LLM, tensor parallelism

communication pattern:

1 x AllGather after all 40 layers

to piece together the generated token

1 token

1 layer

2 x AllReduce:

once before attention,

the other before MoE

271 of 272

DBRX on A100: how fast is inter-GPU communication in terms of latency?

communication latency breakdown:

compared to us:

operation

n times

per token

avg latency in microsecond(s)

total latency

per token in microsecond(s)

AllReduce

2 * 40 = 80

18

1,440

AllGather

1

20

20

total:

1,460

hardware setup

total communication latency per token in microsecond(s)

A100 (1 node, 4 chips)

1,460

M2 Ultra (2 node, 2 chips)

60,000

272 of 272

DBRX performance model

  • since H100 has 8 GPUs in 1 node, we estimate its comm latency to be 2x of A100’s

  • predicted / realized is roughly 1.6 for each row