by Mu-Chi Chen
What Is a Data Hub?
What Is a Data Hub: a one stop shop for data consumers
What Is a Data Hub: Main Functions
Examples
from platform providers:
from an alternative/complementing definition of data hub:
from Google as a narrowly purposed product:
Example: Cumulocity
Highlights:
Example: MarkLogic
Highlights:
Example: Cloudera
workflow:
reference:
https://www.cloudera.com/products/data-hub.html#
https://docs.cloudera.com/data-hub/cloud/index.html
Things to Consider
how to move data into our data hub? (data ingestion & integration)
where do we store the data in our data hub?
implications of our infrastructure choices?
slides above were covered in
Data Hub Project Weekly Meeting on 9/28
Example: Data Hub as Data Catalog/ Metadata Platform
What is a Data Catalog?
According to IBM, “a data catalog is a detailed inventory of all data assets in an organization, designed to help data professionals quickly find the most appropriate data for any analytical or business purpose.”
reference: https://www.ibm.com/topics/data-catalog
Why does a Data Hub need this?
Aside from enabling people to get the data they need easily, data hubs should also help them find the data they need (this would essentially be the face of this project).
Example: Data Hub as Data Catalog/ Metadata Platform
Common functions of a data catalog
reference: https://engineering.linkedin.com/blog/2020/datahub-popular-metadata-architectures-explained
Example: LinkedIn & Acryl Data’s DataHub
open source
slides 12, 14-16 references a LinkedIn Engineering Blog about metadata catalog architectures by Acryl Data’s CTO (previously LinkedIn’s Principal Staff Software Engineer leading the data infrastructure team)
Example: LinkedIn & Acryl Data’s DataHub
architecture
reference: https://engineering.linkedin.com/blog/2020/datahub-popular-metadata-architectures-explained
Example: LinkedIn & Acryl Data’s DataHub
architecture
reference: https://engineering.linkedin.com/blog/2020/datahub-popular-metadata-architectures-explained
Example: LinkedIn & Acryl Data’s DataHub
reference: https://engineering.linkedin.com/blog/2020/datahub-popular-metadata-architectures-explained
Example: Microsoft’s OneLake Data Hub
question:
What does MBZUAI need?
Example: Ads Data Hub (will skip in presentation)
Highlights:
For weekly meeting on 10/5/2023
topic: continue to survey data hub elsewhere
maybe
Example: Internet of Water Data Hub
What is the Internet of Water Coalition?
A group of organizations (e.g. Duke University Nicholas Institute, Lincoln Institute Center for Geospatial Solutions, Western States Water Council) working with federal, state, and local governments to improve water management by building a water data infrastructure.
In summary:
Better data means better management. Therefore, we need good infrastructure to make water data more discoverable, accessible, and usable
Example: Internet of Water (IoW) Data Hub
Example: Internet of Water (IoW) Data Hub
basic components of a Data Hub
individual data hubs are then connected by Geoconnex, which is essentially a global metadata catalog (this is another place data users can come in)
Example: Internet of Water (IoW) Data Hub
there are four types of data hub, and this is type A
Example: Internet of Water (IoW) Data Hub
type B:
type C:
* data sharing creates a privacy issue
Example: Internet of Water (IoW) Data Hub
type D (which is more like our original vision):
* can the data producers trust the data hub to transform the data right? (which brings us back to the ETL v.s. ELT question)
Example: Internet of Water (IoW) Geoconnex
purpose: make it easier for data consumers to find/access the data they want/need. Roughly similar to what LinkedIn and Microsoft is doing
why I think it is not a good product
this is so that a web crawler can then go to these pages to extract the data it needs and store it to a central database once everything above is done right
Example: Internet of Water (IoW) Data Hub
IoW Data Hub examples:
https://newmexicowaterdata.org/
references:
https://internetofwater.org/wp-content/uploads/2022/09/Blog019_IoWWaterDataHubs_Watson.pdf (slides 20-25)
https://internetofwater.org/blog/geoconnex/ (slide 26)
Example: the Art Institute of Chicago Data Hub
background
Example: the Art Institute of Chicago Data Hub
design goals
Example: the Art Institute of Chicago Data Hub
architecture
Example: the Art Institute of Chicago Data Hub
challenges
references: https://mw19.mwconf.org/paper/building-a-data-hub-microservices-apis-and-system-integration-at-the-art-institute-of-chicago/index.html (slides 28-31)
Example: Red Hat’s Open Data Hub (ODH)
Example: Red Hat’s Open Data Hub (ODH)
Architecture highlights
Example: Red Hat’s Open Data Hub (ODH)
references:
https://opendatahub.io/docs/architecture/ (slide 32)
https://cloud.redhat.com/hubfs/Open-Data-Hub-A-Deep-Dive.pdf (slide 33)
Example: Google Analytics Hub
feature of BigQuery
short intro to BigQuery:
Example: Google Analytics Hub
How might it look like for MBZUAI?
Example: Google Analytics Hub
Alright, so how would I build it?
Scaling
What could be MBZUAI Data Hub’s unique offering?
Automatic Sampling Bias Detection
what is sampling bias?
sample unrepresentative of population
why do we care?
what role can data hub play in this?
by enabling automatic and real-time exploratory data analysis, we can
Automatic Sampling Bias Detection
how do we detect sampling bias?
what does detecting sampling bias mean?
uncovering patterns in data (arguably data mining, knowledge discovery in
data): 90% of the data collected today came from people who identifies as
cisgender man (assigned male at birth and identifies as man)
Automatic Sampling Bias Detection
ways to uncover patterns in data
methods of unsupervised learning:
exclusive (K-means)
hierarchical (agglomerative)
probabilistic (Gaussian Mixture, determines probability of data belonging to to a particular distribution)
Apriori algorithm (frequent data combination)
Automatic Sampling Bias Detection
ways to uncover patterns in data
methods of unsupervised learning:
local outlier factor (local density of focus point v.s. neighbor)
isolation forest (build overfitted decision tree by splitting at random threshold & feature until every point is isolated. Those isolated early are outliers)
autoencoders (perhaps for images. Encoders learns a latent representation of input and decoders reconstruct input from encoded output. Larger
reconstruction error implies anomaly)
Automatic Sampling Bias Detection
ways to uncover patterns in data
methods of supervised learning:
understand the different ways in which your data can be categorized to adjust data collection strategy
e.g.
classification results seems geographical, and much more data are classified as coming from East Asia
KNN, K-nearest neighbors
decision tree & random forest
Academic Discussions on Data Hub
Technical Report: Developing a Working Data Hub
Vijay Gadepally (OSU Ph.D. in ECE, technical staff at MIT Lincoln Lab)
Jeremy Kepner (Princeton Ph.D. in Astrophysics, head of MIT Lincoln Lab)
MIT on the need of data hub
big data challenges:
evolving solution:
MIT on data hub architecture
at the bottom:
external data sources + your own data storage including databases, data warehouses, and data lakes
in the middle:
heterogeneous data management and transformation. enable users to query multiple sources at once
on the top:
data discovery and define data transformation rules
MIT on data hub system engineering
NOTE: “system engineering studies the development of complex systems (p.8)”
HIGHLIGHTS:
3a is an operation supported on databases while 3b is a strategy for file systems/ object storage/ block storage i.e. place for unstructured/ raw data like data lakes
=> similar to my data hub architecture proposal
MIT on raw data storage: Lustre
Lustre further reading
“Azure Managed Lustre: not your grandparents' parallel file system”
Ceph (used in Redhat’s Open Data Hub, originally part of Sage Weil’s doctoral dissertation at UC Santa Cruz, also used extensively in HPC) vs Lustre:
https://www.linkedin.com/pulse/ceph-lustre-yashar-esmaildokht/
AWS FSx for Lustre (integrates with s3)
https://docs.aws.amazon.com/fsx/latest/LustreGuide/what-is.html
official site:
MIT on DBMS
what is DMBS?
software interface between users and database
why use DMBS?
MIT on DBMS
ACID v.s. BASE
relational db: atomicity, consistency, isolation, and durability. OLTP
non-relational db: basically available, soft state, eventual consistency
CAP theorem
you can only choose 2 between consistency, availability, and partition tolerance
(introduced by professor at Cal, supported by professors at MIT)
relational db: C + A
non-relational db: C + P (MongoDB), A + P (Cassandra, developed at FB, and founder later went to AWS to build DynamoDB)
MIT on DBMS
scaling
relational db: vertical (improve hardware because they are mainly on single mode)
non-relational db: horizontal (add nodes, which creates need of strong partition tolerance, which makes relaxing C and/or A inevitable if the CAP theorem is correct)
That said, there are controversy surrounding CAP
NewSQL (as opposed to SQL and NoSQL) movement that aims to build databases that are both ACID compliant and performant (i.e. horizontally scalable)
some examples include …
CockroachDB: https://www.cockroachlabs.com/
Google Cloud Spanner (which we have seen before when introducing BigQuery, now advertised as being half the cost of DynamoDB): https://cloud.google.com/spanner?hl=en
MIT on DBMS
Access Control: who can access what
granularity: table, row, to cell levels
principal: the “who,” can be role/ user/ groups
asset: the “what,” i.e. resources stored in database
view-based access control being most popular
query control strategies: restricts what query the principal is allowed to issue
MIT on DBMS
ways to add security
well … duh … encryption and/or masking
CryptDB (also by MIT):
MIT on heterogeneous data management
MIT on heterogeneous data management
BigDAWG (Big Data Working Group)
MIT on heterogeneous data management
MIT on key considerations when developing data hub
Technological considerations
MIT on key considerations when developing data hub
Infrastructure considerations
MIT on key considerations when developing data hub
Data formatting considerations
MIT on key considerations when developing data hub
Security Considerations
MIT on key considerations when developing data hub
Policy considerations
MIT on key considerations when developing data hub
User engagement
UCSC Xena Platform
https://www.nature.com/articles/s41587-020-0546-8
enables visualization and interpretation of cancer genomics data from disparate sources
Background
UCSC Xena Platform
Architectural Design
two components:
UCSC Xena Platform
Lessons from government agencies
Taiwan’s Health and Welfare Data Center (HWDC)
https://pubmed.ncbi.nlm.nih.gov/31118821/
Background
Taiwan’s Health and Welfare Data Center (HWDC)
Background
Taiwan’s Health and Welfare Data Center (HWDC)
Challenges: Privacy
Taiwan’s Health and Welfare Data Center (HWDC)
Challenges: Data Quality
Taiwan’s Health and Welfare Data Center (HWDC)
Challenges: Unmeasured or unavailable variables
Querying Encrypted Data
by Microsoft Research
Querying Encrypted Data: Demand
migration to cloud and thus storage on cloud
Querying Encrypted Data
enters encryption:
you won’t be able to understand/ use the data even if you managed to gain access to them
the quick brown fox … a pangram
Querying Encrypted Data
two fundamental techniques:
Querying Encrypted Data
more on the previous page
partial: one gate (+ or *)
full: multiple gates and unbounded depth (i.e. arbitrary functions)
Monomi: work of MIT CS and AI Lab (CSAIL) in 2013
CryptDB: from CSAIL too, 2011, Raluca Ada Popa is an Associate Professor at Berkeley EECS now
TrustedDB: Stony Brook University (NY), 2014
Cipherbase: Microsoft research, 2015
Querying Encrypted Data
symmetric encryption
Querying Encrypted Data
asymmetric encryption
Querying Encrypted Data
Advanced encryption standard cipher block chaining mode
Querying Encrypted Data
nondeterministic (more secure):
>> for i in range(2):
>> print(encrypt(“foo”))
>> 1qaz2wsx
>> 3edc4rfv
deterministic:
Querying Encrypted Data
homomorphic encryption:
>> encrypt(1) + encrypt(1) == encrypt(2)
order preserving encryption:
Querying Encrypted Data
to summarize
Querying Encrypted Data
authors’ opinion: outdated?
Querying Encrypted Data
Querying Encrypted Data: Trusted Client Architecture
client: applications performing CRUD operations against a server DBMS
Querying Encrypted Data: CryptDB as Trusted Client example
CryptDB:
Querying Encrypted Data: CryptDB as Trusted Client example
example operation
client app
>> SELECT SUM(grade) FROM students;
web proxy
>> decrypt(paillier_grade_sum)
DBMS
>> SELECT PAILLIER_SUM(paillier_grade) AS paillier_grade_sum
FROM students;
students table’s schema:
from user’s perspective
{
“id”: int,
“name”: str,
“grade”: int
}
on DBMS
{
“paillier_id”: str,
“name”: str,
“paillier_grade”: str
}
Querying Encrypted Data: Blob Store as Trusted Client example
What is Blob Store?
students table’s schema:
{
“name”: str,
“blob_grade”: str,
“partition_id”: int
}
range partition:
grade | parition_id |
1 - 9 | 0 |
10 - 19 | 1 |
20 - 29 | 2 |
Querying Encrypted Data: Blob Store as Trusted Client example
Architecture
Querying Encrypted Data: Blob Store as Trusted Client example
client app
>> SELECT SUM(grade)
FROM students
WHERE grade > 10;
DBMS Shell (trusted, client side)
>> SELECT SUM(DECRYPT(grade_blob))
FROM students
WHERE DECRYPT(grade_blob) > 10;
DMBS (untrusted, server side)
>> SELECT grade_blob
WHERE partition_id > 0;
example operation
Querying Encrypted Data: Blob Store as Trusted Client example
challenge
Querying Encrypted Data: Monomi as Enhanced Blob Store
WHERE updated_at = created_at + 1 -> without key, you don’t know encrypt(1))
example operation
client app
>> SELECT SUM(grade)
FROM students
WHERE grade > 10;
DBMS Shell (trusted, client side)
SUM(DECRYPT(grade_paillier))
DMBS (untrusted, server side)
>> SELECT grade_paillier
WHERE grade_paillier > 10;
Querying Encrypted Data: Trusted Client Summary
Querying Encrypted Data: Secure In-Cloud Processing
we talked about trusted client in the previous slides
Querying Encrypted Data: Secure In-Cloud Processing
how do queries under different architecture work?
benefit of having smaller TCB?
notice how isolation (hardware, memory) affects size of base
Querying Encrypted Data: Secure In-Cloud Processing
larger TCB | smaller TCB |
less secure | more secure |
more administration | less administration |
according to this paper:
Querying Encrypted Data: TrustedDB as an Example of Secure In-Cloud Processing
Highlights
SQLite: trusted storage for sensitive data
Querying Encrypted Data: TrustedDB as an Example of Secure In-Cloud Processing
id | name | birthday_timestamp |
0 | Andrew | 866476800 |
1 | Bart | 961171200 |
2 | Casey | 1087401600 |
id | SSN |
0 | 123456789 |
1 | 123456790 |
2 | 123456791 |
MySQL (untrusted)
SQLite (trusted)
students table
Querying Encrypted Data: TrustedDB as an Example of Secure In-Cloud Processing
example query
User Query
>> SELECT *
>> FROM students
>> WHERE birthday_timestamp >866476800 AND
>> SSN > 123456790
Querying Encrypted Data: TrustedDB as an Example of Secure In-Cloud Processing
What Happens Under the Hood
>> SELECT *
>> FROM mysql_students
>> RIGHT JOIN (
>> SELECT id
>> FROM students
>> WHERE SSN > 123456790
>> ) AS sqlite_result
>> ON mysql_students.id = sqlite_result.id
>> WHERE mysql_students.birthday_timestamp >866476800
Querying Encrypted Data: TrustedDB as an Example of Secure In-Cloud Processing
performance bottleneck:
Querying Encrypted Data: Cipherbase as an Example of Secure In-Cloud Processing
Querying Encrypted Data: Cipherbase as an Example of Secure In-Cloud Processing
https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/cipherbase.pdf
Querying Encrypted Data: Semantic Security
“An adversary is allowed to choose between two plaintexts, m0 and m1, and he receives an encryption of either one of the plaintexts. An encryption scheme is semantically secure, if an adversary cannot guess with better probability than 1/2 whether the given ciphertext is an encryption of message m0 or m1. The notion is also referred to as indistinguishability of encryptions. The encryption reveals no information no matter what kind of semantics (meanings) are embedded in the encryption.”
https://link.springer.com/referenceworkentry/10.1007/978-1-4419-5906-5_23
Querying Encrypted Data: Leakage
what information does encryption schemes reveal?
AES in CBC mode (non-deterministic)
deterministic
order preserving
Querying Encrypted Data: Leakage
memory access pattern leakage
malicious actors watching on server end can study storage locations different operation access
e.g.
observer learns the following workflow
query A -> access location 0 ~ 9 -> wires money to some account
which implies
location 0 ~ 9 must store information required for some transaction
Querying Encrypted Data: Leakage
Querying Encrypted Data: Access Pattern Oblivious Algorithms
bubble sort
for i in range(len(arr) - 1, -1, -1):
for j in range(0, i):
if arr[j] < arr[j + 1]:
continue
arr[j], arr[j + 1] = arr[j + 1], arr[j]
image source:
Querying Encrypted Data: Access Pattern Oblivious Algorithms
aggregator oblivious encryption
https://eprint.iacr.org/2017/479.pdf
oblivious simulation
an additional layer that makes memory access pattern looks random -> disables benefitting from spatial and temporal locality
Querying Encrypted Data
(from the Cipherbase paper)
In conclusion, what do we want?
solution that balances
Transaction Processing on Confidential Data using Cipherbase
background
Transaction Processing on Confidential Data using Cipherbase
Abstract (on Cipherbase):
Transaction Processing on Confidential Data using Cipherbase
Architecture Benefits (aside from what was already mentioned last week)
Transaction Processing on Confidential Data using Cipherbase
Threat Model (categories below are assumed to be passive i.e. does not tamper with database contents or processing)
strong adversary:
privileged OS and database access, can view the contents on memory/disk and all internal/external communication. However, cannot observe state and computations in secure hardware.
weak adversary:
can obtain a one-time snapshot of data-at-rest
in real world:
adversary usually lies between the two
Transaction Processing on Confidential Data using Cipherbase
Data Confidentiality (assuming all columns are all strongly, or non-deterministically, encrypted)
Transaction Processing on Confidential Data using Cipherbase
relational algebra notations:
projection - subset of columns
restriction (selection) - subset of rows
security guarantee of range index is similar to that of order preserving encryption
Transaction Processing on Confidential Data using Cipherbase
Data Confidentiality comparing to prior work
CryptDB (PHE)
e.g. to evaluate
CryptDB has to store the entire column deterministically encrypted, which leaks equivalence relation & distribution when weak adversary obtains snapshot
On the other hand, with Cipherbase
scan-based plan only reveal unknown_predicate(A) over R tuples (bc vals can be non-deterministically encrypted)
Transaction Processing on Confidential Data using Cipherbase
Architecture & Overview of query/transaction processing
database processing is about
Transaction Processing on Confidential Data using Cipherbase
Cipherbase client
Transaction Processing on Confidential Data using Cipherbase
e.g. to evaluate
needs a stack program that
inputs encrypted A, outputs plaintext True/False
invoked once for each tuple of R
this usage pattern inspires the design => registered once, invoked using handle (id) returned by register method
has a data cache but is stateless apart from that
Trusted Module
evaluate expressions over encrypted data during query processing
Transaction Processing on Confidential Data using Cipherbase
Key Management
keys communicated to trusted module using standard key-exchange techniques: secure hardware device manufacturer assigns a public key
Data Encryption
supports cell-level (row + column specification) encryption
=> minimizes TM traffic
=> incurs storage penalty (the ciphertext of a 4-byte integer is 12 bytes in AES + CTR mode)
Transaction Processing on Confidential Data using Cipherbase
Indexing
range index:
=> which means weak adversary learns ordering but not equality relationship
=> strong adversary learns equality relationship of queried ranges
Transaction Processing on Confidential Data using Cipherbase
inserting 7 into a range index
Transaction Processing on Confidential Data using Cipherbase
Indexing
equality index:
Transaction Processing on Confidential Data using Cipherbase
transaction processing
First, client issues query Q
during the prepare step:
Transaction Processing on Confidential Data using Cipherbase
e.g.
user issues:
>> UPDATE Accounts
>> SET Balance = Balance + @Amt
>> WHERE Id = @Id
Cipherbase Client rewrites into:
>> UPDATE Accounts
>> SET Balance = TMEval(21, Balance, @Amt)
>> WHERE Id = @Id
TMEval: built-in function on UM (SQL server) added by Cipherbase to invoke stack programs on TM
After rewritten query submitted to UM, it prepares a query plan like the following:
Transaction Processing on Confidential Data using Cipherbase
PHE avoids round-trips to TM:
in the previous example, under a scan-based plan, and if Id column is deterministically encrypted, query does not have to involve TM
Transaction Processing on Confidential Data using Cipherbase
Trusted Module
Transaction Processing on Confidential Data using Cipherbase
System engineering and optimization details
Transaction processing challenges:
Transaction Processing on Confidential Data using Cipherbase
Transaction Processing on Confidential Data using Cipherbase
Optimizations
Transaction Processing on Confidential Data using Cipherbase
Transaction Processing on Confidential Data using Cipherbase
user issues
>> UPDATE Inventory
>> SET item_cnt = item_cnt + @qty
>> SET total_cnt = total_cnt + @qty
>> WHERE storeId = @Id
Naive Cipherbase rewrite
>> UPDATE Inventory
>> SET item_cnt = TMEval(0, item_cnt, @qty)
>> SET total_cnt = TMEval(1, total_cnt, @qty)
>> WHERE storeId = @Id
expression folding
>> UPDATE Inventory
>> SET @var = TMEval(2, item_cnt, total_cnt, @qty)
>> SET item_cnt = UMExtract(@var, 0)
>> SET total_cnt = UMExtract(@var, 1)
>> WHERE storeId = @Id
i.e. common subexpression elimination
Transaction Processing on Confidential Data using Cipherbase
Hitting Pause on Cipherbase
Hitting Pause on Cipherbase
https://www.microsoft.com/en-us/research/project/microsoft-seal/
Hitting Pause on Cipherbase
however, what is SEAL actually capable of?
silent confession by Microsoft:
Hitting Pause on Cipherbase: Microsoft SEAL
=> this means a large portion of SQL queries is unachievable
this restriction is entirely contradictory to the idea of FHE
Hitting Pause on Cipherbase: Microsoft SEAL
this is rather a problem common to Cipherbase and all the previous technologies discussed
Switching to Data Observatory
Clarifying Survey Direction
questions to address:
Thinking Outside the Box: AI4VIS
https://arxiv.org/pdf/2102.01330.pdf
background
i.e. model => visualization
i.e. visualization => model
Thinking Outside the Box: AI4VIS
Why should MBZUAI care?
model => visualization => model
we are not only advancing AI in two directions but giving more content to show on our screen
Thinking Outside the Box: AI4VIS
credence: author affiliation
Thinking Outside the Box: AI4VIS
definitions
Thinking Outside the Box: AI4VIS
why apply AI to visualization data (mentioned previously but now in more detail)
Thinking Outside the Box: AI4VIS
Thinking Outside the Box: AI4VIS
how is the research community applying AI to visualizations?
Thinking Outside the Box: AI4VIS
Thinking Outside the Box: AI4VIS
Thinking Outside the Box: AI4VIS
Thinking Outside the Box: AI4VIS
future research opportunities
AI4VIS: implementation example from Data2VIS => LIDA
Data2Vis => LIDA
links to papers
Data2Vis: https://arxiv.org/pdf/1804.03126.pdf
LIDA: https://arxiv.org/pdf/2303.02927.pdf
Data2Vis => LIDA
background
Data2Vis (2019, submitted on arXiv in 2018) author:
Victor Dibia when at IBM Research, Çağatay Demiralp at MIT CSAIL
LIDA (
submitted on arXiv in March, 2023,
open-sourced on GitHub only in August this year,
claimed by Victor as an update of Data2Vis
) author:
Victor Dibia at Microsoft Research
Data2Vis => LIDA
summary - Data2Vis
Data2Vis => LIDA
Visualizations created by Data2Vis
Data2Vis => LIDA
summary - LIDA
data to semantically rich natural language description,
LLMs suffer from hallucination which sometimes lead them to produce
output not grounded in data. Informing summary augments model by provides grounding context for generating visualizations
question (hypothesis): what are we trying to understand
visualization: e.g. histogram of miles per gallon
rationale: why should we ask the question above?
Data2Vis => LIDA
generate, evaluate, repair, and execute visualization code
code scaffold constructor: imports dependent libraries and create
function stubs
code generator & executor: fill in TODOs, get visualization
Data2Vis => LIDA
Formalizing Visualization Design Knowledge as Constraints: Draco
Visualization Design Knowledge as Constraints: Draco
Visualization Design Knowledge as Constraints: Draco
Visualization Design Knowledge as Constraints: Draco
taste of Answer Set Programming
A :- L_1, …, L_n => a rule
A => atoms, proposition that may be true/false
L_i’s => literals, A is true if only all of them is true
bodiless rules encode a fact whereas headless rules impose integrity constraints (cannot all be satisfied)
e.g.
light_on => power_on, not broken
Visualization Design Knowledge as Constraints: Draco
integrity constraint defines how attributes should interact with each other:
:- mark(bar), channel(E, y), continuous(E), not zero(E)
rules out vertical bar charts that does not use 0 as baseline
set of aggregate rules: specifies attributes of a visualization (e.g. mark, encoding)
Visualization Design Knowledge as Constraints: Draco
finding optimal completion of
partially specified
visualization
Visualization Design Knowledge as Constraints: Draco
e.g.
:~ continuous(E), not zero(E). [5]
this says that model prefers to include 0 for continuous fields and that violating this constraint incurs a cost of 5
=>
where n_p_i denote # violations of soft constraint p_i
Visualization Design Knowledge as Constraints: Draco
using the cost function defined on the last page,
(v1, v2, y = sign(Cost(v1) - Cost(v2)))
=> y = -1 if v1 preferred over v2
=> let
x be vector of [n_p_0(v), …, n_p_k(v)],
corresponding weight vector be w
=> Cost(v1) - Cost(v2) = w * x_1 - w * x_2 = w * (x_1 - x_2)
Linear regression with Ridge regularization, minimizes hinge loss
Visualization Design Knowledge as Constraints: Draco
Draco 1 (2018, from UDub Interactive Data Lab)
https://idl.cs.washington.edu/files/2019-Draco-InfoVis.pdf
Draco 2 (2023, from UDub, University of Vienna, University of Maryland, CMU)
Experiments with Mac Studio
Last Week: exploring MLX on Mac Studio
Last Week: exploring MLX on Mac Studio
Last Week: llama 2 70B with MLX
Last Week: llama 2 70B with MLX
llama 2 -> MLX weights conversion example source code by Apple
Llama 2 with Hugging Face: quickstart module
allows running inference on
https://huggingface.co/meta-llama
either approaches require submitting request to Meta and/or Hugging Face
Llama 2 with Hugging Face: quickstart module
https://github.com/ggerganov/llama.cpp
, which is mainly designed for quantized models (requires less computational and storage capacity)
Llama 2 with Hugging Face: 7B, sample prompt 1
Llama 2 with Hugging Face: 7B, sample prompt 2
Llama 2 with Hugging Face: 13B, sample prompt 1
Llama 2 with Hugging Face: 13B, sample prompt 2
Llama 2 with Hugging Face: 70B, sample prompt 1
Llama 2 with Hugging Face: 70B, sample prompt 2
Llama 2 with llama.cpp
why are we back?
https://huggingface.co/docs/transformers/benchmarks
Llama 2 with llama.cpp: 7B
“Input length is not significant for performance but important for hardware requirements”
source:
https://www.databricks.com/blog/llm-inference-performance-engineering-best-practices
Llama 2 with llama.cpp: interpreting console logs
explanation by project collaborator
Llama 2 with llama.cpp: 13B
Llama 2 with llama.cpp: 70B
Llama 2 with llama.cpp: comparison with other experiments
=> quantized to 4 bits
(originally 16)
source:
https://towardsdatascience.com/apple-m3-machine-learning-speed-test-f1346e23a1b2
Llama 2 with llama.cpp: comparison with other experiments
source: https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF
Llama 2 with llama.cpp: comparison with other experiments
source:
Llama 2 with llama.cpp: comparison with other experiments
source:
https://medium.com/@andreask_75652/benchmarking-apples-mlx-vs-llama-cpp-bbbebdc18416
Llama 2 with llama.cpp: comparison with other experiments
“Overall latency scales sub-linearly with model size” - from the databricks blog
mentioned in slide 189
70 / 13 = 3.58�209.19 (70B eval t/ms) / 44.62 (13B eval t/ms) = 4.68
Question from last week: suspicious prompt eval speed
from llama.cpp maintainer (llama-2-7b fp16):
my local replication:
Question from last week: suspicious prompt eval speed
screenshot from slide 196:
Optimizing LLM Inference: batching and vLLM
why batch?
Optimizing LLM Inference: batching and vLLM
naive/ static batching
Optimizing LLM Inference: batching and vLLM
continuous batching
sequences waiting prefill/ sequences still generating
Optimizing LLM Inference: batching and vLLM
vLLM:
reference: https://blog.vllm.ai/2023/06/20/vllm.html
Optimizing LLM Inference: batching and vLLM
core problem addressed:
inefficient memory management on KV cache�=> kv pairs of all prev tokens required to generate the next one�=> size is large: takes up to 1.7 GB for a single sequence in llama-13B
=> and dynamic: depends on highly variable sequence length, leads to wastage due to fragmentation and over-reservation
devised solution:
PagedAttention
=> allows kv cache stored non-contiguously in memory by partitioning the kv pairs of each sequence into blocks each dedicated to a fixed number of tokens
Optimizing LLM Inference: batching and vLLM
=> memory wastage only occurs at the last block
=> enables efficient memory sharing by mapping to the same physical block comes in handy when we need to generate multiple output from the same sequence
Optimizing LLM Inference: batching and vLLM
=> baseline: HuggingFace’s Pipelines library, which implements static batching
=> generation length sampled from exponential dist with mean = 128 tokens
=> FasterTransformer (optimized model implementation) actually do pretty well
=> continuous batching clearly yields strong improvement over static, and so does vLLM over the other three on the graph
Optimizing LLM Inference: batching and vLLM
QPS: query per second, expected rate of poisson distribution sampling from
Optimizing LLM Inference: batching and vLLM
takeaway from the two graphs:
MLPerf: trouble we are running in
how do we run it?
MLPerf: trouble we are running in
last time, we looked at MLPerf Training results on “GPT-3”
MLPerf: comforting news
CTuning (MLCommons founding member, https://ctuning.org/)
MLPerf: comforting news
running BERT-large on Mac Studio M2 Ultra
MLPerf: comforting news
comparison with results published by MLCommons (MLPerf Inference: Edge)
https://mlcommons.org/benchmarks/inference-edge/
system | framework | offline (sample/s) | single stream (latency in ms) |
Macbook Pro M1 | onnx runtime | 1.69 | 597.08 |
Mac Studio M2 Ultra | onnx runtime | 6.48 | 580.58 |
MLPerf … on Mac?
MLPerf … on Mac?
MLPerf … on Mac?
MLPerf … on Mac?
MLPerf … on Mac?
link to drawing above
https://docs.google.com/drawings/d/1OeIpVmCSgLQk4JSc1364l8wdFaXFKoEAQIWrftISNCQ/edit
MLPerf … on Mac?
main obstacle
https://github.com/ml-explore/mlx-examples/tree/main/llms/llama
7b & 13b worked, 70b returns weird output
https://github.com/ml-explore/mlx-examples/tree/main/llms/mlx_lm
MLPerf … on Mac?
discussion with Awni Hannun
MLPerf … on Mac?
model: llama-2-7b-chat
| prompt evaluation (t/s) | token generation (t/s) |
llama.cpp | 83.42 | 39.72 |
mlx | 25.455 | 29.416 |
llama.cpp/ mlx | 3.27 | 1.35 |
MLPerf … on Mac?
model: llama-2-13b-chat
| prompt evaluation (t/s) | token generation (t/s) |
llama.cpp | 51.45 | 22.41 |
mlx | 14.164 | 18.662 |
llama.cpp/ mlx | 3.63 | 1.2 |
MLPerf … on Mac?
point being …
model: llama-2-70b-chat
| prompt evaluation (t/s) | token generation (t/s) |
llama.cpp | 11.87 | 4.78 |
mlx | 1.063 | 0.201 |
llama.cpp/ mlx | 11.16 | 23.78 |
will continue investigating
MLPerf … on Mac?
mlx v. llama.cpp
performance anomaly?
mlx v. llama.cpp
performance anomaly?
MLX GPU profile
Llama.cpp GPU profile
Llama.cpp parallelization
MLX representative workload: on reboot perf anomaly
MLX representative workload: on perf with diff loading strategy
MLX v. Metal: on perf with diff loading strategy
experiments below were conducted in�environment:�mlx 0.6.0�https://github.com/MBZUAI-Research-Office/research_muchi�commit: fd2376d405b8fe786fe2851b828e435f40421edb
Almost root to the issue?
How weights loading strategy affects application performance
batch
sep
upon closer examination, matmul performance seems identical across loading strategies
Almost root to the issue?
batch
sep
zooming out …�we notice the space between computations are caused by driver (wire memory) activities making sure that there are physical memory backing to GPU resources consumed in metal code
(
where could 256 MiB come from? Maybe …
size of weights per layer
8192 * 8192 * 4 / (1024 **2)
= 256 MiB
)
Almost root to the issue?
well … how does that space/ duration caused by driver activity changes upon reboot?
for the first 2 seconds (after the computation starts), there were still plenty driver (wire memory) activities
Almost root to the issue?
well … how does that space/ duration caused by driver activity changes upon reboot?
after that, however, driver (wire memory) activities ceased => lowering the average and thus making the first run post reboot seem much faster than subsequent runs
space between computations diminishes along with decrease in driver (wire memory) activities!
Almost root to the issue?
2nd run post reboot
3rd run post reboot
What does this mean?�The performance degradation we observed from sep loading strategy can also be accounted by these driver (wire memory) activity
Sidenote: how does matmul in mlx work under the hood?
DBRX: distributed inference with M2 Ultras
What is DBRX?
DBRX: inference efficiency
setting:
What is MoE?
TL;DR
MoE models
reference: https://huggingface.co/blog/moe
What is MoE?
dense: refers to the feed-forward layer, which is really just linear combination followed by activation (in Llama-2: multi-layer perceptron)
traditional transformer encoder block
What is MoE?
traditional transformer
transformer with MoE
vs
the traditional/ dense feed-forward network (FFN) layer is replaced by a
sparse MoE layer that operates independently on the tokens in the sequence
- the router/ gate network decides which expert the token goes
- output of MoE layer = router gate value * output of selected expert
What is MoE?
what does this remind you of?
in addition,
So, how do we run distributed inference with DBRX?
example:�transformer encoder block from Google’s GShard�- attention weights are replicated across devices�- each device is responsible for a FFN/ expert (model parallelism: model partitioned across nodes)
reference: https://arxiv.org/abs/2006.16668
Requirements to run DBRX
132B params, each param being a bfloat16�=> 132 * 2 (bytes) = 264 * (10 ^ 9) = 264 GB�=> however, DBRX’s github repo README recommends having at least 320 GB of memory��right now, model implementations are available in�- TensorRT-LLM (requires CUDA)�- vLLM (requires CUDA)�- MLX (requires Apple Silicon, focus, commitment, and sheer will)
interestingly,�HuggingFace’s transformer library integrates an experiment project called PiPPy that aims to enable easy pipeline parallelism (where each node gets a portion of the model’s layers) for pyTorch => solves the memory problem, but seemingly little performance benefits�https://github.com/pytorch/PiPPy/
Potential Strategy with M2 Ultras
M2 Ultra Cluster Communication latency
recap:
each of our Mac Studio M2 Ultra has 192GB of memory, but DBRX inference requires 264GB (Databricks even recommends to have at least 320GB of memory)
our goal is to run distributed inference on a cluster (2 - 8 nodes), which should provide us more than enough computation & memory resources. However, we need to make sure that communication between nodes does not become a bottleneck
what information is being exchanged?
DBRX has …
L_0
L_39
ATT
FFN_0
FFN_1
FFN_15
.
.
.
…
router
B 6144
B 6144
L_38
to 0, 4, 7, 15
A
6144
C 6144
+
B 6144
B 6144
B 6144
choosing 4 from 16 experts
float16
dispatch/ aggregate
M2 Ultra Cluster Communication latency
4 nodes (Thunderbolt with star topology, Ethernet with switch)
Protocol:
gRPC
(
avg latency includes serialization and deserialization time
)
DBRX: distributed inference with M2 Ultras
recap
DBRX: distributed inference with M2 Ultras
our progress so far
where did we get stuck? (like for quite a while)
DBRX: distributed architecture design
DBRX: distributed inference with M2 Ultras
0.5 t/s performance breakdown
one layer
performance view from Instruments
expert calc, comm
wait/stall
DBRX: distributed inference with M2 Ultras
optimization strategies
updates from 4/30 PAS lab meeting
DBRX: distributed inference with M2 Ultras
latest employed optimization strategies
DBRX: distributed inference with M2 Ultras
model performance
workload: batch size = 1, input = 20 tokens, output = 64 tokens
| prompt evaluation | token generation |
week_0:�baseline | 1.3 t/s | 0.6 t/s |
week_1:�with expert parallelization, weights loading in one-chunk, dense | 3 t/s | 2.4 t/s |
05/09/2024 weekly report
outline
DBRX distributed inference architecture design
DBRX distributed inference performance breakdown
| per layer | per prompt | ||
| attention + router | MoE | wait / stall | token generation t/s |
v1, �- baseline | ~7 ms | ~30 ms | ~13 ms | ~0.5 t/s |
v1.5, - load each expert’s weights as one-chunk - dense | ~ 4 ms | ~3.5 ms | ~3 ms | ~2.4 t/s |
v2, - everything from v1.5 - earlier driver eval - custom serialization | ~1.5 ms | ~3.5 ms | ~3 ms | ~3.1 t/s |
v3, - everything from v2 - architecture redesign - ½ communication - ½ expert computation | estimated: ~1.5 ms | estimated: ~2 ms | estimated: ~1.5 ms | estimated: ~5 t/s |
filename: dbrx_poc_one_custom_serialization
filename: dbrx_poc_batch2
filename: dbrx_poc_perf
Obstacles encountered during v3 development:
driver activity surfacing from idleness
sleep in ms | incurs heavy driver activity |
1 | False |
10 | False |
100 | False |
500 | True |
1000 | True |
driver activity here refers to:
time driver spent on wiring down sufficient physical memory for GPU computation
where does idleness come from (more on this next page):
wait during all-reduce (the shard that finishes first has to wait for the one that finishes last)
observation from v3 MoE layer design (4 chooses 2 experts activated per shard):
waiting/sleeping between layers causes lengthy driver activity to surface
when sleep = 1000 ms, driver processing takes ~400 ms for a ~1 ms GPU computation
How idleness becomes a problem in v3
introducing v3.2,
our first step towards:
- streaming MoE processing (beneficial during prompt evaluation and when input batch size > 1)
- dynamic load balancing (beneficial even when batch size = 1, increases overlap between communication and computation during streaming)
new trouble with v3.2: needs warming up!
what do we mean by warm up?
sync computation time to reduce driver activity
distributed DBRX v3 performance evaluation
| num GPUs | price in USD per GPU | communication | fp16 TFLOPS per GPU | bfloat16 TFLOPS per GPU | single user throughput (t/s) |
M2 Ultra | 4 | ~8,500 ? | 10 GbE switch | ~54 | ??? | ~6 |
L40S | 8 | ~9,287 | PCIe Gen4 x16: 64GB/s bidirectional | ~362 | ~362 | ~60 ? |
H100 | 4? | ~31,710 | nvlink: 900GB/s PCIe Gen5: 128GB/s | ~1979 | ~1979 | ~115 ? |
Evaluation Goal:
A Performance Model that Takes Compute Power into Account
DBRX on A100: how fast is inter-GPU communication in terms of latency?
hardware setup: 1 node, 4 GPUs, 4 x 80 = 320GB memory, 600 GB/s GPU interconnect through NVLink
framework: TensorRT-LLM, tensor parallelism
communication pattern:
1 x AllGather after all 40 layers
to piece together the generated token
1 token
1 layer
2 x AllReduce:
once before attention,
the other before MoE
DBRX on A100: how fast is inter-GPU communication in terms of latency?
communication latency breakdown:
compared to us:
operation | n times per token | avg latency in microsecond(s) | total latency per token in microsecond(s) |
AllReduce | 2 * 40 = 80 | 18 | 1,440 |
AllGather | 1 | 20 | 20 |
total: | 1,460 |
hardware setup | total communication latency per token in microsecond(s) |
A100 (1 node, 4 chips) | 1,460 |
M2 Ultra (2 node, 2 chips) | 60,000 |
DBRX performance model