� Putting it all together
Systems design
In this lecture
Big Scale
Lego Blocks
Roadmap
Hashing
Sorting
Primary data structures/algorithms
HashTables
(hashi(key) --> location)
BucketSort, QuickSort
MergeSort
MergeSortedFiles
HashFunctions
(hashi(key) --> location)
HashFunctions
(hashi(key) --> location)
MergeSort
MergeSort
IO analysis
Recap
Sort N pages with B+1 buffer size
(vs n log n, for n tuples in RAM. Negligible for large data, vs IO -- much, much slower)
Sort N pages when N ~= B
(because (LogB 0.5) < 0)
Sorting of relational T with N pages
SortMerge and HashJoin for R & S
Where P(R) and P(S) are number of pages in R and S, when you have enuf RAM.
Sort N pages when N ~= 2*B^2
(because (LogB B) = 1)
For SortMerge, if already sorted
For HashJoin, if already partitioned
~ 2 N
~3 * (P(R) + P(S)) + OUT
~ 4 N
~1 * (P(R) + P(S)) + OUT
Systems Design
Example:
User Cohorts
Billion products
User searches for “coffee machine”
SMJs then GROUP BYs?
// Query: Demographic split of users, by product interest
SELECT ...,count (*)...
FROM Product p, UserViews u
WHERE p.productid = u.productid
GROUP BY u.age, u.gender, u.city
In SMJ, what do you sort/merge on?
Chaining Sorts, SMJs, HPs
Algorithms can be chained (like functions)
(e.g., SMJ, Sorts (aka ExternalSorts), HPs, HPJs)
General Composition
Input
Split
Sort1
SMJ2
HP3 …Sortk
Output
… …
HPJn
Special cases
Distributed File System
Chain algorithms
In this lecture
Systems Design
Example:
Product
Search & CoOccur
Billion products
User searches for “coffee machine”
Product recommendations
Systems Design
Example:
Product
Search & CoOccur
Counting popular product-pairs
Story: Amazon/Walmart/Alibaba (AWA) want to sell products
⇒ Goal: compute product pairs and their co-occur count, across all users
Data input:
Data Systems Design
Example:
Product Search&
CoOccur
Goal
CoOccur Products
UserViews
CoOccur Index
Products
Products index
App/web browser
User search for product
1. Lookup Products Index
2. Lookup Products image
3. Lookup CoOccur Index
[1] + [2] + [3] < 100 msecs??
4. Capture user browsing info to UserViews
Plan #1
Counting in RAM
Counting product views for billion product PAIRs
UserViews(UserId, ProductID, …, ...)
| Nespresso Coffee |
| Bread maker |
| Kenwood Espresso |
| ... |
Nespresso Coffee | Bread Maker | 301 |
Bread maker | Kenwood Espresso | 24597 |
Kenwood Espresso | Bike | 22 |
| | ... |
CoOccur(ProductID1, ProductID2, count)
Algorithm: For each user, product pi pj
CoOccur[pi, pj] += 1
Idea: If CoOccur[pi, pj] > 1 million (e.g.), ‘related.’ Classic idea. Tweak for AI inferencing/personalization
Pre-cs145
Compute in RAM
(Engg approximations)
Counting product views
Input size (4 bytes for user, 4 bytes for productid) | ~1Bil * [4 + 4] = 8 GB |
Output size (4 bytes for productid, 4 bytes for count) | ~1Bil * [4 + 4] = 8 GB |
Trivial
Counting product pair views
Plan #1: ~P * P/2 matrix for counters in RAM (4 bytes for count)
Plan #1 (on disk): Let OS page into memory as needed
(if you seek to update each counter)
⇒ Need to do some design work, with cs145
p1 p2 p3 . .. . .. . . . . . . .. . p1b
p1
p2
p3
. ..
. ..
. . .
p1b
Systems Design
Example:
Product CoOccur
Counting popular product-pairs
Your mission: Compute a table – CoOccur <productID1, productID2, count>, with co-occur counts from last week’s data
Optimize,
Evaluate design plan 1, plan2, plan 3, ...
Cost in I/O, resources?
To query, maintain?
Build Query Plans
Analyze Plans
Post CS 145
Query
(SQL)
Relational Algebra (RA) Plan
Optimized RA Plan
Execution
Write a query like
SELECT TOP(..)…
FROM UserViews v1, UserViews v2
WHERE ...
GROUP BY v1.productId, v2.productId
HAVING count(*) > 1 million
Query plan#2
Systems Design
Example:
Product CoOccur
Pre-design
| Size | Why? |
ProductId | | |
UserID | | “ |
ViewID | | |
Products | | |
UserViews | | |
ProductPairs for all userViews (We’ll see how to use this in 3 slides) | | |
CoOccur (for top 3 Billion) | | |
1 Billion products ⇒ Need at least 30 bits (2^30 ~= 1 Billion) to represent each product uniquely. So use 4 bytes.
10 Billion product views.
1 Billion products of 1 MB each. Size = 1 PB = 15.625M pages
NumSearchKeys(Product)=1 Billion
Each record is <userID, productID, viewID, viewtime>. Assume: 8 bytes for viewTime. So that’s 24 bytes per record.
10 Billion*[4+4+8+8] bytes = 240 GBs = 3750 Pages.
The output should be <productID, productID, count> for the co-occur counts. That is, 12 bytes per record (4 + 4 + 4 for the two productIDs and 4 bytes for count). For three billion product pairs, you need
3 billion * 12 bytes = 36 GBs = 563 pages. NumSearchKeys(CoOccur) = 1 Billion.
4 bytes
4 bytes
8 bytes
P(Product) = 15.625M pages
P(UserViews) = 3750 pages
P(CoOccur) =
563 pages
Given: blowup-factor = 4.5, and 10 Billion UserViews.
Each record is <productID, productID>. At 8 bytes per record,
45 Billion*8 bytes = 360GB = 5625 Pages.
5625 pages
Computing CoOccurs [B = 3000]
Systems Design
Example:
Product CoOccur
Plan #2
- OUTPUT Temp=<v1.productid, v2.productid>
IO Cost = SMJ(UserViews, UserViews)
~= 3*P(UserViews) + P(Temp)
IO Cost = Sort(Temp)
~= 2*P(Temp) + P(CoOccur)
Engg Cost Approximation [B = 3000]
Systems Design
Example:
Product CoOccur
Plan #2
[See Notebook for
impact of B values]
Typo: In 2[b], should be Sort(Temp), not Sort(UserViews).
Systems Design
Example:
Product CoOccur
Plan #2
Steps | Cost (IOs) | Why? |
SMJ(UserViews, UserViews) on UserID | ~2400 secs 2 * 2O ~8000 secs | 240GB @100 MB/sec 800 GB @100 MB/sec |
Sort, GroupBY -> CoOccur | | |
Total IO cost | ~8028688 IOs 00 secs | 800 GB @100 MB/sec |
3*3750 + 5625
3*P(UserViews) + P(Temp)
2*P(Temp) + P(CoOccur)
P(UserViews) = 3750
P(Temp) = 5625
P(CoOccur) = 563
2*5625 + 563
Recall: HDD Scan at 100 MBps, Access = 10 msecs
~= 18,647 secs
⇒ Could optimize by ~2x with more tweaks. Good enough for cs145. Went from 6B$ or 1 million years to above)
[See Notebook for
impact of B values]
Systems Design
Example:
Product CoOccur
B+ tree index
Build indexes with search key=productId. (Assume: data not clustered)
CoOccur
<pid, cooccur_pid, count>
...
...
CoOccur Index
<pid, pointer>
...
Products
<pid, image, …>
...
...
Products Index
<pid, pointer>
...
NumSearchKeys(CoOccur) = 1 Billion
P(CoOccur) = 563
NumSearchKeys(Products) = 1 Billion
P(Products) = 15.6M
What’s f and h?
(4 byte productId + 8 byte pointer @ 64MB/page)
Cost of lookup? (Worst case, with only root in RAM)
What’s f and h?
(4 byte productId + 8 byte pointer @ 64MB/page)
Cost of lookup? (Worst case, with only root inRAM)
[[Typo Fixed Oct26 ‘h’ from 1 to 2]
Data Systems Design
Example:
Product CoOccur
CoOccur Products
Product Views
CoOccur Index
Products
Product index
App/web browser
In RAM
Fast access
In datacenter
User latency?
1. Lookup Product Index
2. Lookup Products image
3. Lookup CoOccur Index
[1] + [2] + [3] < 100 msecs
Ans:
Overall, ~= 40 (+60) msecs even with HDD
Data Systems Design
Popular Systems design pattern
Popular problems
Histograms & IO Cost Estimation
Optimization
Roadmap
Cost in I/O, resources?
To query, maintain?
Build Query Plans
Analyze Plans
Example
Stats for spatial and temporal data
E.g.carpool or
Traffic model
E.g. user
traffic
Histograms
Example
Values
Frequency
How do we compute how many values between 8 and 10?
(Yes, it’s obvious)
Problem: counts take up too much space!
What if we kept average only?
Space vs Error tradeoff
E.g., uniform count = 3 (average)
Real
Approximate
Fundamental Tradeoffs
So how do we compute the “bucket” sizes?
Equi-width
Partition buckets into roughly same width (value range)
(E.g.,1st 3 values have average = 2.67, vs next 3 with avg = 1.33. Keep (2.67, 1.33, 5, 1, 5) as the averages for the 5 buckets)
Equi-depth
Partition buckets for roughly same number of items (total frequency)
2+3+3+1
8+4
9
1+2+1+5
Histograms
Maintaining Histograms
Compressed Histograms
One popular approach
People continue to try fancy techniques here wavelets, graphical models, entropy models,…
Optimization
Roadmap
Cost in I/O, resources?
To query, maintain?
Build Query Plans
Analyze Plans
Data models
Semi-structured
(e.g. user profile, web site activity)
[aka JSON documents]
Structured
(e.g. ads, purchases, product tables)
[aka relational tables]
Key Starter Questions for:
Scale, Scale, Scale
Hybrid Data Systems
Hashing, Sorting
Indexing, Map-Reduce
JOINs, Aggregates
Unstructured
(e.g. images, videos, docs)
Semi-structured
(e.g. user profile, web site activity)
Structured
(e.g. ads, purchases, product tables)
In reality, a mix of systems -- e.g., Amazon/e-commerce site
MySql
Oracle
SQL Server
BigQuery
Aurora
DynamoDB
BigTable
Cassandra
Mongo
Memcached
DB Service
Structured Relational
Semi-structured
Example: Alibaba’s data analysis design
Hybrid future
Data
Algorithms
Language/Tools