� Putting it all together
Systems design
In this lecture
Feedback on real world applications
Two reactions
⇒ Main takeaways
Key Starter Questions for:
Scale, Scale, Scale
Data models
Semi-structured
(e.g. user profile, web site activity)
[aka JSON documents]
Structured
(e.g. ads, purchases, product tables)
[aka relational tables]
Hybrid Data Systems
[SQL → noSQL → newSQL →
Hybrid SQL (now)]
Hashing, Sorting
Indexing, Map-Reduce
JOINs, Aggregates
Unstructured
(e.g. images, videos, docs)
Semi-structured
(e.g. user profile, web site activity)
Structured
(e.g. ads, purchases, product tables)
In reality, a mix of systems -- e.g., Amazon/e-commerce site
MySql
Oracle
SQL Server
BigQuery
Aurora
DynamoDB
BigTable
Cassandra
Mongo
Memcached
“DB Service”
Structured Relational
Semi-structured
Example: Alibaba’s data analysis design
Hybrid future
Big Scale
Lego Blocks
Roadmap
Hashing
Sorting
Counting
Primary data structures/algorithms
HashTables
(hashi(key) --> location)
BucketSort, QuickSort
MergeSort
HashTable + Counter
(hashi(key) --> <count>)
MergeSortedFiles
HashFunctions
(hashi(key) --> location)
HashFunctions
(hashi(key) --> location)
MergeSort
MergeSort
IO analysis
Recap
Sort N pages with B+1 buffer size
(vs n log n, for n tuples in RAM. Negligible for large data, vs IO -- much, much slower)
Sort N pages when N ~= B
(because (LogB 0.5) < 0)
Sorting of relational T with N pages
SortMerge and HashJoin for R & S
Where P(R) and P(S) are number of pages in R and S, when you have enuf RAM
Sort N pages when N ~= 2*B^2
(because (LogB B) = 1)
For SortMerge, if already sorted
For HashJoin, if already partitioned
~ 2 N
~3 * (P(R) + P(S)) + OUT
~ 4 N
~1 * (P(R) + P(S)) + OUT
B+ Trees
Recap
Let:
→ We need a B+ tree of height h =
In this lecture
Systems Design
Example:
Product CoOccur
Billion products
Product recommendations
Systems Design
Example:
Product CoOccur
Counting popular product-pairs
Story: Amazon/Walmart/Alibaba (AWA) need to compute ‘related products’ for all products so their users can explore and buy new products.
⇒ Goal: compute product pairs and their co-occur count, across all users
Data input: AWA’s product catalog is 1 billion items. Each product record is ~1MB with a product description and image. AWA has 10 billion user-product views each week, from 1 billion users. Each log record stores <userID, productID, viewID, viewTime>.
CoOccur Product Count
Product Views
Web browser
Counting in RAM
(pre-CS145)
Counting product views for billion product PAIRs
UserViews(UserId, ProductID)
|  | Nespresso Coffee | 
|  | Bread maker | 
|  | Kenwood Espresso | 
|  | ... | 
| Nespresso Coffee | Bread Maker | 301 | 
| Bread maker | Kenwood Espresso | 24597 | 
| Kenwood Espresso | Bike | 22 | 
|  |  | ... | 
CoOccur(ProductID, ProductID, count)
Algorithm: For each user, product pi pj
CoOccur[pi, pj] += 1
Pre-cs145
Compute in RAM
(Engg approximations)
Counting product views
| Input size (4 bytes for user, 4 bytes for productid) | ~1Bil * [4 + 4] = 8 GB | 
| Output size (4 bytes for productid, 4 bytes for count) | ~1Bil * [4 + 4] = 8 GB | 
Trivial
Counting product pair views
‘Trivial?’
(if you have ~25 Billion$, at 100$/16GB RAM)
Plan #1: P * P matrix for counters in RAM (4 bytes for count)
Plan #1 (on disk): Let OS page into memory as needed
(if you seek to update each counter)
Systems Design
Example:
Product CoOccur
Counting popular product-pairs
Your mission: Design an efficient system to compute co-occur counts on Sundays from weekly logs and produce a CoOccurCount table <productID, productID, count>
Post CS 145
Query
(SQL)
Relational Algebra (RA) Plan
Optimized RA Plan
Execution
Write a query like
SELECT …
FROM LogOfViews v1, LogOfViews v2
WHERE ...
GROUP BY v1.productId, v2.productId
HAVING count(*) > 1 billion
Data Systems Design
Example:
Product CoOccur
Popular Systems design pattern
CoOccur Product Count
Product Views
Web browser
CoOccur Index
Lookup
Index
Related
Product
Optimize,
Evaluate design plans
Cost in I/O, resources?
To query, maintain?
Build Query Plans
Analyze Plans
Systems Design
Example:
Product CoOccur
Pre-design
|  | Size | Why? | 
| ProductId |  |  | 
| UserID |  | “ | 
| LogOfViewsID |  |  | 
| Product |  |  | 
| LogOfViews |  |  | 
| CoOccur |  |  | 
| TempCoOccur (with UserInterest assumption, of ~10 views/user) |  |  | 
1 Billion products ⇒ Need at least 30 bits (2^30 ~= 1 Billion) to represent each product uniquely. So use 4 bytes.
10 Billion product views.
1 Billion products of 1 MB each
Each record is <userID, productID, viewID, viewtime>. Assume: we use 8 bytes for viewTime. So that’s 24 bytes per record. 10 Billion*24 bytes = 240 GBs.
The output should be <productID, productID, count> for the co-occur counts. That is, 12 bytes per record (4 + 4 + 4 for the two productIDs and 4 bytes for count). To keep top billion product pairs (as recommended by AWA data quality), you need 1 billion * 12 bytes = 12 GBs.
# product pairs produced: 1 billion users * 10^2 = 100 billion
Size @8 bytes/record = 800 GBs
4 bytes
4 bytes
8 bytes
1 PB
240 GB
(4 M pages)
12 GB
800 GB
(12.5 M pages)
Systems Design
Example:
Product CoOccur
Plans?
|  | 
Plan#2: With 1 machine, Analyze with UserInterest assumption.
Plan#1: With 1 machine, use RAM to count (Cost = 25B$ or > 100 milllion years).
TempCoOccurLog
(After Step 1)
Sorted TempCoOccurLog
(After Step 2)
Count sorted TempCoOccurLog
(After Step 3)
Plan 2
Systems Design
Example:
Product CoOccur
Plan #2
| Steps | Cost (time) | Why? | 
| Scan LogOfViews  Append <p_i, p_j> to TempCoOccurLog | ~2400 secs  ~8000 secs | 240GB @100 MB/sec  800 GB @100 MB/sec | 
| Externally sort TempCoOccurLog on disk  (Assume sort cost is ~2N, where N is number of pages for table and B is number of buffers, and B ~~ N) | ~16,000 secs   | IO cost is (appx) 2 * (1 seek + scan cost for 12.5 million pages* 64 KB/per page) = 2* scan cost of 800 GBs. That is, 16000 secs (2*800 GB @100 MB/sec).  Assume TempCoOccurLog (and runs) are stored sequentially. | 
| Scan TempCoOccurLog (sorted) and keep counts in CoOccur | ~8000 secs | 800 GB @100 MB/sec | 
Systems Design
Example:
Product CoOccur
Plan #2
| Steps | Cost (IO) | Why? | 
| Scan LogOfViews  Append <p_i, p_j> to TempCoOccurLog | 4 M  12.5M | 240GB (4 M pages)  800 GB (12.5M pages) | 
| Externally sort TempCoOccurLog on disk  (Assume sort cost is ~2N, where N is number of pages for table and B is number of buffers, and B ~~ N) | 25M   | IO cost is (appx) = 2*N = 2*12.5M  | 
| Scan TempCoOccurLog (sorted) and keep counts in CoOccur | 12.5M | 800 GB | 
Total IO cost = (4M+ 12.5M +25M + 12.5M) = 54M
Recall: If stored sequentially (i.e., scan at 100 MBps, then time (secs)
= (54M * 64 KB) /100 MBps = ~34.5K secs
Optimize,
Evaluate design plan 1, plan2, plan 3, ...
Cost in I/O, resources?
To query, maintain?
Build Query Plans
Analyze Plans
Data Systems Design
Example:
Product CoOccur
CoOccur Products
Product Views
Web browser
CoOccur Index
Lookup
Index
Related
Product
Systems Design
Example:
Product CoOccur
B+ tree index
Evaluate the cost of lookups in a B+ tree, with search key=productId. How many IO lookups can we expect if we had 1 GB of RAM for the index? (CoOccur data may not be clustered)
CoOccur
<pid, cooccur_pid, count>
...
...
CoOccur Index
<pid, pointer>
M: 1 billion tuples * (4 + 4+ 4) bytes each (~167k pages)
N: 1 billion tuples * (4 + 8) bytes each (~= 167k pages)
N/F = 250,000 (with slack)
...
Contrast Example 2: Index on Product data? [Recall: 1 billion tuples * 1 M B each = 1 PB]
Systems Design
Example:
Product CoOccur
B+ tree index
| N/F | 250,000 | From previous page | 
| How large is f? | ~5460 |  | 
4 bytes for productId + 8 bytes for pointers @ 64KB/page�f * 4 + f*8 <= 64k ⇒ f ~= 5460
CoOccur
<pid, cooccur_pid, count>
..
...
CoOccur Index
<pid, pointer>
...
Leaf: N/F = 250,000 pages (167k, F=2/3)
(h = 2, Each leaf can point upto 5460 search keys)
Root: 1 page, with 5460 pointers
Level 1: <= 5460 pages, with 5460 ptrs each to next level
Data: 1 billion records with 12 bytes each
Recall We need a B+ tree of height h =
Data Systems Design
Example:
Product CoOccur
CoOccur Products
Product Views
CoOccur Index
Products
Product index
App/web browser
N/F = 250,000
N/F = 250,000
M = 167,000
M = 15.6 Billion
~0.5M index
Pages ~= 32 GBs
⇒ Keep in RAM
In RAM
Fast access
In datacenter
User latency?
1. Lookup Product Index
2. Lookup Products image
3. Lookup CoOccur Index
[1] + [2] + [3] < 100 msecs
Data Systems Design
Example:
Bigger Product CoOccur
Problem so far
Consider 1000x Bigger problem!
⇒ What changes?
Systems Design
Example:
Product CoOccur
Pre-design
|  | Size | Why? | 
| ProductId | 8 bytes |  | 
| UserID | 4 bytes | “ | 
| LogOfViewsID | 8 bytes |  | 
| Product | 1000 PB |  | 
| Users | Unknown |  | 
| LogOfViews | 280 GB |  | 
| CoOccur | 20 GBs |  | 
| TempCoOccur (with UserSession assumption, of ~10 views/user) | 1600 GB |  | 
1 trillion products ⇒ Need at least 40 bits (2^40 ~= 1 Trillion) to represent each product uniquely. So use 8 bytes (i.e 64 bits).
10 Billion product views.
1 Trillion products of 1 MB each
Each record is <userID, productID, viewID, viewtime>. Assume: we use 8 bytes for viewTime. So that’s 28 bytes per record. 10 Billion*28 bytes = 280 GBs.
The output should be <productID, productID, count> for the co-occur counts. That is, 20 bytes per record (8 + 8 + 4 for the two productIDs and 4 bytes for count). To keep top billion product pairs (as recommended by AWA data quality), you need 1 billion * 20 bytes = 20 GBs.
# product pairs produced: 1 billion users * 10^2 = 100 billion
Size @16 bytes/record = 1600 GBs.
Data Systems Design
Popular Systems design pattern
Popular problems
Key Goal:
Scale, Scale, Scale
Histograms & IO Cost Estimation
Optimization
Roadmap
Cost in I/O, resources?
To query, maintain?
Build Query Plans
Analyze Plans
Example
Stats for spatial and temporal data
E.g.carpool or
Traffic model
E.g. user
traffic
Histograms
Example
Values
Frequency
How do we compute how many values between 8 and 10?
(Yes, it’s obvious)
Problem: counts take up too much space!
What if we kept average only?
How much space do the full counts (bucket_size=1) take?
How much space do the uniform counts (bucket_size=ALL) take?
And Average Error?
E.g., uniform count = 3 (average)
Fundamental Tradeoffs
So how do we compute the “bucket” sizes?
Equi-width
Partition buckets into roughly same width (value range)
Equi-depth
Partition buckets for roughly same number of items (total frequency)
Histograms
Maintaining Histograms
Compressed Histograms
One popular approach
People continue to try fancy techniques here wavelets, graphical models, entropy models,…
Optimization
Roadmap
Cost in I/O, resources?
To query, maintain?
Build Query Plans
Analyze Plans