3 of 40

Big Scale

Lego Blocks

Roadmap

Hashing

Sorting

Primary data structures/algorithms

HashTables

(hash_i(key) --> location)

BucketSort, QuickSort

MergeSort

MergeSortedFiles

HashFunctions

(hash_i(key) --> location)

HashFunctions

(hash_i(key) --> location)

MergeSort

4 of 40

IO analysis

Recap

Sort N pages with B+1 buffer size

(vs n log n, for n tuples in RAM. Negligible for large data, vs IO -- much, much slower)

Sort N pages when N ~= B

(because (Log_B 0.5) < 0)

Sorting of relational T with N pages

SortMerge and HashJoin for R & S

Where P(R) and P(S) are number of pages in R and S, when you have enuf RAM.

Sort N pages when N ~= 2*B^2

(because (Log_B B) = 1)

For SortMerge, if already sorted

For HashJoin, if already partitioned

~ 2 N

~3 * (P(R) + P(S)) + OUT

~ 4 N

~1 * (P(R) + P(S)) + OUT

5 of 40

Systems Design

Example:

User Cohorts

Billion products

User searches for “coffee machine”

6 of 40

SMJs then GROUP BYs?

// Query: Demographic split of users, by product interest

SELECT ...,count (*)...

FROM Product p, UserViews u

WHERE p.productid = u.productid

GROUP BY u.age, u.gender, u.city

In SMJ, what do you sort/merge on?

(a) productid? (b) <u.age, u.gender, u.city >?

7 of 40

Chaining Sorts, SMJs, HPs

Step1: SMJ(p.productid = u.productid). Specifically,

Sort(Product by productid), Sort(UserViews by productid)
MergeJoin and produce Temp=<productid, u.age, u.gender, u.city>

Step2: // Handle groupbys

Sort(Temp, <u.age, u.gender, u.city>)
Scan sortedTemp, and make GROUPBYs for u.age, u.gender, u.city
[Notice: in Step2(a), you could also use HP(Temp, by <u.age, u.gender, u.city>)

Step3: Further process… (e.g., for indexing or for a sub-select, etc.)

Algorithms can be chained (like functions)

(e.g., SMJ, Sorts (aka ExternalSorts), HPs, HPJs)

8 of 40

General Composition

Input

Split

Sort1

SMJ2

HP3 …Sortk

Output

… …

HPJn

Special cases

MapReduce: Input → [Split → HP(keys) → HJP] → Output
Shuffle: HP (e.g., on next step’s GROUP BYs)

Typical optimization – save time by using only RAM, without writing OUT to HDD/SSD.

Distributed File System

Chain algorithms

9 of 40

In this lecture

Quick recap of scale

A full systems example putting things together

How to build Amazon’s Products and Recommendations page

10 of 40

Systems Design

Example:

Product

Search & CoOccur

Billion products

User searches for “coffee machine”

Product recommendations

11 of 40

Systems Design

Example:

Product

Search & CoOccur

Counting popular product-pairs

Story: Amazon/Walmart/Alibaba (AWA) want to sell products

Problem1: AWA wants fast user searches for product
Problem2: AWA shows ‘related products’ for all products
Using collaborative filtering (‘wisdom of crowds’) from historical website logs.
Each time a user views a set of products, those products are related (co-occur)

⇒ Goal: compute product pairs and their co-occur count, across all users

Data input:

AWA has 1 billion products. Each product record is ~1MB (descriptions, images, etc.).
AWA has 10 billion UserViews each week, from 1 billion users. Stored in UserViews, each row has <userID, productID, viewID, viewTime>.

12 of 40

Data Systems Design

Example:

Product Search&

CoOccur

Goal

CoOccur Products

UserViews

CoOccur Index

Products

Products index

App/web browser

User search for product

1. Lookup Products Index

2. Lookup Products image

3. Lookup CoOccur Index

[1] + [2] + [3] < 100 msecs??

4. Capture user browsing info to UserViews

13 of 40

Plan #1

Counting in RAM

Counting product views for billion product PAIRs

UserViews(UserId, ProductID, …, ...)

	Nespresso Coffee
	Bread maker
	Kenwood Espresso
	...

Nespresso Coffee	Bread Maker	301
Bread maker	Kenwood Espresso	24597
Kenwood Espresso	Bike	22
		...

CoOccur(ProductID1, ProductID2, count)

Algorithm: For each user, product p_i p_j

CoOccur[p_i,p_j] += 1

Idea: If CoOccur[p_i,p_j] > 1 million (e.g.), ‘related.’ Classic idea. Tweak for AI inferencing/personalization

14 of 40

Pre-cs145

Compute in RAM

(Engg approximations)

Counting product views

Input size (4 bytes for user, 4 bytes for productid)	~1Bil * [4 + 4] = 8 GB
Output size (4 bytes for productid, 4 bytes for count)	~1Bil * [4 + 4] = 8 GB

Trivial

Counting product pair views

Plan #1: ~P * P/2 matrix for counters in RAM (4 bytes for count)

RAM size = 1 Billion * 1Billion/2 * 4 = 2 Million TBs

// (count(a, b) = count(b, a)

Trivial? If you have ~$6 Billion (RAM at ~$100/32GB RAM)

Plan #1 (on disk): Let OS page into memory as needed

Worst case #1 > 1 million years on HDD

(if you seek to update each counter)

⇒ Need to do some design work, with cs145

p1 p2 p3 . .. . .. . . . . . . .. . p1b

. ..

. . .

p1b

15 of 40

Systems Design

Example:

Product CoOccur

Counting popular product-pairs

Your mission: Compute a table – CoOccur <productID1, productID2, count>, with co-occur counts from last week’s data

AWA’s data quality magicians recommend

Keep only top three recommendations per product, and (b) drop pairs with co-occur counts less than million.
On avg, users view ten products each week (User is interested in ~10 products/week, not 1000s). (For compactness, using u1 for a userID, and p1..p10 for productids)

⇒ {<u1, p1, …>, <u1, p2, ..>, <u1, p3, …> …<u1, p10>}; 10 rows/user in UserViews
⇒ We’d need to count the following 10*9/2 (n choose 2) product pairs, per u1

<p1, p2>, <p1, p3>, …<p1, p10>
<p2, p3>, <p2, p4> …<p2, p10>
<p3, p4>, <p3, p5>, …<p3, p10>
…
<p9, p10>

Alternately, we say the blowup-factor = 4.5 (=45/10 for the cross-product)

(i.e.. for ‘k’ avg views, the blowup-factor = (k-1)/2)

16 of 40

Optimize,

Evaluate design plan 1, plan2, plan 3, ...

For SFW, Joins queries

Sort? Hash? Count? Brute-force?
Pre-build an index? B+ tree, Hash?

What statistics can I keep to optimize?

E.g. Selectivity of columns, values

Cost in I/O, resources?

To query, maintain?

Build Query Plans

Analyze Plans

17 of 40

Post CS 145

Query

(SQL)

Relational Algebra (RA) Plan

Optimized RA Plan

Execution

Write a query like

SELECT TOP(..)…

FROM UserViews v1, UserViews v2

WHERE ...

GROUP BY v1.productId, v2.productId

HAVING count(*) > 1 million

Query plan#2

SMJ(UserViews, UserViews) on userid
Sort(SortedUsersViews) on <v1.productid, v2.productid> for group by & HAVING

18 of 40

Systems Design

Example:

Product CoOccur

Pre-design

	Size	Why?
ProductId
UserID		“
ViewID
Products
UserViews
ProductPairs for all userViews (We’ll see how to use this in 3 slides)
CoOccur (for top 3 Billion)

1 Billion products ⇒ Need at least 30 bits (2^30 ~= 1 Billion) to represent each product uniquely. So use 4 bytes.

10 Billion product views.

1 Billion products of 1 MB each. Size = 1 PB = 15.625M pages

NumSearchKeys(Product)=1 Billion

Each record is <userID, productID, viewID, viewtime>. Assume: 8 bytes for viewTime. So that’s 24 bytes per record.

10 Billion*[4+4+8+8] bytes = 240 GBs = 3750 Pages.

The output should be <productID, productID, count> for the co-occur counts. That is, 12 bytes per record (4 + 4 + 4 for the two productIDs and 4 bytes for count). For three billion product pairs, you need

3 billion * 12 bytes = 36 GBs = 563 pages. NumSearchKeys(CoOccur) = 1 Billion.

4 bytes

8 bytes

P(Product) = 15.625M pages

P(UserViews) = 3750 pages

P(CoOccur) =

563 pages

Given: blowup-factor = 4.5, and 10 Billion UserViews.

Each record is <productID, productID>. At 8 bytes per record,

45 Billion*8 bytes = 360GB = 5625 Pages.

5625 pages

19 of 40

Step1: SMJ(v1.userid = v2.userid). Specifically,

SMJ(UserViews, UserViews by userid)

// Note: Sort and merge UserViews once
// Normally, SMJ Cost ~= 3*(P(R)+P(S)).
// So SMJ Cost* ~= 3*P(UserViews), due to self-join in UserViews.
// You could optimize further. Let’s use above IO Costs for simplicity

Step2: // Handle groupbys

Sort(Temp, <v1.productid, v2.productid>)
Scan sortedTemp, and make GROUPBYs for <v1.productid, v2.productid>

Computing CoOccurs [B = 3000]

Systems Design

Example:

Product CoOccur

Plan #2

20 of 40

Step1: SMJ(v1.userid = v2.userid). Specifically,

SMJ(UserViews, UserViews by userid)

// Note: Sort and merge UserViews once
// Normally, SMJ Cost ~= 3*(P(R)+P(S)).
// So SMJ Cost ~= 3*P(UserViews), due to self-join in UserViews.

- OUTPUT Temp=<v1.productid, v2.productid>

IO Cost = SMJ(UserViews, UserViews)

~= 3*P(UserViews) + P(Temp)

Step2: // Handle groupbys

Sort(Temp, <v1.productid, v2.productid>)
Scan sortedTemp, and make GROUPBYs for <v1.productid, v2.productid>

IO Cost = Sort(Temp)

~= 2*P(Temp) + P(CoOccur)

Engg Cost Approximation [B = 3000]

Systems Design

Example:

Product CoOccur

Plan #2

[See Notebook for

impact of B values]

Typo: In 2[b], should be Sort(Temp), not Sort(UserViews).

21 of 40

Systems Design

Example:

Product CoOccur

Plan #2

Steps	Cost (IOs)	Why?
SMJ(UserViews, UserViews) on UserID	~2400 secs 2 * 2O ~8000 secs	240GB @100 MB/sec 800 GB @100 MB/sec
Sort, GroupBY -> CoOccur
Total IO cost	~8028688 IOs 00 secs	800 GB @100 MB/sec

3*3750 + 5625

3*P(UserViews) + P(Temp)

2*P(Temp) + P(CoOccur)

P(UserViews) = 3750

P(Temp) = 5625

P(CoOccur) = 563

2*5625 + 563

Recall: HDD Scan at 100 MBps, Access = 10 msecs

Time = Access (28688*0.01 secs) + Scan (28688*64MB/100MBps)

~= 18,647 secs

Total cost ~= $250

Cost (B = 64GB RAM at 100$/32GB) = ~200$
Cost(HDD = 2 TB at 100$/4 TB) ~= 50$

⇒ Could optimize by ~2x with more tweaks. Good enough for cs145. Went from 6B$ or 1 million years to above)

[See Notebook for

impact of B values]

22 of 40

Systems Design

Example:

Product CoOccur

B+ tree index

Build indexes with search key=productId. (Assume: data not clustered)

CoOccur

<pid, cooccur_pid, count>

...

CoOccur Index

<pid, pointer>

...

Products

<pid, image, …>

...

Products Index

<pid, pointer>

...

NumSearchKeys(CoOccur) = 1 Billion

P(CoOccur) = 563

NumSearchKeys(Products) = 1 Billion

P(Products) = 15.6M

What’s f and h?

f = 5.6 million

(4 byte productId + 8 byte pointer @ 64MB/page)

h = 2

Cost of lookup? (Worst case, with only root in RAM)

1 IO (for 1st level) + 1 IO for CoOccur data (assume: clustered for 3 recommendations)
2 IOs

What’s f and h?

f = 5.6 million

(4 byte productId + 8 byte pointer @ 64MB/page)

h = 2

Cost of lookup? (Worst case, with only root inRAM)

1 IO (for 1st level) + 1 IO for data
2 IOs

[[Typo Fixed Oct26 ‘h’ from 1 to 2]

23 of 40

Data Systems Design

Example:

Product CoOccur

CoOccur Products

Product Views

CoOccur Index

Products

Product index

App/web browser

In RAM

Fast access

In datacenter

User latency?

1. Lookup Product Index

2. Lookup Products image

3. Lookup CoOccur Index

[1] + [2] + [3] < 100 msecs

Ans:

2 IOs for [1] + [2]
2 IOs for [3]
(+ ~3 *2 IOs for the 3 recommendations)

Overall, ~= 40 (+60) msecs even with HDD

24 of 40

Data Systems Design

Popular Systems design pattern

Efficiently compute ‘batch’ of data (sort, hash, count)

Build Lookup index on result (b+ tree, hash table)

For ‘streaming’ data, update with ‘micro batches’

25 of 40

Histograms & IO Cost Estimation

26 of 40

Optimization

Roadmap

For SFW, Joins queries

Brute-force? Sort? Hash? Count?
Pre-build an index? B+ tree, Hash?

What statistics can I keep to optimize?

E.g. Selectivity of columns, values

Cost in I/O, resources?

To query, maintain?

Build Query Plans

Analyze Plans

27 of 40

Example

Stats for spatial and temporal data

E.g.carpool or

Traffic model

E.g. user

traffic

28 of 40

Histograms

A histogram is a set of value ranges (“buckets”) and the frequencies of values in those buckets

How to choose the buckets?

Equi-width & Equi-depth

High-frequency values are very important(e.g, related products)

29 of 40

Example

Values

Frequency

How do we compute how many values between 8 and 10?

(Yes, it’s obvious)

Problem: counts take up too much space!

30 of 40

What if we kept average only?

Space vs Error tradeoff

E.g., uniform count = 3 (average)

Real

Approximate

31 of 40

Fundamental Tradeoffs

Want high resolution (like the full counts)

Want low space (like uniform)

Histograms are a compromise!

So how do we compute the “bucket” sizes?

32 of 40

Equi-width

Partition buckets into roughly same width (value range)

(E.g.,1st 3 values have average = 2.67, vs next 3 with avg = 1.33. Keep (2.67, 1.33, 5, 1, 5) as the averages for the 5 buckets)

33 of 40

Equi-depth

Partition buckets for roughly same number of items (total frequency)

2+3+3+1

8+4

1+2+1+5

34 of 40

Histograms

Simple, intuitive and popular

Parameters: # of buckets and type

Can extend to many attributes (multidimensional)

35 of 40

Maintaining Histograms

Histograms require that we update them!

Typically, you must run/schedule a command to update statistics on the database
Out dated histograms can be bad!

Research on self-tuning histograms and the use of query feedback

36 of 40

Compressed Histograms

One popular approach

Store the most frequent values and their counts explicitly
Keep an equiwidth or equidepth one for the rest of the values

People continue to try fancy techniques here wavelets, graphical models, entropy models,…

37 of 40

Optimization

Roadmap

For SFW, Joins queries

Brute-force? Sort? Hash? Count?
Pre-build an index? B+ tree, Hash?

What statistics can I keep to optimize?

E.g. Selectivity of columns, values

Cost in I/O, resources?

To query, maintain?

Build Query Plans

Analyze Plans

38 of 40

Data models

Semi-structured

(e.g. user profile, web site activity)

[aka JSON documents]

Structured

(e.g. ads, purchases, product tables)

[aka relational tables]

39 of 40

Key Starter Questions for:

Scale, Scale, Scale

How to scale to large data sets?

Is data relational, or unstructured or ...?
Is data in Row or Column store?
Is data sorted or not?

How do we organize search values?

E.g., Hash indices, B+ trees

How to JOIN multiple datasets?

E.g., SortMerge, HashJoins

40 of 40

Hybrid Data Systems

Hashing, Sorting

Indexing, Map-Reduce

JOINs, Aggregates

Unstructured

(e.g. images, videos, docs)

Semi-structured

(e.g. user profile, web site activity)

Structured

(e.g. ads, purchases, product tables)

In reality, a mix of systems -- e.g., Amazon/e-commerce site

MySql

Oracle

SQL Server

BigQuery

Aurora

DynamoDB

BigTable

Cassandra

Mongo

Memcached

DB Service

Structured Relational

Semi-structured

Example: Alibaba’s data analysis design

Hybrid future

Built on same Lego blocks
Past: SQL → noSQL → newSQL
Now: hybrid SQL + pandas/ML

Data

Algorithms

Language/Tools