3 of 44

Feedback on real world applications

Two reactions

“Great. It helps a lot to put our learning into a much bigger context. ”

“OMG. I got < half of what they were talking about”

⇒ Main takeaways

SQL systems are universal for scale and managing complexity (distributed/federated data sources)
We learn Lego blocks in cs145. Folks build complex structures in industry
Don't expect to grok every single detail (Watch again to “peel onion”)

4 of 44

Key Starter Questions for:

Scale, Scale, Scale

How to scale to large data sets?

Is data relational, or unstructured or ...?
Is data in Row or Column store?
Is data sorted or not?

How do we organize search values?

E.g., Hash indices, B+ trees

How to JOIN multiple datasets?

E.g., SortMerge, HashJoins

5 of 44

Data models

Semi-structured

(e.g. user profile, web site activity)

[aka JSON documents]

Structured

(e.g. ads, purchases, product tables)

[aka relational tables]

6 of 44

Hybrid Data Systems

[SQL → noSQL → newSQL →

Hybrid SQL (now)]

Hashing, Sorting

Indexing, Map-Reduce

JOINs, Aggregates

Unstructured

(e.g. images, videos, docs)

Semi-structured

(e.g. user profile, web site activity)

Structured

(e.g. ads, purchases, product tables)

In reality, a mix of systems -- e.g., Amazon/e-commerce site

MySql

Oracle

SQL Server

BigQuery

Aurora

DynamoDB

BigTable

Cassandra

Mongo

Memcached

“DB Service”

Structured Relational

Semi-structured

Example: Alibaba’s data analysis design

Hybrid future

Blended systems
Built on same Lego blocks

7 of 44

Big Scale

Lego Blocks

Roadmap

Hashing

Sorting

Counting

Primary data structures/algorithms

HashTables

(hash_i(key) --> location)

BucketSort, QuickSort

MergeSort

HashTable + Counter

(hash_i(key) --> <count>)

MergeSortedFiles

HashFunctions

(hash_i(key) --> location)

HashFunctions

(hash_i(key) --> location)

MergeSort

8 of 44

IO analysis

Recap

Sort N pages with B+1 buffer size

(vs n log n, for n tuples in RAM. Negligible for large data, vs IO -- much, much slower)

Sort N pages when N ~= B

(because (Log_B 0.5) < 0)

Sorting of relational T with N pages

SortMerge and HashJoin for R & S

Where P(R) and P(S) are number of pages in R and S, when you have enuf RAM

Sort N pages when N ~= 2*B^2

(because (Log_B B) = 1)

For SortMerge, if already sorted

For HashJoin, if already partitioned

~ 2 N

~3 * (P(R) + P(S)) + OUT

~ 4 N

~1 * (P(R) + P(S)) + OUT

9 of 44

B+ Trees

Recap

Let:

f = fanout (we’ll assume it’s constant for our cost model…)
N = total number of pages we need to index
F = fill-factor (usually ~= 2/3)
M = number of data pages

Our B+ Tree needs to have room to index N / F pages!

We have the fill factor in order to leave some open slots for faster insertions
E.g., if N = 100 and F = ⅔, you use 150 pages (each ⅔ full)

→ We need a B+ tree of height h =

10 of 44

In this lecture

Quick recap of scale

A full systems example putting things together

How to compute data sizes?
Computing a large join?
Building an index

11 of 44

Systems Design

Example:

Product CoOccur

Billion products

Product recommendations

12 of 44

Systems Design

Example:

Product CoOccur

Counting popular product-pairs

Story: Amazon/Walmart/Alibaba (AWA) need to compute ‘related products’ for all products so their users can explore and buy new products.

AWA use collaborative filtering (‘wisdom of crowds’) from their website logs.
Each time a user views a set of products, those products are related (co-occur)

⇒ Goal: compute product pairs and their co-occur count, across all users

Data input: AWA’s product catalog is 1 billion items. Each product record is ~1MB with a product description and image. AWA has 10 billion user-product views each week, from 1 billion users. Each log record stores <userID, productID, viewID, viewTime>.

CoOccur Product Count

Product Views

Web browser

13 of 44

Counting in RAM

(pre-CS145)

Counting product views for billion product PAIRs

UserViews(UserId, ProductID)

	Nespresso Coffee
	Bread maker
	Kenwood Espresso
	...

Nespresso Coffee	Bread Maker	301
Bread maker	Kenwood Espresso	24597
Kenwood Espresso	Bike	22
		...

CoOccur(ProductID, ProductID, count)

Algorithm: For each user, product p_i p_j

CoOccur[p_i,p_j] += 1

14 of 44

Pre-cs145

Compute in RAM

(Engg approximations)

Counting product views

Input size (4 bytes for user, 4 bytes for productid)	~1Bil * [4 + 4] = 8 GB
Output size (4 bytes for productid, 4 bytes for count)	~1Bil * [4 + 4] = 8 GB

Trivial

Counting product pair views

‘Trivial?’

(if you have ~25 Billion$, at 100$/16GB RAM)

Plan #1: P * P matrix for counters in RAM (4 bytes for count)

RAM size = 1 Billion * 1Billion * 4 = 4 Million TBs
[Optimization: Half the counters]

Plan #1 (on disk): Let OS page into memory as needed

Worst case #1 > 100 million years

(if you seek to update each counter)

15 of 44

Systems Design

Example:

Product CoOccur

Counting popular product-pairs

Your mission: Design an efficient system to compute co-occur counts on Sundays from weekly logs and produce a CoOccurCount table <productID, productID, count>

AWA’s data quality magicians recommend

(a) retain only the top billion popular pairs, and (b) drop product pairs with co-occur counts less than million.
Also, assume users view ten products on average each week (UserInterest assumption).

For simplicity, LogOfViews is stored sorted by <userID, productID>.

You can sequentially scan the log and produce co-occurring product pairs for each user. In other words, output (p_i, p_j) if a user viewed p_iand p_j.
This “stream” of tuples (TempCoOccur) may then be (a) stored on disk or (b) discarded after updating any data structures.

16 of 44

Post CS 145

Query

(SQL)

Relational Algebra (RA) Plan

Optimized RA Plan

Execution

Write a query like

SELECT …

FROM LogOfViews v1, LogOfViews v2

WHERE ...

GROUP BY v1.productId, v2.productId

HAVING count(*) > 1 billion

17 of 44

Data Systems Design

Example:

Product CoOccur

Popular Systems design pattern

Efficiently compute ‘batch’ of data

Build Lookup index on result

[For ‘streaming’ data, update with ‘micro batches’]

CoOccur Product Count

Product Views

Web browser

CoOccur Index

Lookup

Index

Product

18 of 44

Optimize,

Evaluate design plans

For SFW, Joins queries

Sort? Hash? Count? Brute-force?
Pre-build an index? B+ tree, Hash?

What statistics can I keep to optimize?

E.g. Selectivity of columns, values

Cost in I/O, resources?

To query, maintain?

Build Query Plans

Analyze Plans

19 of 44

Systems Design

Example:

Product CoOccur

Pre-design

	Size	Why?
ProductId
UserID		“
LogOfViewsID
Product
LogOfViews
CoOccur
TempCoOccur (with UserInterest assumption, of ~10 views/user)

1 Billion products ⇒ Need at least 30 bits (2^30 ~= 1 Billion) to represent each product uniquely. So use 4 bytes.

10 Billion product views.

1 Billion products of 1 MB each

Each record is <userID, productID, viewID, viewtime>. Assume: we use 8 bytes for viewTime. So that’s 24 bytes per record. 10 Billion*24 bytes = 240 GBs.

The output should be <productID, productID, count> for the co-occur counts. That is, 12 bytes per record (4 + 4 + 4 for the two productIDs and 4 bytes for count). To keep top billion product pairs (as recommended by AWA data quality), you need 1 billion * 12 bytes = 12 GBs.

# product pairs produced: 1 billion users * 10^2 = 100 billion

Size @8 bytes/record = 800 GBs

4 bytes

8 bytes

1 PB

240 GB

(4 M pages)

12 GB

800 GB

(12.5 M pages)

20 of 44

Systems Design

Example:

Product CoOccur

Plans?

Plan#2: With 1 machine, Analyze with UserInterest assumption.

Plan#1: With 1 machine, use RAM to count (Cost = 25B$ or > 100 milllion years).

TempCoOccurLog

(After Step 1)

Sorted TempCoOccurLog

(After Step 2)

Count sorted TempCoOccurLog

(After Step 3)

Plan 2

Scan LogOfviews. For each user, append <p_i, p_j> to a file TempCoOccurLog if the user has viewed product p_i and p_j. (i.e., produce per-user co-occur product pair. Append to log ⇒ No seek...)
Externally sort TempCoOccurLog on disk, so identical product pairs are adjacent to each other in the sorted file
Scan sorted TempCoOccurLog. With a single pass, you can count co-occur pairs. Drop co-occur pairs with < 1 million.

21 of 44

Systems Design

Example:

Product CoOccur

Plan #2

Steps	Cost (time)	Why?
Scan LogOfViews Append <p_i, p_j> to TempCoOccurLog	~2400 secs ~8000 secs	240GB @100 MB/sec 800 GB @100 MB/sec
Externally sort TempCoOccurLog on disk (Assume sort cost is ~2N, where N is number of pages for table and B is number of buffers, and B ~~ N)	~16,000 secs	IO cost is (appx) 2 * (1 seek + scan cost for 12.5 million pages* 64 KB/per page) = 2* scan cost of 800 GBs. That is, 16000 secs (2*800 GB @100 MB/sec). Assume TempCoOccurLog (and runs) are stored sequentially.
Scan TempCoOccurLog (sorted) and keep counts in CoOccur	~8000 secs	800 GB @100 MB/sec

22 of 44

Systems Design

Example:

Product CoOccur

Plan #2

Steps	Cost (IO)	Why?
Scan LogOfViews Append <p_i, p_j> to TempCoOccurLog	4 M 12.5M	240GB (4 M pages) 800 GB (12.5M pages)
Externally sort TempCoOccurLog on disk (Assume sort cost is ~2N, where N is number of pages for table and B is number of buffers, and B ~~ N)	25M	IO cost is (appx) = 2N = 212.5M
Scan TempCoOccurLog (sorted) and keep counts in CoOccur	12.5M	800 GB

Total IO cost = (4M+ 12.5M +25M + 12.5M) = 54M

Recall: If stored sequentially (i.e., scan at 100 MBps, then time (secs)

= (54M * 64 KB) /100 MBps = ~34.5K secs

23 of 44

Optimize,

Evaluate design plan 1, plan2, plan 3, ...

For SFW, Joins queries

Sort? Hash? Count? Brute-force?
Pre-build an index? B+ tree, Hash?

What statistics can I keep to optimize?

E.g. Selectivity of columns, values

Cost in I/O, resources?

To query, maintain?

Build Query Plans

Analyze Plans

24 of 44

Data Systems Design

Example:

Product CoOccur

CoOccur Products

Product Views

Web browser

CoOccur Index

Lookup

Index

Product

25 of 44

Systems Design

Example:

Product CoOccur

B+ tree index

Evaluate the cost of lookups in a B+ tree, with search key=productId. How many IO lookups can we expect if we had 1 GB of RAM for the index? (CoOccur data may not be clustered)

CoOccur

<pid, cooccur_pid, count>

...

CoOccur Index

<pid, pointer>

M: 1 billion tuples * (4 + 4+ 4) bytes each (~167k pages)

N: 1 billion tuples * (4 + 8) bytes each (~= 167k pages)

N/F = 250,000 (with slack)

...

Contrast Example 2: Index on Product data? [Recall: 1 billion tuples * 1 M B each = 1 PB]

M: 1 PB/64 KB = 15.6 Billion pages
N: <pid, pointer> = 167K pages

26 of 44

Systems Design

Example:

Product CoOccur

B+ tree index

N/F	250,000	From previous page
How large is f?	~5460

4 bytes for productId + 8 bytes for pointers @ 64KB/page�f * 4 + f*8 <= 64k ⇒ f ~= 5460

CoOccur

<pid, cooccur_pid, count>

...

CoOccur Index

<pid, pointer>

...

Leaf: N/F = 250,000 pages (167k, F=2/3)

(h = 2, Each leaf can point upto 5460 search keys)

Root: 1 page, with 5460 pointers

Level 1: <= 5460 pages, with 5460 ptrs each to next level

Data: 1 billion records with 12 bytes each

Recall We need a B+ tree of height h =

27 of 44

Data Systems Design

Example:

Product CoOccur

CoOccur Products

Product Views

CoOccur Index

Products

Product index

App/web browser

N/F = 250,000

M = 167,000

M = 15.6 Billion

~0.5M index

Pages ~= 32 GBs

⇒ Keep in RAM

In RAM

Fast access

In datacenter

User latency?

1. Lookup Product Index

2. Lookup Products image

3. Lookup CoOccur Index

[1] + [2] + [3] < 100 msecs

28 of 44

Data Systems Design

Example:

Bigger Product CoOccur

Problem so far

AWA’s product catalog is 1 billion items. AWA has 10 billion product views each week, from 1 billion users. Each log record stores <userID, productID, viewID, viewtime>

Consider 1000x Bigger problem!

Product catalog is 1 trillion items. AWA has 10 billion product views. Rest stays same

⇒ What changes?

29 of 44

Systems Design

Example:

Product CoOccur

Pre-design

	Size	Why?
ProductId	8 bytes
UserID	4 bytes	“
LogOfViewsID	8 bytes
Product	1000 PB
Users	Unknown
LogOfViews	280 GB
CoOccur	20 GBs
TempCoOccur (with UserSession assumption, of ~10 views/user)	1600 GB

1 trillion products ⇒ Need at least 40 bits (2^40 ~= 1 Trillion) to represent each product uniquely. So use 8 bytes (i.e 64 bits).

10 Billion product views.

1 Trillion products of 1 MB each

Each record is <userID, productID, viewID, viewtime>. Assume: we use 8 bytes for viewTime. So that’s 28 bytes per record. 10 Billion*28 bytes = 280 GBs.

The output should be <productID, productID, count> for the co-occur counts. That is, 20 bytes per record (8 + 8 + 4 for the two productIDs and 4 bytes for count). To keep top billion product pairs (as recommended by AWA data quality), you need 1 billion * 20 bytes = 20 GBs.

# product pairs produced: 1 billion users * 10^2 = 100 billion

Size @16 bytes/record = 1600 GBs.

30 of 44

Data Systems Design

Popular Systems design pattern

Efficiently compute ‘batch’ of data (sort, hash, count)

Build Lookup index on result (b+ tree, hash table)

For ‘streaming’ data, update with ‘micro batches’

31 of 44

Key Goal:

Scale, Scale, Scale

How to scale to large data sets?

Is data in Row or Column store?
Is data sorted or not?

How do we organize search values?

E.g., Hash indices, B+ trees

How to JOIN multiple datasets?

E.g., SortMerge, HashJoins

32 of 44

Histograms & IO Cost Estimation

33 of 44

Optimization

Roadmap

For SFW, Joins queries

Brute-force? Sort? Hash? Count?
Pre-build an index? B+ tree, Hash?

What statistics can I keep to optimize?

E.g. Selectivity of columns, values

Cost in I/O, resources?

To query, maintain?

Build Query Plans

Analyze Plans

34 of 44

Example

Stats for spatial and temporal data

E.g.carpool or

Traffic model

E.g. user

traffic

35 of 44

Histograms

A histogram is a set of value ranges (“buckets”) and the frequencies of values in those buckets

How to choose the buckets?

Equi-width & Equi-depth

High-frequency values are very important(e.g, related products)

36 of 44

Example

Values

Frequency

How do we compute how many values between 8 and 10?

(Yes, it’s obvious)

Problem: counts take up too much space!

37 of 44

What if we kept average only?

How much space do the full counts (bucket_size=1) take?

How much space do the uniform counts (bucket_size=ALL) take?

And Average Error?

E.g., uniform count = 3 (average)

38 of 44

Fundamental Tradeoffs

Want high resolution (like the full counts)

Want low space (like uniform)

Histograms are a compromise!

So how do we compute the “bucket” sizes?

39 of 44

Equi-width

Partition buckets into roughly same width (value range)

40 of 44

Equi-depth

Partition buckets for roughly same number of items (total frequency)

41 of 44

Histograms

Simple, intuitive and popular

Parameters: # of buckets and type

Can extend to many attributes (multidimensional)

42 of 44

Maintaining Histograms

Histograms require that we update them!

Typically, you must run/schedule a command to update statistics on the database
Out of date histograms can be terrible!

Research on self-tuning histograms and the use of query feedback

43 of 44

Compressed Histograms

One popular approach

Store the most frequent values and their counts explicitly
Keep an equiwidth or equidepth one for the rest of the values

People continue to try fancy techniques here wavelets, graphical models, entropy models,…

44 of 44

Optimization

Roadmap

For SFW, Joins queries

Brute-force? Sort? Hash? Count?
Pre-build an index? B+ tree, Hash?

What statistics can I keep to optimize?

E.g. Selectivity of columns, values

Cost in I/O, resources?

To query, maintain?

Build Query Plans

Analyze Plans