1 of 44

� Putting it all together

Systems design

2 of 44

In this lecture

  1. Quick recap of scale

  • A full systems example putting things together
    • How to compute data sizes?
    • Computing a large join?
    • Building an index

3 of 44

Feedback on real world applications

Two reactions

  • “Great. It helps a lot to put our learning into a much bigger context.

  • “OMG. I got < half of what they were talking about

Main takeaways

  1. SQL systems are universal for scale and managing complexity (distributed/federated data sources)
  2. We learn Lego blocks in cs145. Folks build complex structures in industry
  3. Don't expect to grok every single detail (Watch again to “peel onion”)

4 of 44

Key Starter Questions for:

Scale, Scale, Scale

  • How to scale to large data sets?
    • Is data relational, or unstructured or ...?
    • Is data in Row or Column store?
    • Is data sorted or not?

  • How do we organize search values?
    • E.g., Hash indices, B+ trees

  • How to JOIN multiple datasets?
    • E.g., SortMerge, HashJoins

5 of 44

Data models

Semi-structured

(e.g. user profile, web site activity)

[aka JSON documents]

Structured

(e.g. ads, purchases, product tables)

[aka relational tables]

6 of 44

Hybrid Data Systems

[SQL → noSQL → newSQL →

Hybrid SQL (now)]

Hashing, Sorting

Indexing, Map-Reduce

JOINs, Aggregates

Unstructured

(e.g. images, videos, docs)

Semi-structured

(e.g. user profile, web site activity)

Structured

(e.g. ads, purchases, product tables)

In reality, a mix of systems -- e.g., Amazon/e-commerce site

MySql

Oracle

SQL Server

BigQuery

Aurora

DynamoDB

BigTable

Cassandra

Mongo

Memcached

“DB Service”

Structured Relational

Semi-structured

Hybrid future

  • Blended systems
  • Built on same Lego blocks

7 of 44

Big Scale

Lego Blocks

Roadmap

Hashing

Sorting

Counting

Primary data structures/algorithms

HashTables

(hashi(key) --> location)

BucketSort, QuickSort

MergeSort

HashTable + Counter

(hashi(key) --> <count>)

MergeSortedFiles

HashFunctions

(hashi(key) --> location)

HashFunctions

(hashi(key) --> location)

MergeSort

MergeSort

8 of 44

IO analysis

Recap

Sort N pages with B+1 buffer size

(vs n log n, for n tuples in RAM. Negligible for large data, vs IO -- much, much slower)

Sort N pages when N ~= B

(because (LogB 0.5) < 0)

Sorting of relational T with N pages

SortMerge and HashJoin for R & S

Where P(R) and P(S) are number of pages in R and S, when you have enuf RAM

Sort N pages when N ~= 2*B^2

(because (LogB B) = 1)

For SortMerge, if already sorted

For HashJoin, if already partitioned

 

~ 2 N

~3 * (P(R) + P(S)) + OUT

~ 4 N

~1 * (P(R) + P(S)) + OUT

9 of 44

B+ Trees

Recap

Let:

    • f = fanout (we’ll assume it’s constant for our cost model…)
    • N = total number of pages we need to index
    • F = fill-factor (usually ~= 2/3)
    • M = number of data pages

  • Our B+ Tree needs to have room to index N / F pages!
    • We have the fill factor in order to leave some open slots for faster insertions
    • E.g., if N = 100 and F = ⅔, you use 150 pages (each ⅔ full)

→ We need a B+ tree of height h =

10 of 44

In this lecture

  • Quick recap of scale

  • A full systems example putting things together
    • How to compute data sizes?
    • Computing a large join?
    • Building an index

11 of 44

Systems Design

Example:

Product CoOccur

Billion products

Product recommendations

12 of 44

Systems Design

Example:

Product CoOccur

Counting popular product-pairs

Story: Amazon/Walmart/Alibaba (AWA) need to compute ‘related products’ for all products so their users can explore and buy new products.

  • AWA use collaborative filtering (‘wisdom of crowds’) from their website logs.
  • Each time a user views a set of products, those products are related (co-occur)

⇒ Goal: compute product pairs and their co-occur count, across all users

Data input: AWA’s product catalog is 1 billion items. Each product record is ~1MB with a product description and image. AWA has 10 billion user-product views each week, from 1 billion users. Each log record stores <userID, productID, viewID, viewTime>.

CoOccur Product Count

Product Views

Web browser

13 of 44

Counting in RAM

(pre-CS145)

Counting product views for billion product PAIRs

UserViews(UserId, ProductID)

Nespresso Coffee

Bread maker

Kenwood Espresso

...

Nespresso Coffee

Bread Maker

301

Bread maker

Kenwood Espresso

24597

Kenwood Espresso

Bike

22

...

CoOccur(ProductID, ProductID, count)

Algorithm: For each user, product pi pj

CoOccur[pi, pj] += 1

14 of 44

Pre-cs145

Compute in RAM

(Engg approximations)

Counting product views

Input size (4 bytes for user, 4 bytes for productid)

~1Bil * [4 + 4] = 8 GB

Output size (4 bytes for productid, 4 bytes for count)

~1Bil * [4 + 4] = 8 GB

Trivial

Counting product pair views

‘Trivial?’

(if you have ~25 Billion$, at 100$/16GB RAM)

Plan #1: P * P matrix for counters in RAM (4 bytes for count)

    • RAM size = 1 Billion * 1Billion * 4 = 4 Million TBs
    • [Optimization: Half the counters]

Plan #1 (on disk): Let OS page into memory as needed

  • Worst case #1 > 100 million years

(if you seek to update each counter)

15 of 44

Systems Design

Example:

Product CoOccur

Counting popular product-pairs

Your mission: Design an efficient system to compute co-occur counts on Sundays from weekly logs and produce a CoOccurCount table <productID, productID, count>

  1. AWA’s data quality magicians recommend
    • (a) retain only the top billion popular pairs, and (b) drop product pairs with co-occur counts less than million.
    • Also, assume users view ten products on average each week (UserInterest assumption).

  • For simplicity, LogOfViews is stored sorted by <userID, productID>.
    • You can sequentially scan the log and produce co-occurring product pairs for each user. In other words, output (pi, pj) if a user viewed pi and pj.
    • This “stream” of tuples (TempCoOccur) may then be (a) stored on disk or (b) discarded after updating any data structures.

16 of 44

Post CS 145

Query

(SQL)

Relational Algebra (RA) Plan

Optimized RA Plan

Execution

Write a query like

SELECT …

FROM LogOfViews v1, LogOfViews v2

WHERE ...

GROUP BY v1.productId, v2.productId

HAVING count(*) > 1 billion

17 of 44

Data Systems Design

Example:

Product CoOccur

Popular Systems design pattern

  1. Efficiently compute ‘batch’ of data

  • Build Lookup index on result

  • [For ‘streaming’ data, update with ‘micro batches’]

CoOccur Product Count

Product Views

Web browser

CoOccur Index

Lookup

Index

Related

Product

18 of 44

Optimize,

Evaluate design plans

  1. For SFW, Joins queries
    1. Sort? Hash? Count? Brute-force?
    2. Pre-build an index? B+ tree, Hash?

  • What statistics can I keep to optimize?
    • E.g. Selectivity of columns, values

Cost in I/O, resources?

To query, maintain?

Build Query Plans

Analyze Plans

19 of 44

Systems Design

Example:

Product CoOccur

Pre-design

Size

Why?

ProductId

UserID

LogOfViewsID

Product

LogOfViews

CoOccur

TempCoOccur (with UserInterest assumption, of ~10 views/user)

1 Billion products ⇒ Need at least 30 bits (2^30 ~= 1 Billion) to represent each product uniquely. So use 4 bytes.

10 Billion product views.

1 Billion products of 1 MB each

Each record is <userID, productID, viewID, viewtime>. Assume: we use 8 bytes for viewTime. So that’s 24 bytes per record. 10 Billion*24 bytes = 240 GBs.

The output should be <productID, productID, count> for the co-occur counts. That is, 12 bytes per record (4 + 4 + 4 for the two productIDs and 4 bytes for count). To keep top billion product pairs (as recommended by AWA data quality), you need 1 billion * 12 bytes = 12 GBs.

# product pairs produced: 1 billion users * 10^2 = 100 billion

Size @8 bytes/record = 800 GBs

4 bytes

4 bytes

8 bytes

1 PB

240 GB

(4 M pages)

12 GB

800 GB

(12.5 M pages)

20 of 44

Systems Design

Example:

Product CoOccur

Plans?

Plan#2: With 1 machine, Analyze with UserInterest assumption.

Plan#1: With 1 machine, use RAM to count (Cost = 25B$ or > 100 milllion years).

TempCoOccurLog

(After Step 1)

Sorted TempCoOccurLog

(After Step 2)

Count sorted TempCoOccurLog

(After Step 3)

Plan 2

  1. Scan LogOfviews. For each user, append <p_i, p_j> to a file TempCoOccurLog if the user has viewed product p_i and p_j. (i.e., produce per-user co-occur product pair. Append to log ⇒ No seek...)
  2. Externally sort TempCoOccurLog on disk, so identical product pairs are adjacent to each other in the sorted file
  3. Scan sorted TempCoOccurLog. With a single pass, you can count co-occur pairs. Drop co-occur pairs with < 1 million.

21 of 44

Systems Design

Example:

Product CoOccur

Plan #2

Steps

Cost (time)

Why?

Scan LogOfViews

Append <p_i, p_j> to TempCoOccurLog

~2400 secs

~8000 secs

240GB @100 MB/sec

800 GB @100 MB/sec

Externally sort TempCoOccurLog on disk

(Assume sort cost is ~2N, where N is number of pages for table and B is number of buffers, and B ~~ N)

~16,000 secs

IO cost is (appx)

2 * (1 seek + scan cost for 12.5 million pages* 64 KB/per page) = 2* scan cost of 800 GBs. That is, 16000 secs (2*800 GB @100 MB/sec).

Assume TempCoOccurLog (and runs) are stored sequentially.

Scan TempCoOccurLog (sorted) and keep counts in CoOccur

~8000 secs

800 GB @100 MB/sec

22 of 44

Systems Design

Example:

Product CoOccur

Plan #2

Steps

Cost (IO)

Why?

Scan LogOfViews

Append <p_i, p_j> to TempCoOccurLog

4 M

12.5M

240GB (4 M pages)

800 GB (12.5M pages)

Externally sort TempCoOccurLog on disk

(Assume sort cost is ~2N, where N is number of pages for table and B is number of buffers, and B ~~ N)

25M

IO cost is (appx)

= 2*N = 2*12.5M

Scan TempCoOccurLog (sorted) and keep counts in CoOccur

12.5M

800 GB

Total IO cost = (4M+ 12.5M +25M + 12.5M) = 54M

Recall: If stored sequentially (i.e., scan at 100 MBps, then time (secs)

= (54M * 64 KB) /100 MBps = ~34.5K secs

23 of 44

Optimize,

Evaluate design plan 1, plan2, plan 3, ...

  • For SFW, Joins queries
    • Sort? Hash? Count? Brute-force?
    • Pre-build an index? B+ tree, Hash?

  • What statistics can I keep to optimize?
    • E.g. Selectivity of columns, values

Cost in I/O, resources?

To query, maintain?

Build Query Plans

Analyze Plans

24 of 44

Data Systems Design

Example:

Product CoOccur

CoOccur Products

Product Views

Web browser

CoOccur Index

Lookup

Index

Related

Product

25 of 44

Systems Design

Example:

Product CoOccur

B+ tree index

Evaluate the cost of lookups in a B+ tree, with search key=productId. How many IO lookups can we expect if we had 1 GB of RAM for the index? (CoOccur data may not be clustered)

CoOccur

<pid, cooccur_pid, count>

...

...

CoOccur Index

<pid, pointer>

M: 1 billion tuples * (4 + 4+ 4) bytes each (~167k pages)

N: 1 billion tuples * (4 + 8) bytes each (~= 167k pages)

N/F = 250,000 (with slack)

...

Contrast Example 2: Index on Product data? [Recall: 1 billion tuples * 1 M B each = 1 PB]

  • M: 1 PB/64 KB = 15.6 Billion pages
  • N: <pid, pointer> = 167K pages

26 of 44

Systems Design

Example:

Product CoOccur

B+ tree index

N/F

250,000

From previous page

How large is f?

~5460

4 bytes for productId + 8 bytes for pointers @ 64KB/page�f * 4 + f*8 <= 64k ⇒ f ~= 5460

CoOccur

<pid, cooccur_pid, count>

..

...

CoOccur Index

<pid, pointer>

...

Leaf: N/F = 250,000 pages (167k, F=2/3)

(h = 2, Each leaf can point upto 5460 search keys)

Root: 1 page, with 5460 pointers

Level 1: <= 5460 pages, with 5460 ptrs each to next level

Data: 1 billion records with 12 bytes each

Recall We need a B+ tree of height h =

27 of 44

Data Systems Design

Example:

Product CoOccur

CoOccur Products

Product Views

CoOccur Index

Products

Product index

App/web browser

N/F = 250,000

N/F = 250,000

M = 167,000

M = 15.6 Billion

~0.5M index

Pages ~= 32 GBs

⇒ Keep in RAM

In RAM

Fast access

In datacenter

User latency?

1. Lookup Product Index

2. Lookup Products image

3. Lookup CoOccur Index

[1] + [2] + [3] < 100 msecs

28 of 44

Data Systems Design

Example:

Bigger Product CoOccur

Problem so far

  • AWA’s product catalog is 1 billion items. AWA has 10 billion product views each week, from 1 billion users. Each log record stores <userID, productID, viewID, viewtime>

Consider 1000x Bigger problem!

  • Product catalog is 1 trillion items. AWA has 10 billion product views. Rest stays same

⇒ What changes?

29 of 44

Systems Design

Example:

Product CoOccur

Pre-design

Size

Why?

ProductId

8 bytes

UserID

4 bytes

LogOfViewsID

8 bytes

Product

1000 PB

Users

Unknown

LogOfViews

280 GB

CoOccur

20 GBs

TempCoOccur (with UserSession assumption, of ~10 views/user)

1600 GB

1 trillion products ⇒ Need at least 40 bits (2^40 ~= 1 Trillion) to represent each product uniquely. So use 8 bytes (i.e 64 bits).

10 Billion product views.

1 Trillion products of 1 MB each

Each record is <userID, productID, viewID, viewtime>. Assume: we use 8 bytes for viewTime. So that’s 28 bytes per record. 10 Billion*28 bytes = 280 GBs.

The output should be <productID, productID, count> for the co-occur counts. That is, 20 bytes per record (8 + 8 + 4 for the two productIDs and 4 bytes for count). To keep top billion product pairs (as recommended by AWA data quality), you need 1 billion * 20 bytes = 20 GBs.

# product pairs produced: 1 billion users * 10^2 = 100 billion

Size @16 bytes/record = 1600 GBs.

30 of 44

Data Systems Design

Popular Systems design pattern

  • Efficiently compute ‘batch’ of data (sort, hash, count)

  • Build Lookup index on result (b+ tree, hash table)

  • For ‘streaming’ data, update with ‘micro batches’

Popular problems

  • Related videos (youtube), people (Facebook), pages (web)

  • Security threats, malware (security), correlation analysis

31 of 44

Key Goal:

Scale, Scale, Scale

  • How to scale to large data sets?
    • Is data in Row or Column store?
    • Is data sorted or not?

  • How do we organize search values?
    • E.g., Hash indices, B+ trees

  • How to JOIN multiple datasets?
    • E.g., SortMerge, HashJoins

32 of 44

Histograms & IO Cost Estimation

33 of 44

Optimization

Roadmap

  • For SFW, Joins queries
    • Brute-force? Sort? Hash? Count?
    • Pre-build an index? B+ tree, Hash?

  • What statistics can I keep to optimize?
    • E.g. Selectivity of columns, values

Cost in I/O, resources?

To query, maintain?

Build Query Plans

Analyze Plans

34 of 44

Example

Stats for spatial and temporal data

E.g.carpool or

Traffic model

E.g. user

traffic

35 of 44

Histograms

  • A histogram is a set of value ranges (“buckets”) and the frequencies of values in those buckets

  • How to choose the buckets?
    • Equi-width & Equi-depth

  • High-frequency values are very important(e.g, related products)

36 of 44

Example

Values

Frequency

How do we compute how many values between 8 and 10?

(Yes, it’s obvious)

Problem: counts take up too much space!

37 of 44

What if we kept average only?

How much space do the full counts (bucket_size=1) take?

How much space do the uniform counts (bucket_size=ALL) take?

And Average Error?

E.g., uniform count = 3 (average)

38 of 44

Fundamental Tradeoffs

  • Want high resolution (like the full counts)

  • Want low space (like uniform)

  • Histograms are a compromise!

So how do we compute the “bucket” sizes?

39 of 44

Equi-width

Partition buckets into roughly same width (value range)

40 of 44

Equi-depth

Partition buckets for roughly same number of items (total frequency)

41 of 44

Histograms

  • Simple, intuitive and popular

  • Parameters: # of buckets and type

  • Can extend to many attributes (multidimensional)

42 of 44

Maintaining Histograms

  • Histograms require that we update them!
    • Typically, you must run/schedule a command to update statistics on the database
    • Out of date histograms can be terrible!

  • Research on self-tuning histograms and the use of query feedback

43 of 44

Compressed Histograms

One popular approach

    • Store the most frequent values and their counts explicitly
    • Keep an equiwidth or equidepth one for the rest of the values

People continue to try fancy techniques here wavelets, graphical models, entropy models,…

44 of 44

Optimization

Roadmap

  • For SFW, Joins queries
    • Brute-force? Sort? Hash? Count?
    • Pre-build an index? B+ tree, Hash?

  • What statistics can I keep to optimize?
    • E.g. Selectivity of columns, values

Cost in I/O, resources?

To query, maintain?

Build Query Plans

Analyze Plans