Scale, scale, scale
(Indexing + Sorting, Hashing, Counting)
This week
Scale, Scale, Scale
How to read/write indices?
Sorting, Counting, Hashing
(for RAM, Disk, Clusters)
Big Scaling (with Indexes)
Roadmap
Hashing
Sorting
Counting
Primary data structures/algorithms
HashTables
(hashi(x))
BucketSort, QuickSort
MergeSort
HashTable + Counter
(hashi(key) --> <count>)
MergeSortedFiles
SortFiles
?????
MergeSortedFiles
SortFiles
Hashes for disk location
(hashi(x))
Hashes for machines, shards
(hashi(x))
Counting?
Counting product views for billion products
Counting popular product-pairs
Counting in RAM
... | Nespresso coffee: 5003 | ... |
Counting product views for billion products
UserViews(UserId, ProductID)
| Nespresso Coffee |
| Bread maker |
| Kenwood Espresso |
| ... |
Nespresso Coffee | 5003 |
Bread maker | 20,007 |
Kenwood Espresso | 45 |
| ... |
ProductViews(ProductID, count)
Algorithm: For each user, product pi
Counter[pi ] += 1
// [ .. ] is python ‘dict’ notation
// e.g., h1 (pi ) denotes location
Counting in RAM
... | Nespresso Coffee, Bread Maker: 245 | ... |
Counting product views for billion product PAIRs
UserViews(UserId, ProductID)
| Nespresso Coffee |
| Bread maker |
| Kenwood Espresso |
| ... |
Nespresso Coffee | Bread Maker | 301 |
Bread maker | Kenwood Espresso | 24597 |
Kenwood Espresso | Bike | 22 |
| | ... |
ProductViews(ProductID, ProductID, count)
Algorithm: For each user, product pi pj
Counter[pi, pj] += 1
Sizing up problem
Number of products | P | ~1 Billion |
Number of users | u | ~1 Billion |
Products viewed in user/session (avg) | q | ~10 |
Number of product-pairs | P^2 | ~1B* 1B = 10^18 |
Number of product-pairs with count > 1 | k*P^2 | k*10^18 |
Bytes per productID (2^32 ~=4 Billion) | 4 |
Bytes per userID | 4 |
Bytes per tuple (2 productIDs + count) | 12 |
Input data
Output data
Size data
RAM, Page/disk block size | 64 GB, 64KB |
Disk seek, Disk IO | 10 msec, 100 MB/sec |
Machine(s)
Look for data blowups. . .
Input data
Output data
Intermediate data (blowups)
Performance Analysis
(Engg approximations)
Counting product views
Input size (4 bytes for user, 4 bytes for productid) | ~1Bil * [4 + 4] = 8 GB |
Output size (4 bytes for productid, 4 bytes for count) | ~1Bil * [4 + 4] = 8 GB |
Trivial
Counting product pair views
Output/Intermediate data size - worst case (8 bytes for productid pair, 4 bytes for count) | ~1Bil * 1Bil * 4 = 4 Million TBs |
‘Trivial?’
(if you have ~25 Billion$, at 100$/16GB RAM)
Design 1: P * P matrix for counters in RAM
Design 2: Array for products + per-product linked list for <other product, counter>
Design 1 & 2 (on disk): Let OS page into memory as needed
Idea #1: smarter disk strategy
With Sorting
Design 3: Output u * q^2 tuples to a file
UserViews(UserId, ProductID)
| Nespresso Coffee |
| Bread maker |
| Kenwood Espresso |
| ... |
Algorithm: For each user, product pi pj
Append < pi pj> to file-to-sort
External Sort, then Count
Design 3
Recall Sorting
Side math
B = 64GB/64KB = 1 million pages
N = 800 GB/64 KB = 12.5 million pages
Log1000000 12.5Million/(2 * 1Million) = 0.13
⇒ for B ~= N, IO Sorting cost ~= 2 N pages
Sort file
⇒ Compute time ~= 7.5 hrs !!
Idea #2: partition
Output smarter
With Sorting
(hashing + parallelism)
Design 4: Output u * q^2 tuples to a file
With parallel disks,
UserViews(UserId, ProductID)
| Nespresso Coffee |
| Bread maker |
| Kenwood Espresso |
| ... |
Algorithm: For each user, product pi pj
x = hash(pi pj) % numFiles // bucket
Append < pi pj> to file fx
External Sort each fx, as you go
Design 4
Idea#3:
Simplify the problem, Approximate the problem
Popular product pairs
Design 5: Cut down I/O time with sampling, probabilistic hashing (e.g., p’ = 1%)
UserViews(UserId, ProductID)
| Nespresso Coffee |
| Bread maker |
| Kenwood Espresso |
| ... |
Algorithm: For each user, product pi pj
x = hash(pi pj) % numFiles // bucket
With probability p’, append < pi pj> to file fx
External Sort each fx, Count as you go
Design 5:
Summary
Scale, Scale, Scale
Sorting, hashing,counting toolkit
⇒ With the right scaling techniques, we went from
~25B$ or 300 million years ⇒ minutes/hours and < 10k$
General note on query optimization (more in next lecture)