2 of 12

This week

Scale, Scale, Scale

How to read/write indices?

Sorting, Counting, Hashing

(for RAM, Disk, Clusters)

3 of 12

Big Scaling (with Indexes)

Roadmap

Hashing

Sorting

Counting

Primary data structures/algorithms

HashTables

(hash_i(x))

BucketSort, QuickSort

MergeSort

HashTable + Counter

(hash_i(key) --> <count>)

MergeSortedFiles

SortFiles

?????

MergeSortedFiles

SortFiles

Hashes for disk location

(hash_i(x))

Hashes for machines, shards

(hash_i(x))

4 of 12

Counting?

Counting product views for billion products

Counting popular product-pairs

5 of 12

Counting in RAM

...	Nespresso coffee: 5003	...

Counting product views for billion products

UserViews(UserId, ProductID)

	Nespresso Coffee
	Bread maker
	Kenwood Espresso
	...

Nespresso Coffee	5003
Bread maker	20,007
Kenwood Espresso	45
	...

ProductViews(ProductID, count)

Algorithm: For each user, product p_i

Counter[p_i ] += 1

// [ .. ] is python ‘dict’ notation

// e.g., h₁ (p_i ) denotes location

6 of 12

Counting in RAM

...	Nespresso Coffee, Bread Maker: 245	...

Counting product views for billion product PAIRs

UserViews(UserId, ProductID)

	Nespresso Coffee
	Bread maker
	Kenwood Espresso
	...

Nespresso Coffee	Bread Maker	301
Bread maker	Kenwood Espresso	24597
Kenwood Espresso	Bike	22
		...

ProductViews(ProductID, ProductID, count)

Algorithm: For each user, product p_i p_j

Counter[p_i,p_j] += 1

7 of 12

Sizing up problem

Number of products	P	~1 Billion
Number of users	u	~1 Billion
Products viewed in user/session (avg)	q	~10

Number of product-pairs	P^2	~1B* 1B = 10^18
Number of product-pairs with count > 1	k*P^2	k*10^18

Bytes per productID (2^32 ~=4 Billion)	4
Bytes per userID	4
Bytes per tuple (2 productIDs + count)	12

Input data

Output data

Size data

RAM, Page/disk block size	64 GB, 64KB
Disk seek, Disk IO	10 msec, 100 MB/sec

Machine(s)

Look for data blowups. . .

Input data

Output data

Intermediate data (blowups)

8 of 12

Performance Analysis

(Engg approximations)

Counting product views

Input size (4 bytes for user, 4 bytes for productid)	~1Bil * [4 + 4] = 8 GB
Output size (4 bytes for productid, 4 bytes for count)	~1Bil * [4 + 4] = 8 GB

Trivial

Counting product pair views

Output/Intermediate data size - worst case (8 bytes for productid pair, 4 bytes for count)	~1Bil * 1Bil * 4 = 4 Million TBs

‘Trivial?’

(if you have ~25 Billion$, at 100$/16GB RAM)

Design 1: P * P matrix for counters in RAM

RAM size = 4 Million TBs

Design 2: Array for products + per-product linked list for <other product, counter>

Worst case: u * q^2 * [8 bytes + 8 bytes for pointer] = 1.6 TB

Design 1 & 2 (on disk): Let OS page into memory as needed

Worst case #1 = 300 million years
Worst case #2: O(u * q^2) seeks = 100 Billion seeks = 31 years

9 of 12

Idea #1: smarter disk strategy

With Sorting

Design 3: Output u * q^2 tuples to a file

Data size: u * q^2 * [8 bytes] = 800 GB
Time to write (@100 MB/sec) = 8000 secs (~2.5 hours)

UserViews(UserId, ProductID)

	Nespresso Coffee
	Bread maker
	Kenwood Espresso
	...

Algorithm: For each user, product p_i p_j

Append < p_i p_j> to file-to-sort

External Sort, then Count

Design 3

Recall Sorting

Side math

B = 64GB/64KB = 1 million pages

N = 800 GB/64 KB = 12.5 million pages

Log_100000012.5Million/(2 * 1Million) = 0.13

_{⇒ for B ~= N, IO Sorting cost ~= 2 N pages}

Sort file

Data size: u * q^2 * [8 bytes] = 800 GB
Time to read-write (@100 MB/sec) = 16000 secs (~5 hours)

⇒ Compute time ~= 7.5 hrs !!

10 of 12

Idea #2: partition

Output smarter

With Sorting

(hashing + parallelism)

Design 4: Output u * q^2 tuples to a file

Cutting out 1 extra r/w
Time to write (@100 MB/sec) = 8000 secs (~2.5 hours)

With parallel disks,

Time to write (10 @100 MB/sec) = 800 secs (~15 mins)

UserViews(UserId, ProductID)

	Nespresso Coffee
	Bread maker
	Kenwood Espresso
	...

Algorithm: For each user, product p_i p_j

x = hash(p_i p_j) % numFiles // bucket

Append < p_i p_j> to file f_x

External Sort each f_x, as you go

Design 4

11 of 12

Idea#3:

Simplify the problem, Approximate the problem

12 of 12

Summary

Scale, Scale, Scale

Sorting, hashing,counting toolkit

E.g, Smarter disk strategy (sorting)
Smarter partition (hashing, parallelism)
Simplify, Approximate the problem

⇒ With the right scaling techniques, we went from

~25B$ or 300 million years ⇒ minutes/hours and < 10k$

General note on query optimization (more in next lecture)

Data systems use such techniques to optimize queries
For super-expensive queries, developers reframe and hand optimize query plans