1 of 12

Scale, scale, scale

(Indexing + Sorting, Hashing, Counting)

2 of 12

This week

Scale, Scale, Scale

How to read/write indices?

Sorting, Counting, Hashing

(for RAM, Disk, Clusters)

3 of 12

Big Scaling (with Indexes)

Roadmap

Hashing

Sorting

Counting

Primary data structures/algorithms

HashTables

(hashi(x))

BucketSort, QuickSort

MergeSort

HashTable + Counter

(hashi(key) --> <count>)

MergeSortedFiles

SortFiles

?????

MergeSortedFiles

SortFiles

Hashes for disk location

(hashi(x))

Hashes for machines, shards

(hashi(x))

4 of 12

Counting?

Counting product views for billion products

Counting popular product-pairs

5 of 12

Counting in RAM

...

Nespresso coffee:

5003

...

Counting product views for billion products

UserViews(UserId, ProductID)

Nespresso Coffee

Bread maker

Kenwood Espresso

...

Nespresso Coffee

5003

Bread maker

20,007

Kenwood Espresso

45

...

ProductViews(ProductID, count)

Algorithm: For each user, product pi

Counter[pi ] += 1

// [ .. ] is python ‘dict’ notation

// e.g., h1 (pi ) denotes location

6 of 12

Counting in RAM

...

Nespresso Coffee, Bread Maker: 245

...

Counting product views for billion product PAIRs

UserViews(UserId, ProductID)

Nespresso Coffee

Bread maker

Kenwood Espresso

...

Nespresso Coffee

Bread Maker

301

Bread maker

Kenwood Espresso

24597

Kenwood Espresso

Bike

22

...

ProductViews(ProductID, ProductID, count)

Algorithm: For each user, product pi pj

Counter[pi, pj] += 1

7 of 12

Sizing up problem

Number of products

P

~1 Billion

Number of users

u

~1 Billion

Products viewed in user/session (avg)

q

~10

Number of product-pairs

P^2

~1B* 1B = 10^18

Number of product-pairs with count > 1

k*P^2

k*10^18

Bytes per productID (2^32 ~=4 Billion)

4

Bytes per userID

4

Bytes per tuple (2 productIDs + count)

12

Input data

Output data

Size data

RAM, Page/disk block size

64 GB, 64KB

Disk seek, Disk IO

10 msec, 100 MB/sec

Machine(s)

Look for data blowups. . .

Input data

Output data

Intermediate data (blowups)

8 of 12

Performance Analysis

(Engg approximations)

Counting product views

Input size (4 bytes for user, 4 bytes for productid)

~1Bil * [4 + 4] = 8 GB

Output size (4 bytes for productid, 4 bytes for count)

~1Bil * [4 + 4] = 8 GB

Trivial

Counting product pair views

Output/Intermediate data size - worst case

(8 bytes for productid pair, 4 bytes for count)

~1Bil * 1Bil * 4 = 4 Million TBs

‘Trivial?’

(if you have ~25 Billion$, at 100$/16GB RAM)

Design 1: P * P matrix for counters in RAM

    • RAM size = 4 Million TBs

Design 2: Array for products + per-product linked list for <other product, counter>

  • Worst case: u * q^2 * [8 bytes + 8 bytes for pointer] = 1.6 TB

Design 1 & 2 (on disk): Let OS page into memory as needed

  • Worst case #1 = 300 million years
  • Worst case #2: O(u * q^2) seeks = 100 Billion seeks = 31 years

9 of 12

Idea #1: smarter disk strategy

With Sorting

Design 3: Output u * q^2 tuples to a file

  • Data size: u * q^2 * [8 bytes] = 800 GB
  • Time to write (@100 MB/sec) = 8000 secs (~2.5 hours)

UserViews(UserId, ProductID)

Nespresso Coffee

Bread maker

Kenwood Espresso

...

Algorithm: For each user, product pi pj

Append < pi pj> to file-to-sort

External Sort, then Count

Design 3

 

Recall Sorting

Side math

B = 64GB/64KB = 1 million pages

N = 800 GB/64 KB = 12.5 million pages

Log1000000 12.5Million/(2 * 1Million) = 0.13

⇒ for B ~= N, IO Sorting cost ~= 2 N pages

Sort file

  • Data size: u * q^2 * [8 bytes] = 800 GB
  • Time to read-write (@100 MB/sec) = 16000 secs (~5 hours)

⇒ Compute time ~= 7.5 hrs !!

10 of 12

Idea #2: partition

Output smarter

With Sorting

(hashing + parallelism)

Design 4: Output u * q^2 tuples to a file

  • Cutting out 1 extra r/w
  • Time to write (@100 MB/sec) = 8000 secs (~2.5 hours)

With parallel disks,

  • Time to write (10 @100 MB/sec) = 800 secs (~15 mins)

UserViews(UserId, ProductID)

Nespresso Coffee

Bread maker

Kenwood Espresso

...

Algorithm: For each user, product pi pj

x = hash(pi pj) % numFiles // bucket

Append < pi pj> to file fx

External Sort each fx, as you go

Design 4

11 of 12

Idea#3:

Simplify the problem, Approximate the problem

Popular product pairs

Design 5: Cut down I/O time with sampling, probabilistic hashing (e.g., p’ = 1%)

  • Time to write ~ minutes

UserViews(UserId, ProductID)

Nespresso Coffee

Bread maker

Kenwood Espresso

...

Algorithm: For each user, product pi pj

x = hash(pi pj) % numFiles // bucket

With probability p’, append < pi pj> to file fx

External Sort each fx, Count as you go

Design 5:

12 of 12

Summary

Scale, Scale, Scale

Sorting, hashing,counting toolkit

  • E.g, Smarter disk strategy (sorting)
  • Smarter partition (hashing, parallelism)
  • Simplify, Approximate the problem

⇒ With the right scaling techniques, we went from

~25B$ or 300 million years ⇒ minutes/hours and < 10k$

General note on query optimization (more in next lecture)

  • Data systems use such techniques to optimize queries
  • For super-expensive queries, developers reframe and hand optimize query plans