A Production Quality Sketching Library for the Analysis of Big Data
Lee Rhodes
Verizon Media, Inc.
✦ Currently in Incubation
✦
1
Some Very Common Queries …
Frequent Items /
Heavy Hitters
Uniform
Weighted
Reservoir Sampling
Quantiles, CDFs
Unique Identifiers
with Set Expressions:�(AUB)∩(CUD) - E
Graph Analysis
Are All Computationally Difficult
Histograms, PMFs
Vector & Matrix�Operations:�SVD, etc.
Mobile Telemetry
5 Major Characteristics
Sub-linear
Stream Size
Linear
Sketch Size
Results +/- ε
ε = f(k)
Data
Stream
Random Selection
Stream�Processor
Query�Processor
Data�Structure
size = f(k)
Sizing, Resizing, Storing
Query
Merge / Set�Operations
Sketch Stream
Result Sketch
The Sketch. (a.k.a, Stochastic Streaming Algorithm)
Case Study: Real-time, Before and After
| Before Sketches | After Sketches |
VCS* / Mo. | ~80B | ~20B |
Result Freshness | Daily: 2 to 8 hours; Weekly: ~3 days Real-time Results Not Feasible! | 15 seconds! |
Big Wins!
Near-Real Time �Lower System $
* VCS: Virtual Core Seconds
Advantages of Sketch-based System Design
How Do We Do This?
We are a team of Scientists that love Engineering�… and Engineers that love Science!
Core Team (VM: Verizon Media)
Extended Team
… And our Community is Growing !
THANK YOU!
Open Invitation for
Collaboration & Committers
datasketches.apache.org