1 of 7

A Production Quality Sketching Library for the Analysis of Big Data

Lee Rhodes

Verizon Media, Inc.

✦ Currently in Incubation

1

2 of 7

Some Very Common Queries …

Frequent Items /

Heavy Hitters

Uniform

Weighted

Reservoir Sampling

Quantiles, CDFs

Unique Identifiers

with Set Expressions:�(AUB)∩(CUD) - E

Graph Analysis

Are All Computationally Difficult

Histograms, PMFs

 

Vector & Matrix�Operations:�SVD, etc.

Mobile Telemetry

3 of 7

5 Major Characteristics

  • Small Stored Size, Sub-linear in Space →
  • Single-pass, “One-Touch”
  • Mergeable
  • Approximate, Probabilistic
  • Mathematically Proven Error Bounds

Sub-linear

Stream Size

Linear

Sketch Size

Results +/- ε

ε = f(k)

Data

Stream

Random Selection

Stream�Processor

Query�Processor

Data�Structure

size = f(k)

Sizing, Resizing, Storing

Query

Merge / Set�Operations

Sketch Stream

Result Sketch

The Sketch. (a.k.a, Stochastic Streaming Algorithm)

4 of 7

Case Study: Real-time, Before and After

  • Customers: >250K Mobile App Developers
  • Data: 40-50 TB per day
  • Platform: 2 clusters X 80 Nodes = 160 Nodes
    • Node: 24 CPUs, 250GB RAM = 40TB

Before Sketches

After Sketches

VCS* / Mo.

~80B

~20B

Result Freshness

Daily: 2 to 8 hours; Weekly: ~3 days

Real-time Results Not Feasible!

15 seconds!

Big Wins!

Near-Real Time �Lower System $

* VCS: Virtual Core Seconds

5 of 7

Advantages of Sketch-based System Design

  • Architectural simplicity
    • Fewer processing steps
    • Smaller intermediate tables
    • Embarrassingly Parallelizable
  • Enables “Hyper-cube” architectures
  • Simple time windowing
  • Multiple Languages Supported: Java, C++, Python
  • Binary compatibility across languages and systems
  • Set expressions for extended analysis capabilities (Theta Sketches)
  • Fast! -- The Only Solution for Real-Time!

6 of 7

How Do We Do This?

We are a team of Scientists that love Engineering�… and Engineers that love Science!

Core Team (VM: Verizon Media)

  • Lee Rhodes, Distinguished Architect, Yahoo/VM. Started internal DataSketches project 2012
  • Alex Saydakov, Systems Developer, Yahoo/VM, joined 2015
  • Jon Malkin, Ph.D., Scientist, Developer, Yahoo/VM, joined 2016
  • Edo Liberty, Ph.D., Founder, HyperCube Technologies. Joined 2015
  • Justin Thaler, Ph.D., Assistant Professor, Georgetown University, Computer Science. Joined 2015.
  • Roman Leventov, Systems Developer for Druid, Metamarkets, joined 2018

Extended Team

    • Graham Cormode, Ph.D., Professor, University of Warwick, Computer Science, joined 2017
    • Jelani Nelson, Ph.D., Professor, U.C. Berkeley, joined 2019
    • Daniel Ting, Ph.D., Sr Scientist, Tableau (now Salesforce), joined 2019
    • Christopher Musco, Ph.D. MIT➔NYU, Theoretical Computer Science, Lin Algebra. Joined 2018

… And our Community is Growing !

7 of 7

THANK YOU!

Open Invitation for

Collaboration & Committers

datasketches.apache.org