1 of 66

Open Problems

in Stream Processing

A Call to Action

Tyler Akidau

Software Engineer, Google (then) → Snowflake (now)

@takidau

DEBS 2019 Keynote�

http://s.apache.org/open-problems-stream-processing

2 of 66

Who am I? Why do I care about streaming?

Ten years of industry experience with massive-scale stream processing:

2010: Joined the 2-year old MillWheel team at Google
2019: Tech Lead for Data Processing Languages & Systems group at Google

Cloud Dataflow, Google’s Apache Beam efforts, Google Flume, MillWheel, MapReduce

2020: Driving the creation of a new streaming processing system at Snowflake

Various streaming publications:

2013 VLDB MillWheel paper (supporting author; Sam McVeety primary)
2015 VLDB Dataflow Model paper (primary author)
2019 SIGMOD One SQL to Rule Them All paper (true co-author)
Streaming 101 & Streaming 102 articles on O’Reilly website
Streaming Systems book from O’Reilly

3 of 66

What is the purpose of this talk?

Stream processing has come a long ways in the last decade
Rich ecosystem of industry systems:

Apache Beam / Google Cloud Dataflow
Apache Flink
Apache Kafka Streams
Apache Spark
Microsoft Trill
Many, many others…

Broad set of supported semantics, consistency guarantees, etc.
Are we done?

4 of 66

What is the purpose of this talk?

Share my opinions on where we are and where we need to go

Encourage future investment in streaming research

Learn some new things from all of you�

5 of 66

1�2�3�4�5�6�7

Graceful evolution�Operational ease of use�SQL�Formal semantics�Latency ↔ Cost ↔ Correctness�Batch + Streaming Interoperability�Database-style optimizations

Agenda

6 of 66

Why does it matter?

Where are we?

Things to explore

Per Topic

7 of 66

Graceful evolution

Because change is the only constant

8 of 66

Graceful evolution: why does it matter?

Streaming pipelines run “forever” by definition.
Stateless streaming is useful, but stateful streaming is hugely useful.
Stateful operations persist intermediate/incremental results over time.
Changes to the pipeline over time may require changes to the persisted state:

Bugs? Surely not!
Semantic evolution

Changing data types: int → float
Altering/evolving use cases: distributed count distinct → HyperLogLog approximation

Manual optimization

When the machines aren’t smart enough

Automatic optimization!

When the machines are too smart

9 of 66

Graceful evolution: where are we?

Batch (recall: many batch jobs are really incremental streaming)

Just run it again! Kinda manual. Mostly works.
Can sort of do this w/ replayable streams. Think Kafka.

Limits on retention
Additional time and cost required to reprocess

Streaming

Stateless

Easy! Nothing to do here. Move along.

Drain / Update

Kind of a big hammer. Not very subtle. Downtime. Partial results or stranded data.

Snapshots

Ahhhhh, starts to feel like batch again. Nice. Flink is state of the art here.

Snapshot migration

Run a batch job to mutate your snapshot to new format.
Still manual, but powerful.

10 of 66

Graceful evolution: things to explore

Principled approaches to stateful evolution
Automatic evolution
Online evolution
Backend evolution

But graceful evolution is just one piece of the larger story of...

11 of 66

Operational ease of use

Because building pipelines is the easy part

12 of 66

Operational ease of use: why does it matter?

Streaming pipelines run “forever” by definition.

TIME(build) << TIME(run & maintain)

(Lack of) operational ease of use: frequent pain point for streaming pipelines

Exhibit 1: see previous topic (graceful evolution)
Exhibit 2: “Just copy my config, these magic N flags make it work”
Exhibit 3: “It worked fine for the last N months, and nothing’s changed!”

Two related classes of issues here:

Systems should be self-tuning and self-repairing
When they’re not, they should be easy to diagnose, tune, and repair

13 of 66

Operational ease of use: where are we?

State of the art in some aspects: Apache Beam on Google Cloud Dataflow

Serverless batch & streaming data processing in Google Cloud
Quite good in the self-tuning dimension:

“No knobs”, everything auto-tuned and essentially just works out of the box, no ops.
User: Here’s my pipeline, please process 1 GB of data. Dataflow: Sure, no problem.
User: Here’s my pipeline, please process 1 TB of data. Dataflow: Sure, no problem.
User: Here’s my pipeline, please process 1 PB of data. Dataflow: Sure, no problem.
...

More work to do in the “what’s wrong when it stops working?” dimension

Too opaque. Working to surface more actionable insights.

Takeaways:

Tensions between “no knobs” and “I need to make the #$@% thing work”
Nothing is ever perfect, so what do you do when things go wrong?

14 of 66

Operational ease of use: things to explore

What are the right things to auto tune?
What are the right kinds of knobs to provide?
How do we make appropriate self-tuning more ubiquitous

Hint: robust, proven, open-source algorithms would likely go a long way

What type of operational pain is especially difficult to work around manually?

Multi-homing anyone?

Interesting question: are there different ways we could be building these systems that would make them easier to operate?

Ops are super important, but ease of use goes well beyond operations...

15 of 66

SQL

Because SQL makes data processing accessible

16 of 66

SQL: why does it matter?

Do I really need to justify this?
SQL is the de facto lingua franca for data processing and analysis
Declarative approach = lots of opportunities for automatic optimization
But…
Despite a loooooooooong history of streaming SQL research...
Streaming SQL still not really a solved problem.

17 of 66

SQL: where are we?

A brief (and incomplete) history of streaming SQL research:

Tapestry & TQL (1992)
NiagaraCQ (2000)
CQL (2003)
Aurora (2003) / Borealis (2004)
SPADE / SPL (2008)
CEDR (2007) / Trill (2014)

Lots of focus on streaming as something different from classic SQL
IMO, the best example of streaming in SQL today: materialized views
A couple of recent efforts:

Streams & Tables: Two Sides of the Same Coin (BIRTE 2018)
One SQL to Rule Them All: an Efficient and Syntactically Idiomatic Approach to Management of Streams and Tables (SIGMOD 2019)

18 of 66

Deep dive on streaming SQL

19 of 66

Streams & tables (KSQL)

Streams & tables
SQL operations over Kafka streams
Key/value store tables
Processing-time windowing
Eventual-consistency semantics

Think materialized views

https://dl.acm.org/ft_gateway.cfm?id=3242155

20 of 66

One SQL to Rule Them All

Guiding principles

Unified semantics for SQL �over tables and streams
Minimal additions to the standard
Use as much as possible of SQL�in a streaming context�

Our proposal is threefold

Time-varying relations
Event time semantics
Controlling the materialization �of time-varying results

https://arxiv.org/pdf/1905.12133.pdf

21 of 66

Time-Varying Relations

22 of 66

Regular and Streaming SQL

Regular SQL queries process point-in-time relations

Transactions & isolation levels ensure consistency of relations�

Time is the new dimension of streaming SQL

Streaming SQL queries process relations that evolve over time

23 of 66

Time-Varying Relations (TVRs)

A time-varying relation (TVR) is a regular relation that changes over time

A table that is updated by transactional applications
A stream that is interpreted as the changelog of a table�

For each point in time, a TVR can return a static relation

The full set of SQL operations remains valid!�

A SQL query on a TVR is continuously evaluated and produces a result TVR

Result TVR can be computed in lock step with input TVRs
Equivalent to maintaining a materialized view

24 of 66

Key Insight

Streams and Tables are different representations �of the same semantic object - a TVR.

“Streams are Tables” instead of “Streams and Tables”

25 of 66

TVR Representations

TVR of auction bids

--------------------------�| bidtime | price | item |�--------------------------

26 of 66

TVR Representations

TVR of auction bids

--------------------------�| bidtime | price | item |�--------------------------

Stream representation of TVR

8:08 INSERT (8:07, $2, A)�8:12 INSERT (8:11, $3, B)�8:13 INSERT (8:05, $4, C)�8:15 INSERT (8:09, $5, D)�8:17 INSERT (8:13, $1, E)�8:18 INSERT (8:17, $6, F)

Time of change

Type of change

Changed data

27 of 66

TVR Representations

TVR of auction bids

--------------------------�| bidtime | price | item |�--------------------------

Stream representation of TVR

8:08 INSERT (8:07, $2, A)�8:12 INSERT (8:11, $3, B)�8:13 INSERT (8:05, $4, C)�8:15 INSERT (8:09, $5, D)�8:17 INSERT (8:13, $1, E)�8:18 INSERT (8:17, $6, F)

Table representation of TVR

8:14> SELECT * FROM bids;�--------------------------�| bidtime | price | item |�--------------------------�| 8:07 | $2 | A |�| 8:11 | $3 | B |�| 8:05 | $4 | C |�--------------------------

Time of query

28 of 66

TVR Representations

TVR of auction bids

--------------------------�| bidtime | price | item |�--------------------------

Stream representation of TVR

8:08 INSERT (8:07, $2, A)�8:12 INSERT (8:11, $3, B)�8:13 INSERT (8:05, $4, C)�8:15 INSERT (8:09, $5, D)�8:17 INSERT (8:13, $1, E)�8:18 INSERT (8:17, $6, F)

Table representation of TVR

8:20> SELECT * FROM bids;

--------------------------

| bidtime | price | item |

--------------------------

| 8:07 | $2 | A |

| 8:11 | $3 | B |

| 8:05 | $4 | C |

| 8:09 | $5 | D |

| 8:13 | $1 | E |

| 8:17 | $6 | F |

--------------------------

29 of 66

No SQL Extension

No SQL extensions needed!�
All SQL operations remain valid on TVRs

30 of 66

Event Time Semantics

31 of 66

Event Time vs. Processing Time

Time-based operations are very common in stream processing

Count clicks per URL and hour
Join events that are at most 5 minutes apart from each other�

An engine needs a notion of time to evaluate such queries

Using arrival time of events (a.k.a. processing time) results in arbitrary results�

Event time semantics are required for correct and consistent results

Using timestamps that are provided by/embedded in the data
Correct results when live data is delayed or out-of-order is processed
Correct results when recorded data is processed

32 of 66

Implementing Event Time Semantics

Event timestamps and watermarks* to implement event time semantics

Event timestamps define point in time of an event
Watermarks define a temporal margin of completeness for a stream�

General approach to support full breadth �of streaming use cases

Temporal aggregations
Notifications and alerts

*Millwheel (VLDB’13) proposed watermarks. Later adopted by Cloud Dataflow, Beam, and Flink

33 of 66

SQL Extension 1: Event Time Attributes

Add DDL syntax to declare watermarked event time attributes

Similar to PRIMARY KEY, UNIQUE, NOT NULL, or other constraints�

Event time attribute is a regular TIMESTAMP type

Attribute can be used like any other TIMESTAMP attribute
Attribute is “roughly” increasing.
Watermarks report minimum of future values�

Optimizer uses knowledge of event time attributes to build plans that leverage watermarks to reason about progress

34 of 66

SQL Extension 1: Event Time Attributes

8:07 WM → 8:05

8:08 INSERT (8:07, $2, A)

8:12 INSERT (8:11, $3, B)

8:13 INSERT (8:05, $4, C)

8:14 WM → 8:08

8:15 INSERT (8:09, $5, D)

8:16 WM → 8:12

8:17 INSERT (8:13, $1, E)

8:18 INSERT (8:17, $6, F)

8:21 WM → 8:20

Watermark

Event Timestamp

35 of 66

SQL Extension 2: Event Time Windowing Functions

Add build-in table-valued functions to assign rows to event time windows

TUMBLE function assigns each row to a fixed sized window

Enriches rows of a TVR with start and end timestamps

------------------------------------------

------------------------------------------

| 8:00 | 8:10 | 8:07 | $2 | A |

| 8:10 | 8:20 | 8:11 | $3 | B |

| 8:00 | 8:10 | 8:05 | $4 | C |

| 8:00 | 8:10 | 8:09 | $5 | D |

| 8:10 | 8:20 | 8:13 | $1 | E |

| 8:10 | 8:20 | 8:17 | $6 | F |

------------------------------------------

8:21> SELECT *

FROM

Tumble (

data => TABLE(Bid),

timecol => DESCRIPTOR(bidtime),

dur => INTERVAL '10 ' MINUTES);

36 of 66

SQL Extension 2: Event Time Windowing Functions

Aggregate rows per event time window

-------------------------

| wstart | wend | price |

-------------------------

| 8:00 | 8:10 | $11 |

| 8:10 | 8:20 | $10 |

-------------------------

8:21> SELECT MAX(wstart), wend, SUM(price)

FROM

Tumble (

data => TABLE(Bid),

timecol => DESCRIPTOR(bidtime),

dur => INTERVAL '10 ' MINUTES)

GROUP BY wend;

37 of 66

Controlling the �Materialization of Time-Varying Results

38 of 66

Control How and When to Materialize a TVR

A query on a TVR produces a result TVR

There are several options how to materialize a TVR�

Materialize a TVR as table or as stream?

Table materialization is the default
Stream materialization needs to be explicitly chosen�

Choose when or how often to materialize TVR changes

Only materialize complete results
Only materialize changes once per minute

39 of 66

SQL Extension 3: Stream Materialization

Add EMIT STREAM clause for stream materialization�
Materializes the changes of a TVR in a changelog TVR

All operations on TVR are supported

40 of 66

SQL Extension 3: Stream Materialization

8:08> SELECT ... EMIT STREAM;

----------------------------------------------

----------------------------------------------

| 8:00 | 8:10 | $2 | | 8:08 | 0 |

| 8:10 | 8:20 | $3 | | 8:12 | 0 |

| 8:00 | 8:10 | $2 | undo | 8:13 | 1 |

| 8:00 | 8:10 | $6 | | 8:13 | 2 |

| 8:00 | 8:10 | $6 | undo | 8:15 | 3 |

| 8:00 | 8:10 | $11 | | 8:15 | 4 |

| 8:10 | 8:20 | $3 | undo | 8:17 | 1 |

| 8:10 | 8:20 | $4 | | 8:17 | 2 |

| 8:10 | 8:20 | $4 | undo | 8:18 | 3 |

| 8:10 | 8:20 | $10 | | 8:18 | 4 |

...

8:08 INSERT (8:07, $2, A)�8:12 INSERT (8:11, $3, B)�8:13 INSERT (8:05, $4, C)�8:15 INSERT (8:09, $5, D)�8:17 INSERT (8:13, $1, E)�8:18 INSERT (8:17, $6, F)

8:20> SELECT ...;�-------------------------

| wstart | wend | price |

-------------------------

| 8:00 | 8:10 | $11 |

| 8:10 | 8:20 | $10 |

-------------------------

41 of 66

SQL Extension 4: Delay for Completeness

Add EMIT AFTER WATERMARK clause to materialize only complete results

Watermark indicates completeness

8:15> SELECT ... EMIT AFTER WATERMARK;�-------------------------

| wstart | wend | price |

-------------------------

8:17> SELECT ... EMIT AFTER WATERMARK;�-------------------------

| wstart | wend | price |

-------------------------�| 8:00 | 8:10 | $11 |

-------------------------

8:07 WM → 8:05

8:08 INSERT (8:07, $2, A)

8:12 INSERT (8:11, $3, B)

8:13 INSERT (8:05, $4, C)

8:14 WM → 8:08

8:15 INSERT (8:09, $5, D)

8:16 WM → 8:12

8:17 INSERT (8:13, $1, E)

8:18 INSERT (8:17, $6, F)

8:21 WM → 8:20

42 of 66

SQL Extension 4: Delay for Completeness

Add EMIT AFTER WATERMARK clause can be combined with EMIT STREAM

8:07 WM → 8:05

8:08 INSERT (8:07, $2, A)

8:12 INSERT (8:11, $3, B)

8:13 INSERT (8:05, $4, C)

8:14 WM → 8:08

8:15 INSERT (8:09, $5, D)

8:16 WM → 8:12

8:17 INSERT (8:13, $1, E)

8:18 INSERT (8:17, $6, F)

8:21 WM → 8:20

8:08> SELECT ... EMIT STREAM AFTER WATERMARK;

----------------------------------------------

----------------------------------------------

| 8:00 | 8:10 | $11 | | 8:16 | 0 |

| 8:10 | 8:20 | $10 | | 8:21 | 0 |

...

43 of 66

SQL Extension 5: Periodic Delays

Materializing every change of a TVR can result in many updates

Can overload downstream systems
Often not necessary / required�

Add EMIT AFTER DELAY clause to control frequency of updates

44 of 66

SQL Extension 5: Periodic Delays

8:08> SELECT ... EMIT STREAM

AFTER DELAY INTERVAL '6' MINUTES;

----------------------------------------------

----------------------------------------------

| 8:00 | 8:10 | $6 | | 8:14 | 0 |

| 8:10 | 8:20 | $10 | | 8:18 | 0 |

| 8:00 | 8:10 | $6 | undo | 8:21 | 1 |

| 8:00 | 8:10 | $11 | | 8:21 | 2 |

...

8:08 INSERT (8:07, $2, A)�8:12 INSERT (8:11, $3, B)�8:13 INSERT (8:05, $4, C)�8:15 INSERT (8:09, $5, D)�8:17 INSERT (8:13, $1, E)�8:18 INSERT (8:17, $6, F)

8:21> SELECT ...;�-------------------------

| wstart | wend | price |

-------------------------

| 8:00 | 8:10 | $11 |

| 8:10 | 8:20 | $10 |

-------------------------

8:15> SELECT ...;�-------------------------

| wstart | wend | price |

-------------------------

| 8:00 | 8:10 | $6 |

-------------------------

8:19> SELECT ...;�-------------------------

| wstart | wend | price |

-------------------------

| 8:00 | 8:10 | $6 |

| 8:10 | 8:20 | $10 |

-------------------------

45 of 66

Streaming SQL Recap

Guiding principles

Unified semantics for SQL over tables and streams
Minimal additions to the standard
Use as much as possible of SQL in a streaming context�

Our proposal
Time-varying relations
Event time semantics

Ext 1: Event time attributes
Ext 2: Event time windowing functions

Controlling the materialization of time-varying results

Ext 3: Stream materialization
Ext 4: Watermark delays for completeness
Ext 5: Periodic delays for rate limiting

← DDL for watermarks, etc.�← TUMBLE, HOP, etc. via table-valued functions��← EMIT STREAM�← EMIT AFTER WATERMARK�← EMIT AFTER DELAY

46 of 66

SQL: things to explore

Need to understand:

What else is truly missing?

Or are there different approaches that would be better?

How can we cleanly (and minimally) add new semantics to the SQL language?

But to do all that, we need a clearer understanding of what we’re doing...

47 of 66

Formal Semantics

Because guesses and intuition are a perilous foundation

48 of 66

Formal Semantics: why does it matter?

Math is the foundation on top of which we build everything else
Formal models provide frameworks for analyzing and discussing �practical approaches
The exercise of refining a formal model helps surface new insights �and simplify approaches
It’s also fun :-)

49 of 66

Formal Semantics: where are we now?

(mostly kidding)

50 of 66

Formal Semantics: where are we now?

Long history of research in this area.

See previous SQL topic

But still falls short

Lots of focus on streaming as something wholly different

That’s not going to be the right model

Lots of great ideas that need to be tied together properly into a big picture
Much lacking in the area of event-time semantics
Also not much enough formalization of practical run-time semantics

51 of 66

Formal Semantics: things to explore

What is streaming? What concepts are necessary and sufficient?
Example: watermarks

What do they mean and how do you use them?
How do we reason about different approaches to “watermarking”?

MillWheel / Cloud Dataflow vs Flink vs Spark: three very different approaches

Example: latency

What is it?

Time spent processing? Time since event happened?

What kinds of latency matter?

Max? Percentile? Rolling average?

How do we measure them? How do we adjust them? What are the tradeoffs?

On the subject of tradeoffs, let’s talk about the perpetual tensions between...

52 of 66

Latency ↔ Cost ↔ Correctness

Because you can’t have everything all at once (yet ;-)

53 of 66

Latency ↔ Cost ↔ Correctness: why does it matter?

Three dimensions!
Can choose 2!
Have to be flexible on the 3rd
Need clean abstracrtions for full freedom

Correctness

Low cost

Low latency

1. Fast & Correct

2. Cheap & Correct

3. Fast & Cheap

54 of 66

Latency ↔ Cost ↔ Correctness: where are we?

Industry state of the art:

Dataflow: batch & streaming engines

Same API, very different implementations and performance/efficiency/latency characteristics.

Batch: Correct & Cheap
Streaming: Fast & Correct

Flink snapshots vs zero-consistency mode

Can turn off consistency to go fast (and lose data)

Consistency on: Fast & Correct
Consistency off: Fast(er) & Cheap

Spark streaming batch sizes, in-memory only

Batch sizes: Manual tuning of Fast vs Cheap balance
In-memory only: Fast & Cheap (& Correct in certain circumstances)

55 of 66

Latency ↔ Cost ↔ Correctness: things to explore

What do real world users truly need, and what things are just nice to have?
What does it take to effectively serve a broad set of use cases?

Is one engine enough?
Two?
Three?
…?

Batch-like approaches?
The big question: how can we have all three all the time?
Meta question: what is the right term for correctness?

Precision? Completeness? Correctness?

And on the topic of batch-like approaches...

56 of 66

Batch + Streaming Interoperability

Because we each have a lot to learn from one another

57 of 66

Batch + Streaming Interoperability: why does it matter?

Batch is still huge, often because it’s cheaper and faster in cases where low latency is not top priority.

See previous topic.

Why is it cheaper and faster?

Because their engines are doing things that are fundamentally different:

Different optimization strategies
Different algorithms
Different types of machine resources
...

58 of 66

Batch + Streaming Interoperability: where are we?

Many batch users are actually doing streaming
Many approaches in the industry:

Dataflow

Unified API, separate batch and streaming engines

Spark

Unified API, unified batch-centric engine

True streaming engine in development

Flink

Separate APIs & engines now, moving towards unifying everything.

Batch becoming a special case of streaming

59 of 66

Batch + Streaming Interoperability: things to explore

How can we make batch and streaming play better together?

Imagine transparently using each for the things they’re best at:

Batch: lower cost, higher latency

Backfills / catchup, massive jobs

Streaming: higher cost, lower latency

Incremental / online processing, low-latency use cases (e.g., anomaly detection)

How can we help batch users move to streaming?

Cost is often a factor.
Complexity too.

Batch can be simple and straightforward. Streaming often is not.

How can we gain the advantages of batch systems in streaming systems, but without lots of manual work for users?

Speaking of manual work, why are we largely ignoring decades of research into...

60 of 66

Database-style optimizations

Because, let’s face it, so much of what we currently do was �invented by the database community decades ago anyway

61 of 66

Database-style optimizations: why does it matter?

Automatic optimization is cruical to SQL’s broad adoption
Huge body of optimization work in databases.
An arguably minimal amount of it has been applied in streaming systems.
Lots of talk in industry about “turning database inside out”
...but all the smarts are falling out as we do so.
Stream processor as a database ~= some business logic + a key value store
We can do so much better!

62 of 66

Database-style optimizations: where are we?

State of art in industry: Spark’s Catalyst Optimizer

Quite sophisticated, very similar to classic database optimizer
Built-in code gen is slick
Unclear if optimizer adapts over time for streaming

Possibly not too hard, given micro-batch architecture

Dataflow has an optimizer.

Currently has optimizations like fusion, combiner lifting, direct joins (batch)
Really valuable, but could do a lot more
Adapts over time for auto-scaling, but not for optimizations

Flink mostly has optimizations primarily in batch

This will be more unified soon

63 of 66

Database-style optimizations: things to explore

What classic database optimizations are applicable to streaming?

Do they need minor changes?
Do we need entirely new optimizations?

Can optimizations help adapt the pipeline over time?

Hot keys coming and going
Pipelines falling behind
Data shapes and sizes evolving over time
....

Can optimizations help solve the latency ↔ cost ↔ correctness conundrum?

And with that, we’re done.

1 of 66

2 of 66

3 of 66

4 of 66

5 of 66

6 of 66

7 of 66

8 of 66

9 of 66

10 of 66

11 of 66

12 of 66

13 of 66

14 of 66

15 of 66

16 of 66

17 of 66

18 of 66

19 of 66

20 of 66

21 of 66

22 of 66

23 of 66

24 of 66

25 of 66

26 of 66

27 of 66

28 of 66

29 of 66

30 of 66

31 of 66

32 of 66

33 of 66

34 of 66

35 of 66

36 of 66

37 of 66

38 of 66

39 of 66

40 of 66

41 of 66

42 of 66

43 of 66

44 of 66

45 of 66

46 of 66

47 of 66

48 of 66

49 of 66

50 of 66

51 of 66

52 of 66

53 of 66

54 of 66

55 of 66

56 of 66

57 of 66

58 of 66

59 of 66

60 of 66

61 of 66

62 of 66

63 of 66

64 of 66

65 of 66

66 of 66