1 of 66

Open Problems

in Stream Processing

A Call to Action

Tyler Akidau

Software Engineer, Google (then) → Snowflake (now)

@takidau

DEBS 2019 Keynote�

http://s.apache.org/open-problems-stream-processing

2 of 66

Who am I? Why do I care about streaming?

Ten years of industry experience with massive-scale stream processing:

  • 2010: Joined the 2-year old MillWheel team at Google
  • 2019: Tech Lead for Data Processing Languages & Systems group at Google
    • Cloud Dataflow, Google’s Apache Beam efforts, Google Flume, MillWheel, MapReduce
  • 2020: Driving the creation of a new streaming processing system at Snowflake

Various streaming publications:

3 of 66

What is the purpose of this talk?

  • Stream processing has come a long ways in the last decade
  • Rich ecosystem of industry systems:
    • Apache Beam / Google Cloud Dataflow
    • Apache Flink
    • Apache Kafka Streams
    • Apache Spark
    • Microsoft Trill
    • Many, many others…
  • Broad set of supported semantics, consistency guarantees, etc.
  • Are we done?
    • No!

4 of 66

What is the purpose of this talk?

Share my opinions on where we are and where we need to go

Encourage future investment in streaming research

Learn some new things from all of you�

5 of 66

1�2�3�4�5�6�7

Graceful evolution�Operational ease of use�SQL�Formal semantics�Latency ↔ Cost ↔ Correctness�Batch + Streaming Interoperability�Database-style optimizations

Agenda

6 of 66

Why does it matter?

Where are we?

Things to explore

Per Topic

7 of 66

1

Graceful evolution

Because change is the only constant

8 of 66

Graceful evolution: why does it matter?

  • Streaming pipelines run “forever” by definition.
  • Stateless streaming is useful, but stateful streaming is hugely useful.
  • Stateful operations persist intermediate/incremental results over time.
  • Changes to the pipeline over time may require changes to the persisted state:
    • Bugs? Surely not!
    • Semantic evolution
      • Changing data types: int → float
      • Altering/evolving use cases: distributed count distinct → HyperLogLog approximation
    • Manual optimization
      • When the machines aren’t smart enough
    • Automatic optimization!
      • When the machines are too smart

9 of 66

Graceful evolution: where are we?

  • Batch (recall: many batch jobs are really incremental streaming)
    • Just run it again! Kinda manual. Mostly works.
    • Can sort of do this w/ replayable streams. Think Kafka.
      • Limits on retention
      • Additional time and cost required to reprocess
  • Streaming
    • Stateless
      • Easy! Nothing to do here. Move along.
    • Drain / Update
      • Kind of a big hammer. Not very subtle. Downtime. Partial results or stranded data.
    • Snapshots
      • Ahhhhh, starts to feel like batch again. Nice. Flink is state of the art here.
    • Snapshot migration
      • Run a batch job to mutate your snapshot to new format.
      • Still manual, but powerful.

10 of 66

Graceful evolution: things to explore

  • Principled approaches to stateful evolution
  • Automatic evolution
  • Online evolution
  • Backend evolution

But graceful evolution is just one piece of the larger story of...

11 of 66

2

Operational ease of use

Because building pipelines is the easy part

12 of 66

Operational ease of use: why does it matter?

  • Streaming pipelines run “forever” by definition.
    • TIME(build) << TIME(run & maintain)
  • (Lack of) operational ease of use: frequent pain point for streaming pipelines
    • Exhibit 1: see previous topic (graceful evolution)
    • Exhibit 2: “Just copy my config, these magic N flags make it work”
    • Exhibit 3: “It worked fine for the last N months, and nothing’s changed!”
  • Two related classes of issues here:
    • Systems should be self-tuning and self-repairing
    • When they’re not, they should be easy to diagnose, tune, and repair

13 of 66

Operational ease of use: where are we?

  • State of the art in some aspects: Apache Beam on Google Cloud Dataflow
    • Serverless batch & streaming data processing in Google Cloud
    • Quite good in the self-tuning dimension:
      • “No knobs”, everything auto-tuned and essentially just works out of the box, no ops.
      • User: Here’s my pipeline, please process 1 GB of data. Dataflow: Sure, no problem.
      • User: Here’s my pipeline, please process 1 TB of data. Dataflow: Sure, no problem.
      • User: Here’s my pipeline, please process 1 PB of data. Dataflow: Sure, no problem.
      • ...
    • More work to do in the “what’s wrong when it stops working?” dimension
      • Too opaque. Working to surface more actionable insights.
  • Takeaways:
    • Tensions between “no knobs” and “I need to make the #$@% thing work”
    • Nothing is ever perfect, so what do you do when things go wrong?

14 of 66

Operational ease of use: things to explore

  • What are the right things to auto tune?
  • What are the right kinds of knobs to provide?
  • How do we make appropriate self-tuning more ubiquitous
    • Hint: robust, proven, open-source algorithms would likely go a long way
  • What type of operational pain is especially difficult to work around manually?
    • Multi-homing anyone?
  • Interesting question: are there different ways we could be building these systems that would make them easier to operate?

Ops are super important, but ease of use goes well beyond operations...

15 of 66

3

SQL

Because SQL makes data processing accessible

16 of 66

SQL: why does it matter?

  • Do I really need to justify this?
  • SQL is the de facto lingua franca for data processing and analysis
  • Declarative approach = lots of opportunities for automatic optimization
  • But…
  • Despite a loooooooooong history of streaming SQL research...
  • Streaming SQL still not really a solved problem.

17 of 66

SQL: where are we?

18 of 66

Deep dive on streaming SQL

19 of 66

Streams & tables (KSQL)

  • Streams & tables
  • SQL operations over Kafka streams
  • Key/value store tables
  • Processing-time windowing
  • Eventual-consistency semantics
    • Think materialized views

https://dl.acm.org/ft_gateway.cfm?id=3242155

20 of 66

One SQL to Rule Them All

  • Guiding principles
    • Unified semantics for SQL �over tables and streams
    • Minimal additions to the standard
    • Use as much as possible of SQL�in a streaming context�
  • Our proposal is threefold
    • Time-varying relations
    • Event time semantics
    • Controlling the materialization �of time-varying results

https://arxiv.org/pdf/1905.12133.pdf

21 of 66

Time-Varying Relations

22 of 66

Regular and Streaming SQL

  • Regular SQL queries process point-in-time relations
    • Transactions & isolation levels ensure consistency of relations�
  • Time is the new dimension of streaming SQL
    • Streaming SQL queries process relations that evolve over time

23 of 66

Time-Varying Relations (TVRs)

  • A time-varying relation (TVR) is a regular relation that changes over time
    • A table that is updated by transactional applications
    • A stream that is interpreted as the changelog of a table�
  • For each point in time, a TVR can return a static relation
    • The full set of SQL operations remains valid!�
  • A SQL query on a TVR is continuously evaluated and produces a result TVR
    • Result TVR can be computed in lock step with input TVRs
    • Equivalent to maintaining a materialized view

24 of 66

Key Insight

Streams and Tables are different representations �of the same semantic object - a TVR.

“Streams are Tables” instead of “Streams and Tables”

25 of 66

TVR Representations

TVR of auction bids

--------------------------�| bidtime | price | item |�--------------------------

26 of 66

TVR Representations

TVR of auction bids

--------------------------�| bidtime | price | item |�--------------------------

Stream representation of TVR

8:08 INSERT (8:07, $2, A)�8:12 INSERT (8:11, $3, B)�8:13 INSERT (8:05, $4, C)�8:15 INSERT (8:09, $5, D)�8:17 INSERT (8:13, $1, E)�8:18 INSERT (8:17, $6, F)

Time of change

Type of change

Changed data

27 of 66

TVR Representations

TVR of auction bids

--------------------------�| bidtime | price | item |�--------------------------

Stream representation of TVR

8:08 INSERT (8:07, $2, A)�8:12 INSERT (8:11, $3, B)�8:13 INSERT (8:05, $4, C)�8:15 INSERT (8:09, $5, D)�8:17 INSERT (8:13, $1, E)�8:18 INSERT (8:17, $6, F)

Table representation of TVR

8:14> SELECT * FROM bids;�--------------------------�| bidtime | price | item |�--------------------------�| 8:07 | $2 | A |�| 8:11 | $3 | B |�| 8:05 | $4 | C |�--------------------------

Time of query

28 of 66

TVR Representations

TVR of auction bids

--------------------------�| bidtime | price | item |�--------------------------

Stream representation of TVR

8:08 INSERT (8:07, $2, A)�8:12 INSERT (8:11, $3, B)�8:13 INSERT (8:05, $4, C)�8:15 INSERT (8:09, $5, D)�8:17 INSERT (8:13, $1, E)�8:18 INSERT (8:17, $6, F)

Table representation of TVR

8:14> SELECT * FROM bids;�--------------------------�| bidtime | price | item |�--------------------------�| 8:07 | $2 | A |�| 8:11 | $3 | B |�| 8:05 | $4 | C |�--------------------------

8:20> SELECT * FROM bids;

--------------------------

| bidtime | price | item |

--------------------------

| 8:07 | $2 | A |

| 8:11 | $3 | B |

| 8:05 | $4 | C |

| 8:09 | $5 | D |

| 8:13 | $1 | E |

| 8:17 | $6 | F |

--------------------------

29 of 66

No SQL Extension

  • No SQL extensions needed!�
  • All SQL operations remain valid on TVRs

30 of 66

Event Time Semantics

31 of 66

Event Time vs. Processing Time

  • Time-based operations are very common in stream processing
    • Count clicks per URL and hour
    • Join events that are at most 5 minutes apart from each other�
  • An engine needs a notion of time to evaluate such queries
    • Using arrival time of events (a.k.a. processing time) results in arbitrary results�
  • Event time semantics are required for correct and consistent results
    • Using timestamps that are provided by/embedded in the data
    • Correct results when live data is delayed or out-of-order is processed
    • Correct results when recorded data is processed

32 of 66

Implementing Event Time Semantics

  • Event timestamps and watermarks* to implement event time semantics
    • Event timestamps define point in time of an event
    • Watermarks define a temporal margin of completeness for a stream�
  • General approach to support full breadth �of streaming use cases
    • Temporal aggregations
    • Notifications and alerts

*Millwheel (VLDB’13) proposed watermarks. Later adopted by Cloud Dataflow, Beam, and Flink

33 of 66

SQL Extension 1: Event Time Attributes

  • Add DDL syntax to declare watermarked event time attributes
    • Similar to PRIMARY KEY, UNIQUE, NOT NULL, or other constraints�
  • Event time attribute is a regular TIMESTAMP type
    • Attribute can be used like any other TIMESTAMP attribute
    • Attribute is “roughly” increasing.
    • Watermarks report minimum of future values�
  • Optimizer uses knowledge of event time attributes to build plans that leverage watermarks to reason about progress

34 of 66

SQL Extension 1: Event Time Attributes

8:07 WM → 8:05

8:08 INSERT (8:07, $2, A)

8:12 INSERT (8:11, $3, B)

8:13 INSERT (8:05, $4, C)

8:14 WM → 8:08

8:15 INSERT (8:09, $5, D)

8:16 WM → 8:12

8:17 INSERT (8:13, $1, E)

8:18 INSERT (8:17, $6, F)

8:21 WM → 8:20

Watermark

Event Timestamp

35 of 66

SQL Extension 2: Event Time Windowing Functions

  • Add build-in table-valued functions to assign rows to event time windows

  • TUMBLE function assigns each row to a fixed sized window
    • Enriches rows of a TVR with start and end timestamps

------------------------------------------

| wstart | wend | bidtime | price | item |

------------------------------------------

| 8:00 | 8:10 | 8:07 | $2 | A |

| 8:10 | 8:20 | 8:11 | $3 | B |

| 8:00 | 8:10 | 8:05 | $4 | C |

| 8:00 | 8:10 | 8:09 | $5 | D |

| 8:10 | 8:20 | 8:13 | $1 | E |

| 8:10 | 8:20 | 8:17 | $6 | F |

------------------------------------------

8:21> SELECT *

FROM

Tumble (

data => TABLE(Bid),

timecol => DESCRIPTOR(bidtime),

dur => INTERVAL '10 ' MINUTES);

36 of 66

SQL Extension 2: Event Time Windowing Functions

  • Aggregate rows per event time window

-------------------------

| wstart | wend | price |

-------------------------

| 8:00 | 8:10 | $11 |

| 8:10 | 8:20 | $10 |

-------------------------

8:21> SELECT MAX(wstart), wend, SUM(price)

FROM

Tumble (

data => TABLE(Bid),

timecol => DESCRIPTOR(bidtime),

dur => INTERVAL '10 ' MINUTES)

GROUP BY wend;

37 of 66

Controlling the �Materialization of Time-Varying Results

38 of 66

Control How and When to Materialize a TVR

  • A query on a TVR produces a result TVR
    • There are several options how to materialize a TVR�
  • Materialize a TVR as table or as stream?
    • Table materialization is the default
    • Stream materialization needs to be explicitly chosen�
  • Choose when or how often to materialize TVR changes
    • Only materialize complete results
    • Only materialize changes once per minute

39 of 66

SQL Extension 3: Stream Materialization

  • Add EMIT STREAM clause for stream materialization�
  • Materializes the changes of a TVR in a changelog TVR
    • All operations on TVR are supported

40 of 66

SQL Extension 3: Stream Materialization

8:08> SELECT ... EMIT STREAM;

----------------------------------------------

| wstart | wend | price | undo | ptime | ver |

----------------------------------------------

| 8:00 | 8:10 | $2 | | 8:08 | 0 |

| 8:10 | 8:20 | $3 | | 8:12 | 0 |

| 8:00 | 8:10 | $2 | undo | 8:13 | 1 |

| 8:00 | 8:10 | $6 | | 8:13 | 2 |

| 8:00 | 8:10 | $6 | undo | 8:15 | 3 |

| 8:00 | 8:10 | $11 | | 8:15 | 4 |

| 8:10 | 8:20 | $3 | undo | 8:17 | 1 |

| 8:10 | 8:20 | $4 | | 8:17 | 2 |

| 8:10 | 8:20 | $4 | undo | 8:18 | 3 |

| 8:10 | 8:20 | $10 | | 8:18 | 4 |

...

8:08 INSERT (8:07, $2, A)�8:12 INSERT (8:11, $3, B)�8:13 INSERT (8:05, $4, C)�8:15 INSERT (8:09, $5, D)�8:17 INSERT (8:13, $1, E)�8:18 INSERT (8:17, $6, F)

8:20> SELECT ...;�-------------------------

| wstart | wend | price |

-------------------------

| 8:00 | 8:10 | $11 |

| 8:10 | 8:20 | $10 |

-------------------------

41 of 66

SQL Extension 4: Delay for Completeness

  • Add EMIT AFTER WATERMARK clause to materialize only complete results
    • Watermark indicates completeness

8:15> SELECT ... EMIT AFTER WATERMARK;�-------------------------

| wstart | wend | price |

-------------------------

-------------------------

8:17> SELECT ... EMIT AFTER WATERMARK;�-------------------------

| wstart | wend | price |

-------------------------�| 8:00 | 8:10 | $11 |

-------------------------

8:07 WM → 8:05

8:08 INSERT (8:07, $2, A)

8:12 INSERT (8:11, $3, B)

8:13 INSERT (8:05, $4, C)

8:14 WM → 8:08

8:15 INSERT (8:09, $5, D)

8:16 WM → 8:12

8:17 INSERT (8:13, $1, E)

8:18 INSERT (8:17, $6, F)

8:21 WM → 8:20

42 of 66

SQL Extension 4: Delay for Completeness

  • Add EMIT AFTER WATERMARK clause can be combined with EMIT STREAM

8:07 WM → 8:05

8:08 INSERT (8:07, $2, A)

8:12 INSERT (8:11, $3, B)

8:13 INSERT (8:05, $4, C)

8:14 WM → 8:08

8:15 INSERT (8:09, $5, D)

8:16 WM → 8:12

8:17 INSERT (8:13, $1, E)

8:18 INSERT (8:17, $6, F)

8:21 WM → 8:20

8:08> SELECT ... EMIT STREAM AFTER WATERMARK;

----------------------------------------------

| wstart | wend | price | undo | ptime | ver |

----------------------------------------------

| 8:00 | 8:10 | $11 | | 8:16 | 0 |

| 8:10 | 8:20 | $10 | | 8:21 | 0 |

...

43 of 66

SQL Extension 5: Periodic Delays

  • Materializing every change of a TVR can result in many updates
    • Can overload downstream systems
    • Often not necessary / required�
  • Add EMIT AFTER DELAY clause to control frequency of updates

44 of 66

SQL Extension 5: Periodic Delays

8:08> SELECT ... EMIT STREAM

AFTER DELAY INTERVAL '6' MINUTES;

----------------------------------------------

| wstart | wend | price | undo | ptime | ver |

----------------------------------------------

| 8:00 | 8:10 | $6 | | 8:14 | 0 |

| 8:10 | 8:20 | $10 | | 8:18 | 0 |

| 8:00 | 8:10 | $6 | undo | 8:21 | 1 |

| 8:00 | 8:10 | $11 | | 8:21 | 2 |

...

8:08 INSERT (8:07, $2, A)�8:12 INSERT (8:11, $3, B)�8:13 INSERT (8:05, $4, C)�8:15 INSERT (8:09, $5, D)�8:17 INSERT (8:13, $1, E)�8:18 INSERT (8:17, $6, F)

8:21> SELECT ...;�-------------------------

| wstart | wend | price |

-------------------------

| 8:00 | 8:10 | $11 |

| 8:10 | 8:20 | $10 |

-------------------------

8:15> SELECT ...;�-------------------------

| wstart | wend | price |

-------------------------

| 8:00 | 8:10 | $6 |

-------------------------

8:19> SELECT ...;�-------------------------

| wstart | wend | price |

-------------------------

| 8:00 | 8:10 | $6 |

| 8:10 | 8:20 | $10 |

-------------------------

45 of 66

Streaming SQL Recap

  • Guiding principles
    • Unified semantics for SQL over tables and streams
    • Minimal additions to the standard
    • Use as much as possible of SQL in a streaming context�
  • Our proposal
  • Time-varying relations
  • Event time semantics
    • Ext 1: Event time attributes
    • Ext 2: Event time windowing functions
  • Controlling the materialization of time-varying results
    • Ext 3: Stream materialization
    • Ext 4: Watermark delays for completeness
    • Ext 5: Periodic delays for rate limiting

DDL for watermarks, etc.�← TUMBLE, HOP, etc. via table-valued functions��← EMIT STREAM�← EMIT AFTER WATERMARK�← EMIT AFTER DELAY

46 of 66

SQL: things to explore

  • Need to understand:
    • What else is truly missing?
      • Or are there different approaches that would be better?
    • How can we cleanly (and minimally) add new semantics to the SQL language?

But to do all that, we need a clearer understanding of what we’re doing...

47 of 66

4

Formal Semantics

Because guesses and intuition are a perilous foundation

48 of 66

Formal Semantics: why does it matter?

  • Math is the foundation on top of which we build everything else
  • Formal models provide frameworks for analyzing and discussing �practical approaches
  • The exercise of refining a formal model helps surface new insights �and simplify approaches
  • It’s also fun :-)

49 of 66

Formal Semantics: where are we now?

  • :-/

(mostly kidding)

50 of 66

Formal Semantics: where are we now?

  • Long history of research in this area.
    • See previous SQL topic
  • But still falls short
    • Lots of focus on streaming as something wholly different
      • That’s not going to be the right model
    • Lots of great ideas that need to be tied together properly into a big picture
    • Much lacking in the area of event-time semantics
    • Also not much enough formalization of practical run-time semantics

51 of 66

Formal Semantics: things to explore

  • What is streaming? What concepts are necessary and sufficient?
  • Example: watermarks
    • What do they mean and how do you use them?
    • How do we reason about different approaches to “watermarking”?
      • MillWheel / Cloud Dataflow vs Flink vs Spark: three very different approaches
  • Example: latency
    • What is it?
      • Time spent processing? Time since event happened?
    • What kinds of latency matter?
      • Max? Percentile? Rolling average?
    • How do we measure them? How do we adjust them? What are the tradeoffs?

On the subject of tradeoffs, let’s talk about the perpetual tensions between...

52 of 66

5

Latency ↔ Cost ↔ Correctness

Because you can’t have everything all at once (yet ;-)

53 of 66

Latency ↔ Cost ↔ Correctness: why does it matter?

  • Three dimensions!
  • Can choose 2!
  • Have to be flexible on the 3rd
  • Need clean abstracrtions for full freedom

Correctness

Low cost

Low latency

1. Fast & Correct

2. Cheap & Correct

3. Fast & Cheap

54 of 66

Latency ↔ Cost ↔ Correctness: where are we?

  • Industry state of the art:
    • Dataflow: batch & streaming engines
      • Same API, very different implementations and performance/efficiency/latency characteristics.
        • Batch: Correct & Cheap
        • Streaming: Fast & Correct
    • Flink snapshots vs zero-consistency mode
      • Can turn off consistency to go fast (and lose data)
        • Consistency on: Fast & Correct
        • Consistency off: Fast(er) & Cheap
    • Spark streaming batch sizes, in-memory only
      • Batch sizes: Manual tuning of Fast vs Cheap balance
      • In-memory only: Fast & Cheap (& Correct in certain circumstances)

55 of 66

Latency ↔ Cost ↔ Correctness: things to explore

  • What do real world users truly need, and what things are just nice to have?
  • What does it take to effectively serve a broad set of use cases?
    • Is one engine enough?
    • Two?
    • Three?
    • …?
  • Batch-like approaches?
  • The big question: how can we have all three all the time?
  • Meta question: what is the right term for correctness?
    • Precision? Completeness? Correctness?

And on the topic of batch-like approaches...

56 of 66

6

Batch + Streaming Interoperability

Because we each have a lot to learn from one another

57 of 66

Batch + Streaming Interoperability: why does it matter?

  • Batch is still huge, often because it’s cheaper and faster in cases where low latency is not top priority.
    • See previous topic.
  • Why is it cheaper and faster?
    • Because their engines are doing things that are fundamentally different:
      • Different optimization strategies
      • Different algorithms
      • Different types of machine resources
      • ...

58 of 66

Batch + Streaming Interoperability: where are we?

  • Many batch users are actually doing streaming
  • Many approaches in the industry:
    • Dataflow
      • Unified API, separate batch and streaming engines
    • Spark
      • Unified API, unified batch-centric engine
        • True streaming engine in development
    • Flink
      • Separate APIs & engines now, moving towards unifying everything.
        • Batch becoming a special case of streaming

59 of 66

Batch + Streaming Interoperability: things to explore

  • How can we make batch and streaming play better together?
    • Imagine transparently using each for the things they’re best at:
      • Batch: lower cost, higher latency
        • Backfills / catchup, massive jobs
      • Streaming: higher cost, lower latency
        • Incremental / online processing, low-latency use cases (e.g., anomaly detection)
  • How can we help batch users move to streaming?
    • Cost is often a factor.
    • Complexity too.
      • Batch can be simple and straightforward. Streaming often is not.
  • How can we gain the advantages of batch systems in streaming systems, but without lots of manual work for users?

Speaking of manual work, why are we largely ignoring decades of research into...

60 of 66

7

Database-style optimizations

Because, let’s face it, so much of what we currently do was �invented by the database community decades ago anyway

61 of 66

Database-style optimizations: why does it matter?

  • Automatic optimization is cruical to SQL’s broad adoption
  • Huge body of optimization work in databases.
  • An arguably minimal amount of it has been applied in streaming systems.
  • Lots of talk in industry about “turning database inside out”
  • ...but all the smarts are falling out as we do so.
  • Stream processor as a database ~= some business logic + a key value store
  • We can do so much better!

62 of 66

Database-style optimizations: where are we?

  • State of art in industry: Spark’s Catalyst Optimizer
    • Quite sophisticated, very similar to classic database optimizer
    • Built-in code gen is slick
    • Unclear if optimizer adapts over time for streaming
      • Possibly not too hard, given micro-batch architecture
  • Dataflow has an optimizer.
    • Currently has optimizations like fusion, combiner lifting, direct joins (batch)
    • Really valuable, but could do a lot more
    • Adapts over time for auto-scaling, but not for optimizations
  • Flink mostly has optimizations primarily in batch
    • This will be more unified soon

63 of 66

Database-style optimizations: things to explore

  • What classic database optimizations are applicable to streaming?
    • Do they need minor changes?
    • Do we need entirely new optimizations?
  • Can optimizations help adapt the pipeline over time?
    • Hot keys coming and going
    • Pipelines falling behind
    • Data shapes and sizes evolving over time
    • ....
  • Can optimizations help solve the latency ↔ cost ↔ correctness conundrum?

And with that, we’re done.

64 of 66

In summary

1�2�3�4�5�6�7

Graceful evolution�Operational ease of use�SQL�Formal semantics�Latency ↔ Cost ↔ Correctness�Batch + Streaming Interoperability�Database-style optimizations

65 of 66

Edge Computing / IoT�Streaming graph processing�Streaming/online ML�Streaming & Microservices�Streaming & Serverless Functions�Online algorithms�….

And really, there’s so much more...

66 of 66

Thank you!

Questions?

Complaints? ;-)

http://s.apache.org/open-problems-stream-processing