Kettle Beam

Matt Casters

Neo4j Chief Solutions Architect, Kettle Project Founder, Hop Lead Architect

Updated: Nov 22
th 2019

Topic

  • Kettle essentials
  • What is Apache Beam?
  • What is the Kettle Beam project?
  • How does it work?
  • Which platforms are supported?
  • What’s next?
  • What does not work yet?
  • How can I contribute?
  • Q&A

2

Kettle Essentials

Define an ETL workload

  • GUI (workstation/browser)
  • XML standard
  • Java API (SDK)
  • ETL Metadata Injection

Run the ETL workload

  • Anywhere you have a JVM
  • GUI, Local, Remote, …
  • But also inside engines…
    • Kettle Clustering
    • Map/Reduce
    • Storm
    • Spark (EE)

3

Kettle Essentials

How does it work?

4

Define Kettle Metadata

Transformations

Jobs

Configure Options

Execute

Log levels

Parameters

...

Local

Remote

Clustered

“Run Configuration”

Metadata driven execution on an engine

Apache Beam

  • Originates from a Google paper called

“The dataflow model”

https://dl.acm.org/citation.cfm?doid=2824032.2824076

The data flow model: a practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing

5

Apache Beam

  • Google released an open source SDK for it in 2014

→ The Apache Beam Direct runner

→ Local execution

  • Google released the SDK for their GCP DataFlow platform

→ The DataFlow runner

→ Execution on Google Cloud Platform - DataFlow

  • Spark and a bunch of other platforms followed quickly
  • Lots of work getting done on Apache Flink right now

6

Apache Beam

FOSDEM 2019 Presentation on Apache Beam

by Maximilian Michels

Article & Video of presentation

Slide deck

→ Technical!!

7

Apache Beam

  • 2018 top 3 Apache Project by commits
  • 2018 top project by dev@ list activity

2019 Technology of the Year Award Winner

8

Apache Beam

Capabilities matrix:

https://beam.apache.org/documentation/runners/capability-matrix

Capabilities depend on the underlying “Engine” or “Runner”

9

Apache Beam

What results are being calculated?

10

Apache Beam

Where in event time?

11

Apache Beam

When in processing time?

12

Apache Beam

How do refinements of results relate?

13

Apache Beam

How does it work?

14

Java / Python

Specific options per Runner

DirectRunner
DataFlowRunner

SparkRunner
FlinkRunner

...

Define an

Apache Beam
Pipeline

Configure Pipeline Options

Execute on a Runner

Kettle Beam: Why?

  • Recognizes the architectural similarities
  • Allows Kettle to focus on the needs, metadata, code
  • Re-uses the abstraction layers
  • Does away with the need to focus on one engine
  • Recognizes that innovation happens at an incredible pace
  • Apache Beam shields us from making bad bets.

15

Kettle Beam: Diff

  • Any differences between Kettle and Beam?
  • Kettle has strict separation of metadata and engine!
  • But that made execution on Apache Beam easier!

16

Define Kettle Metadata

Configure Options

Execute

Define an

Apache Beam
Pipeline

Configure Pipeline Options

Execute on a Runner

Kettle Beam: a brief history in time

  • Started right after #PCM18 in Bologna
  • The Beam spark was officially ignited by the world famous Kettle advocate and personal troller Diethard Steiner
  • Originally I was supposed to show a

“Hello, world!\n”

  • But as these things go, things got out of hand…
  • Many nights, days and weekends
  • Haven’t had this much fun since I started Kettle

17

Kettle Beam: features

  • Kettle transformations
  • Beam File Definitions

→ No more cumbersome “split field” steps

  • Beam Job Configurations

→ Handles all the Apache Beam Pipeline Options

→ Generic approach but only Direct / DataFlow / Spark for now

  • Beam Run Configuration

→ Not possible with a Kettle plugin but Pentaho is working on it

  • Execution in a job, Pan, Kitchen, Maitre, Slave server, ...

18

Kettle Beam: GUI specifics

  • Pipeline metrics integration for Direct runner and DataFlow

→ Spark / Flink are harder, require some plumbing (demo later)

  • Step status & pipeline metrics in Spoon!

19

Kettle Beam: supported operations

  • Step-to-step data transfer
  • Copy data to multiple outputs (NOT distribute)
  • Combine data from multiple input steps
  • Source data from particular (info) step(s)
  • Target data to particular steps
  • ...

20

Kettle Beam: supported steps

  • All non-aggregating steps doing read / write
    • Calculator, Select Values, …
    • But also: Stream Lookup, Switch/Case, Filter Rows, ...
  • Merge Join: converted into Beam API
  • Memory Group By: converted into Beam API
  • Ability to reduce parallelism:
    • Input steps: Since thread non-Beam input step
    • Output steps: Recognize BEAM_SINGLE as desire to run single threaded
  • Batch up records: better performance in Kettle output steps

21

Kettle Beam: Extra I/O

  • Beam Input / Output : Text file reading and writing
  • Beam Kafka Consume / Produce
  • Beam BigQuery Input / Output
  • Beam GCP PubSub Subscribe / Publish

22

Kettle Beam: Streaming

  • Apache Beam Terminology
    • Bounded data source: finite streaming
    • Unbounded data source: infinitely streaming
  • “Streaming” Pipeline uses an unbounded data source
    • Kafka Consume, PubSub Subscribe, …
    • OR a bounded data source which is “timestamped”
  • Enter the Beam Timestamp step
  • Allows combining Bounded and Unbounded data
  • Example: split a large file up into chunks of 10 minutes

23

Kettle Beam: Windowing

  • A window allows you to consider parts of a data stream
    • Global window: everything
    • Fixed window: fixed period of time
    • Sliding window: fixed period, repeated at intervals
    • Session window: No more data after X minutes makes a window (min 5 minutes)
  • Enter the Beam Window step
  • Example: write Kafka data to a files

24

Kettle Beam: Demo!!!1!İ!

  • The Kettle Beam Examples project
  • Show the File Definitions
  • Show the Beam Job Configs
  • Show the Run Configurations
  • Run unit tests in Kettle
  • Execute on the Flink runner locally
  • Execute on Dataflow (Google Cloud Dataflow)
  • Execute on Spark (AWS with Flintrock)
  • Execute on Flink in a VM

25

Kettle Beam: Next steps

  • Release 1.0
  • Remote execution, metrics/status updates
    • For Spark/Flink
    • Using Carte plugins
    • Remote pipeline metrics → Carte ← Spoon
  • Triggers?

26

Kettle Beam: Where is the light?

The Kettle Beam project code on GitHub

https://github.com/mattcasters/kettle-beam

Diethard’s blog post on Kettle Beam

http://diethardsteiner.github.io/pdi/2018/12/01/Kettle-Beam.html

My YouTube channel with status updates

https://www.youtube.com/mattcasters

Download

http://www.kettle.be

27

This presentation: http://beam.kettle.be

Kettle Beam: step into the light!

  • Try it yourself
  • Any feedback is good feedback!
    • “This doesn’t work!”
    • Write documentation
    • Request features
  • Spread the word about Kettle, Apache Beam and Kettle Beam
  • Help out coding
    • Easier than you think
    • Ask for help.

28

Q&A

29

Kettle Beam

Matt Casters

Neo4j Chief Solutions Architect

20190207 Kettle Beam - Google Slides