1 of 28

Go Parallel Data Processing

Golang, Apache Beam & Dataflow

Presented to: Atlanta Go User Group

2 of 28

Going to Cover

Intro
Why Beam
Beam Concepts
Code & Demo
Questions

Code & Notes https://github.com/GLStephen/GoApacheBeamDemo

3 of 28

Stephen Johnston Jr.

Founder/CTO PubWise

Create privacy preserving digital advertising optimization technologies
3 patents for digital advertising optimization systems

25+ yrs. building online applications
Gaming, Content Management, AI/ML & Data
Pragmatic Founder

Product , Operations, Growth
Which technologies? What are the requirements?
Which languages? Yes

Board Member of Atlanta Bonsai Society
https://www.linkedin.com/in/stephenjohnston2/

4 of 28

Quick Survey

Manage “big data” - “five v’s” in any capacity?
Have batch or streaming jobs?
Used In Anger?

Hadoop/Spark and Similar
Beam?
Cloud Dataflow?

Familiar with?

Hadoop/Spark and Similar
Beam?
Cloud Dataflow?

5 of 28

Why Beam

6 of 28

Realities of Data

Data is out of order
Batch is less expensive
Stream is more timely
Perfect is the enemy of good enough - speculative vs. final results
Parallel Processing is **hard** at appreciable scale

Workers
Sharding
Monitoring
Coordination
State

7 of 28

What is Beam

Apache Beam is an open source, unified model for defining both batch and streaming data-parallel processing pipelines. Using one of the open source Beam SDKs, you build a program that defines the pipeline. The pipeline is then executed by one of Beam’s supported distributed processing back-ends, which include Apache Flink, Apache Spark, and Google Cloud Dataflow.

8 of 28

What does beam solve?

Speed vs. Scalability -> Speed & Scalability
Unifies Real Time & Historical Processing
Unifies Batch & Streaming
Encapsulate Complex Processing Concepts

Items

Wall Time

9 of 28

Apache Beam Support

Beam SDKs

Java
Python
Golang
Scalla (Scio)

Beam Runners

Apache Spark
Apache Flink
Apache Samza
Google Cloud Dataflow
Hazelcast Jet
Twister2
Local

10 of 28

Beam Concepts�Very very high level…

11 of 28

12 of 28

Concepts

Components

PCollection

A multi-element, potentially distributed, dataset. The “data”

PTransform

An operation performed on the data by the code

Source

External data of one type

Sink

Destination of data, can be another type

Pipeline

Loosely “the code”
a DAG of PTransforms from source data into PCollections and output into Sinks…

Processing

Event Time

When occurred, vs when processed

Windowing

Enables group operations over unbound data

Watermarks

Expectations about processing of data in the pipeline

Trigger

When aggregates are emitted - the relation of watermark, event time and watermarks

https://beam.apache.org/documentation/programming-guide/�https://cloud.google.com/dataflow/docs/concepts/beam-programming-model

13 of 28

Runner (Dataflow)

Worker (Compute Eng.)

Pipeline �(Go Beam App)

Worker (Compute Eng.)

Pipeline �(Go Beam App)

Worker (Compute Eng.)

Pipeline�(Go Beam App)

14 of 28

Windows

Divides the PCollection into configurable “aggregation” windows
Especially useful in streaming
Key for reasoning about streaming and batch with the same API
Unifies all Sorts of Use Cases
Global Window and Sub Windows
Types

Tumbling
Hopping
Session

https://beam.apache.org/blog/timely-processing/�https://cloud.google.com/dataflow/docs/concepts/streaming-pipelines

15 of 28

Tumbling

16 of 28

Hopping

17 of 28

Session

18 of 28

Processing

Element Wise

ParDo, Map, etc.
Procedural flow of the pipeline (loosely)
ParDo - fundamental parallel processing unit - using a DoFn

Window or Key Wise

Combine, Reduce, GroupByKey
If necessary, beam handling across workers to aggregate in windows, based on triggers

Implicit Global Window Exists

19 of 28

Simple DoFn - Parallel

20 of 28

Counts - Global Window - Aggregation

21 of 28

Simple DAG

22 of 28

Complex DAG

23 of 28

Element Wise

24 of 28

Per-Key or Window

25 of 28

26 of 28

Autoscaling Managed for You

Backlog = Scale Up Workers

If a streaming pipeline remains backlogged with sufficient parallelism on the workers for a couple of minutes, Dataflow scales up. Dataflow targets clearing the backlog in approximately 150 seconds after scaling up, given the current throughput per worker.

Lack of Backlog = Scaling Down Workers

If a streaming pipeline backlog is lower than 10 seconds

Not if CPU Remains High

If there is no backlog but CPU usage is 75% or greater, the pipeline does not scale down.

Predictive Abilities

Predictive scaling: The streaming engine also uses a predictive Horizontal Autoscaling technique based on timer backlog. In the Dataflow model, the unbounded data in a streaming pipeline is divided into windows grouped by timestamps.

Adaptive Vertical Scaling

Dataflow has “adaptive” vertical scaling as well

https://cloud.google.com/dataflow/docs/horizontal-autoscaling#streaming

27 of 28

Code & Demo

28 of 28

Questions?