1 of 28

Go Parallel Data Processing

Golang, Apache Beam & Dataflow

Presented to: Atlanta Go User Group

2 of 28

Going to Cover

  • Intro
  • Why Beam
  • Beam Concepts
  • Code & Demo
  • Questions

Code & Notes https://github.com/GLStephen/GoApacheBeamDemo

3 of 28

Stephen Johnston Jr.

  • Founder/CTO PubWise
    • Create privacy preserving digital advertising optimization technologies
    • 3 patents for digital advertising optimization systems
  • 25+ yrs. building online applications
  • Gaming, Content Management, AI/ML & Data
  • Pragmatic Founder
    • Product , Operations, Growth
    • Which technologies? What are the requirements?
    • Which languages? Yes
  • Board Member of Atlanta Bonsai Society
  • https://www.linkedin.com/in/stephenjohnston2/

4 of 28

Quick Survey

  • Manage “big data” - “five v’s” in any capacity?
  • Have batch or streaming jobs?
  • Used In Anger?
    • Hadoop/Spark and Similar
    • Beam?
    • Cloud Dataflow?
  • Familiar with?
    • Hadoop/Spark and Similar
    • Beam?
    • Cloud Dataflow?

5 of 28

Why Beam

6 of 28

Realities of Data

  • Data is out of order
  • Batch is less expensive
  • Stream is more timely
  • Perfect is the enemy of good enough - speculative vs. final results
  • Parallel Processing is **hard** at appreciable scale
    • Workers
    • Sharding
    • Monitoring
    • Coordination
    • State

7 of 28

What is Beam

Apache Beam is an open source, unified model for defining both batch and streaming data-parallel processing pipelines. Using one of the open source Beam SDKs, you build a program that defines the pipeline. The pipeline is then executed by one of Beam’s supported distributed processing back-ends, which include Apache Flink, Apache Spark, and Google Cloud Dataflow.

8 of 28

What does beam solve?

  • Speed vs. Scalability -> Speed & Scalability
  • Unifies Real Time & Historical Processing
  • Unifies Batch & Streaming
  • Encapsulate Complex Processing Concepts

Items

Wall Time

9 of 28

Apache Beam Support

Beam SDKs

  • Java
  • Python
  • Golang
  • Scalla (Scio)

Beam Runners

  • Apache Spark
  • Apache Flink
  • Apache Samza
  • Google Cloud Dataflow
  • Hazelcast Jet
  • Twister2
  • Local

10 of 28

Beam Concepts�Very very high level…

11 of 28

12 of 28

Concepts

Components

  • PCollection
    • A multi-element, potentially distributed, dataset. The “data”
  • PTransform
    • An operation performed on the data by the code
  • Source
    • External data of one type
  • Sink
    • Destination of data, can be another type
  • Pipeline
    • Loosely “the code”
    • a DAG of PTransforms from source data into PCollections and output into Sinks…

Processing

  • Event Time
    • When occurred, vs when processed
  • Windowing
    • Enables group operations over unbound data
  • Watermarks
    • Expectations about processing of data in the pipeline
  • Trigger
    • When aggregates are emitted - the relation of watermark, event time and watermarks

https://beam.apache.org/documentation/programming-guide/�https://cloud.google.com/dataflow/docs/concepts/beam-programming-model

13 of 28

Runner (Dataflow)

Worker (Compute Eng.)

Pipeline �(Go Beam App)

Worker (Compute Eng.)

Pipeline �(Go Beam App)

Worker (Compute Eng.)

Pipeline�(Go Beam App)

14 of 28

Windows

  • Divides the PCollection into configurable “aggregation” windows
  • Especially useful in streaming
  • Key for reasoning about streaming and batch with the same API
  • Unifies all Sorts of Use Cases
  • Global Window and Sub Windows
  • Types
    • Tumbling
    • Hopping
    • Session

15 of 28

Tumbling

16 of 28

Hopping

17 of 28

Session

18 of 28

Processing

  • Element Wise
    • ParDo, Map, etc.
    • Procedural flow of the pipeline (loosely)
    • ParDo - fundamental parallel processing unit - using a DoFn
  • Window or Key Wise
    • Combine, Reduce, GroupByKey
    • If necessary, beam handling across workers to aggregate in windows, based on triggers
  • Implicit Global Window Exists

19 of 28

Simple DoFn - Parallel

20 of 28

Counts - Global Window - Aggregation

21 of 28

Simple DAG

22 of 28

Complex DAG

23 of 28

Element Wise

24 of 28

Per-Key or Window

25 of 28

26 of 28

Autoscaling Managed for You

  • Backlog = Scale Up Workers
    • If a streaming pipeline remains backlogged with sufficient parallelism on the workers for a couple of minutes, Dataflow scales up. Dataflow targets clearing the backlog in approximately 150 seconds after scaling up, given the current throughput per worker.
  • Lack of Backlog = Scaling Down Workers
    • If a streaming pipeline backlog is lower than 10 seconds
  • Not if CPU Remains High
    • If there is no backlog but CPU usage is 75% or greater, the pipeline does not scale down.
  • Predictive Abilities
    • Predictive scaling: The streaming engine also uses a predictive Horizontal Autoscaling technique based on timer backlog. In the Dataflow model, the unbounded data in a streaming pipeline is divided into windows grouped by timestamps.
  • Adaptive Vertical Scaling
    • Dataflow has “adaptive” vertical scaling as well

27 of 28

Code & Demo

28 of 28

Questions?