Portability: Delivering on the Vision of Beam
Robert Bradshaw
robertwb@google.com
Stockholm - July 2019
The Beam Vision
Provide a comprehensive framework for data processing pipelines, one that allows you to write your pipeline once in your language of choice and run it with minimal effort on your �execution engine of choice.
The Beam Vision
⋮
input | CombinePerKey(sum)
Python
input.apply(
Sum.integersPerKey())
Java
stats.Sum(s, input)
Go
SELECT key, SUM(value) FROM input GROUP BY key
SQL
⋮
Cloud Dataflow
Apache Spark
Apache Flink
Apache Apex
Gearpump
Apache Samza
Apache Nemo (incubating)
IBM Streams
input.combineByKey(_ + _)
Scala
Beam Reality (pre 2018)
⋮
input | CombinePerKey(sum)
Python
input.apply(
Sum.integersPerKey())
Java
stats.Sum(s, input)
Go
SELECT key, SUM(value) FROM input GROUP BY key
SQL
⋮
Cloud Dataflow
Apache Spark
Apache Flink
Apache Apex
Gearpump
Apache Samza
Apache Nemo (incubating)
IBM Streams
input.combineByKey(_ + _)
Scala
Sum Per Key
Java objects
Sum Per Key
Python objects
?
Not a Viable Solution
⋮
input | CombinePerKey(sum)
Python
input.apply(
Sum.integersPerKey())
Java
stats.Sum(s, input)
Go
SELECT key, SUM(value) FROM input GROUP BY key
SQL
⋮
Cloud Dataflow
Apache Spark
Apache Flink
Apache Apex
Gearpump
Apache Samza
Apache Nemo (incubating)
IBM Streams
input.combineByKey(_ + _)
Scala
Sum Per Key
Java objects
Sum Per Key
Python objects
Go objects
Sum Per Key
⋮
Portability Framework
The Portability Framework is a language-agnostics way of representing and executing Beam Pipelines.
Runner
SDK Client
SDK Worker
Portability Framework
Runner
SDK Client
SDK Worker
Define Pipelines
(protobuf)
Submit and Monitor Pipelines
(grpc)
Execute User Code
(docker, grpc)
Portability Framework! [2019]
⋮
Apache Spark
Apache Flink
Apache Apex
Gearpump
Cloud Dataflow
Apache Samza
Apache Nemo (incubating)
IBM Streams
Sum Per Key
Java objects
Sum Per Key
Beam
Portable protos
⋮
input | CombinePerKey(sum)
Python
input.apply(
Sum.integersPerKey())
Java
stats.Sum(s, input)
Go
SELECT key, SUM(value) FROM input GROUP BY key
SQL
input.combineByKey(_ + _)
Scala
So, what's in it for me?
In addition to realizing the vision of a full cross-product of Runner and SDKs, the portability API also enables
Hermetic Worker Environment
Hermetic Worker Environment
Beam Environments
Beam Environments
Beam Environments
Beam Environments
Beam Environments
Cross-language transforms
Portability gives us
Cross-language transforms
Portability gives us
We are no longer bound to a single language/SDK in a given pipeline.
Cross-language transforms
Transforms can be shared among SDKs
More libraries available in language of your choice.
...
...
Cross-language transforms
Client SDK
Cross-language transforms
Client SDK
MyTransform(param=value)
Cross-language transforms
Client SDK
External SDK
Expansion
Service
MyTransform(param=value)
{param: int:value}
Cross-language transforms
External SDK
Client SDK
Expansion
Service
MyTransform(param=value)
MyTransformConfig {
int getParam();
}
{param: int:value}
Cross-language transforms
External SDK
Client SDK
Expansion
Service
Cross-language transforms
External SDK
Client SDK
Expansion
Service
Cross-language transforms
Client SDK
Cross-language transforms
Client SDK
Runner
Job Service
Worker
Worker
SDK Environment
SDK Environment
SDK Environment
Demo
TEST_COUNT_URN = "pytest:beam:transforms:count"
TEST_FILTER_URN = "pytest:beam:transforms:filter_less_than"
expansion_service = 'localhost:8096'
def pipeline(root):
(root
| ' PYTHON CREATE ' >> beam.Create(list(u'aaabccxyyzzz'), reshuffle=False)
| ' JAVA MAP ' >> beam.ExternalTransform(TEST_FILTER_URN, b'middle', expansion_service)
| ' JAVA COUNT ' >> beam.ExternalTransform(TEST_COUNT_URN, None, expansion_service)
| ' PYTHON MAP ' >> beam.Map(lambda kv: '%s: %s' % kv)
| beam.Map(lambda x: time.sleep(5)))
options = PipelineOptions(
experiments=['beam_fn_api'],
environment_type='LOOPBACK',
flink_master_url="localhost:8081")
FlinkRunner().run(pipeline, options=options)
New Features
Some new features exclusively available on Portability
New Features
Some new features exclusively available via Portability�
Portability is the future.
Status
Summary
The future is being built on top of Portability
Questions?