1 of 38

FastFlow Building Blocks

Massimo Torquati

<torquati@di.unipi.it>

Pisa, 2019

2 of 38

High-level Parallel Patterns

Pattern-based Parallel Programming is nowadays a well-recognized approach for raising the level of abstraction in parallel computing without sacrificing performance and performance portability features.

Notable Parallel Patterns

3 of 38

Are they enough? Do we have all we need?

Do it yourself!

By combining and customizing core patterns and basic mechanisms

By using the structured parallel programming methodology even for building high-level parallel patterns as well as Run-Time Systems (RTSs)

4 of 38

FastFlow 3.0 stack

High-level Parallel Patterns�
Parallel and Sequential Building Blocks �
Concurrency graph transformer

M. Torquati. “Harnessing Parallelism in Multi/Many-Cores with Streams and Parallel Patterns.” Ph.D. Thesis in Computer Science, 2019

5 of 38

FastFlow Builing Block layer

6 of 38

FastFlow Building Blocks

sequential

nodes

channels

Pipeline composition

Parallel nodes

7 of 38

Why Building Blocks?

Promote LEGO-style parallel programming
They have a clear functional and parallel semantics
Promote Performance Portability

they may have different implementations for different platforms

They “simplify the life” of the RTS programmer

… while Parallel Patterns mainly “simplify the life” of Application Programmers

Simple/Efficient Producer-Consumer Data-Flow semantics

PC semantics enables memory/cache optimizations

The concurrency graph composed by building blocks can be decomposed and re-assembled in different ways enabling performance optimizations and increasing flexibility

chaining/fusion, fission, pipeline composition

Aldinucci, S. Campa, M. Danelutto, P. Kilpatrick, and M. Torquati. “Design patterns percolating to parallel programming framework implementation.” International Journal of Parallel Programming, 2014.

8 of 38

FastFlow Building Blocks

Channels built on top of a lock-free SPSC FIFO queue
Bounded and unbounded capacity (configurable)

Backward channels have always unbounded capacity

Concurrency control: blocking and non-blocking (automatic experimental)

9 of 38

FastFlow Building Blocks

r=svc₃(svc₂(svc₁(x)))

svc₁

svc₂

svc₃

10 of 38

FastFlow Building Blocks

Building blocks’ container
Data-Flow pipeline executor

11 of 38

FastFlow Building Blocks

Master-worker model variants
Configurable and user-definable scheduling and gathering policies
Concurrency throttling (extra-functional feature)
Feedback channel modifier (optional)

12 of 38

FastFlow Building Blocks

Multi functional replication
No master node
Shuffle communication pattern
User-definable scheduling and gathering policies
Feedback channel (optional)

13 of 38

Sequential BBs recap

Classified according their input/output cardinality:

SISO, MISO, SIMO

The cardinality of MISO, SIMO is defined when they are connected to other nodes
A connection is a FIFO SPSC channel

Blocking and non-blocking concurrency control

Sequential nodes can be combined. The aim is to decouple nodes from their concrete implementations (i.e. threads)

14 of 38

Parallel BBs recap

pipeline: it models the pipeline composition of BBs and it also acts as a container of BBs, thus enabling BBs nesting�
farm: it models functional replications of pipelines coordinated by one or two master nodes (called Emitter and Collector)�
all-to-all (A2A): it models two distinct sets of pipelines (L-Workers and R-Workers) connected according to the shuffle communication pattern.

The L-Worker pipelines must terminate with a SIMO node
The R-Worker pipelines must start with a MISO node

15 of 38

Composition Rules

Rule1

Rule2

Rule3

16 of 38

Composition Rules

Rule1: two sequential building blocks can be connected in pipeline regardless their input/output cardinality
Rule2: a parallel building block can be connected to a sequential building block through a multi-output (SIMO) and multi-input (MISO) node
Rule3: two parallel building block can be connected either if they have the same number of nodes regardless their input/output cardinality or through multi-input/multi-output nodes if they have a different number of nodes

All-to-all introduction rule between two pipelines

Rule1

Rule2

Rule3

17 of 38

Input/Output cardinality

The input and output channel cardinality of a single building block or of a composition of building blocks can be computed considering the rules shown aside (the symbol “*” denotes a cardinality not statically defined)
A given building block may have different cardinality on the basis of the number of edge-nodes composing it and on the basis of the sequential node used for implementing edge-nodes

18 of 38

Graphs that cannot be built with Building Blocks

… all the interesting ones (structured approach)

Only a subset of all possible concurrency graphs can be built ….

19 of 38

Building Blocks composition example

20 of 38

Building Blocks composition examples

21 of 38

Simple usage examples

22 of 38

Concurrency Graph Transformer

23 of 38

Basic idea

The concurrency graph transformer provides functions to refactor the FastFlow graph
Objectives: to reduce the number of nodes (i.e. default Emitters/Collectors) and to enable advanced combinations of BBs
A simple interface function, called optimize_static(), is provided to the FastFlow programmer to automatically apply transformations to the pipeline composition of BBs

These transformations are applied only if specific conditions hold so to not modify the semantics of the transformed program (user-hints)
The programmer can force some transformations using a proper API provided by the layer taking the responsibility of the transformations

Automatic transformations currently supported are: �nodes reduction, farm fusion, farm combine, All-to-All introduction
Extra-functional operations supported : concurrency control (blocking vs non-blocking), initial barrier removal, default-mapping of threads (enabled by default, can be disabled), ...

25 of 38

Example

Transformations that modify the original functional semantics might still be still acceptable if acknowledged by the programmer

How to help the programmer to introduce such transformations

in the FF’s high-level layer? Specialized nodes?

26 of 38

Nodes reduction

The ff_Farm pattern uses by default an Emitter and Collector nodes (not user-defined)
We may want to avoid having dedicated nodes performing only scheduling and gathering functionalities
We can merge them with the previous/subsequent pipeline node
Example: three-staged pipeline where the middle stage is a farm with default emitter/collector

27 of 38

Nodes reduction

Extra nodes added when nesting a farm BB into a farm or a pipeline
Top: the results produced by the inner farm Workers will be directly forwarded to the collector of the outermost farm which will do the collection
Bottom: each Worker is a pipeline where the farm is the last stage containing a default collector.

28 of 38

Farm fusion

Two farms are consecutive stages of a pipeline and they have the same parallelism degree (number of Workers)
The result of the fusion is a unique farm merging the two original ones.
Besides reducing nodes also in this case, this optimization may be useful
to reduce latency (number of hops)
to avoid bottlenecks between the �Collector of the first farm and the �Emitter of the second one

29 of 38

Farm combine

Two farms in pipeline without any constraint on their parallelism degree and on their emitter/collector nodes (they can be either default or custom nodes)
Similar operation to “farm fusion” but limited to the Emitter and Collector nodes

Case1

Case2

Case3

30 of 38

Farm combine

An interesting case of the farm combine transformation is given when one of the two farms to be combined is an ordered farm instance
This case introduces some constraints in order to preserve the ordering of stream elements

Case1: The first farm is a standard farm while the second one is an ordered farm. Simplest case (not shown)
Case2: The opposite of Case1. The new Emitter of the first farm adds unique identifiers to the input elements. The Emitter of the second farm combines a special ordering collector which orders and remove the identifiers to the results.

31 of 38

All-to-all introduction

Two farms in pipeline regardless their parallelism degree
We want to remove the potential bottleneck introduced by the Collector and Emitter without changing the amount of concurrency of the worker nodes

Example

The Workers of the first and second farm are automatically transformed into multi-output and multi-input nodes, respectively
The Workers of the first farm are added to the L-Worker set of the all-to-all while the Workers of the second farm are added to the R-Worker set of the same all-to-all building block

32 of 38

All-to-all introduction

What about if the Emitter and/or the Collector are user-defined?
The transformation has to be “guided” by the user

In the figure: R=E and G=C.
R and G can be any sequential composition of nodes such that they are a multi-output node and a multi-input node, respectively

33 of 38

Ferret Example

Ferret applications from the PARSEC benchmark suite (Princeton Application Repository for Shared-Memory Computers)
Ferret: content-based similarity search of feature-rich data such as audio, images, video, and 3D shapes

34 of 38

Ferret FastFlow code

35 of 38

Ferret transformations

36 of 38

All-to-all introduction example

To evaluate the advantage of A2A introduction, we consider a classical benchmark used to evaluate framework for big data analytics: WordCount (PiCO library)

A pipeline of two parallel transformations (map-reduce)
a “flatmap” operator in charge of producing for each word of a text file a pair (1, word)
a “reduce” operator merging all the pairs with the same word in a unique pair (n, word), where n is the number of pairs with that word as key
Shuffling can be applied to avoid bottlenecks

37 of 38

All-to-all introduction example

Results obtained using the PiCO framework implemented on top of FastFlow. WordCount test on a Intel Xeon platform 24-cores 28-threads, with different number of keys that change the cost of the distribution
More than 350K different words (keys) in the “many keys” scenario, only 10K in the “few keys” case

Parallelism degree of the FlatMap fixed to 24. Slightly better performance and more stable behavior by increasing the degree of parallelism in the reduce phase
In the test with few keys, the reduce phase is fixed to use parallelism degree equal to 2. No additional cost of using the A2A here

38 of 38

Further research on BBs (some options)

Introduce more advanced static transformations�Some transformations need “helper nodes” to fulfill the Composition Rules. These nodes could be introduced automatically so to enlarge the options
Transformations applied at run-time while the program is running (e.g., the initial program can be compiled into multiple versions):

To adapt to fluctuating workloads (in addition to concurrency throttling)
To use them in different places of the program (as for the ParallelFor)

1 of 38

2 of 38

3 of 38

4 of 38

5 of 38

6 of 38

7 of 38

8 of 38

9 of 38

10 of 38

11 of 38

12 of 38

13 of 38

14 of 38

15 of 38

16 of 38

17 of 38

18 of 38

19 of 38

20 of 38

21 of 38

22 of 38

23 of 38

24 of 38

25 of 38

26 of 38

27 of 38

28 of 38

29 of 38

30 of 38

31 of 38

32 of 38

33 of 38

34 of 38

35 of 38

36 of 38

37 of 38

38 of 38