1 of 38

FastFlow Building Blocks

Massimo Torquati

<torquati@di.unipi.it>

Pisa, 2019

2 of 38

High-level Parallel Patterns

  • Pattern-based Parallel Programming is nowadays a well-recognized approach for raising the level of abstraction in parallel computing without sacrificing performance and performance portability features.

Notable Parallel Patterns

3 of 38

Are they enough? Do we have all we need?

Do it yourself!

By combining and customizing core patterns and basic mechanisms

By using the structured parallel programming methodology even for building high-level parallel patterns as well as Run-Time Systems (RTSs)

4 of 38

FastFlow 3.0 stack

  • High-level Parallel Patterns�
  • Parallel and Sequential Building Blocks
  • Concurrency graph transformer

M. Torquati. “Harnessing Parallelism in Multi/Many-Cores with Streams and Parallel Patterns.” Ph.D. Thesis in Computer Science, 2019

5 of 38

FastFlow Builing Block layer

6 of 38

FastFlow Building Blocks

sequential

nodes

channels

Pipeline composition

Parallel nodes

7 of 38

Why Building Blocks?

  • Promote LEGO-style parallel programming
  • They have a clear functional and parallel semantics
  • Promote Performance Portability
    • they may have different implementations for different platforms
  • They “simplify the life” of the RTS programmer
    • … while Parallel Patterns mainly “simplify the life” of Application Programmers
  • Simple/Efficient Producer-Consumer Data-Flow semantics
    • PC semantics enables memory/cache optimizations
  • The concurrency graph composed by building blocks can be decomposed and re-assembled in different ways enabling performance optimizations and increasing flexibility
    • chaining/fusion, fission, pipeline composition

Aldinucci, S. Campa, M. Danelutto, P. Kilpatrick, and M. Torquati. “Design patterns percolating to parallel programming framework implementation.” International Journal of Parallel Programming, 2014.

8 of 38

FastFlow Building Blocks

  • Channels built on top of a lock-free SPSC FIFO queue
  • Bounded and unbounded capacity (configurable)
    • Backward channels have always unbounded capacity
  • Concurrency control: blocking and non-blocking (automatic experimental)

9 of 38

FastFlow Building Blocks

r=svc3(svc2(svc1(x)))

svc1

svc2

svc3

10 of 38

FastFlow Building Blocks

  1. Building blocks’ container
  2. Data-Flow pipeline executor

11 of 38

FastFlow Building Blocks

  • Master-worker model variants
  • Configurable and user-definable scheduling and gathering policies
  • Concurrency throttling (extra-functional feature)
  • Feedback channel modifier (optional)

12 of 38

FastFlow Building Blocks

  • Multi functional replication
  • No master node
  • Shuffle communication pattern
  • User-definable scheduling and gathering policies
  • Feedback channel (optional)

13 of 38

Sequential BBs recap

  • Classified according their input/output cardinality:
    • SISO, MISO, SIMO
  • The cardinality of MISO, SIMO is defined when they are connected to other nodes
  • A connection is a FIFO SPSC channel
    • Blocking and non-blocking concurrency control
  • Sequential nodes can be combined. The aim is to decouple nodes from their concrete implementations (i.e. threads)

14 of 38

Parallel BBs recap

  • pipeline: it models the pipeline composition of BBs and it also acts as a container of BBs, thus enabling BBs nesting�
  • farm: it models functional replications of pipelines coordinated by one or two master nodes (called Emitter and Collector)�
  • all-to-all (A2A): it models two distinct sets of pipelines (L-Workers and R-Workers) connected according to the shuffle communication pattern.
    • The L-Worker pipelines must terminate with a SIMO node
    • The R-Worker pipelines must start with a MISO node

15 of 38

Composition Rules

Rule1

Rule2

Rule3

16 of 38

Composition Rules

  • Rule1: two sequential building blocks can be connected in pipeline regardless their input/output cardinality
  • Rule2: a parallel building block can be connected to a sequential building block through a multi-output (SIMO) and multi-input (MISO) node
  • Rule3: two parallel building block can be connected either if they have the same number of nodes regardless their input/output cardinality or through multi-input/multi-output nodes if they have a different number of nodes
    • All-to-all introduction rule between two pipelines

Rule1

Rule2

Rule3

17 of 38

Input/Output cardinality

  • The input and output channel cardinality of a single building block or of a composition of building blocks can be computed considering the rules shown aside (the symbol “*” denotes a cardinality not statically defined)
  • A given building block may have different cardinality on the basis of the number of edge-nodes composing it and on the basis of the sequential node used for implementing edge-nodes

18 of 38

Graphs that cannot be built with Building Blocks

… all the interesting ones (structured approach)

Only a subset of all possible concurrency graphs can be built ….

19 of 38

Building Blocks composition example

20 of 38

Building Blocks composition examples

21 of 38

Simple usage examples

22 of 38

Concurrency Graph Transformer

23 of 38

Basic idea

  • The concurrency graph transformer provides functions to refactor the FastFlow graph
  • Objectives: to reduce the number of nodes (i.e. default Emitters/Collectors) and to enable advanced combinations of BBs
  • A simple interface function, called optimize_static(), is provided to the FastFlow programmer to automatically apply transformations to the pipeline composition of BBs
    • These transformations are applied only if specific conditions hold so to not modify the semantics of the transformed program (user-hints)
    • The programmer can force some transformations using a proper API provided by the layer taking the responsibility of the transformations
  • Automatic transformations currently supported are: �nodes reduction, farm fusion, farm combine, All-to-All introduction
  • Extra-functional operations supported : concurrency control (blocking vs non-blocking), initial barrier removal, default-mapping of threads (enabled by default, can be disabled), ...

24 of 38

Example

25 of 38

Example

  • Transformations that modify the original functional semantics might still be still acceptable if acknowledged by the programmer

How to help the programmer to introduce such transformations

in the FF’s high-level layer? Specialized nodes?

26 of 38

Nodes reduction

  • The ff_Farm pattern uses by default an Emitter and Collector nodes (not user-defined)
  • We may want to avoid having dedicated nodes performing only scheduling and gathering functionalities
  • We can merge them with the previous/subsequent pipeline node
  • Example: three-staged pipeline where the middle stage is a farm with default emitter/collector

27 of 38

Nodes reduction

  • Extra nodes added when nesting a farm BB into a farm or a pipeline
  • Top: the results produced by the inner farm Workers will be directly forwarded to the collector of the outermost farm which will do the collection
  • Bottom: each Worker is a pipeline where the farm is the last stage containing a default collector.

28 of 38

Farm fusion

  • Two farms are consecutive stages of a pipeline and they have the same parallelism degree (number of Workers)
  • The result of the fusion is a unique farm merging the two original ones.
  • Besides reducing nodes also in this case, this optimization may be useful
  • to reduce latency (number of hops)
  • to avoid bottlenecks between the �Collector of the first farm and the �Emitter of the second one

29 of 38

Farm combine

  • Two farms in pipeline without any constraint on their parallelism degree and on their emitter/collector nodes (they can be either default or custom nodes)
  • Similar operation to “farm fusion” but limited to the Emitter and Collector nodes

Case1

Case2

Case3

30 of 38

Farm combine

  • An interesting case of the farm combine transformation is given when one of the two farms to be combined is an ordered farm instance
  • This case introduces some constraints in order to preserve the ordering of stream elements

  • Case1: The first farm is a standard farm while the second one is an ordered farm. Simplest case (not shown)
  • Case2: The opposite of Case1. The new Emitter of the first farm adds unique identifiers to the input elements. The Emitter of the second farm combines a special ordering collector which orders and remove the identifiers to the results.

31 of 38

All-to-all introduction

  • Two farms in pipeline regardless their parallelism degree
  • We want to remove the potential bottleneck introduced by the Collector and Emitter without changing the amount of concurrency of the worker nodes

Example

  • The Workers of the first and second farm are automatically transformed into multi-output and multi-input nodes, respectively
  • The Workers of the first farm are added to the L-Worker set of the all-to-all while the Workers of the second farm are added to the R-Worker set of the same all-to-all building block

32 of 38

All-to-all introduction

  • What about if the Emitter and/or the Collector are user-defined?
  • The transformation has to be “guided” by the user

  • In the figure: R=E and G=C.
  • R and G can be any sequential composition of nodes such that they are a multi-output node and a multi-input node, respectively

33 of 38

Ferret Example

  • Ferret applications from the PARSEC benchmark suite (Princeton Application Repository for Shared-Memory Computers)
  • Ferret: content-based similarity search of feature-rich data such as audio, images, video, and 3D shapes

34 of 38

Ferret FastFlow code

35 of 38

35

Ferret transformations

36 of 38

All-to-all introduction example

  • To evaluate the advantage of A2A introduction, we consider a classical benchmark used to evaluate framework for big data analytics: WordCount (PiCO library)

  • A pipeline of two parallel transformations (map-reduce)
  • a “flatmap” operator in charge of producing for each word of a text file a pair (1, word)
  • a “reduce” operator merging all the pairs with the same word in a unique pair (n, word), where n is the number of pairs with that word as key
  • Shuffling can be applied to avoid bottlenecks

37 of 38

All-to-all introduction example

  • Results obtained using the PiCO framework implemented on top of FastFlow. WordCount test on a Intel Xeon platform 24-cores 28-threads, with different number of keys that change the cost of the distribution
  • More than 350K different words (keys) in the “many keys” scenario, only 10K in the “few keys” case
  • Parallelism degree of the FlatMap fixed to 24. Slightly better performance and more stable behavior by increasing the degree of parallelism in the reduce phase
  • In the test with few keys, the reduce phase is fixed to use parallelism degree equal to 2. No additional cost of using the A2A here

38 of 38

Further research on BBs (some options)

  1. Introduce more advanced static transformations�Some transformations need “helper nodes” to fulfill the Composition Rules. These nodes could be introduced automatically so to enlarge the options
  2. Transformations applied at run-time while the program is running (e.g., the initial program can be compiled into multiple versions):
    • To adapt to fluctuating workloads (in addition to concurrency throttling)
    • To use them in different places of the program (as for the ParallelFor)