1 of 79

CINI HPC Summer School

16-20 June, 2025 - Naples, Italy

Programming Tools for �High-Performance Data Analysis

Domenico Talia and Paolo Trunfio

{talia,trunfio}@dimes.unical.it

University of Calabria & CINI lab HPC-KTT

Your

photo

Your

photo

2 of 79

Motivations and goals

  • The world today generates an unprecedented amount of data, and the ability to extract valuable insights from these data is critical to success in many fields, including science, business, and government.
  • The best way to exploit the value of the massive amount of available data is to implement scalable data analysis applications that efficiently extract useful patterns, models, and trends from them.
  • Programming high-performance data analysis applications is a multifaceted task that requires a deep understanding of various concepts including data analytics, distributed computing, parallel processing, and machine learning.

2

D. Talia, P. Trunfio

Programming Tools for High-Performance Data Analysis

Tutorial @ HPC Summer School 2025

June 18th, 2025, Naples, Italy

3 of 79

Motivations and goals

  • Providing an essential guide for developers looking on how to develop scalable big data applications on HPC systems:
  • The main programming models for big data, which support users in expressing parallel algorithms and applications, by providing an abstraction for a parallel computer architecture.
  • The most used programming tools for big data processing, dealing with different kinds of data (from structured data to graphs and streams) and domains (batch, streaming, graph-, and query-based applications).
  • The key features that support programmers in choosing the most appropriate framework, along with other important factors that can drive this choice, such as data type, infrastructure scale and developer skills.

3

D. Talia, P. Trunfio

Programming Tools for High-Performance Data Analysis

Tutorial @ HPC Summer School 2025

June 18th, 2025, Naples, Italy

4 of 79

Programming Models for Big Data

Programming Models

D. Talia, P. Trunfio

Programming Tools for High-Performance Data Analysis

Tutorial @ HPC Summer School 2025

June 18th, 2025, Naples, Italy

5 of 79

Programming models for big data

  • “A programming model is an interface that separates high-level properties from low-level ones, providing specific operations to the programming level above and requiring implementations for all the operations on the architectural level below” (Skillicorn and Talia, 1998).
  • A parallel programming model is an abstraction for a parallel computer architecture that aids in the expression of parallel algorithms and applications. It can represent a variety of problems for different parallel and distributed systems.

Programming Models

5

D. Talia, P. Trunfio

Programming Tools for High-Performance Data Analysis

Tutorial @ HPC Summer School 2025

June 18th, 2025, Naples, Italy

6 of 79

Programming models for big data

  • Parallel programming models are often the core feature of big data frameworks since they impact the execution paradigm of big data processing engines as well as the way users design and create applications.
  • They enable the separation of software development issues from parallel execution concerns, providing abstraction and stability.
    • Abstraction is guaranteed because the model’s operations are at a higher level than those of the underlying architectures.
    • This simplifies the software structure and the difficulty of its development, also guaranteeing stability through a standard interface.
  • Therefore, a model can lower the implementation effort, making decisions once for each target system, rather than for each program.

Programming Models

6

D. Talia, P. Trunfio

Programming Tools for High-Performance Data Analysis

Tutorial @ HPC Summer School 2025

June 18th, 2025, Naples, Italy

7 of 79

Programming models for big data

  • Programming systems are implementations of one or more models and can be developed through different strategies:
    • Language Development: it involves creating new parallel programming languages or integrating parallel constructs and data structures into existing languages.
    • Annotation Approach: it uses specific symbols or keywords in annotations to identify parallel statements in the program code and tell the compiler which instructions must be executed concurrently.
    • Library Integration: this approach involves enhancing parallelism by including libraries in the application code, which is the most popular approach since it is orthogonal to host languages.

Programming Models

7

D. Talia, P. Trunfio

Programming Tools for High-Performance Data Analysis

Tutorial @ HPC Summer School 2025

June 18th, 2025, Naples, Italy

8 of 79

The MapReduce model

  • The MapReduce programming model was developed by Google to face the challenge of processing big data effectively.

  • Its paradigm was inspired by the map and reduce functions available in functional programming languages, such as LISP, and it allows designers to create distributed applications based on those two operations.

Programming Models: MapReduce

8

D. Talia, P. Trunfio

Programming Tools for High-Performance Data Analysis

Tutorial @ HPC Summer School 2025

June 18th, 2025, Naples, Italy

9 of 79

The MapReduce model

  • The MapReduce model largely exploit the divide & conquer strategy to tackle issues related to big data:
    1. divide the problem into smaller sub-problems,
    2. execute independent sub-problems in parallel using several workers,
    3. combine intermediate results from each individual worker.

Programming Models: MapReduce

9

D. Talia, P. Trunfio

Programming Tools for High-Performance Data Analysis

Tutorial @ HPC Summer School 2025

June 18th, 2025, Naples, Italy

10 of 79

The MapReduce model

  • The programmer defines two phases for MapReduce processing: map and reduce.
  • The map function takes a (key, value) pair as input and produces a list of intermediate (key, value) pairs:

map (k1, v1) → list(k2, v2)

  • The reduce function merges all intermediate values with the same intermediate key:

reduce (k2, list(v2)) → list(v3)

Programming Models: MapReduce

10

D. Talia, P. Trunfio

Programming Tools for High-Performance Data Analysis

Tutorial @ HPC Summer School 2025

June 18th, 2025, Naples, Italy

11 of 79

Parallelism in MapReduce

  • Parallelism is achieved in both phases:
    • In the map phase, where keys can be processed in parallel by different computers (map calls are distributed across computers by sharding the input data).
    • In the reduce phase, where reducers working on distinct keys are executed concurrently.
  • As a consequence, MapReduce algorithms scale from a single server to hundreds of thousands of servers.
  • The MapReduce approach hides the details of the underlying parallelization from the programmer, making it simple to use.
  • Developers can focus on defining the computations, without considering the details of how they are performed or how data is sent to processors.

Programming Models: MapReduce

11

D. Talia, P. Trunfio

Programming Tools for High-Performance Data Analysis

Tutorial @ HPC Summer School 2025

June 18th, 2025, Naples, Italy

12 of 79

Example: inverted index

  • An example of a MapReduce application is the generation of an inverted index.
  • Given a set of documents, this index contains a set of words (or index terms), specifying the IDs of all the documents that contain each word.
  • A MapReduce approach can be effectively leveraged in this case:
    • the map function generates a sequence of <word, documentID> pairs for each input document.
    • The reduce function takes all the pairs for a given word, sorts the corresponding document IDs, and emits a <word, list (documentID)> pair.

Programming Models: MapReduce

12

D. Talia, P. Trunfio

Programming Tools for High-Performance Data Analysis

Tutorial @ HPC Summer School 2025

June 18th, 2025, Naples, Italy

13 of 79

Example: inverted index

  • The inverted index for the input documents is formed by the set of all output pairs created by the reduce function:

Programming Models: MapReduce

13

D. Talia, P. Trunfio

Programming Tools for High-Performance Data Analysis

Tutorial @ HPC Summer School 2025

June 18th, 2025, Naples, Italy

14 of 79

The workflow model

  • Workflows adopt a declarative approach to express the high-level logic of many types of applications while obscuring low-level details that are not essential for application design.
  • One significant benefit of workflows is their ability to be stored and retrieved, facilitating modification and/or re-execution. This allows users to design and reuse common patterns in multiple contexts.
  • A workflow management system (WMS) facilitates the definition, development, and execution of processes, with the coordination of activities (or enactment) playing a pivotal role during workflow execution.

Programming Models: Workflows

14

D. Talia, P. Trunfio

Programming Tools for High-Performance Data Analysis

Tutorial @ HPC Summer School 2025

June 18th, 2025, Naples, Italy

15 of 79

Workflow patterns

  • Workflow tasks can be combined in various ways, allowing designers to address the needs of a wide range of application scenarios through the use of recurring and reusable constructs (sequential and parallel).
  • Workflow patterns provide a standardized way of organizing and orchestrating tasks within a process.
  • The main workflow patterns are:
    • Sequence
    • Branching
    • Synchronization
    • Repetition

Programming Models: Workflows

15

D. Talia, P. Trunfio

Programming Tools for High-Performance Data Analysis

Tutorial @ HPC Summer School 2025

June 18th, 2025, Naples, Italy

16 of 79

Directed Acyclic Graphs

  • A Directed Acyclic Graph (DAG) is a workflow that is both:
    • Directed: if multiple tasks exist, each must have at least one previous or subsequent task, or both. However, some DAGs have multiple parallel tasks that lack direct interdependencies.
    • Acyclic: tasks cannot generate data that reference themselves, potentially resulting in an infinite loop. This means that DAGs do not have any cycles.
  • DAGs are the most commonly used programming structure in workflow management, and they have proven to be extremely useful in a variety of big data frameworks, including Apache Spark.

Programming Models: Workflows

16

D. Talia, P. Trunfio

Programming Tools for High-Performance Data Analysis

Tutorial @ HPC Summer School 2025

June 18th, 2025, Naples, Italy

17 of 79

Directed Acyclic Graphs

  • The DAG paradigm is effective for modeling complex data analysis processes, such as machine learning applications, which can be efficiently executed on distributed/parallel/cloud computing systems.
  • DAGs can easily model many different types of applications, in which the input, output, and tasks of one application are dependent on the tasks of another.

Programming Models: Workflows

17

D. Talia, P. Trunfio

Programming Tools for High-Performance Data Analysis

Tutorial @ HPC Summer School 2025

June 18th, 2025, Naples, Italy

18 of 79

The message-passing model

  • The Message-passing model is a paradigm for inter-process communication (IPC) in distributed computing, where each processing element has its private memory.
  • IPC mechanisms, provided by the operating system, include shared memory and distributed memory or message passing.
  • Parallel programming models are generally categorized based on memory usage.

Programming Models: Message Passing

18

D. Talia, P. Trunfio

Programming Tools for High-Performance Data Analysis

Tutorial @ HPC Summer School 2025

June 18th, 2025, Naples, Italy

19 of 79

Shared-memory vs Message-passing model

  • The key distinction lies in how processes interact and share data: shared-memory relies on a common address space, while message passing relies on communication through explicit message exchange.

Programming Models: Message Passing

19

D. Talia, P. Trunfio

Programming Tools for High-Performance Data Analysis

Tutorial @ HPC Summer School 2025

June 18th, 2025, Naples, Italy

20 of 79

The BSP model

  • The Bulk Synchronous Parallel (BSP) is a parallel computation model developed by Leslie Valiant (1990).
  • Valiant proposed a paradigm similar to Von Neumann's model, connecting hardware and software for parallel machines.
  • The BSP approach enables programmers to avoid expensive memory and communication management, achieving efficient parallel computation with a low degree of synchronization.

Programming Models: BSP

20

D. Talia, P. Trunfio

Programming Tools for High-Performance Data Analysis

Tutorial @ HPC Summer School 2025

June 18th, 2025, Naples, Italy

21 of 79

BSP computation

  • A computation in the BSP model is made up of a set of supersteps, where each processor is assigned a task involving local computing steps, message broadcasts, and message arrivals.
  • A global check occurs every L time units (the periodicity parameter) to verify superstep completion by all processors before proceeding to the next superstep.

A BSP superstep

P0

P1

P2

P3

Pn

Programming Models: BSP

21

D. Talia, P. Trunfio

Programming Tools for High-Performance Data Analysis

Tutorial @ HPC Summer School 2025

June 18th, 2025, Naples, Italy

22 of 79

The SQL-like model

  • SQL-like systems aim at combining the efficiency of MapReduce programming with the simplicity of the SQL language.
  • MapReduce can address scalability issues and reduce query times, but its complexity may pose challenges for less skilled users.
  • SQL-like systems overcome MapReduce programming complexities for simple operations (e.g., row aggregations, selects, or counts) while maintaining query speeds and scalability.
  • In many cases, SQL-like systems optimize queries on large repositories automatically using MapReduce under-the-hood.

Programming Models: SQL-like

22

D. Talia, P. Trunfio

Programming Tools for High-Performance Data Analysis

Tutorial @ HPC Summer School 2025

June 18th, 2025, Naples, Italy

23 of 79

Why use SQL on big data?

  • SQL has become the go-to tool for developers, database managers, and data scientists, widely utilized in commercial products for data querying, modification, and visualization.
  • Its key benefits include:
    • Declarative Language: SQL is declarative, describing data transformations and operations, making it easily understandable.
    • Interoperability: Being a standardized language, it allows different systems to provide their own implementations while ensuring compatibility and a syntax that can be easily understood by users.
    • Data-driven: SQL operations reflect transformations and modifications of input datasets, making it a convenient programming model for data-centric applications in traditional and big data environments.

Programming Models: SQL-like

23

D. Talia, P. Trunfio

Programming Tools for High-Performance Data Analysis

Tutorial @ HPC Summer School 2025

June 18th, 2025, Naples, Italy

24 of 79

The PGAS model

  • The Partitioned Global Address Space (PGAS) is a parallel programming model aimed at boosting programmer productivity while maintaining high performance.
  • Its core idea revolves around utilizing a globally shared address space to enhance productivity, while also implementing a separation between local and distant data accesses.
  • This separation on data access is crucial for achieving performance improvements and ensuring scalability on large-scale parallel architectures.

Programming Models: PGAS

24

D. Talia, P. Trunfio

Programming Tools for High-Performance Data Analysis

Tutorial @ HPC Summer School 2025

June 18th, 2025, Naples, Italy

25 of 79

The PGAS model

Programming Models: PGAS

25

D. Talia, P. Trunfio

Programming Tools for High-Performance Data Analysis

Tutorial @ HPC Summer School 2025

June 18th, 2025, Naples, Italy

26 of 79

Models for exascale systems

  • Exascale systems represent a promising opportunity, yet their design and implementation are intricate due to challenges, such as scalability, network latency, reliability, and the robustness of data operations.
  • Efficiently handling vast data volumes requires scalable algorithms capable of partitioning and analyzing data through millions of parallel operations.
  • Modern HPC systems demand scalable programming models for optimal performance, supporting programmers to address the complexity of managing millions to billions of concurrent threads.

Programming Models: Exascale systems

26

D. Talia, P. Trunfio

Programming Tools for High-Performance Data Analysis

Tutorial @ HPC Summer School 2025

June 18th, 2025, Naples, Italy

27 of 79

Requirements of Exascale Models

  • A scalable exascale programming model should incorporate the following mechanisms:
    • Parallel data access to improve data access bandwidth by accessing different elements concurrently.
    • Fault resiliency to deal with failures occurring during non-local communication.
    • Data-driven local communication to limit data exchange.
    • Data processing on limited groups of cores on specific exascale machines.
    • Near-data synchronization, reducing the overhead generated by synchronization among many distant cores.
    • In-memory analytics to decrease reaction time by caching data in the RAMs of processing nodes.
    • Locality-based data selection to reduce latency by keeping a subset of data locally available.
  • Solutions traditionally used in HPC systems (e.g., MPI and OpenMP) face significant challenges for programming software designed to run on exascale systems.

Programming Models: Exascale systems

27

D. Talia, P. Trunfio

Programming Tools for High-Performance Data Analysis

Tutorial @ HPC Summer School 2025

June 18th, 2025, Naples, Italy

28 of 79

Models for exascale programming

  • Parallel applications on exascale systems must efficiently manage millions of threads running on a very large array of cores, necessitating strategies to minimize synchronization, reduce communication and remote memory usage, and address potential software and hardware faults.
  • Several programming models have been proposed to cope with the needs of exascale environments, such as:
    • Legion
    • Charm++
    • DCEx
    • X10
    • Chapel
    • UPC++

Programming Models: Exascale systems

28

D. Talia, P. Trunfio

Programming Tools for High-Performance Data Analysis

Tutorial @ HPC Summer School 2025

June 18th, 2025, Naples, Italy

29 of 79

Legion

  • Legion is a distributed memory programming model designed for high performance on diverse parallel architectures.
  • Data organization is based on the use of logical regions, which can dynamically allocated, removed, and used to store groups of objects in data structures.
  • Regions can also be supplied as inputs to distinct functions, called tasks, which read data in specific regions and provide locality information.
  • Logical regions can be divided into either disjoint or aliased subregions, offering crucial information for assessing computation independence.

Programming Models: Exascale systems

29

D. Talia, P. Trunfio

Programming Tools for High-Performance Data Analysis

Tutorial @ HPC Summer School 2025

June 18th, 2025, Naples, Italy

30 of 79

Charm++

  • Charm++ is a distributed memory programming model where a program defines collections of interacting objects dynamically mapped to processors by the runtime system.
  • It employs an asynchronous, message-driven, task-based approach with movable objects.
  • Objects can be migrated among processors, allowing operations to send data to logical objects rather than physical processors.
  • Charm++ utilizes overdecomposition to divide applications into many small objects representing coarse work and/or data units, which may greatly exceed the number of processors.

Programming Models: Exascale systems

30

D. Talia, P. Trunfio

Programming Tools for High-Performance Data Analysis

Tutorial @ HPC Summer School 2025

June 18th, 2025, Naples, Italy

31 of 79

DCEx

  • DCEx is a PGAS-based programming model for implementing data-centric, large-scale parallel applications on exascale systems.
  • It is built on data-aware basic operations for data-intensive applications, allowing for the scalable usage of a massive number of processing elements.
  • Utilizing private data structures and minimizing data exchange between concurrent threads, DCEx employs near-data synchronization to enable computation threads to operate closely with data.
  • A DCEx program is structured into data-parallel blocks, serving as memory/storage units for shared- and distributed-memory parallel computation, communication, and migration.

Programming Models: Exascale systems

31

D. Talia, P. Trunfio

Programming Tools for High-Performance Data Analysis

Tutorial @ HPC Summer School 2025

June 18th, 2025, Naples, Italy

32 of 79

X10

  • X10 is a programming model based on APGAS, which introduces locations as a computational context abstraction.
  • Locations provide a locally synchronous view of shared memory.
  • In an X10 computation, multiple places are distributed, each storing data and performing one or more activities (lightweight threads) that can be dynamically created.
  • Activities can synchronously use one or more memory regions within the place where they reside.

Programming Models: Exascale systems

32

D. Talia, P. Trunfio

Programming Tools for High-Performance Data Analysis

Tutorial @ HPC Summer School 2025

June 18th, 2025, Naples, Italy

33 of 79

Chapel

  • Chapel is an APGAS-based programming model using high-level language abstractions for general parallel programming, providing global-view data structures and a global view of control to improve the abstraction level for both data and control flow.
  • Global-view data structures include arrays and other aggregated data with sizes and indices represented globally, even if their implementations are distributed across parallel system locales.
  • A locale in Chapel is an abstraction of a target architecture's unit of uniform memory access, ensuring that all threads within a locale have similar access times to any single memory address.
  • The global view of control means that an application starts with a single logical thread of control and introduces parallelism through specific language concepts.

Programming Models: Exascale systems

33

D. Talia, P. Trunfio

Programming Tools for High-Performance Data Analysis

Tutorial @ HPC Summer School 2025

June 18th, 2025, Naples, Italy

34 of 79

UPC++

  • UPC++ is a C++ library designed for PGAS programming, which includes tools for describing dependencies between asynchronous computations and data transfer.
  • The library facilitates efficient one-sided communication and enables moving computation to data through remote procedure calls, facilitating the implementation of complex distributed data structures.
  • The library features three primary programming concepts:
    • Global pointers, supporting efficient data locality exploitation.
    • RPC-based asynchronous programming, allowing for efficient development of asynchronous programs.
    • Futures to manage the availability of data coming from computations.

Programming Models: Exascale systems

34

D. Talia, P. Trunfio

Programming Tools for High-Performance Data Analysis

Tutorial @ HPC Summer School 2025

June 18th, 2025, Naples, Italy

35 of 79

Programming Tools for �Big Data Applications

Programming Tools

D. Talia, P. Trunfio

Programming Tools for High-Performance Data Analysis

Tutorial @ HPC Summer School 2025

June 18th, 2025, Naples, Italy

36 of 79

Apache Hadoop

  • Apache Hadoop* is the most popular open-source framework for implementing the MapReduce programming model.
  • Hadoop is designed for developing scalable data-intensive applications in various programming languages (e.g., Java, Python) to be executed on parallel and distributed systems.
  • The programming approach in Hadoop allows abstraction from classical distributed computing issues, including data locality, workload balancing, fault tolerance, and network bandwidth saving.

* https://hadoop.apache.org/

MapReduce-based Programming Tools: Apache Hadoop

36

D. Talia, P. Trunfio

Programming Tools for High-Performance Data Analysis

Tutorial @ HPC Summer School 2025

June 18th, 2025, Naples, Italy

37 of 79

Software stack

MapReduce-based Programming Tools: Apache Hadoop

37

D. Talia, P. Trunfio

Programming Tools for High-Performance Data Analysis

Tutorial @ HPC Summer School 2025

June 18th, 2025, Naples, Italy

38 of 79

Execution flow

MapReduce-based Programming Tools: Apache Hadoop

38

D. Talia, P. Trunfio

Programming Tools for High-Performance Data Analysis

Tutorial @ HPC Summer School 2025

June 18th, 2025, Naples, Italy

39 of 79

Workflow-based programming tools

  • Workflows are used in a wide range of application domains, including scientific simulation, data analytics, and machine learning.
  • Various parallel and distributed frameworks have been proposed that exploit this programming model to model execution.
  • Three representative frameworks are:
    • Apache Spark for developing general-purpose applications.
    • Apache Storm for streaming data.
    • Apache Airflow for facilitating the design and execution of workflow-based applications.

Workflow-based Programming Tools

39

D. Talia, P. Trunfio

Programming Tools for High-Performance Data Analysis

Tutorial @ HPC Summer School 2025

June 18th, 2025, Naples, Italy

40 of 79

Apache Spark

  • Apache Spark* has established itself as the most popular open source framework for big data analytics, thanks to its in-memory programming feature and higher-level modules.
  • Initially developed in 2009 by Matei Zaharia at UC Berkeley’s AMPLab, Spark joined the Apache Software Foundation in 2013.
  • Several big companies, including eBay, Amazon, and Alibaba, use Spark in production.
  • Thanks to a very large community of users and contributors, the development of Spark is constantly expanding.

* https://spark.apache.org/

Workflow-based Programming Tools: Apache Spark

40

D. Talia, P. Trunfio

Programming Tools for High-Performance Data Analysis

Tutorial @ HPC Summer School 2025

June 18th, 2025, Naples, Italy

41 of 79

Software stack

Workflow-based Programming Tools: Apache Spark

41

D. Talia, P. Trunfio

Programming Tools for High-Performance Data Analysis

Tutorial @ HPC Summer School 2025

June 18th, 2025, Naples, Italy

42 of 79

RDDs

  • A Resilient Distributed Dataset (RDD) is a distributed memory abstraction, representing items across cluster nodes for parallel manipulation.
  • Immutable and fault-tolerant, items are partitioned, with at least one partition stored in each node's memory or, if insufficient, on local disk through spilling.

Workflow-based Programming Tools: Apache Spark

42

D. Talia, P. Trunfio

Programming Tools for High-Performance Data Analysis

Tutorial @ HPC Summer School 2025

June 18th, 2025, Naples, Italy

43 of 79

Transformations and actions

  • A Spark application is written as a sequence of RDD operations, categorized into two types:
    • Transformations:
      • Coarse-grained operations creating new RDDs from existing ones.
      • Examples include map, filter, and join operations.
    • Actions:
      • Execute computations on RDDs or write data to storage.
      • Output values, e.g., count, collect, reduce, and save.
  • At runtime, a Spark application forms a Directed Acyclic Graph (DAG) comprising data sources, transformations, and operations.

Workflow-based Programming Tools: Apache Spark

43

D. Talia, P. Trunfio

Programming Tools for High-Performance Data Analysis

Tutorial @ HPC Summer School 2025

June 18th, 2025, Naples, Italy

44 of 79

Data locality

  • Data locality: Move computation close to data on the cluster node, avoiding large data transfers. Serialized code moves faster than data, reducing network congestion and boosting system throughput.

Workflow-based Programming Tools: Apache Spark

44

D. Talia, P. Trunfio

Programming Tools for High-Performance Data Analysis

Tutorial @ HPC Summer School 2025

June 18th, 2025, Naples, Italy

45 of 79

Apache Storm

  • Apache Storm* is a distributed real-time computation system that allows for the processing of unbounded data streams in a reliable way.
  • Before Storm, real-time processing systems were developed using queues for writing data and workers to read and process those data:
    • Most of the application logic had to do with where send/receive messages, how to serialize/deserialize messages, and making sure that the queues and workers were always alive.
  • Storm proved to be extremely scalable, easy to use, and capable of processing data with low latency.

* https://storm.apache.org/

Workflow-based Programming Tools: Apache Storm

45

D. Talia, P. Trunfio

Programming Tools for High-Performance Data Analysis

Tutorial @ HPC Summer School 2025

June 18th, 2025, Naples, Italy

46 of 79

Data and computation abstractions

  • The programming paradigm offered by Storm is based on five abstractions for data and computation:
    1. Tuple: it is the basic unit of data that can be processed. A tuple consists of a list of fields of various types (e.g., byte, char, integer, long).
    2. Stream: it represents an unbounded sequence of tuples, which is created or processed in parallel. Streams can be created using standard serializers (e.g., integers, doubles) or with custom ones.

Workflow-based Programming Tools: Apache Storm

46

D. Talia, P. Trunfio

Programming Tools for High-Performance Data Analysis

Tutorial @ HPC Summer School 2025

June 18th, 2025, Naples, Italy

47 of 79

Data and computation abstractions

    • Spout: it is the data source of a stream. Data can be read from different external sources, such as social network APIs, sensor networks, and queuing systems (e.g., Java Message Service, Kafka, Redis). Then, they are fed into the application.
    • Bolt: it represents the processing entity. Specifically, it can execute any type of task or algorithm (e.g., data cleaning, joins, queries).
    • Topology: it represents a job. A generic topology is configured as a DAG, where spouts and bolts represent the graph vertices and streams act as their edges. It may run forever until it is stopped.

Workflow-based Programming Tools: Apache Storm

47

D. Talia, P. Trunfio

Programming Tools for High-Performance Data Analysis

Tutorial @ HPC Summer School 2025

June 18th, 2025, Naples, Italy

48 of 79

Architecture

Workflow-based Programming Tools: Apache Storm

48

D. Talia, P. Trunfio

Programming Tools for High-Performance Data Analysis

Tutorial @ HPC Summer School 2025

June 18th, 2025, Naples, Italy

49 of 79

Apache Airflow

  • Apache Airflow* is an open-source platform designed to develop, schedule and monitor workflows; top-Level project of the Apache Software Foundation since 2019.
  • It can be used to create data processing applications as DAGs of tasks.
  • DAGs are defined as Python code, which gives the possibility to:
    • store workflows in version control and to roll back to previous versions
    • develop workflows by multiple people simultaneously
    • write tests to validate functionalities

* https://airflow.apache.org/

Workflow-based Programming Tools: Apache Airflow

49

D. Talia, P. Trunfio

Programming Tools for High-Performance Data Analysis

Tutorial @ HPC Summer School 2025

June 18th, 2025, Naples, Italy

50 of 79

Apache Airflow

  • Airflow presents a high level of abstraction as programmers can easily build workflows by combining a set of tasks and by specifying dependencies among them.
  • The Airflow scheduler executes the tasks on an array of workers taking into account the dependencies specified by the DAG.
  • The runtime supports both:
    • data parallelism when many tasks execute in parallel the same code on different data chunks, and
    • task parallelism when different tasks (or operators) run in parallel

Workflow-based Programming Tools: Apache Airflow

50

D. Talia, P. Trunfio

Programming Tools for High-Performance Data Analysis

Tutorial @ HPC Summer School 2025

June 18th, 2025, Naples, Italy

51 of 79

Architecture

Workflow-based Programming Tools: Apache Airflow

51

D. Talia, P. Trunfio

Programming Tools for High-Performance Data Analysis

Tutorial @ HPC Summer School 2025

June 18th, 2025, Naples, Italy

52 of 79

BSP-based programming tools

  • Traditional big data processing frameworks (e.g., Hadoop, Spark) are not the best option when dealing with graphs:
    • They do not consider the graph structure underlying the data.
    • Computation can lead to excessive data movement and performance degradation.
  • Ad-hoc solutions are thus needed, specially designed for efficient graph-parallel computation
  • Several frameworks, based on the Bulk Synchronous Parallel (BSP) model, are used for processing large datasets, with a particular focus on efficient graph computation.
  • Some of them are implementations of the Google Pregel model, a BSP messaging abstraction aimed at expressing graph-parallel iterative algorithms.
  • Examples are Apache Hama, Giraph, Flink’s Gelly API, and the GraphX library provided by Spark.

BSP-based Programming Tools: Spark GraphX

52

D. Talia, P. Trunfio

Programming Tools for High-Performance Data Analysis

Tutorial @ HPC Summer School 2025

June 18th, 2025, Naples, Italy

53 of 79

Spark GraphX

  • Apache Spark GraphX* is a graph processing library for the scalable processing of large-scale graph data structures, with several key features:
    • Resilient Distributed Graphs (RDG): GraphX extends Spark’s RDDs with a graph abstraction, the RDG, designed to efficiently partition graph data across a cluster of machines.
    • Graph Algorithms: GraphX includes a collection of built-in graph algorithms, such as PageRank, connected components, and triangle counting.
    • Graph Operators: GraphX provides a set of operators that can be used to perform map operations, create subgraphs, reverse edge direction, or compute a masked version of a graph.
    • Integration with Spark: GraphX seamlessly integrates with other Spark components, allowing to combine graph processing with data processing and machine learning within the same Spark application.

* https://spark.apache.org/graphx/

BSP-based Programming Tools: Spark GraphX

53

D. Talia, P. Trunfio

Programming Tools for High-Performance Data Analysis

Tutorial @ HPC Summer School 2025

June 18th, 2025, Naples, Italy

54 of 79

Graph abstraction

  • GraphX extends the Spark RDD by introducing the Resilient Distributed Graph, a new graph abstraction that consists of a directed multigraph with properties attached to each vertex and edge.

  • This abstraction provides a unified interface to represent data considering the underlying graph structure, while maintaining the efficiency of Spark RDDs. Indeed, it allows to leverage:
    • Graph concepts and efficient primitives for graph computation.
    • Distributed data-parallel operations typical of Spark.
  • A Graph contains two distinct RDDs, one for edges and one for vertices.

Source: GraphX programming guide

BSP-based Programming Tools: Spark GraphX

54

D. Talia, P. Trunfio

Programming Tools for High-Performance Data Analysis

Tutorial @ HPC Summer School 2025

June 18th, 2025, Naples, Italy

55 of 79

SQL-Like programming tools

  • SQL-like systems attempt to combine Hadoop’s efficacy with the SQL-like language’s ease of use, to enable the development of simple and efficient data analysis applications.
  • Apache Hive, a data warehouse software built on top of Hadoop for reading, writing, and managing data in large-scale infrastructures, is one of the most used systems in this context.
  • Apache Pig is another Hadoop-based framework that exploits an SQL-like language for executing data flow applications in large-scale infrastructures.

SQL-like Programming Tools

55

D. Talia, P. Trunfio

Programming Tools for High-Performance Data Analysis

Tutorial @ HPC Summer School 2025

June 18th, 2025, Naples, Italy

56 of 79

Apache Hive

  • Apache Hive* is a Hadoop-based data warehouse system, which enables users to write queries using an SQL-like declarative language, called HiveQL, which are then compiled into MapReduce jobs and run on Hadoop.
  • Hive can be seen as an SQL engine capable of automatically compiling an SQL-like query into a set of MapReduce jobs that are executed on a Hadoop cluster, with additional features for data and metadata management.
  • The reasons behind the development of Hive are based on the fact that although MapReduce is a very flexible paradigm, it is too low level for routine data analysis tasks.

* https://hive.apache.org/

SQL-like Programming Tools: Apache Hive

56

D. Talia, P. Trunfio

Programming Tools for High-Performance Data Analysis

Tutorial @ HPC Summer School 2025

June 18th, 2025, Naples, Italy

57 of 79

Main concepts

  • Hive is specifically designed for online analytical processing (OLAP) rather than online transaction processing (OLTP).
  • Unlike SQL Server, Hive does not provide real-time access to data.
  • Hive provides three different types of functions for data manipulation:
    • user-defined functions (UDFs)
    • user-defined aggregate functions (UDAFs)
    • user-defined table-generating functions (UDTFs)
  • Such functions make it really easy to write custom ones in different languages, such as Java or Python.

SQL-like Programming Tools: Apache Hive

57

D. Talia, P. Trunfio

Programming Tools for High-Performance Data Analysis

Tutorial @ HPC Summer School 2025

June 18th, 2025, Naples, Italy

58 of 79

Architecture

SQL-like Programming Tools: Apache Hive

58

D. Talia, P. Trunfio

Programming Tools for High-Performance Data Analysis

Tutorial @ HPC Summer School 2025

June 18th, 2025, Naples, Italy

59 of 79

Apache Pig

  • Apache Pig* is a high-level dataflow framework for executing MapReduce programs on Hadoop by using an SQL-like language.
  • Pig was proposed to bridge the gap between the high-level declarative querying of SQL and the low-level procedural style of the MapReduce programming model.
  • Queries are written using a custom language, called Pig Latin, and are then converted into execution plans that are performed as MapReduce jobs on Hadoop.

* https://pig.apache.org/

SQL-like Programming Tools: Apache Pig

59

D. Talia, P. Trunfio

Programming Tools for High-Performance Data Analysis

Tutorial @ HPC Summer School 2025

June 18th, 2025, Naples, Italy

60 of 79

Main concepts

  • Data model: Pig provides a nested data model, which allows for handling complex and non-normalized data. It supports scalar types, such as int, long, double, chararray (i.e., string), and bytearray types.
  • Moreover, it provides three complex data models:
    • map: an associative array, where a string is the key and the value can be any type.
    • tuple: an ordered list of data elements, also called fields, where each field is a piece of data. The elements of a tuple can be of any type, allowing nested complex types.
    • bag: a collection of tuples, similar to a relational database. Tuples in a bag correspond to the rows in a table, although, unlike a relational table, Pig bags do not require that each tuple contain the same number of fields. A bag is also identified as a relation.

SQL-like Programming Tools: Apache Pig

60

D. Talia, P. Trunfio

Programming Tools for High-Performance Data Analysis

Tutorial @ HPC Summer School 2025

June 18th, 2025, Naples, Italy

61 of 79

Architecture

SQL-like Programming Tools: Apache Pig

61

D. Talia, P. Trunfio

Programming Tools for High-Performance Data Analysis

Tutorial @ HPC Summer School 2025

June 18th, 2025, Naples, Italy

62 of 79

Choosing the Right Framework to�Tame Big Data

Choosing the Right Framework

D. Talia, P. Trunfio

Programming Tools for High-Performance Data Analysis

Tutorial @ HPC Summer School 2025

June 18th, 2025, Naples, Italy

63 of 79

Comparative analysis of the system features

  • Several programming tools for big data analytics are available today, designed to meet diverse needs.
  • Choosing the right tool can be a challenging choice, which requires the evaluation of various aspects, such as:
    • the application class (batch, streaming, data querying, and graph-based)
    • the available budget
    • the type of parallelism
    • the abstraction level
    • and the code verbosity
  • Other factors to take include performance, scalability, usability, and suitability.

Choosing the Right Framework

63

D. Talia, P. Trunfio

Programming Tools for High-Performance Data Analysis

Tutorial @ HPC Summer School 2025

June 18th, 2025, Naples, Italy

64 of 79

Summary of system features

  • This table presents a summary of system features, categorizing them based on their programming model, type of parallelism, level of abstraction, code verbosity, and main classes of applications.

Choosing the Right Framework

64

D. Talia, P. Trunfio

Programming Tools for High-Performance Data Analysis

Tutorial @ HPC Summer School 2025

June 18th, 2025, Naples, Italy

65 of 79

Main factors

  • The main factors to be considered when selecting the appropriate framework for implementing a big data application include:
    • Input data: It mainly refers to data volume (both in terms of size and dimensionality of the input dataset) velocity and variety.
    • Application class: It refers to the type of application that must be implemented (e.g. batch, stream, graph-based, and query-based applications).
    • Infrastructure: It refers to the storage and computing infrastructure that will be used to run the big data application (e.g. on-premise and cloud-based infrastructures).

Choosing the Right Framework

65

D. Talia, P. Trunfio

Programming Tools for High-Performance Data Analysis

Tutorial @ HPC Summer School 2025

June 18th, 2025, Naples, Italy

66 of 79

Input data: Volume

  • Volume of data impacts the storage requirements of applications, as storing large amounts of data requires distributed storage solutions that provide data replication, fault tolerance, and scalability, such as Hadoop Distributed File System (HDFS).
  • Data volume also affects the processing requirements, as processing large volumes of data requires distributed computing systems that can scale horizontally, such as Hadoop, which is commonly used for parallel processing of big data, and Spark, which offers in-memory processing and is suitable for iterative algorithms and interactive data analysis.
  • High-dimensional data may require the use of dimensionality reduction techniques, such as principal component analysis (PCA) or singular value decomposition (SVD); a framework that supports these techniques is Spark through the MLlib library, which can make it easier to analyze and derive insights from high-dimensional data.

Choosing the Right Framework

66

D. Talia, P. Trunfio

Programming Tools for High-Performance Data Analysis

Tutorial @ HPC Summer School 2025

June 18th, 2025, Naples, Italy

67 of 79

Input data: Velocity

  • Velocity, i.e. the speed at which data are generated, is another important characteristic of big data that demands programming models and systems that can capture, process, and analyze data in real time to enable timely decision-making.
  • The speed of the data affects the processing and analysis capabilities of the application, since techniques such as windowing and time-based aggregation are required to capture relevant information from a data stream generated at high velocity.
  • Furthermore, processing data in real time necessitates low-latency processing capabilities, such as in-memory processing, to enable the responsiveness of a big data application.
  • While micro-batch streaming systems such as Spark Streaming can be used in some scenarios, stream processing frameworks such as Storm are typically adopted to process data streams in real time, allowing applications to respond to events as they occur.

Choosing the Right Framework

67

D. Talia, P. Trunfio

Programming Tools for High-Performance Data Analysis

Tutorial @ HPC Summer School 2025

June 18th, 2025, Naples, Italy

68 of 79

Input data: Variety

  • Variety, i.e. the heterogeneity of data types, formats and sources, requires handling data integration, transformation, and analysis in a flexible and adaptable manner.
  • Data from multiple sources and formats can be preprocessed before analysis using systems that support extract, transform, and load (ETL) operations, such as Hive and Pig.
  • Structured data can be analyzed using traditional relational database techniques, such as those provided by Hive and Pig.
  • Unstructured data (e.g., text) require special-purpose techniques (e.g., natural language processing (NLP) for textual data), which are provided by a few systems, including Spark.
  • Spark is the most versatile framework for processing heterogeneous data as it provides APIs for batch processing, stream processing, machine learning, graph processing, and DataFrames to work with different data types, such as CSV, JSON, and database tables.

Choosing the Right Framework

68

D. Talia, P. Trunfio

Programming Tools for High-Performance Data Analysis

Tutorial @ HPC Summer School 2025

June 18th, 2025, Naples, Italy

69 of 79

Application class: Batch

  • Batch applications involve processing large amounts of data that are collected and analyzed together, typically during off-peak hours when the processing demand on the system is low.
  • They are particularly useful for analyzing historical data, generating reports, and performing complex analytics that require significant processing power and resources.
  • Spark and Hadoop are both widely used for batch processing due to their distributed storage and processing capabilities:
    • Hadoop includes fault-tolerant storage and a framework for distributed processings.
    • Spark offers a fast and flexible data processing engine with high-level APIs, machine learning, and more.
  • Airflow can be used for developing and monitoring batch-oriented, workflow-based applications.
    • It can be used to orchestrate and automate batch processing workflows, allowing final users to focus on generating meaningful insights from their data rather than managing the processing infrastructures.

Choosing the Right Framework

69

D. Talia, P. Trunfio

Programming Tools for High-Performance Data Analysis

Tutorial @ HPC Summer School 2025

June 18th, 2025, Naples, Italy

70 of 79

Application class: Stream

  • Stream applications are designed to process and analyze data as they are received, without the need to store them in centralized repositories.
  • They are particularly useful in application domains where real-time data analysis is required, such as finance, telecommunications, and transportation.
  • Examples of stream big data applications include real-time data analytics, fraud detection, and real-time monitoring of sensors and IoT devices.
  • Storm and Spark are both used for stream processing of data:
    • Storm offers low-latency, scalable, and fault-tolerant real-time processing.
    • Spark provides micro-batch stream processing through specialized APIs.

Choosing the Right Framework

70

D. Talia, P. Trunfio

Programming Tools for High-Performance Data Analysis

Tutorial @ HPC Summer School 2025

June 18th, 2025, Naples, Italy

71 of 79

Application class: Graph-based

  • Graph-based applications are designed to process and analyze data that are interconnected in complex networks or graph structures.
  • This involves analyzing relationships between data nodes in a graph to uncover patterns that may not be apparent when using traditional analysis methods.
  • Examples of graph-based big data applications include social network analysis, recommendation engines, and fraud detection.
  • MPI and Spark are both suitable for graph processing of big data:
    • MPI offers low-level control over parallelism and communication.
    • Spark provides specialized high-level APIs (i.e., GraphX) for efficient and scalable graph processing.

Choosing the Right Framework

71

D. Talia, P. Trunfio

Programming Tools for High-Performance Data Analysis

Tutorial @ HPC Summer School 2025

June 18th, 2025, Naples, Italy

72 of 79

Application class: Query-based

  • Query-based applications are designed to provide fast and efficient access to large volumes of data through query languages and search tools.
  • This involves storing data in a distributed system and using query languages, such as SQL, to retrieve data from the system.
  • Examples of query-based big data applications include business intelligence, data exploration, and ad hoc data analysis.
  • Hive, Pig, and Spark are suitable for query processing of large datasets.
    • Hive provides an SQL-like interface.
    • Pig provides a simple scripting language.
    • Spark SQL allows querying and analyzing data using an SQL syntax.

Choosing the Right Framework

72

D. Talia, P. Trunfio

Programming Tools for High-Performance Data Analysis

Tutorial @ HPC Summer School 2025

June 18th, 2025, Naples, Italy

73 of 79

Infrastructure: On-premise

  • On-premise infrastructure refers to the deployment of hardware and software within an organization’s premises, which does not require transferring large amounts of data to a remote location.
  • In this scenario, data are processed and stored in a proprietary data center, allowing for higher security and easier compliance with stringent accessibility and privacy regulations.
  • On-premise infrastructures, especially for organizations with limited IT budget, are often made up of interconnected machines equipped with commodity hardware:
    • Hadoop can be effectively used to process large datasets at a lower cost on heterogeneous commodity hardware, relying on any disk storage type for data processing.
    • HDFS is capable of distributing data on different machines running different operating systems without requiring special drivers.
  • For IT companies with larger budgets, Spark is an effective solution for quick in-memory processing of large amounts of data.
    • However, it operates at a higher cost because it requires large amounts of RAM to spin up nodes.

Choosing the Right Framework

73

D. Talia, P. Trunfio

Programming Tools for High-Performance Data Analysis

Tutorial @ HPC Summer School 2025

June 18th, 2025, Naples, Italy

74 of 79

Infrastructure: Cloud-based

  • Cloud-based infrastructure refers to the use of cloud resources to store and process data.
  • Cloud services are usually adopted for their scalability and flexibility, allowing them to add and remove resources based on application needs.
  • They include specific services for big data processing, such as Amazon EMR, Azure HDInsight, and Google Cloud Dataproc, and fully managed big data frameworks, such as Hadoop and Spark, which are optimized for the cloud.
  • The use of cloud infrastructures poses many privacy and data management issues, including those relating to security, regulatory compliance, jurisdictional constraints, and data access control.
    • To avoid legal issues, it is critical that a public cloud infrastructure meet relevant data regulations, such as the General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA).

Choosing the Right Framework

74

D. Talia, P. Trunfio

Programming Tools for High-Performance Data Analysis

Tutorial @ HPC Summer School 2025

June 18th, 2025, Naples, Italy

75 of 79

Bibliography (1/4)

  • Bauer, M., Treichler, S., Slaughter, E., and Aiken, A. (2012). Legion: Expressing locality and independence with logical regions, in Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC ’12.
  • Belcastro, L., Marozzo, F., and Talia, D. (2019). Programming models and systems for big data analysis, International Journal of Parallel, Emergent and Distributed Systems 34, 632–652.
  • Byun, C. et al. (2022). pPython for parallel python programming, in 2022 IEEE High Performance Extreme Computing Conference (HPEC), pp. 1–6.
  • Cattell, R. (2011). Scalable SQL and NoSQL data stores, ACM SIGMOD Record 39(4), 12–27.
  • Charles, P., Grothoff, C., Saraswat, V., Donawa, C., Kielstra, A., Ebcioglu, K., Von Praun, C., and Sarkar, V. (2005). X10: An object-oriented approach to non-uniform cluster computing, ACM SIGPLAN Notices 40(10), 519–538.
  • Da Costa et al. (2015). Exascale machines require new programming paradigms and runtimes, Supercomputing Frontiers and Innovations: International Journal 2(2), 6–27.
  • DeWael, M., Marr, S., De Fraine, B., Van Cutsem, T., and De Meuter,W. (2015). Partitioned global address space languages, ACM Computing Surveys 47(4), 1–27.
  • Dean, J. and Ghemawat, S. (2004). MapReduce: Simplified data processing on large clusters, OSDI’04

75

D. Talia, P. Trunfio

Programming Tools for High-Performance Data Analysis

Tutorial @ HPC Summer School 2025

June 18th, 2025, Naples, Italy

76 of 79

Bibliography (2/4)

  • Deitz, S. J., Chamberlain, B. L., and Hribar, M. B. (2006). Chapel: Cascade high-productivity language an overview of the chapel parallel programming model, Cray User Group.
  • Flynn, M. J. (1972). Some computer organizations and their effectiveness, IEEE Transactions on Computers 100(9), 948–960.
  • Fuerlinger, K., Fuchs, T., and Kowalewski, R. (2016). DASH: A C++ PGAS library for distributed data structures and parallel algorithms, in 2016 IEEE 18th International Conference on High Performance Computing and Communications), pp. 983–990.
  • Gates, A. F., Natkovich, O., Chopra, S., Kamath, P., Narayanamurthy, S. M., Olston, C., Reed, B., Srinivasan, S., and Srivastava, U. (2009). Building a high-level dataflow system on top of Map-Reduce: The Pig experience, Proceedings of the VLDB Endowment 2(2), 1414–1425.
  • Gropp, W. and Snir, M. (2013). Programming for exascale computers, Computing in Science & Engineering 15(6), 27–35.
  • Huai, Y., Chauhan, A., Gates, A., Hagleitner, G., Hanson, E. N., O’Malley, O., Pandey, J., Yuan, Y., Lee, R., and Zhang, X. (2014). Major technical advancements in Apache Hive, in Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, pp. 1235–1246.
  • Kalé, L. and Krishnan, S. (1993). CHARM++: A portable concurrent object oriented system based on C++, in A. Paepcke (ed.), Proceedings of OOPSLA’93 (ACM Press), pp. 91–108.

76

D. Talia, P. Trunfio

Programming Tools for High-Performance Data Analysis

Tutorial @ HPC Summer School 2025

June 18th, 2025, Naples, Italy

77 of 79

Bibliography (3/4)

  • Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C., Horn, I., Leiser, N., and Czajkowski, G. (2010). Pregel: A system for large-scale graph processing, in Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (ACM), pp. 135–146.
  • Marozzo, F., Talia, D., and Trunfio, P. (2012). P2P-MapReduce: Parallel data processing in dynamic cloud environments, Journal of Computer and System Sciences 78(5), 1382–1402.
  • Olston, C., Reed, B., Silberstein, A., and Srivastava, U. (2008). Automatic optimization of parallel dataflow programs. 2008 USENIX Annual Technical Conference.
  • Salloum, S., Dautov, R., Chen, X., Peng, P. X., and Huang, J. Z. (2016). Big data analytics on Apache Spark, International Journal of Data Science and Analytics 1(3), 145–164.
  • Sarkar, A., Ghosh, A., and Nath, D. A. (2015). MapReduce: A comprehensive study on applications, scope and challenges, Department of Computer Science, International Journal of Advance Research in Computer Science and Management Studies 3(7).
  • Skillicorn, D. B. and Talia, D. (1998). Models and languages for parallel computation, ACM Computing Surveys (CSUR) 30(2), 123–169.
  • Talia, D. (2013). Workflow systems for science: Concepts and tools, International Scholarly Research Notices 2013.
  • Talia, D. (2019). A view of programming scalable data analysis: From clouds to exascale, Journal of Cloud Computing 8(1), 1–16.

77

D. Talia, P. Trunfio

Programming Tools for High-Performance Data Analysis

Tutorial @ HPC Summer School 2025

June 18th, 2025, Naples, Italy

78 of 79

Bibliography (4/4)

  • Talia, D. and Trunfio, P. (2012). Service-Oriented Distributed Knowledge Discovery (Chapman and Hall/CRC).
  • Talia, D. et al. (2019). A novel data-centric programming model for large-scale parallel systems, in European Conference on Parallel Processing (Springer), pp. 452–463.
  • UPC Consortium (2005). UPC language specifications, v1. 2, Lawrence Berkeley National Lab tech report lbnl-59208, Technical Report, Berkeley, CA, USA.
  • Valiant, L. G. (1990). A bridging model for parallel computation, Communications of the ACM 33(8), 103–111.
  • Van der Aalst, W. M. P., ter Hofstede, A. H. M., Kiepuszewski, B., and Barros, A. P. (2003). Workflow patterns, Distributed and Parallel Databases 14(1), 5–51.
  • Verma, A., Mansuri, A. H., and Jain, N. (2016). Big data management processing with Hadoop MapReduce and Spark technology: A comparison, in 2016 Symposium on Colossal Data Analysis and Networking, pp. 1–4.
  • Wadkar, S., Siddalingaiah, M., and Venner, J. (2014). Pro Apache Hadoop (Apress).
  • Wu, D., Sakr, S., and Zhu, L. (2017). Big data programming models, in Handbook of Big Data Technologies (Springer), pp. 31–63.
  • Zheng, Y., Kamil, A., Driscoll, M. B., Shan, H., and Yelick, K. (2014). UPC++: A PGAS extension for C++, in 2014 IEEE 28th International Parallel and Distributed Processing Symposium, pp. 1105–1114.

78

D. Talia, P. Trunfio

Programming Tools for High-Performance Data Analysis

Tutorial @ HPC Summer School 2025

June 18th, 2025, Naples, Italy

79 of 79

Additional resources

  • 400+ slides (pptx and pdf) are available to the readers of our book:

  • An online repository including all the codes and datasets used in the examples presented in the book is available at:

https://bigdataprogramming.github.io

  • The repository provides Docker containers and step-by-step instructions for the seamless execution of the examples.

79

D. Talia, P. Trunfio

Programming Tools for High-Performance Data Analysis

Tutorial @ HPC Summer School 2025

June 18th, 2025, Naples, Italy