2 of 18

Table of contents

XtremeHub: NEARDATA’s Data Plane

XtremeHub Compute: Lithops

XtremeHub Streams: Pravega

XtremeHub Security: Scone

XtremeHub Connectors

Wrap Up

3 of 18

WP3 Tasks

WP3: Development of serverless, stream, and HPC data connectors and compute/storage infrastructure for extreme data analytics.

T3.1: Extreme Data Connectors and Data Catalog
T3.2: Stream Storage Data connectors
T3.3: Video Stream Data connectors
T3.4: High-Performance Computing Data connectors
T3.5: Data Analytics Interconnection Layer

4 of 18

XtremeHub: NEARDATA’s Data Plane

Goal: To facilitate ingestion, management, and processing of massive unstructured data.
Key concepts in XtremeHub:

Extreme Data Types: Virtual objects that encapsulate unstructured data and specific metadata.
Data Connectors: Specialized components that will help to manage extreme data types to facilitate data processing.�

XtremeHub main components:

XtremeHub Compute: Lithops
XtremeHub Streams: Pravega
XtremeHub Security: Scone
XtremeHub Connectors

KPIs: Throughput/latency, data reduction, resource auto-scaling, security, and productivity.
Use-cases: Metabolomics, Surgery, Genomics, Federated learning.

5 of 18

XtremeHub Compute: KPIs & Use-Cases

Experiment: Evaluate the performance of compute/storage backends.

Goal: Compare the performance of IBM and AWS clouds.
Setup: 1000 functions were invoked with a memory size of 1024MB.
Workload: FLOP benchmark (matrix multiplications).

KPI-1 (Throughput Improvements): Lithops can deliver multi-GFLOP compute performance (5-45 GFLOPs), as well as high IO throughput (25-100GBps), depending on the Cloud provider (IBM, AWS).�
KPI-3 (Resource Auto-scaling): Lithops can seamlessly parallelize a single function from 1 to 1000s of lambdas depending on the workload demands.�
Use-cases: Metabolomics, Genomics.

Finol, Gerard, Gerard París, Pedro García-López, and Marc Sánchez-Artigas. "Exploiting inherent elasticity of serverless in algorithms with unbalanced and irregular workloads." Journal of Parallel and Distributed Computing 190 (2024): 104891.�Arjona, Aitor, Arnau Gabriel-Atienza, Sara Lanuza-Orna, Xavier Roca-Canals, Ayman Bourramouss, Tyler K. Chafin, Lucio Marcello, Paolo Ribeca, and Pedro García-López. "Scaling a Variant Calling Genomics Pipeline with FaaS." In Proceedings of the 9th International Workshop on Serverless Computing, pp. 59-64. 2023.

6 of 18

XtremeHub Streams: KPIs & Use-Cases

Experiment: Evaluating frame ingestion and batching strategies for clients.

Goal: Latency vs throughput behavior of GStreamer and Pravega/Kafka clients.
Setup: AWS deployment (3VMs with local NVMe drives for journaling and EBS).
Workload: GStreamer videotestsrc and Open-Messaging Benchmark workload.

KPI-2 (Data Speed Improvements): Pravega Byte API achieves low end-to-end latency (p95<10ms) for GStreamer byte buffer streaming compared to AI inference latency, specially with local storage and network co-location.

KPI-5 (Simplicity and Productivity): Dynamic event batching in Pravega allows writers to achieve a good balance between latency and throughput compared to Kafka, while guaranteeing data durability.�
Use-cases:

Surgery: Low latency ingestion of video frames and efficient sensor data ingestion without client configuration.

Raúl Gracia-Tinedo, Flavio Junqueira, Tom Kaitchuck, and Sachin Joshi. "Pravega: A Tiered Storage System for Data Streams." In ACM/IFIP Middleware’23, pp. 165-177. 2023.�Raúl Gracia-Tinedo, Flavio Junqueira, Tom Kaitchuck. “"Back to the Byte": Towards Byte-oriented Semantics for Streaming Storage”. In ACM/IFIP Middleware’24 (Industry Track), to appear.

7 of 18

XtremeHub Streams: KPIs & Use-Cases

Experiment: Evaluate writer/segment parallelism.

Goal: Understand impact of parallelism on write performance.
Setup: AWS deployment (3VMs with local NVMe drives for journaling).
Workload: Open-Messaging Benchmark workload.

KPI-1 (Throughput Improvements): Pravega is the only system that achieves consistent throughput (e.g., 250MBps) for the number of producers and segments tested, while guaranteeing event ordering and data durability.�
Use-cases:

Surgery: Ingestion of many parallel video and sensor streams.
Genomics: Ingestion of parallel genomic files (e.g., FAST) as data streams.

Raúl Gracia-Tinedo, Flavio Junqueira, Tom Kaitchuck, and Sachin Joshi. "Pravega: A Tiered Storage System for Data Streams." In ACM/IFIP Middleware’23, pp. 165-177. 2023.

8 of 18

XtremeHub Streams: KPIs & Use-Cases

Experiment: Batch reads of stream data.

Goal: Understand performance of historical read on Pulsar/Pravega.
Setup: AWS deployment (3VMs with local NVMe drives for journaling).
Workload: Open-Messaging Benchmark workload.

KPI-1 (Throughput Improvements): Pravega achieves much higher historical read throughput (e.g., 7x at peak) compared to Pulsar without user intervention or complex configuration settings.�
Use-cases:

Surgery: High-performance reads of historical video for AI training.
Genomics: Batch analytics of genomic data streams.

Raúl Gracia-Tinedo, Flavio Junqueira, Tom Kaitchuck, and Sachin Joshi. "Pravega: A Tiered Storage System for Data Streams." In ACM/IFIP Middleware’23, pp. 165-177. 2023.

9 of 18

XtremeHub Streams: KPIs & Use-Cases

Experiment: Coordinated storage-compute auto-scaling.

Goal: Evaluate coordinated storage-compute auto-scaling of Pravega streams (0.8K events/second) and Flink task managers (1/segment).
Setup: 6VM Kubernetes cluster (block volumes).
Workload: Custom generator of sinusoidal workload.

KPI-3 (Resource Auto-scaling): Administrators can reason about the storage-compute autoscaling of the streaming pipeline considering just two main parameters.�
Use-cases:

Surgery: Handle strong workload patterns in NCT surgery room utilization.

Raúl Gracia-Tinedo, Flavio Junqueira, Brian Zhou, Yimin Xiong, and Luis Liu. "Practical Storage-Compute Elasticity for Stream Data Processing." In ACM/IFIP Middleware’23: Industrial Track, pp. 1-7. 2023.

10 of 18

XtremeHub Security: Scone

Scone: Provide confidential computing in containerized applications.

Exploitation of Trusted Execution Environments (TEEs).
Support for concurrent executions and sub-processes.

Integration with Lithops for secure serverless confidential computing.

KPI-4 (Confidential Computing): SCONE supports confidential execution via TEE of both Lithops and Flower ML, two systems developed in Python.�
Use-cases: Surgery, Metabolomics, Genomics, Federated Learning.

*More details in WP4 presentation

Galanou, Anna, Khushboo Bindlish, Luca Preibsch, Yvonne-Anne Pignolet, Christof Fetzer, and Rüdiger Kapitza. "Trustworthy confidential virtual machines for the masses." In ACM/IFIP Middleware’23, pp. 316-328. 2023.�Gregor, Franz, Robert Krahn, Do Le Quoc, and Christof Fetzer. "SinClave: Hardware-assisted Singletons for TEEs." In ACM/IFIP Middleware’23, pp. 85-97. 2023.

11 of 18

XtremeHub Connectors: Dataplug

Dataplug: On-the-fly data partitioning connector for unstructured scientific data (Lithops, Ray, etc.).

Experiment: Dynamic partitioning of FASTQGZip compressed genomic data format.

Goal: Performance evaluation of dynamic partitioning on the FASTQGZip compressed genomic data format.
Setup: AWS EC2 t2.xlarge instance.
Workload: FASTQGZip files from UKHSA.

KPI-1 (Throughput Improvements): Dataplug shows better processing throughput, avoiding static data pre-processing from metadata extraction and thus ensuring transfer reduction.�
Use-cases: Metabolomics, Genomics.

*More details in WP2 presentation

Aitor Arjona, Pedro García-López, Daniel Barcelona-Pons. “Dataplug: Unlocking extreme data analytics with on-the-fly dynamic partitioning of unstructured data”. In IEEE/ACM CCGrid’24.

12 of 18

XtremeHub Connectors: HPC Connectors

Multifactor Dimensionality Reduction (MDR) connector to detect variant pairs associated to diseases (e.g., Type 2 Diabetes).

Experiment:

Goal: Evaluate MDR implemented with MPI.
Setup: MareNostrum 4 and 5.
Workload: 70KforT2D dataset.

KPI-1 (Data Throughput Improvements): Our MDR implemented with MPI achieves a processing rate of 1311 and 3057 combinations per second per core in MareNostrum 4 and 5, respectively.

Use-cases: Metabolomics, Genomics.

*More details in WP5 presentation

13 of 18

Research Outcomes

~18 research publications as part of WP3 tasks (plus several more ongoing).�
Lithops & Dataplug:

Contributions to Lithops by extending it with a novel data partitioning scheme.
Key feature for building a cutting-edge analytics service for object storage.

Pravega & video analytics:

Best paper award at ACM/IFIP Middleware‘23.
Novel auto-scaling mechanisms for Pravega streams and analytics engines.
Innovative application and evaluation of Pravega in video analytics use cases (NCT computer-assisted surgery).�

Scone:

Pioneering research of applying Confidential Computing to Federated Learning use-cases (NCT).
Multiple publications in top conferences (e.g., 3x ACM/IFIP Middleware’23).

Connectors:

Promising early results of MDR implementation executed in real HPC environments (MareNostrum).

14 of 18

Wrap Up

M18 Outcomes:

Evaluation of Lithops in large-scale scenarios and Dataplug integration.
Analysis of Pravega in multiple use-case related workloads.
“Sconification” of use-case computations.
Release of data connectors (Dataplug, MDR).
Integration of several XtremeHub components (Scone-Lithops, Pravega-Lithops).

Next steps:

Work on further integrations within XtremeHub.
Explore further mechanisms that can help use cases (e.g., video search, indexing, etc.).
Larger-scale experiments with use-case workloads.

17 of 18

XtremeHub Compute: Lithops

Lithops: Multi-cloud serverless computing framework.

Relies heavily in Python features to automate multiple management tasks.

Goal: Deploy serverless analytics and data management jobs transparently across multiple clouds.

“Imperative” programs can be easily parallelized via parallel functions (lambdas).�

Containerized runtime for running user jobs:

It implements the APIs of the main Cloud serverless providers (e.g., AWS Lambda).

Object storage as the main foundation for unstructured data:

Implements multiple storage APIs and backends.

Advanced computing primitives: map-reduce, function chaining, etc.�

18 of 18

XtremeHub Streams: Pravega

Pravega: Distributed, tiered storage system for data streams.

Pravega streams: Unbounded sequence of bytes with durability guarantees and good performance.

Streams are formed of one/multiple parallel segments.

Software-defined architecture:

Controller instances (metadata, stream operations).
Segment stores (segment IO operations).
Clients (readers and writers).

Pioneering system providing storage tiering for data streams:

Write-Ahead Log (WAL): Low-latency writes, append only.
Long-term Storage (LTS): Scale-out, cost-effective storage (e.g., object store).

Raúl Gracia-Tinedo, Flavio Junqueira, Tom Kaitchuck, and Sachin Joshi. "Pravega: A Tiered Storage System for Data Streams." In ACM/IFIP Middleware’23, pp. 165-177. 2023.

1 of 18

2 of 18

3 of 18

4 of 18

5 of 18

6 of 18

7 of 18

8 of 18

9 of 18

10 of 18

11 of 18

12 of 18

13 of 18

14 of 18

15 of 18

16 of 18

17 of 18

18 of 18