1 of 18

WP3 - Data Plane: Extreme Data Connectors

Leader: Dell Technologies (DELL)

EXTREME NEAR-DATA PROCESSING PLATFORM

2 of 18

Table of contents

XtremeHub: NEARDATA’s Data Plane

1

2

3

4

XtremeHub Compute: Lithops

XtremeHub Streams: Pravega

XtremeHub Security: Scone

5

XtremeHub Connectors

6

Wrap Up

3 of 18

WP3 Tasks

WP3: Development of serverless, stream, and HPC data connectors and compute/storage infrastructure for extreme data analytics.

  • T3.1: Extreme Data Connectors and Data Catalog
  • T3.2: Stream Storage Data connectors
  • T3.3: Video Stream Data connectors
  • T3.4: High-Performance Computing Data connectors
  • T3.5: Data Analytics Interconnection Layer

4 of 18

XtremeHub: NEARDATA’s Data Plane

  • Goal: To facilitate ingestion, management, and processing of massive unstructured data.
  • Key concepts in XtremeHub:
    • Extreme Data Types: Virtual objects that encapsulate unstructured data and specific metadata.
    • Data Connectors: Specialized components that will help to manage extreme data types to facilitate data processing.�
  • XtremeHub main components:
    • XtremeHub Compute: Lithops
    • XtremeHub Streams: Pravega
    • XtremeHub Security: Scone
    • XtremeHub Connectors

  • KPIs: Throughput/latency, data reduction, resource auto-scaling, security, and productivity.
  • Use-cases: Metabolomics, Surgery, Genomics, Federated learning.

5 of 18

XtremeHub Compute: KPIs & Use-Cases

  • Experiment: Evaluate the performance of compute/storage backends.
    • Goal: Compare the performance of IBM and AWS clouds.
    • Setup: 1000 functions were invoked with a memory size of 1024MB.
    • Workload: FLOP benchmark (matrix multiplications).

  • KPI-1 (Throughput Improvements): Lithops can deliver multi-GFLOP compute performance (5-45 GFLOPs), as well as high IO throughput (25-100GBps), depending on the Cloud provider (IBM, AWS).�
  • KPI-3 (Resource Auto-scaling): Lithops can seamlessly parallelize a single function from 1 to 1000s of lambdas depending on the workload demands.
  • Use-cases: Metabolomics, Genomics.

Finol, Gerard, Gerard París, Pedro García-López, and Marc Sánchez-Artigas. "Exploiting inherent elasticity of serverless in algorithms with unbalanced and irregular workloads." Journal of Parallel and Distributed Computing 190 (2024): 104891.�Arjona, Aitor, Arnau Gabriel-Atienza, Sara Lanuza-Orna, Xavier Roca-Canals, Ayman Bourramouss, Tyler K. Chafin, Lucio Marcello, Paolo Ribeca, and Pedro García-López. "Scaling a Variant Calling Genomics Pipeline with FaaS." In Proceedings of the 9th International Workshop on Serverless Computing, pp. 59-64. 2023.

6 of 18

XtremeHub Streams: KPIs & Use-Cases

  • Experiment: Evaluating frame ingestion and batching strategies for clients.
    • Goal: Latency vs throughput behavior of GStreamer and Pravega/Kafka clients.
    • Setup: AWS deployment (3VMs with local NVMe drives for journaling and EBS).
    • Workload: GStreamer videotestsrc and Open-Messaging Benchmark workload.

  • KPI-2 (Data Speed Improvements): Pravega Byte API achieves low end-to-end latency (p95<10ms) for GStreamer byte buffer streaming compared to AI inference latency, specially with local storage and network co-location.

  • KPI-5 (Simplicity and Productivity): Dynamic event batching in Pravega allows writers to achieve a good balance between latency and throughput compared to Kafka, while guaranteeing data durability.
  • Use-cases:
    • Surgery: Low latency ingestion of video frames and efficient sensor data ingestion without client configuration.

Raúl Gracia-Tinedo, Flavio Junqueira, Tom Kaitchuck, and Sachin Joshi. "Pravega: A Tiered Storage System for Data Streams." In ACM/IFIP Middleware’23, pp. 165-177. 2023.�Raúl Gracia-Tinedo, Flavio Junqueira, Tom Kaitchuck. “"Back to the Byte": Towards Byte-oriented Semantics for Streaming Storage”. In ACM/IFIP Middleware’24 (Industry Track), to appear.

7 of 18

XtremeHub Streams: KPIs & Use-Cases

  • Experiment: Evaluate writer/segment parallelism.
    • Goal: Understand impact of parallelism on write performance.
    • Setup: AWS deployment (3VMs with local NVMe drives for journaling).
    • Workload: Open-Messaging Benchmark workload.

  • KPI-1 (Throughput Improvements): Pravega is the only system that achieves consistent throughput (e.g., 250MBps) for the number of producers and segments tested, while guaranteeing event ordering and data durability.
  • Use-cases:
    • Surgery: Ingestion of many parallel video and sensor streams.
    • Genomics: Ingestion of parallel genomic files (e.g., FAST) as data streams.

Raúl Gracia-Tinedo, Flavio Junqueira, Tom Kaitchuck, and Sachin Joshi. "Pravega: A Tiered Storage System for Data Streams." In ACM/IFIP Middleware’23, pp. 165-177. 2023.

8 of 18

XtremeHub Streams: KPIs & Use-Cases

  • Experiment: Batch reads of stream data.
    • Goal: Understand performance of historical read on Pulsar/Pravega.
    • Setup: AWS deployment (3VMs with local NVMe drives for journaling).
    • Workload: Open-Messaging Benchmark workload.

  • KPI-1 (Throughput Improvements): Pravega achieves much higher historical read throughput (e.g., 7x at peak) compared to Pulsar without user intervention or complex configuration settings.
  • Use-cases:
    • Surgery: High-performance reads of historical video for AI training.
    • Genomics: Batch analytics of genomic data streams.

Raúl Gracia-Tinedo, Flavio Junqueira, Tom Kaitchuck, and Sachin Joshi. "Pravega: A Tiered Storage System for Data Streams." In ACM/IFIP Middleware’23, pp. 165-177. 2023.

9 of 18

XtremeHub Streams: KPIs & Use-Cases

  • Experiment: Coordinated storage-compute auto-scaling.
    • Goal: Evaluate coordinated storage-compute auto-scaling of Pravega streams (0.8K events/second) and Flink task managers (1/segment).
    • Setup: 6VM Kubernetes cluster (block volumes).
    • Workload: Custom generator of sinusoidal workload.

  • KPI-3 (Resource Auto-scaling): Administrators can reason about the storage-compute autoscaling of the streaming pipeline considering just two main parameters.�
  • Use-cases:
    • Surgery: Handle strong workload patterns in NCT surgery room utilization.

Raúl Gracia-Tinedo, Flavio Junqueira, Brian Zhou, Yimin Xiong, and Luis Liu. "Practical Storage-Compute Elasticity for Stream Data Processing." In ACM/IFIP Middleware’23: Industrial Track, pp. 1-7. 2023.

10 of 18

XtremeHub Security: Scone

  • Scone: Provide confidential computing in containerized applications.
    • Exploitation of Trusted Execution Environments (TEEs).
    • Support for concurrent executions and sub-processes.

  • Integration with Lithops for secure serverless confidential computing.

  • KPI-4 (Confidential Computing): SCONE supports confidential execution via TEE of both Lithops and Flower ML, two systems developed in Python.�
  • Use-cases: Surgery, Metabolomics, Genomics, Federated Learning.

*More details in WP4 presentation

Galanou, Anna, Khushboo Bindlish, Luca Preibsch, Yvonne-Anne Pignolet, Christof Fetzer, and Rüdiger Kapitza. "Trustworthy confidential virtual machines for the masses." In ACM/IFIP Middleware’23, pp. 316-328. 2023.�Gregor, Franz, Robert Krahn, Do Le Quoc, and Christof Fetzer. "SinClave: Hardware-assisted Singletons for TEEs." In ACM/IFIP Middleware’23, pp. 85-97. 2023.

11 of 18

XtremeHub Connectors: Dataplug

  • Dataplug: On-the-fly data partitioning connector for unstructured scientific data (Lithops, Ray, etc.).

  • Experiment: Dynamic partitioning of FASTQGZip compressed genomic data format.
    • Goal: Performance evaluation of dynamic partitioning on the FASTQGZip compressed genomic data format.
    • Setup: AWS EC2 t2.xlarge instance.
    • Workload: FASTQGZip files from UKHSA.

  • KPI-1 (Throughput Improvements): Dataplug shows better processing throughput, avoiding static data pre-processing from metadata extraction and thus ensuring transfer reduction.
  • Use-cases: Metabolomics, Genomics.

*More details in WP2 presentation

Aitor Arjona, Pedro García-López, Daniel Barcelona-Pons. “Dataplug: Unlocking extreme data analytics with on-the-fly dynamic partitioning of unstructured data”. In IEEE/ACM CCGrid’24.

12 of 18

XtremeHub Connectors: HPC Connectors

  • Multifactor Dimensionality Reduction (MDR) connector to detect variant pairs associated to diseases (e.g., Type 2 Diabetes).

  • Experiment:
    • Goal: Evaluate MDR implemented with MPI.
    • Setup: MareNostrum 4 and 5.
    • Workload: 70KforT2D dataset.

  • KPI-1 (Data Throughput Improvements): Our MDR implemented with MPI achieves a processing rate of 1311 and 3057 combinations per second per core in MareNostrum 4 and 5, respectively.

  • Use-cases: Metabolomics, Genomics.

*More details in WP5 presentation

13 of 18

Research Outcomes

  • ~18 research publications as part of WP3 tasks (plus several more ongoing).�
  • Lithops & Dataplug:
    • Contributions to Lithops by extending it with a novel data partitioning scheme.
    • Key feature for building a cutting-edge analytics service for object storage.

  • Pravega & video analytics:
    • Best paper award at ACM/IFIP Middleware‘23.
    • Novel auto-scaling mechanisms for Pravega streams and analytics engines.
    • Innovative application and evaluation of Pravega in video analytics use cases (NCT computer-assisted surgery).
  • Scone:
    • Pioneering research of applying Confidential Computing to Federated Learning use-cases (NCT).
    • Multiple publications in top conferences (e.g., 3x ACM/IFIP Middleware’23).

  • Connectors:
    • Promising early results of MDR implementation executed in real HPC environments (MareNostrum).

14 of 18

Wrap Up

  • M18 Outcomes:
    • Evaluation of Lithops in large-scale scenarios and Dataplug integration.
    • Analysis of Pravega in multiple use-case related workloads.
    • “Sconification” of use-case computations.
    • Release of data connectors (Dataplug, MDR).
    • Integration of several XtremeHub components (Scone-Lithops, Pravega-Lithops).

  • Next steps:
    • Work on further integrations within XtremeHub.
    • Explore further mechanisms that can help use cases (e.g., video search, indexing, etc.).
    • Larger-scale experiments with use-case workloads.

15 of 18

Thank you

16 of 18

Backup

17 of 18

XtremeHub Compute: Lithops

  • Lithops: Multi-cloud serverless computing framework.
    • Relies heavily in Python features to automate multiple management tasks.

  • Goal: Deploy serverless analytics and data management jobs transparently across multiple clouds.
    • “Imperative” programs can be easily parallelized via parallel functions (lambdas).�
  • Containerized runtime for running user jobs:
    • It implements the APIs of the main Cloud serverless providers (e.g., AWS Lambda).

  • Object storage as the main foundation for unstructured data:
    • Implements multiple storage APIs and backends.

  • Advanced computing primitives: map-reduce, function chaining, etc.

18 of 18

XtremeHub Streams: Pravega

  • Pravega: Distributed, tiered storage system for data streams.

  • Pravega streams: Unbounded sequence of bytes with durability guarantees and good performance.
    • Streams are formed of one/multiple parallel segments.

  • Software-defined architecture:
    • Controller instances (metadata, stream operations).
    • Segment stores (segment IO operations).
    • Clients (readers and writers).

  • Pioneering system providing storage tiering for data streams:
    • Write-Ahead Log (WAL): Low-latency writes, append only.
    • Long-term Storage (LTS): Scale-out, cost-effective storage (e.g., object store).

Raúl Gracia-Tinedo, Flavio Junqueira, Tom Kaitchuck, and Sachin Joshi. "Pravega: A Tiered Storage System for Data Streams." In ACM/IFIP Middleware’23, pp. 165-177. 2023.