1 of 74

Extending Composable Data Services to the Realm of Embedded Systems

Jianshen Liu

^{University of California, Santa Cruz�}^{Dissertation Defense @ June 21, 2023}

2 of 74

Challenges and Opportunities for Modern Data Processing

2

Data processing on general-purpose systems faces performance, cost, and energy challenges as modern applications increasingly rely on larger data volumes.

* the International Symposium on Computer Architecture

Word cloud of most common words in abstracts for ISCA^* 1983 - 2023

[upasani2023]

Embedded systems present opportunities to optimize by offloading the services for data processing.

But:�Cost/benefit landscape for offloading is complex.

3 of 74

Types of Data Service Offloading

3

Examples:

filesystems

key/value stores

data queries

Examples:

data distribution

data recovery

load balancing

Data services require frequent data movement across system layers and components [kufeldt2018]

Example benefits of offloading data services

Minimize data movement cost to host systems
Maximize resource availability for host applications
Leverage efficient resources and mechanisms for data movement
Allow for service isolation without additional infrastructure

Involve compute, network, and energy overheads

4 of 74

Hypotheses and Scope

4

Scope

Providing a set of new tools to estimate and realize the benefits of offloading composable data services for any embedded systems.

Essential building blocks for applications to implement data availability, validity, and curation

Hypothesis #1: Quantifying the benefits of data service offloading requires multiple and specialized evaluation metrics.

Hypothesis #2: Data service offloading has to be carefully tailored based on the function, the data, and the fluctuating and highly specialized resource availability on embedded systems.

Hypothesis #3: Harnessing the benefits of data service offloading requires efficient mechanisms for communication, computation, and scheduling.

Why: Metrics for Evaluation

What: Potential for Offloading

How: Strategies to Offload

5 of 74

Related Publications

5

[HPC-IODC ‘19] MBWU: Benefit quantification for data access function offloading�Jianshen Liu, Philip Kufeldt, and Carlos Maltzahn.
[HotEdge '20] Scale-out edge storage systems with embedded storage nodes to get better availability and cost-efficiency at the same time�Jianshen Liu, Matthew Leon Curry, Carlos Maltzahn, and Philip Kufeldt
[SNL Report 2021] Performance Characteristics of the BlueField-2 SmartNIC�Jianshen Liu, Carlos Maltzahn, Craig D. Ulmer, Matthew Leon Curry
[HPEC ‘22] Processing Particle Data Flows with SmartNICs ( Outstanding Student Paper)�Jianshen Liu, Carlos Maltzahn, Matthew Leon Curry, Craig Ulmer
[COMPSYS ‘23] Extending Composable Data Services into SmartNICs ( Best Paper)�Craig Ulmer, Jianshen Liu, Carlos Maltzahn, Matthew Leon Curry
WIP for [HPEC ’23] (for the work about coordinating processing offloading)

( Picked up by The Next Platform)

[prickett2021]

6 of 74

Contributions

6

Why

What

How

before advancement

after advancement

7 of 74

Contributions

7

Why

What

How

Metrics

These metrics and methods present new toolkits to evaluate the cost-benefit of offloading data services, which are also applicable to future embedded systems for evaluation.

8 of 74

Quantifying the Benefits of Offloading [liu2019] collaborated with Philip Kufeldt

8

Metrics

MBWU

Throughput of a given workload for a given single device (without offloading)
Example: the not-offloaded throughput of a given workload with a single storage device is called “media-based” work unit (MBWU).

Work Unit (WU)

As new devices emerge at increasing frequency

Automated determination of a (Media-Based) Work Unit

As long as workload and device is fixed, we can compare throughput with and without offloading.

9 of 74

Data Availability

9

Metrics

MBWU ║ Data Availability

Challenges of small storage systems ➡ edge data centers

Environmental constraints (e.g., space, network and power)
Server maintenance costs rival redundancy provisioning expenses
Limited number of failure domains due to small deployment sizes

The more independent failure domains a failover mechanism spans, the more available the data becomes.

Server-based Storage System

Embedded Storage System

Embedded storage enables more failure domains under the same cost/space/power restrictions.

The higher the storage aggregation in servers, the greater the data availability benefit of embedded storage.
The higher the compute aggregation in servers, the greater the data availability benefit of embedded storage.
The lower the failure rate of storage devices used in storage systems, the greater the data availability benefit of embedded storage.

10 of 74

Contributions

10

Why

What

How

Metrics

These metrics and methods present new toolkits to evaluate the cost-benefit of offloading data services, which are also applicable to future embedded systems for evaluation.

11 of 74

Contributions

11

Why

What

How

Metrics

Potential

Realizing benefits goes beyond the use of specialized metrics. It requires exploring the potential of embedded systems in both computation and communication dimensions.

12 of 74

Evaluation Platform (SmartNIC)

12

Potential

Programmable NICs seen to optimize HPC/HPDA workflows [kung1989]
Programmable NICs are already capable of north-south and east-west communication

As both a specific use case and a placeholder for other future embedded systems

Multiple vendors have created powerful SmartNICs

Roadmaps show rapid performance improvement (10x every two years!)
Emerging HPC platforms include SmartNICs
Existing market for SmartNICs to enable traditional storage devices to become computational

NVIDIA BlueField-2 SmartNIC

100Gb/s InfiniBand
8x A72 Arm Cores�@ 2.75GHz
16GB DRAM
60GB Flash
Accelerators
compression
encryption
regular expression

Ubuntu 20.04
Operation Modes
Embedded function
Separated

M

Platform

13 of 74

Performance Comparison using Microbenchmarks

13

Potential

Platform ║ Performance Delineation

Testing with realistic workloads often miss device pros and cons
Microbenchmarks focus specific operations
Microbenchmarks are placeholders for offloading and non-offloading

Stress-ng Microbenchmark [ubuntustressng]

250 stressors are designed to cover a broad spectrum of system operations (e.g., mmap, vecmath, sock)
Stressors are categorized into classes (e.g., OS, VM, FILESYSTEM, and SCHEDULER)

Operation latency-based performance metrics are not directly comparable

M

(executed on the SmartNIC vs. on the host)

Normalize the stressor results relative to the Raspberry Pi 4B’s performance

Raspberry-Pi-based Work Unit, or RPWU

14 of 74

Performance Characterization

14

Compare BlueField-2 SmartNIC with a variety of host systems available at CloudLab [ricci2014]

Triangle data points denote the relative performance of the BlueField-2 SmartNIC

Potential

Platform ║ Performance Delineation

M

Offloading Potentials

BlueField-2 SmartNIC exhibits advantageous performance compared to hosts in:

Memory contention operations
Hardware accelerated operations (e.g., cryptographic and compression)
IPC operations
SIMD vectorized operations

15 of 74

Network Processing Headroom

15

Potential

Platform ║ Performance Delineation ║ Network Process Headroom

Determines maximum CPU time for offloaded functions while preserving network performance
Impacted by hardware configuration and the overhead of different networking stacks
Needs user- or kernel-space benchmarks to evaluate upper-bound performance
“Processing Headroom” is also relevant for storage devices if “non-computational” storage steals cycles from on-device processors

Linux pktgen [linuxpktgen]

In kernel-space generating/injecting UDP packets with multi-core parallelism and multi-queue support

M

Evaluation Method

Evaluate the minimum thread count to maximize the bandwidth with the given packet and burst size
Evaluate the duration of “busy retry” by introducing “delay” between bursts
The maximum delay without impeding throughput is the processing headroom

Yet another specialized metric for evaluation

16 of 74

Processing Headroom Evaluation

16

Potential

Platform ║ Performance Delineation ║ Network Process Headroom

M

The BlueField-2:

only saturate 60% bandwidth, with all 8 cores, a large burst size, and the largest packet size 10 KiB
at half bandwidth utilization, processing headroom is just 21% of the burst time

17 of 74

Processing Headroom Evaluation

17

The BlueField-2:

78.5% of CPU time available with the kernel IP stack
87.5% of CPU time available with DPDK

Potential

Platform ║ Performance Delineation ║ Network Process Headroom

M

Offloading Potentials

High-performance networking stacks are crucial to the performance of offloading functions
Specific hardware configurations can greatly boost offloaded function performance

with DPDK

with kernel IP stack

Performance Leap!

18 of 74

Offloading Potential for Collective Acting SmartNICs

18

What additional potential do SmartNICs have when they can act collectively?

Problem: data services on compute nodes complete for resources with applications, causing potential delays [tseng2016]

Opportunity: Offload to SmartNICs

But: Require SmartNICs to act collectively to expand processing capability

Platform ║ Performance Delineation ║ Network Process Headroom ║ Potential for E/W Offloading

Service Placement

In-situ: sacrifices application resources; causes fate sharing
In-vitro: add communication overhead; increases node count
In-storage [chakraborty2022]: prevent by system policies
Data processing elements co-locate with hosts?

Background: HPC Scientific Computing Workflows

Simulation tasks periodically generate data output
Data consumers want customized data
Data consumers and producers differ in their expectations of data organization
Composable data services are responsible for data tailoring

Potential

M

19 of 74

Collective Acting SmartNICs

19

Software Stack

Data processing: Apache Arrow [arrow2023], with SIMD support, provides serialization, compression, projection, aggregation on tabular data
Network transfer: Faodel [ulmer2018], built on RDMA, provides distributed-memory Key/Blob API and computation dispatching.

How do these data service libraries perform on the SmartNIC?

Expand in-transit data processing capability with multiple SmartNICs

Sorted by�position and time

Sorted by�ID and time

Offload data reorganization service and online query service

Log-structured merge (LSM) tree sorts data

Platform ║ Performance Delineation ║ Network Process Headroom ║ Potential for E/W Offloading

Potential

M

20 of 74

Particle Data Partitioning Performance

20

Algorithm: unpack tabular data, split and repack into 2-16 partitions based on particle ID
Time to process 1 GB of data on compute node^* vs. BlueField-2
Evaluate on three “particle” datasets: scientific, airplanes, and ships

Overhead for partitioning without compression

Host is ~4x faster than the SmartNIC

Timing breakdown on BlueField-2 with compressions

Handling compression in software incurs significant overhead

Offloading Potential

Offload data compression to hardware accelerator

Platform ║ Performance Delineation ║ Network Process Headroom ║ Potential for E/W Offloading ⇒ Data Partitioning

Potential

M

21 of 74

Multi-threaded Service Performance

21

Bookkeeping overhead on the SmartNIC

Faodel LocalKV test uses multi-threading for put/get/delete operations on a 2D in-memory hashmap

32-core AMD EPYC 7543P

68-core

Query Arrow Data

Filtering query selects particle data in a ⅛ bounding box
Aggregation query calculates min/max squared magnitude velocities of particles

32-core Xeon E5-2698

At 8 threads, the host is only 37.8% faster than the SmartNIC

Servers are roughly four times faster than the SmartNIC when using the same number of threads

Offloading Potentials

Can efficiently scale performance with parallelization
Sufficient to handle low-rate, asynchronous data processing tasks

Platform ║ Performance Delineation ║ Network Process Headroom ║ Potential for E/W Offloading ⇒ Data Partitioning ⇒ Parallel Processing

Potential

M

22 of 74

Contributions

22

Why

What

How

Metrics

Potential

Realizing benefits goes beyond the use of specialized metrics. It requires exploring the potential of embedded systems in both computation and communication dimensions.

23 of 74

Contributions

23

Why

What

How

Metrics

Potential

Strategies

Efficient solutions for communication, computation, and scheduling complexities are essential to harness the benefits of offloading data services to embedded devices.

24 of 74

Bitar: Optimizing Data Compression for Serialization

24

M

P

Strategies

Bitar

Compression has significant overhead but plays a crucial role in data communication
How to build a convenient interface between data services and a compression accelerator?

Hide the communication complexity with the hardware
Explore the hardware capability (e.g., mutiqueue and multithreading)
Ensure performance efficiency (e.g., minimize memory allocation)

8.6x

5.7x

2.8x

1.9x

Bitar simplifies the use of compression hardware to improve performance of data service offloaded to SmartNICs.

Bitar implementation

Ensure the hardware accelerator is efficiently utilized

25 of 74

Embedded Processing Pipeline

25

M

P

Strategies

Bitar ║ Embedded Processing Pipeline

How can we construct an environment to host data services on distributed SmartNICs?
Requirements for communication and computation

Addressability: Efficient endpoint access (Faodel)
Configurability: Customizable resource mapping and computation dispatch (Faodel)
Adaptability: Optimized data processing through common data representation (Apache Arrow)

Prototype: Distributed Particle Sifting

For reorganizing particle data
Uses LSM Tree to sort on a 100-node BlueField-2 SmartNICs cluster

32-core host systems are roughly 4x faster than the SmartNICs

Particle sifting performance

* 100M particles

26 of 74

Enabling Dynamic Offloading of Data Service Workloads

26

M

P

Strategies

Bitar ║ Embedded Processing Pipeline ║ Distribute Offload Planning

Potential for overloading resource-constrained embedded systems

Need to dynamically offload online data query workloads based on overhead

Need SmartNICs and not hosts to decide when to offload

Data complexity affects the overhead of a query workload
SmartNICs manages data, and the data is dynamic

Design goals of the decision mechanism (decision engine)

Flexible Specification: Define workloads with a flexible, architecture-neutral specification (Substrait [substrait2023])
Job Size Estimation: Estimate based on the workload definition and referred data (Cardinality Estimator)
Adaptable Decision: Fine-tune offloading decisions based on resource availability (Predictive Model)
Efficient Decision: Make offloading decision efficiently

27 of 74

Estimating the Cost of Query Workloads

27

Dynamic offloading scheduling

Reactive scheduling: could defer beneficial decisions per request
Predictive scheduling: commonly employed in database query optimizers [ioannidis1996]

Estimate time consumption for each component

Serialization Cost

Train a machine learning model using the random forest regression algorithm
BlueField-2 serialization prediction shows under 7% error rate

Prediction performance for serialization time

M

P

Strategies

Bitar ║ Embedded Processing Pipeline ║ Distribute Offload Planning ⇒ Cost Estimation

deserialization

serialization

Deserialization incurs a constant cost, whereas serialization does not

28 of 74

Estimating the Cost of Query Execution

28

Query Execution Cost

Cardinality Estimator: Estimate the job size of a query workload
Execution Time Prediction: Based on job size and execution context (e.g., # of available threads)
Model Training: One per system type (e.g., host and BlueField-2)
Model Delivery: Shared to each SmartNIC for placement decision-making

Extract column statistics for each table using Apache DataSketches [datasketches]�(e.g., histogram and distinct counting statistics)
Update statistics as data changes
Support queries with filtering, projection, aggregation, reducible conditions, and multiple sources.

Implementing Cardinality Estimator

Estimate the # of output rows and the call count for each operation in a query

M

P

Strategies

Bitar ║ Embedded Processing Pipeline ║ Distribute Offload Planning ⇒ Cost Estimation

29 of 74

Prediction Performance

29

M

P

Strategies

Bitar ║ Embedded Processing Pipeline ║ Distribute Offload Planning ⇒ Cost Estimation

simple selection

project with compute function

filter with reducible condition

aggregate with compute function

Q1

Q3

Q4

Q7

Result from aggregation of multiple statistics' biases

Performance of Cardinality Estimation on Test Queries

Cost =

CPU time for estimation /

CPU time for query execution

Query Execution Time Prediction Performance (actual vs. estimated time)

# of threads 5 2 7 8 7 1 8 3 4 2 5 3 1 7 1 3

Estimation based on available threads

30 of 74

Case Study I

30

select * from particles where

x >= 0.7 and y < 0.3 and z <= 0.1

query

6,177,731 rows

table

Cardinality Estimator

Estimated output rows: 55,036.7 ( 0.87% difference )

Actual output rows: 55,517 (selectivity 0.9%)

Host system: two Intel Xeon 16-core E5-2698 CPUs running at 2.30GHz and 512 GB of memory
Host measurements and estimates utilize 32 threads, while the BlueField-2 SmartNIC uses 6 threads

Offloading is 74.6% faster than pushing back due to low-percentage selectivity
Main overhead of pushing back stems from network transfer

M

P

Strategies

Bitar ║ Embedded Processing Pipeline ║ Distribute Offload Planning ⇒ Cost Estimation ⇒ Case Studies

Model Prediction

Transfer Table Size 3.79 MiB 3.81 MiB 418.2 MiB 424.2 MiB

10.3% diff

5.1% diff

Offloaded

Pushed-back

31 of 74

Case Study II

31

select * from particles where

x >= 0.5 and y < 0.55 and z <= 0.67

query

6,177,731 rows

table

Crossover point:

Estimation suggests pushing back
Offloading is only 1.38% faster than pushing back due to higher-percentage selectivity

M

P

Strategies

Bitar ║ Embedded Processing Pipeline ║ Distribute Offload Planning ⇒ Cost Estimation ⇒ Case Studies

Model Prediction

Transfer Table Size 75.3 MiB 78.1 MiB 418.2 MiB 424.2 MiB

1.9% diff

3.8% diff

Offloaded

Pushed-back

Cardinality Estimator

Estimated output rows: 1,152,860 ( 1.4% difference )

Actual output rows: 1,136,847 (selectivity 18.4%)

32 of 74

Conclusion

32

Embedded systems present opportunities to address data processing challenges with general-purpose systems
Exploiting data service offloading benefits requires navigating between complexities in data, services, and systems
Contributions: Tools and methods to estimate and realize efficient data services offloading

Introduce meaningful metrics for quantifying offloading benefits
Identify potentials for offloading
Propose strategies for device-side offload optimization

Metrics

Potential

Strategies

Conclusion

33 of 74

What's Next?

33

Metrics

Potential

Strategies

Conclusion

Cost-benefit Quantification for East-West Data Service Offloading

The Work Unit methodology might apply
Distributed data service measurement and system configuration issues

Query Performance with Dynamic Offloading

Apply strategies and evaluate dynamically offloading online query workloads
Adapt to runtime resources and congestion control issues

Security and Performance Isolation

Important to share embedded system resources among multiple tenants
Opportunities such as eBPF [miano2018] and webassembly [menetrey2021]

34 of 74

Jianshen Liu (jliu120@ucsc.edu)

^{University of California, Santa Cruz�Dissertation Defense @ June 21, 2023}

Thank you!

Mentors: Carlos Maltzahn, Scott Brandt, Peter Alvaro, Craig Ulmer, Matthew Curry, Philip Kufeldt, Jeff LeFevre, Paul Stamwitz, Ike Nassi, Shel Finkelstein

SRL team: Ivo Jimenez, Noah Watkins, Michael Sevilla, Aldrin Montana, Jayjeet Chakraborty, Holly Casaletto, Saheed Adepoju, Esmaeil Mirvakili, Farid Zakaria

35 of 74

Related Publications

35

[HPC-IODC ‘19] MBWU: Benefit quantification for data access function offloading�Jianshen Liu, Philip Kufeldt, and Carlos Maltzahn.
[HotEdge '20] Scale-out edge storage systems with embedded storage nodes to get better availability and cost-efficiency at the same time�Jianshen Liu, Matthew Leon Curry, Carlos Maltzahn, and Philip Kufeldt
[SNL Report 2021] Performance Characteristics of the BlueField-2 SmartNIC�Jianshen Liu, Carlos Maltzahn, Craig D. Ulmer, Matthew Leon Curry
[HPEC ‘22] Processing Particle Data Flows with SmartNICs ( Outstanding Student Paper)�Jianshen Liu, Carlos Maltzahn, Matthew Leon Curry, Craig Ulmer
[COMPSYS ‘23] Extending Composable Data Services into SmartNICs ( Best Paper)�Craig Ulmer, Jianshen Liu, Carlos Maltzahn, Matthew Leon Curry
WIP for [HPEC ’23] (for the work about coordinating processing offloading)

( Picked up by The Next Platform)

[prickett2021]

36 of 74

36

[rong2016] Rong, Huigui, et al. "Optimizing energy consumption for data centers." Renewable and Sustainable Energy Reviews 58 (2016): 674-691.

[masanet2020] Masanet, Eric, et al. "Recalibrating global data center energy-use estimates." Science 367.6481 (2020): 984-986.

[dayarathna2015] Dayarathna, Miyuru, Yonggang Wen, and Rui Fan. "Data center energy consumption modeling: A survey." IEEE Communications Surveys & Tutorials 18.1 (2015): 732-794.

[openai2018] Amodei, Dario et al. "AI and compute.", https://openai.com/research/ai-and-compute.

[gupta2021] Gupta, Udit, et al. "Chasing carbon: The elusive environmental footprint of computing." 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 2021.

[nikolic2022] Nikolić, Tatjana R., et al. "From Single CPU to Multicore Systems." 2022 57th International Scientific Conference on Information, Communication and Energy Systems and Technologies (ICEST). IEEE, 2022.

[karkar2022] Karkar, Ammar, et al. "Thermal and performance efficient on-chip surface-wave communication for many-core systems in dark silicon era." ACM Journal on Emerging Technologies in Computing Systems (JETC) 18.3 (2022): 1-18.

[gupta2021] Gupta, Udit, et al. "Chasing carbon: The elusive environmental footprint of computing." 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 2021.

[cidon2013] Cidon, Asaf, et al. "Copysets: Reducing the frequency of data loss in cloud storage." 2013 {USENIX} Annual Technical Conference ({USENIX}{ATC} 13). 2013.

[vishwanath2010] Vishwanath, Kashi Venkatesh, and Nachiappan Nagappan. "Characterizing cloud computing hardware reliability." Proceedings of the 1st ACM symposium on Cloud computing. 2010.

[xu2019lessons] Xu, Erci, et al. "Lessons and Actions: What We Learned from 10K SSD-Related Storage System Failures." USENIX Annual Technical Conference. 2019.

[ubuntustressng] King, Colin. Stress-Ng: A Tool to Load and Stress a Computer System. https://kernel.ubuntu.com/git/cking/stress-ng.git. Accessed 29 May 2023.

[ricci2014] Ricci, Robert, Eric Eide, and CloudLab Team. "Introducing CloudLab: Scientific infrastructure for advancing cloud architectures and applications." ; login:: the magazine of USENIX & SAGE 39.6 (2014): 36-38.

[linuxpktgen] “HOWTO for the Linux Packet Generator.” The Linux Kernel Documentation, https://docs.kernel.org/networking/pktgen.html. Accessed 29 May 2023.

[iperf999] Tirumala, Ajay. "Iperf: The TCP/UDP bandwidth measurement tool." http://dast. nlanr. net/Projects/Iperf/ (1999).

[nuttcp2014] Tierney, Brain. "Experiences with 40G/100G applications." Berkeley: ESnet (2014).

[upasani2023] Upasani, Gaurang, et al. "Fifty Years of ISCA: A data-driven retrospective on key trends." arXiv preprint arXiv:2306.03964 (2023).

[liu2019] Liu, Jianshen, Philip Kufeldt, and Carlos Maltzahn. "Mbwu: Benefit quantification for data access function offloading." High Performance Computing: ISC High Performance 2019 International Workshops, Frankfurt, Germany, June 16-20, 2019.

37 of 74

37

[netperf2012] Jones, Rick. "Netperf benchmark." http://www. netperf. org/ (2012).

[arrow2023] Richardson N, Cook I, Crane N, Dunnington D, François R, Keane J, Moldovan-Grünfeld D, Ooms J, Apache Arrow (2023). arrow: Integration to 'Apache' 'Arrow'.

[dpdk] Data Plane Development Kit (DPDK). Linux Foundation, https://www.dpdk.org/. Accessed 31 May 2023.

[ulmer2018] Ulmer, Craig, et al. "Faodel: Data management for next-generation application workflows." Proceedings of the 9th Workshop on Scientific Cloud Computing. 2018.

[ioannidis1996] Ioannidis, Yannis E. "Query optimization." ACM Computing Surveys (CSUR) 28.1 (1996): 121-123.

[substrait2023] Substrait: Cross-Language Serialization for Relational Algebra. https://substrait.io/. Accessed 2 June 2023.

[datasketches] Rhodes, Lee, et al. Apache DataSketches: A Software Library of Stochastic Streaming Algorithms. https://datasketches.apache.org/. Accessed 9 Apr. 2023.

[kufeldt2018] Kufeldt, Philip, et al. "Eusocial Storage Devices-Offloading Data Management to Storage Devices that Can Act Collectively." ; login: The USENIX Magazine 43.2 (2018): 16-22.

[miano2018] Miano, Sebastiano, et al. "Creating complex network services with ebpf: Experience and lessons learned." 2018 IEEE 19th International Conference on High Performance Switching and Routing (HPSR). IEEE, 2018.

[menetrey2021] Ménétrey, Jämes, et al. "Twine: An embedded trusted runtime for webassembly." 2021 IEEE 37th International Conference on Data Engineering (ICDE). IEEE, 2021.

[kung1989] Kung, H. T. "Network-based multicomputers: redefining high performance computing in the 1990s." (1989).

[tseng2016] Tseng, Hung-Wei, et al. "Morpheus: Creating application objects efficiently for heterogeneous computing." ACM SIGARCH Computer Architecture News 44.3 (2016): 53-65.

[boboila2012] Boboila, Simona, et al. "Active flash: Out-of-core data analytics on flash storage." 2012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST). IEEE, 2012.

[chakraborty2022] Chakraborty, Jayjeet, et al. "Skyhook: Towards an arrow-native storage system." 2022 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing (CCGrid). IEEE, 2022.

[hennessy2017] Hennessy, John L., and David A. Patterson. Computer Architecture, Sixth Edition: A Quantitative Approach. 6th ed., Morgan Kaufmann Publishers Inc., 2017.

[samuels2019] Samuels, Allen. The Consequences of Infinite Storage Bandwidth. 27 Apr. 2016, https://www.youtube.com/watch?v=-X9BuepxGko.

[wikibitrates] Wikipedia contributors. List of Interface Bit Rates --- Wikipedia, The Free Encyclopedia. 2019.

[prickett2021] Prickett, Nicole Hemsoth. “Testing the Limits of the BlueField-2 SmartNIC.” The Next Platform, 24 May 2021, https://www.nextplatform.com/2021/05/24/testing-the-limits-of-the-bluefield-2-smartnic/.

38 of 74

38

[liu2020] Liu, Jianshen, and Matthew Leon. "Scale-out edge storage systems with embedded storage nodes to get better availability and cost-efficiency at the same time." HotEdge'20 (2020).

39 of 74

Performance of deserialization + serialization vs. memcpy on the BlueField-2

39

131072 rows, 9 MiB object size

Similar performance even with small objects (< 10 MiB)
Leverage Apache Arrow's (de)serialization efficiency
Use our serialization model for object merge time prediction

40 of 74

Evolving Trends in Computer Hardware

40

Hardware Trends

Performance growth in computer hardware is not uniform
Big data fuels an increasing demand for computing power in more applications

Background

* The CPU performance is depicted in relation to the VAX 11/780 using the SPEC integer benchmarks.

[hennessy2017, samuels2019, wikibitrates]

Ceased improvement in single-core performance

Expedited improvement in network and storage

Trade functionality with performance

Asymmetric Processors (E.g., P-Cores and E-Cores)
Domain-specific Hardware (Embedded Systems)

GPU

DPU

41 of 74

Challenges in General-purpose Computing

41

The “Power Wall”

Increased power consumption threatens thermal runaway
Maximized at around 4 GHz since 2006 [nikolic2022]

Multicore Architecture (confront “dark silicon” post-Dennard scaling)

Energy Consumption

Data centers

US KWH: 61B to 100B (2006 - 2011) [rong2016]
Instances: rose 550% (2010 - 2018) [masanet2020]

Refrigeration systems: 40% of energy consumption
Power bills: substantial expense [dayarathna2015]
ML/AI resources: double every 3.4 months since 2012 [openai2018]

Environmental Impacts

Carbon: Energy = Emissions
Manufacturing > Operations [gupta2021]

Ratio of the parallelizable portion (Amdahl’s law)
Memory-bound workloads
Lock contention

False sharing
Inter-core communication

Dark silicon patterns on a chip

[karkar2022]

Background

Hardware Trends

42 of 74

Trends in Hardware Design Paradigms

42

Trading functionality with performance

Asymmetric Processors

E.g., P-Cores and E-Cores

Domain-specific Hardware (Embedded Systems)

E.g., GPU, TPU, DPU, FPGA
Power efficiency
Performance benefits stay ahead of next-gen general-purpose processors

P-Cores

E-Cores

Intel Raptor Lake Processor Die Shot

Wikipedia: Raptor Lake

GPU

DPU

FPGA

Background

Hardware Trends

43 of 74

Composable Data Services

43

Essential building blocks for applications to implement data resiliency, availability, validity, and curation

Provide richer views of the underlying data in web applications
Facilitate data storage, organization, and processing in HPC applications
Secure data in storage and network services

Hardware Trends ║ Data Services

Types of Data Services [kufeldt2018]

Data services require frequent data movement across system layers and components

Examples:

filesystems

key/value stores

data queries

Examples:

data distribution

data recovery

load balancing

Background

Involve compute, network, and energy overheads

44 of 74

Contributions

44

Dynamic Offloading

Strategies to exploit data service offloading benefits

Accelerate Serialization

Distribute Offload Planning

Distribute Data Processing

HPC-IODC ’19 HPEC ‘22 ^@

HotEdge ‘20 COMPSYS ‘23 *

SNL Report In-progress

^{@ Outstanding Student Paper Award}

* ^{Best Paper Award}

Metrics for quantifying offloading benefits

Work Unit

Data Availability

Potential for offloading data services

Performance Gap Delineation

Network Processing Headroom

Data Partitioning Service

Parallel Processing Service

45 of 74

Opportunities for Key-value Offloading

45

Data Access Functions

Metrics

Basic Architecture Of RocksDB

Read/write amplification

RocksDB Data Access Overhead

YCSB Workload A: 50/50 put and get operations

Device throughput ⬆ 20% with 1 more thread

No significant improvement in RocksDB throughput

6x data access amplification

Offloading to embedded storage devices can return amplified system resources occupation to applications

Opportunity

B

46 of 74

Cost-benefit Quantification for Key-value Offloading

46

Software Stack

Hardware Infrastructure

YCSB serves the key-value workload generator
RocksDB APIs are exposed through Java RMI framework
YCSB communicates with RocksDB via RPC

Host is 33x more expensive than the embedded storage devices
SATA SSD as the storage media

Metrics

Data Access Functions ⇒ MBWU ⇒ Evaluation

B

47 of 74

Metrics for Quantifying Benefits of Offloading

47

For optimizing a function offloaded

Commonly used in existing research
Rely solely on external characteristics (e.g., workload, cost, and energy)
Examples: MB/sec and Kops/sec, $/op, J/op

Type 1 Metrics

For optimizing an embedded device

Facilitate cost-performance optimization of a device
Incorporates the domain-specific resource efficiency of a device

Storage media efficiency of an embedded storage device
Network bandwidth efficiency of an embedded network interface card.

Type 2 Metrics

Metrics

Data Access Functions ⇒ MBWU

B

48 of 74

MBWU: Data Access Function Efficiency

48

Different workloads and storage media lead to diverse cost-optimal placements of data access functions

Metrics

Data Access Functions ⇒ MBWU

B

Examples of data access functions:

get/put in key/value stores
read/write in filesystems
select/project in DBMS

If k < x, system B’s storage resource is underutilized,

Its cost-effectiveness could be improved by

Increasing the computing resources
Reducing the number of storage media

A throughput-oriented normalization metric
Normalizes workload performance relative to storage media's workload capacity
Exclude caching effects

Media-based Work Unit (MBWU)

49 of 74

MBWU-based Efficiency Metrics

49

Evaluate the the cost incurred in enabling the performance of the storage media on a system

measures cost-efficiency

measures energy-efficiency

measures space-efficiency

$/MBWU

kWh/MBWU

m³/MBWU

Metrics

Data Access Functions ⇒ MBWU

B

Device throughput ⬆ 20% with 1 more thread

No significant improvement in RocksDB throughput

6x data access amplification

Key-value Data Access Overhead

A case study

Cost-benefit Quantification for Key-value Offloading

50 of 74

Reduce Time to Insights!

50

Metrics

Data Access Functions ⇒ MBWU ⇒ Case Study

Construct the MBWU for the key-value workload
Evaluate the MBWUs of systems for comparison

Pre-condition storage devices

Rate control data initialization

Monitor system resources’ utilization

Identify peak performance

different offloading configurations

Evaluation Complexity:

B

Hardware Infrastructure

33x more expensive

…

Software Stack

RocksDB via Java RMI

RPC YCSB workloads

51 of 74

Insights of Key-value Offloading Landscapes

51

57.9%

73.4%

45.9%

39.6%

70.7%

Network Tests

Integrated Tests

Disaggregated Tests

Reduction in $/MBWU

Reduction in kWh/MBWU

64%

Imbalance in server compute vs. storage allocation offers offloading opportunities
Changes in offloading configurations directly affect the offloading benefits
Disaggregate storage in servers amplifies the offloading benefits

MBWUs (host, embedded)

5.95, 0.5

5.2, 0.37

3.28, 0.37

Benefits

Metrics

Data Access Functions ⇒ MBWU ⇒ Case Study

B

52 of 74

Evaluation with Solid-state Drives

52

Impact of Compute Aggregation

Impact of Storage Aggregation

Background

Metrics

Data Access Functions ⇒ MBWU ⇒ Evaluation ║ Data Availability ⇒ Mathematical Model ⇒ Evaluation

⯈ server-based system has (m=) 10 servers

⯈ each server has (n=) 4 storage devices

⯈ relative benefit is 20.7

53 of 74

Data Availability

53

Challenges of small storage systems ➡ edge data centers

Environmental constraints (e.g., space, network and power)
Server maintenance costs rival redundancy provisioning expenses
Limited number of failure domains

Embedded storage increases data availability

Lower resource aggregation enhances node reliability
Denser deployment within the same cost/space/power limits
Failover mechanism spans more failure domains

Metrics

Data Access Functions ⇒ MBWU ⇒ Case Study ║ Data Availability

B

Failure Domains

Server-based Storage System

Embedded Storage System

54 of 74

Mathematically Evaluate the Data Availability Benefit

54

Develop a mathematical model to compare server storage systems to embedded storage systems

Metrics

Data Access Functions ⇒ MBWU ⇒ Case Study ║ Data Availability ⇒ Mathematical Model

B

Model Assumptions

Higher Storage Aggregation

Higher Compute Aggregation

Evaluation for Spinning Media

Evaluation for Solid-state Drives

55 of 74

Insights of Disaggregating Storage and Compute

55

P_data-loss(server-based storage system)

P_data-loss(embedded storage system)

Relative Benefit =

Evaluate the relative probability of data loss between the server-based and embedded storage systems:

Higher Relative Benefits when:

Higher server storage aggregation in the server-based system
Higher server compute aggregation in the server-based system
Lower storage device failure rate in storage systems

Data loss risk shifts from storage to other components

Metrics

Data Access Functions ⇒ MBWU ⇒ Case Study ║ Data Availability ⇒ Mathematical Model ⇒ Insights

B

the greater the ratio, the better the embedded storage

56 of 74

Relative Benefit

56

P_data-loss(server-based storage system)

P_data-loss(embedded storage system)

Relative Benefit =

Evaluate the relative probability of data loss between the server-based and embedded storage systems:

the failure rate of the storage component over that of the non-storage components

spinning media f = 2 [vishwanath2010]
solid-state drive f = 0.06 [xu2019lessons]

the number of nodes that have a replica set relationship with a node

w = 4 [cidon2013]

# of GP servers
the ratio of storage aggregation (# of storage devices in a server)
the ratio of compute aggregation (# of embedded storage device / # of servers)

Metrics

Data Access Functions ⇒ MBWU ⇒ Evaluation ║ Data Availability ⇒ Mathematical Model ⇒ Evaluation

B

the greater the ratio, the better the embedded storage

57 of 74

Mathematically Evaluate the Data Availability Benefit

57

and
, where ( the ratio of failure rates)
, where ( the ratio of computing performance)
( the ratio of storage performance)
(3-way replication)

Develop a mathematical model to compare a storage system built with general-purpose servers to one that is built with embedded storage nodes.

Metrics

Data Access Functions ⇒ MBWU ⇒ Evaluation ║ Data Availability ⇒ Mathematical Model

Storage Node nodes of the same type have the same configuration
External Redundancy all nodes have the same network and power redundancy
Data Redundancy 3-way replication
Replication Schema copyset scheme [cidon2013] ➡ reduce the number of joint failure domains that share replicas
Failure Correlation independency of node failures ➡ model hardware failures with Poisson distribution

System Configuration Assumptions

Model Parameters Assumptions

B

58 of 74

Evaluation with Spinning Media

58

Impact of Compute Aggregation

Impact of Storage Aggregation

Higher Storage Aggregation

Higher Compute Aggregation

⯈ c = n = 4 ➡ embedded storage system has (10x4=) 40 devices

⯈ server-based system has (m=) 10 servers

⯈ each server has (n=) 4 storage devices

⯈ relative benefit is 7.1

⯈ each server has 12 storage devices

⯈ server-based system has (m=) 10 servers

⯈ embedded storage system has (17x10=) 170 devices

⯈ relative benefit is 114.3

Metrics

Data Access Functions ⇒ MBWU ⇒ Evaluation ║ Data Availability ⇒ Mathematical Model ⇒ Evaluation

B

59 of 74

59

Discovered potentials and new metrics should navigate offloading by aligning service needs with the potentials.

B

Potential

M

60 of 74

Processing Headroom Evaluation

60

The Host (two AMD EPYC 7542 CPU @ 2.9GHz, 512GB DRAM):

saturates the full 100 Gbps bandwidth, with 5 vCPU cores and packet size 832 B
with less than 1% processing headroom on 5 vCPU, 123 CPUs remain available

Potential

Platform ║ Performance Delineation ║ Network Process Headroom

M

61 of 74

Bitar Implementation

61

dequeue burst�(an operation list)

…

mbuf struct

Input buffer

seg size < 64KB

struct rte_mbuf

…

an operation

(two mbuf chains with metadata)

enque burst�(an operation list)

allocated in huge pages

DMA

source mbufs

(an mbuf chain)

preallocated memzone pool�(from huge pages)

…

❶ slice input buf

❷ assemble output buf

❸ enqueue ops

❹ dequeue ops

❺ recycle resources

❶

❷

❸

❹

preallocated ops pool

struct rte_comp_op

preallocated mbuf pool

destination mbufs

(an mbuf chain)

❺

62 of 74

Particle Data Flows

62

B

Potential

Platform ║ Performance Delineation ║ Network Process Headroom ║ Data Partitioning

HPC applications often require complex network services for particle data
Discrepancies exist between between data producers and consumers

Airplane data from spatial to temporal representation

…

Partition &

Distribute

SmartNIC

Host

SmartNIC

Data reorganization with SmartNICs

M

Embedding Distributed Data Reorganization

Use Apache Arrow [arrow2023] for particle data representation to simplify movement and processing

serialization, compression, partition, projection, aggregation, etc.

Use SmartNICs as distributed processing elements

63 of 74

Bitar: Optimizing Data Compression for Serialization

63

M

P

Strategies

B

Bitar

Compression plays a crucial role in data communication

Increase the effective data throughput of the network
Compression codec support is in many I/O libraries (e.g., Apache Kafka, Apache Arrow)

How to build a general interface for data services to accelerate compression?

Hide the communication complexity with the hardware
Explore the hardware capability (e.g., mutiqueue and multithreading)
Ensure performance efficiency (e.g., minimize memory allocation)

Implementation https://github.com/skyhookdm/bitar

Build on top of DPDK [dpdk] and Apache Arrow
Automatically map input data to DPDK buffers
Chain operations to improve performance
Maintain memory pools to reduce allocation overhead

Features

Features zero-copy (a)synchronous API
Supports multi-core/multi-device processing
Runs on either the host or the BlueField-2 card
Hardware/software compatible compression

64 of 74

Serialization Performance with Multi-threaded Compression

64

M

P

Strategies

B

Bitar ⇒ Performance

Evaluate software vs. hardware compression (Bitar) performance impact on data serialization

Input data uses Arrow IPC “airplanes” reference dataset

8.6x

5.7x

2.8x

1.9x

Hardware compression beats the performance with 35 host threads

Hardware compression significantly improves serialization performance

Hardware compression provides a comparable compression ratio for particle data

Bitar simplifies the use of compression hardware to improve performance of data service offloaded to SmartNICs.

65 of 74

Embedded Processing Pipeline

65

M

P

Strategies

B

Bitar ⇒ Performance ║ Embedded Processing Pipeline

How can we construct an environment to host the data service on SmartNICs?
Requirements on communication and computation

Addressability: Efficient endpoint access
Configurability: Customizable resource mapping and computation dispatch
Adaptability: Optimized data processing through common data representation

Compute node A

Compute node B

Dispatch data processing workload at runtime

Resource group

Explore computation with parallelization

Workload with common data representation

Prototype: Distributed Particle Sifting

Faodel: provide globally accessible endpoints, resource group policy, and computation dispatching
Apache Arrow: provide a standardized, efficient tabular data format with flexible data processing APIs

66 of 74

Distributed Particle Sifting

66

Simulation tasks inject particle data to local SmartNICs
SmartNICs sift particles using Log-Structured Merge Tree
Particles split and transfer to next SmartNIC level during compaction
Tested on a 100-node BlueField-2 SmartNICs cluster

For 64M particles, the overall transfer rate of 1.32GB/s

Local Injection performance

Transfer to local SmartNIC takes 81% of the overall injection time

Particle sifting performance

32-core host systems are roughly 4x faster than the SmartNICs

M

P

Strategies

B

Bitar ⇒ Performance ║ Embedded Processing Pipeline

67 of 74

Estimating the Cost of Query Workloads — Network Transfer

67

Network Transfer Cost

Decouple network transfer cost from intricate data details

Offloaded workloads: Particle Table ➡ Data Size ➡ Transfer Time
Pushed-back workloads: aggregate the data object size

Collect training data via round-trip time of different-sized buffer requests to SmartNIC
Prediction error rates for table size and transfer time are within 6% and 7%, respectively

Prediction performance for table size

Prediction performance for transfer time

M

P

Strategies

B

Bitar ⇒ Performance ║ Embedded Processing Pipeline ║ Dynamic Offloading ⇒ Cost Estimation

68 of 74

Takeaway

68

Why

What

How

Metrics

MBWU ║ Data Availability ║ Takeaway

These metrics and methods present new toolkits to evaluate the cost-benefit of offloading data services, which are also applicable to future embedded systems for evaluation.

B

69 of 74

Takeaway

69

Why

What

How

B

M

Potential

Platform ║ Performance Delineation ║ Network Process Headroom ║ Data Processing Services ⇒ Performance ║ Takeaway

Modern SmartNICs are powerful enough to handle specific data services by aligning needs with potentials, providing desirable resource isolation and locality benefits.

70 of 74

Takeaway

70

Why

What

How

M

P

Strategies

B

Bitar ║ Embedded Processing Pipeline ║ Dynamic Offloading ⇒ Cost Estimation ⇒ Case Study ║ Takeaway

Efficient solutions for communication, computation, and scheduling complexities are essential to harness the benefits of offloading data services to embedded devices.

71 of 74

Enabling Dynamic Offloading of Data Service Workloads

71

M

P

Strategies

Bitar ║ Embedded Processing Pipeline ║ Dynamic Offloading

Potential for overloading resource-constrained embedded systems ➡ Need dynamic offloading
SmartNICs and not hosts decide when to offload

Design goals of decision engine

Flexible Specification: Define workloads with a flexible, architecture-neutral specification (Substrait [substrait2023])
Job Size Estimation: Estimate based on the workload definition and referred data (Cardinality Estimator)
Adaptable Decision: Fine-tune offloading decisions based on resource availability (Predictive Model)
Efficient Decision: Make offloading decision efficiently

Ideal dynamic offloading use case: data query workloads (Substrait [substrait2023])

Efficiently estimate the job sizes then feed into the decision engine

Dissect workloads for offloading decision-making to maximize benefits

Refer to remote data sources

Serialize to architecture-agnostic, concise workload definition

72 of 74

Prediction Performance

72

Query execution time prediction with execution context information

Extremely efficient with relatively low error rate

aggregation

with reducible condition

simple filtering

projection

M

P

Strategies

Bitar ║ Embedded Processing Pipeline ║ Distribute Offload Planning ⇒ Cost Estimation

73 of 74

Case Study I

73

select * from particles where

x >= 0.7 and y < 0.3 and z <= 0.1

query

6,177,731 rows

table

Cardinality Estimator

Estimated output rows: 55,036.7 ( 0.865% difference )

Actual output rows: 55,517 (selectivity 0.9%)

Host system: two Intel Xeon 16-core E5-2698 CPUs running at 2.30GHz and 512 GB of memory
Host measurements and estimates utilize 32 threads, while the BlueField-2 SmartNIC uses 6 threads

Offloading is 74.6% faster than pushing back due to low-percentage selectivity
Main overhead of pushing back stems from network transfer

Model Prediction

Transfer Table Size

3.79 MiB 3.81 MiB 0.49%

418.19 MiB 424.19 MiB 1.42%

M

P

Strategies

Bitar ║ Embedded Processing Pipeline ║ Distribute Offload Planning ⇒ Cost Estimation ⇒ Case Studies

74 of 74

Case Study II

74

select * from particles where

x >= 0.5 and y < 0.55 and z <= 0.67

query

6,177,731 rows

table

Cardinality Estimator

Estimated output rows: 1,152,860 ( 1.41% difference )

Actual output rows: 1,136,847 (selectivity 18.4%)

Crossover point:

Estimation suggests pushing back
Offloading is only 1.38% faster than pushing back due to higher-percentage selectivity

Transfer Table Size

75.35 MiB 78.06 MiB 3.48%

Model Prediction

418.19 MiB 424.19 MiB 1.42%

M

P

Strategies

Bitar ║ Embedded Processing Pipeline ║ Distribute Offload Planning ⇒ Cost Estimation ⇒ Case Studies