1 of 74

Extending Composable Data Services to the Realm of Embedded Systems

Jianshen Liu

University of California, Santa Cruz�Dissertation Defense @ June 21, 2023

2 of 74

Challenges and Opportunities for Modern Data Processing

2

Data processing on general-purpose systems faces performance, cost, and energy challenges as modern applications increasingly rely on larger data volumes.

* the International Symposium on Computer Architecture

Word cloud of most common words in abstracts for ISCA* 1983 - 2023

[upasani2023]

Embedded systems present opportunities to optimize by offloading the services for data processing.

But:�Cost/benefit landscape for offloading is complex.

3 of 74

Types of Data Service Offloading

3

Examples:

filesystems

key/value stores

data queries

Examples:

data distribution

data recovery

load balancing

Data services require frequent data movement across system layers and components [kufeldt2018]

Example benefits of offloading data services

  • Minimize data movement cost to host systems
  • Maximize resource availability for host applications
  • Leverage efficient resources and mechanisms for data movement
  • Allow for service isolation without additional infrastructure

Involve compute, network, and energy overheads

4 of 74

Hypotheses and Scope

4

Scope

Providing a set of new tools to estimate and realize the benefits of offloading composable data services for any embedded systems.

Essential building blocks for applications to implement data availability, validity, and curation

Hypothesis #1: Quantifying the benefits of data service offloading requires multiple and specialized evaluation metrics.

Hypothesis #2: Data service offloading has to be carefully tailored based on the function, the data, and the fluctuating and highly specialized resource availability on embedded systems.

Hypothesis #3: Harnessing the benefits of data service offloading requires efficient mechanisms for communication, computation, and scheduling.

Why: Metrics for Evaluation

What: Potential for Offloading

How: Strategies to Offload

5 of 74

Related Publications

5

  • [HPC-IODC ‘19] MBWU: Benefit quantification for data access function offloading�Jianshen Liu, Philip Kufeldt, and Carlos Maltzahn.
  • [HotEdge '20] Scale-out edge storage systems with embedded storage nodes to get better availability and cost-efficiency at the same time�Jianshen Liu, Matthew Leon Curry, Carlos Maltzahn, and Philip Kufeldt
  • [SNL Report 2021] Performance Characteristics of the BlueField-2 SmartNIC�Jianshen Liu, Carlos Maltzahn, Craig D. Ulmer, Matthew Leon Curry
  • [HPEC ‘22] Processing Particle Data Flows with SmartNICs ( Outstanding Student Paper)Jianshen Liu, Carlos Maltzahn, Matthew Leon Curry, Craig Ulmer
  • [COMPSYS ‘23] Extending Composable Data Services into SmartNICs ( Best Paper)Craig Ulmer, Jianshen Liu, Carlos Maltzahn, Matthew Leon Curry
  • WIP for [HPEC ’23] (for the work about coordinating processing offloading)

( Picked up by The Next Platform)

[prickett2021]

6 of 74

Contributions

6

Why

What

How

before advancement

after advancement

7 of 74

Contributions

7

Why

What

How

Metrics

These metrics and methods present new toolkits to evaluate the cost-benefit of offloading data services, which are also applicable to future embedded systems for evaluation.

8 of 74

Quantifying the Benefits of Offloading [liu2019] collaborated with Philip Kufeldt

8

Metrics

MBWU

  • Throughput of a given workload for a given single device (without offloading)
  • Example: the not-offloaded throughput of a given workload with a single storage device is called “media-based” work unit (MBWU).

Work Unit (WU)

As new devices emerge at increasing frequency

Automated determination of a (Media-Based) Work Unit

As long as workload and device is fixed, we can compare throughput with and without offloading.

9 of 74

Data Availability

9

Metrics

MBWU ║ Data Availability

Challenges of small storage systems ➡ edge data centers

  • Environmental constraints (e.g., space, network and power)
  • Server maintenance costs rival redundancy provisioning expenses
  • Limited number of failure domains due to small deployment sizes

The more independent failure domains a failover mechanism spans, the more available the data becomes.

Server-based Storage System

Embedded Storage System

Embedded storage enables more failure domains under the same cost/space/power restrictions.

  • The higher the storage aggregation in servers, the greater the data availability benefit of embedded storage.
  • The higher the compute aggregation in servers, the greater the data availability benefit of embedded storage.
  • The lower the failure rate of storage devices used in storage systems, the greater the data availability benefit of embedded storage.

10 of 74

Contributions

10

Why

What

How

Metrics

These metrics and methods present new toolkits to evaluate the cost-benefit of offloading data services, which are also applicable to future embedded systems for evaluation.

11 of 74

Contributions

11

Why

What

How

Metrics

Potential

Realizing benefits goes beyond the use of specialized metrics. It requires exploring the potential of embedded systems in both computation and communication dimensions.

12 of 74

Evaluation Platform (SmartNIC)

12

Potential

  • Programmable NICs seen to optimize HPC/HPDA workflows [kung1989]
  • Programmable NICs are already capable of north-south and east-west communication
    • As both a specific use case and a placeholder for other future embedded systems
  • Multiple vendors have created powerful SmartNICs
    • Roadmaps show rapid performance improvement (10x every two years!)
    • Emerging HPC platforms include SmartNICs
    • Existing market for SmartNICs to enable traditional storage devices to become computational

NVIDIA BlueField-2 SmartNIC

  • 100Gb/s InfiniBand
  • 8x A72 Arm Cores�@ 2.75GHz
  • 16GB DRAM
  • 60GB Flash
  • Accelerators
  • compression
  • encryption
  • regular expression
  • Ubuntu 20.04
  • Operation Modes
  • Embedded function
  • Separated

M

Platform

13 of 74

Performance Comparison using Microbenchmarks

13

Potential

Platform ║ Performance Delineation

  • Testing with realistic workloads often miss device pros and cons
  • Microbenchmarks focus specific operations
  • Microbenchmarks are placeholders for offloading and non-offloading

Stress-ng Microbenchmark [ubuntustressng]

  • 250 stressors are designed to cover a broad spectrum of system operations (e.g., mmap, vecmath, sock)
  • Stressors are categorized into classes (e.g., OS, VM, FILESYSTEM, and SCHEDULER)

Operation latency-based performance metrics are not directly comparable

M

(executed on the SmartNIC vs. on the host)

Normalize the stressor results relative to the Raspberry Pi 4B’s performance

Raspberry-Pi-based Work Unit, or RPWU

14 of 74

Performance Characterization

14

Compare BlueField-2 SmartNIC with a variety of host systems available at CloudLab [ricci2014]

Triangle data points denote the relative performance of the BlueField-2 SmartNIC

Potential

Platform ║ Performance Delineation

M

Offloading Potentials

BlueField-2 SmartNIC exhibits advantageous performance compared to hosts in:

  • Memory contention operations
  • Hardware accelerated operations (e.g., cryptographic and compression)
  • IPC operations
  • SIMD vectorized operations

15 of 74

Network Processing Headroom

15

Potential

Platform ║ Performance Delineation ║ Network Process Headroom

  • Determines maximum CPU time for offloaded functions while preserving network performance
  • Impacted by hardware configuration and the overhead of different networking stacks
  • Needs user- or kernel-space benchmarks to evaluate upper-bound performance
  • “Processing Headroom” is also relevant for storage devices if “non-computational” storage steals cycles from on-device processors

Linux pktgen [linuxpktgen]

  • In kernel-space generating/injecting UDP packets with multi-core parallelism and multi-queue support

M

Evaluation Method

  1. Evaluate the minimum thread count to maximize the bandwidth with the given packet and burst size
  2. Evaluate the duration of “busy retry” by introducing “delay” between bursts
  3. The maximum delay without impeding throughput is the processing headroom

Yet another specialized metric for evaluation

16 of 74

Processing Headroom Evaluation

16

Potential

Platform ║ Performance Delineation ║ Network Process Headroom

M

The BlueField-2:

  • only saturate 60% bandwidth, with all 8 cores, a large burst size, and the largest packet size 10 KiB
  • at half bandwidth utilization, processing headroom is just 21% of the burst time

17 of 74

Processing Headroom Evaluation

17

The BlueField-2:

  • 78.5% of CPU time available with the kernel IP stack
  • 87.5% of CPU time available with DPDK

Potential

Platform ║ Performance Delineation ║ Network Process Headroom

M

Offloading Potentials

  • High-performance networking stacks are crucial to the performance of offloading functions
  • Specific hardware configurations can greatly boost offloaded function performance

with DPDK

with kernel IP stack

Performance Leap!

18 of 74

Offloading Potential for Collective Acting SmartNICs

18

What additional potential do SmartNICs have when they can act collectively?

Problem: data services on compute nodes complete for resources with applications, causing potential delays [tseng2016]

Opportunity: Offload to SmartNICs

But: Require SmartNICs to act collectively to expand processing capability

Platform ║ Performance Delineation ║ Network Process Headroom ║ Potential for E/W Offloading

Service Placement

  • In-situ: sacrifices application resources; causes fate sharing
  • In-vitro: add communication overhead; increases node count
  • In-storage [chakraborty2022]: prevent by system policies
  • Data processing elements co-locate with hosts?

Background: HPC Scientific Computing Workflows

  • Simulation tasks periodically generate data output
  • Data consumers want customized data
  • Data consumers and producers differ in their expectations of data organization
  • Composable data services are responsible for data tailoring

Potential

M

19 of 74

Collective Acting SmartNICs

19

Software Stack

  • Data processing: Apache Arrow [arrow2023], with SIMD support, provides serialization, compression, projection, aggregation on tabular data
  • Network transfer: Faodel [ulmer2018], built on RDMA, provides distributed-memory Key/Blob API and computation dispatching.

How do these data service libraries perform on the SmartNIC?

Expand in-transit data processing capability with multiple SmartNICs

Sorted by�position and time

Sorted by�ID and time

Offload data reorganization service and online query service

  • Log-structured merge (LSM) tree sorts data

Platform ║ Performance Delineation ║ Network Process Headroom ║ Potential for E/W Offloading

Potential

M

20 of 74

Particle Data Partitioning Performance

20

  • Algorithm: unpack tabular data, split and repack into 2-16 partitions based on particle ID
  • Time to process 1 GB of data on compute node* vs. BlueField-2
  • Evaluate on three “particle” datasets: scientific, airplanes, and ships

Overhead for partitioning without compression

Host is ~4x faster than the SmartNIC

Timing breakdown on BlueField-2 with compressions

Handling compression in software incurs significant overhead

Offloading Potential

  • Offload data compression to hardware accelerator

Platform ║ Performance Delineation ║ Network Process Headroom ║ Potential for E/W Offloading ⇒ Data Partitioning

Potential

M

21 of 74

Multi-threaded Service Performance

21

Bookkeeping overhead on the SmartNIC

Faodel LocalKV test uses multi-threading for put/get/delete operations on a 2D in-memory hashmap

32-core AMD EPYC 7543P

68-core

Query Arrow Data

  • Filtering query selects particle data in a ⅛ bounding box
  • Aggregation query calculates min/max squared magnitude velocities of particles

32-core Xeon E5-2698

At 8 threads, the host is only 37.8% faster than the SmartNIC

Servers are roughly four times faster than the SmartNIC when using the same number of threads

Offloading Potentials

  • Can efficiently scale performance with parallelization
  • Sufficient to handle low-rate, asynchronous data processing tasks

Platform ║ Performance Delineation ║ Network Process Headroom ║ Potential for E/W Offloading ⇒ Data Partitioning ⇒ Parallel Processing

Potential

M

22 of 74

Contributions

22

Why

What

How

Metrics

Potential

Realizing benefits goes beyond the use of specialized metrics. It requires exploring the potential of embedded systems in both computation and communication dimensions.

23 of 74

Contributions

23

Why

What

How

Metrics

Potential

Strategies

Efficient solutions for communication, computation, and scheduling complexities are essential to harness the benefits of offloading data services to embedded devices.

24 of 74

Bitar: Optimizing Data Compression for Serialization

24

M

P

Strategies

Bitar

  • Compression has significant overhead but plays a crucial role in data communication
  • How to build a convenient interface between data services and a compression accelerator?
    • Hide the communication complexity with the hardware
    • Explore the hardware capability (e.g., mutiqueue and multithreading)
    • Ensure performance efficiency (e.g., minimize memory allocation)

8.6x

5.7x

2.8x

1.9x

Bitar simplifies the use of compression hardware to improve performance of data service offloaded to SmartNICs.

Bitar implementation

Ensure the hardware accelerator is efficiently utilized

25 of 74

Embedded Processing Pipeline

25

M

P

Strategies

Bitar ║ Embedded Processing Pipeline

  • How can we construct an environment to host data services on distributed SmartNICs?
  • Requirements for communication and computation
    • Addressability: Efficient endpoint access (Faodel)
    • Configurability: Customizable resource mapping and computation dispatch (Faodel)
    • Adaptability: Optimized data processing through common data representation (Apache Arrow)
  • Prototype: Distributed Particle Sifting
    • For reorganizing particle data
    • Uses LSM Tree to sort on a 100-node BlueField-2 SmartNICs cluster

32-core host systems are roughly 4x faster than the SmartNICs

Particle sifting performance

* 100M particles

26 of 74

Enabling Dynamic Offloading of Data Service Workloads

26

M

P

Strategies

Bitar ║ Embedded Processing Pipeline ║ Distribute Offload Planning

  • Potential for overloading resource-constrained embedded systems
    • Need to dynamically offload online data query workloads based on overhead
  • Need SmartNICs and not hosts to decide when to offload
    • Data complexity affects the overhead of a query workload
    • SmartNICs manages data, and the data is dynamic
  • Design goals of the decision mechanism (decision engine)
    • Flexible Specification: Define workloads with a flexible, architecture-neutral specification (Substrait [substrait2023])
    • Job Size Estimation: Estimate based on the workload definition and referred data (Cardinality Estimator)
    • Adaptable Decision: Fine-tune offloading decisions based on resource availability (Predictive Model)
    • Efficient Decision: Make offloading decision efficiently

27 of 74

Estimating the Cost of Query Workloads

27

  • Dynamic offloading scheduling
    • Reactive scheduling: could defer beneficial decisions per request
    • Predictive scheduling: commonly employed in database query optimizers [ioannidis1996]
  • Estimate time consumption for each component

Serialization Cost

  • Train a machine learning model using the random forest regression algorithm
  • BlueField-2 serialization prediction shows under 7% error rate

Prediction performance for serialization time

M

P

Strategies

Bitar ║ Embedded Processing Pipeline ║ Distribute Offload Planning ⇒ Cost Estimation

deserialization

serialization

Deserialization incurs a constant cost, whereas serialization does not

28 of 74

Estimating the Cost of Query Execution

28

Query Execution Cost

  • Cardinality Estimator: Estimate the job size of a query workload
  • Execution Time Prediction: Based on job size and execution context (e.g., # of available threads)
  • Model Training: One per system type (e.g., host and BlueField-2)
  • Model Delivery: Shared to each SmartNIC for placement decision-making
  • Extract column statistics for each table using Apache DataSketches [datasketches]�(e.g., histogram and distinct counting statistics)
  • Update statistics as data changes
  • Support queries with filtering, projection, aggregation, reducible conditions, and multiple sources.

Implementing Cardinality Estimator

Estimate the # of output rows and the call count for each operation in a query

M

P

Strategies

Bitar ║ Embedded Processing Pipeline ║ Distribute Offload Planning ⇒ Cost Estimation

29 of 74

Prediction Performance

29

M

P

Strategies

Bitar ║ Embedded Processing Pipeline ║ Distribute Offload Planning ⇒ Cost Estimation

simple selection

project with compute function

filter with reducible condition

aggregate with compute function

Q1

Q3

Q4

Q7

Result from aggregation of multiple statistics' biases

Performance of Cardinality Estimation on Test Queries

Cost =

CPU time for estimation /

CPU time for query execution

Query Execution Time Prediction Performance (actual vs. estimated time)

# of threads 5 2 7 8 7 1 8 3 4 2 5 3 1 7 1 3

Estimation based on available threads

30 of 74

Case Study I

30

select * from particles where

x >= 0.7 and y < 0.3 and z <= 0.1

query

6,177,731 rows

table

Cardinality Estimator

Estimated output rows: 55,036.7 ( 0.87% difference )

Actual output rows: 55,517 (selectivity 0.9%)

  • Host system: two Intel Xeon 16-core E5-2698 CPUs running at 2.30GHz and 512 GB of memory
  • Host measurements and estimates utilize 32 threads, while the BlueField-2 SmartNIC uses 6 threads
  • Offloading is 74.6% faster than pushing back due to low-percentage selectivity
  • Main overhead of pushing back stems from network transfer

M

P

Strategies

Bitar ║ Embedded Processing Pipeline ║ Distribute Offload Planning ⇒ Cost Estimation ⇒ Case Studies

Model Prediction

Transfer Table Size 3.79 MiB 3.81 MiB 418.2 MiB 424.2 MiB

10.3% diff

5.1% diff

Offloaded

Pushed-back

31 of 74

Case Study II

31

select * from particles where

x >= 0.5 and y < 0.55 and z <= 0.67

query

6,177,731 rows

table

Crossover point:

  • Estimation suggests pushing back
  • Offloading is only 1.38% faster than pushing back due to higher-percentage selectivity

M

P

Strategies

Bitar ║ Embedded Processing Pipeline ║ Distribute Offload Planning ⇒ Cost Estimation ⇒ Case Studies

Model Prediction

Transfer Table Size 75.3 MiB 78.1 MiB 418.2 MiB 424.2 MiB

1.9% diff

3.8% diff

Offloaded

Pushed-back

Cardinality Estimator

Estimated output rows: 1,152,860 ( 1.4% difference )

Actual output rows: 1,136,847 (selectivity 18.4%)

32 of 74

Conclusion

32

  • Embedded systems present opportunities to address data processing challenges with general-purpose systems
  • Exploiting data service offloading benefits requires navigating between complexities in data, services, and systems
  • Contributions: Tools and methods to estimate and realize efficient data services offloading
    • Introduce meaningful metrics for quantifying offloading benefits
    • Identify potentials for offloading
    • Propose strategies for device-side offload optimization

Metrics

Potential

Strategies

Conclusion

33 of 74

What's Next?

33

Metrics

Potential

Strategies

Conclusion

  • Cost-benefit Quantification for East-West Data Service Offloading
    • The Work Unit methodology might apply
    • Distributed data service measurement and system configuration issues
  • Query Performance with Dynamic Offloading
    • Apply strategies and evaluate dynamically offloading online query workloads
    • Adapt to runtime resources and congestion control issues
  • Security and Performance Isolation
    • Important to share embedded system resources among multiple tenants
    • Opportunities such as eBPF [miano2018] and webassembly [menetrey2021]

34 of 74

Jianshen Liu (jliu120@ucsc.edu)

University of California, Santa Cruz�Dissertation Defense @ June 21, 2023

Thank you!

Mentors: Carlos Maltzahn, Scott Brandt, Peter Alvaro, Craig Ulmer, Matthew Curry, Philip Kufeldt, Jeff LeFevre, Paul Stamwitz, Ike Nassi, Shel Finkelstein

SRL team: Ivo Jimenez, Noah Watkins, Michael Sevilla, Aldrin Montana, Jayjeet Chakraborty, Holly Casaletto, Saheed Adepoju, Esmaeil Mirvakili, Farid Zakaria

35 of 74

Related Publications

35

  • [HPC-IODC ‘19] MBWU: Benefit quantification for data access function offloading�Jianshen Liu, Philip Kufeldt, and Carlos Maltzahn.
  • [HotEdge '20] Scale-out edge storage systems with embedded storage nodes to get better availability and cost-efficiency at the same time�Jianshen Liu, Matthew Leon Curry, Carlos Maltzahn, and Philip Kufeldt
  • [SNL Report 2021] Performance Characteristics of the BlueField-2 SmartNIC�Jianshen Liu, Carlos Maltzahn, Craig D. Ulmer, Matthew Leon Curry
  • [HPEC ‘22] Processing Particle Data Flows with SmartNICs ( Outstanding Student Paper)Jianshen Liu, Carlos Maltzahn, Matthew Leon Curry, Craig Ulmer
  • [COMPSYS ‘23] Extending Composable Data Services into SmartNICs ( Best Paper)Craig Ulmer, Jianshen Liu, Carlos Maltzahn, Matthew Leon Curry
  • WIP for [HPEC ’23] (for the work about coordinating processing offloading)

( Picked up by The Next Platform)

[prickett2021]

36 of 74

36

[rong2016] Rong, Huigui, et al. "Optimizing energy consumption for data centers." Renewable and Sustainable Energy Reviews 58 (2016): 674-691.

[masanet2020] Masanet, Eric, et al. "Recalibrating global data center energy-use estimates." Science 367.6481 (2020): 984-986.

[dayarathna2015] Dayarathna, Miyuru, Yonggang Wen, and Rui Fan. "Data center energy consumption modeling: A survey." IEEE Communications Surveys & Tutorials 18.1 (2015): 732-794.

[openai2018] Amodei, Dario et al. "AI and compute.", https://openai.com/research/ai-and-compute.

[gupta2021] Gupta, Udit, et al. "Chasing carbon: The elusive environmental footprint of computing." 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 2021.

[nikolic2022] Nikolić, Tatjana R., et al. "From Single CPU to Multicore Systems." 2022 57th International Scientific Conference on Information, Communication and Energy Systems and Technologies (ICEST). IEEE, 2022.

[karkar2022] Karkar, Ammar, et al. "Thermal and performance efficient on-chip surface-wave communication for many-core systems in dark silicon era." ACM Journal on Emerging Technologies in Computing Systems (JETC) 18.3 (2022): 1-18.

[gupta2021] Gupta, Udit, et al. "Chasing carbon: The elusive environmental footprint of computing." 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 2021.

[cidon2013] Cidon, Asaf, et al. "Copysets: Reducing the frequency of data loss in cloud storage." 2013 {USENIX} Annual Technical Conference ({USENIX}{ATC} 13). 2013.

[vishwanath2010] Vishwanath, Kashi Venkatesh, and Nachiappan Nagappan. "Characterizing cloud computing hardware reliability." Proceedings of the 1st ACM symposium on Cloud computing. 2010.

[xu2019lessons] Xu, Erci, et al. "Lessons and Actions: What We Learned from 10K SSD-Related Storage System Failures." USENIX Annual Technical Conference. 2019.

[ubuntustressng] King, Colin. Stress-Ng: A Tool to Load and Stress a Computer System. https://kernel.ubuntu.com/git/cking/stress-ng.git. Accessed 29 May 2023.

[ricci2014] Ricci, Robert, Eric Eide, and CloudLab Team. "Introducing CloudLab: Scientific infrastructure for advancing cloud architectures and applications." ; login:: the magazine of USENIX & SAGE 39.6 (2014): 36-38.

[linuxpktgen] “HOWTO for the Linux Packet Generator.” The Linux Kernel Documentation, https://docs.kernel.org/networking/pktgen.html. Accessed 29 May 2023.

[iperf999] Tirumala, Ajay. "Iperf: The TCP/UDP bandwidth measurement tool." http://dast. nlanr. net/Projects/Iperf/ (1999).

[nuttcp2014] Tierney, Brain. "Experiences with 40G/100G applications." Berkeley: ESnet (2014).

[upasani2023] Upasani, Gaurang, et al. "Fifty Years of ISCA: A data-driven retrospective on key trends." arXiv preprint arXiv:2306.03964 (2023).

[liu2019] Liu, Jianshen, Philip Kufeldt, and Carlos Maltzahn. "Mbwu: Benefit quantification for data access function offloading." High Performance Computing: ISC High Performance 2019 International Workshops, Frankfurt, Germany, June 16-20, 2019.

37 of 74

37

[netperf2012] Jones, Rick. "Netperf benchmark." http://www. netperf. org/ (2012).

[arrow2023] Richardson N, Cook I, Crane N, Dunnington D, François R, Keane J, Moldovan-Grünfeld D, Ooms J, Apache Arrow (2023). arrow: Integration to 'Apache' 'Arrow'.

[dpdk] Data Plane Development Kit (DPDK). Linux Foundation, https://www.dpdk.org/. Accessed 31 May 2023.

[ulmer2018] Ulmer, Craig, et al. "Faodel: Data management for next-generation application workflows." Proceedings of the 9th Workshop on Scientific Cloud Computing. 2018.

[ioannidis1996] Ioannidis, Yannis E. "Query optimization." ACM Computing Surveys (CSUR) 28.1 (1996): 121-123.

[substrait2023] Substrait: Cross-Language Serialization for Relational Algebra. https://substrait.io/. Accessed 2 June 2023.

[datasketches] Rhodes, Lee, et al. Apache DataSketches: A Software Library of Stochastic Streaming Algorithms. https://datasketches.apache.org/. Accessed 9 Apr. 2023.

[kufeldt2018] Kufeldt, Philip, et al. "Eusocial Storage Devices-Offloading Data Management to Storage Devices that Can Act Collectively." ; login: The USENIX Magazine 43.2 (2018): 16-22.

[miano2018] Miano, Sebastiano, et al. "Creating complex network services with ebpf: Experience and lessons learned." 2018 IEEE 19th International Conference on High Performance Switching and Routing (HPSR). IEEE, 2018.

[menetrey2021] Ménétrey, Jämes, et al. "Twine: An embedded trusted runtime for webassembly." 2021 IEEE 37th International Conference on Data Engineering (ICDE). IEEE, 2021.

[kung1989] Kung, H. T. "Network-based multicomputers: redefining high performance computing in the 1990s." (1989).

[tseng2016] Tseng, Hung-Wei, et al. "Morpheus: Creating application objects efficiently for heterogeneous computing." ACM SIGARCH Computer Architecture News 44.3 (2016): 53-65.

[boboila2012] Boboila, Simona, et al. "Active flash: Out-of-core data analytics on flash storage." 2012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST). IEEE, 2012.

[chakraborty2022] Chakraborty, Jayjeet, et al. "Skyhook: Towards an arrow-native storage system." 2022 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing (CCGrid). IEEE, 2022.

[hennessy2017] Hennessy, John L., and David A. Patterson. Computer Architecture, Sixth Edition: A Quantitative Approach. 6th ed., Morgan Kaufmann Publishers Inc., 2017.

[samuels2019] Samuels, Allen. The Consequences of Infinite Storage Bandwidth. 27 Apr. 2016, https://www.youtube.com/watch?v=-X9BuepxGko.

[wikibitrates] Wikipedia contributors. List of Interface Bit Rates --- Wikipedia, The Free Encyclopedia. 2019.

[prickett2021] Prickett, Nicole Hemsoth. “Testing the Limits of the BlueField-2 SmartNIC.” The Next Platform, 24 May 2021, https://www.nextplatform.com/2021/05/24/testing-the-limits-of-the-bluefield-2-smartnic/.

38 of 74

38

[liu2020] Liu, Jianshen, and Matthew Leon. "Scale-out edge storage systems with embedded storage nodes to get better availability and cost-efficiency at the same time." HotEdge'20 (2020).

39 of 74

Performance of deserialization + serialization vs. memcpy on the BlueField-2

39

131072 rows, 9 MiB object size

  • Similar performance even with small objects (< 10 MiB)
  • Leverage Apache Arrow's (de)serialization efficiency
  • Use our serialization model for object merge time prediction

40 of 74

Evolving Trends in Computer Hardware

40

Hardware Trends

  • Performance growth in computer hardware is not uniform
  • Big data fuels an increasing demand for computing power in more applications

Background

* The CPU performance is depicted in relation to the VAX 11/780 using the SPEC integer benchmarks.

[hennessy2017, samuels2019, wikibitrates]

Ceased improvement in single-core performance

Expedited improvement in network and storage

  • Trade functionality with performance
    • Asymmetric Processors (E.g., P-Cores and E-Cores)
    • Domain-specific Hardware (Embedded Systems)

GPU

DPU

41 of 74

Challenges in General-purpose Computing

41

  • The “Power Wall”
    • Increased power consumption threatens thermal runaway
    • Maximized at around 4 GHz since 2006 [nikolic2022]
  • Multicore Architecture (confront “dark silicon” post-Dennard scaling)

  • Energy Consumption
    • Data centers
      • US KWH: 61B to 100B (2006 - 2011) [rong2016]
      • Instances: rose 550% (2010 - 2018) [masanet2020]
    • Refrigeration systems: 40% of energy consumption
    • Power bills: substantial expense [dayarathna2015]
    • ML/AI resources: double every 3.4 months since 2012 [openai2018]
  • Environmental Impacts
    • Carbon: Energy = Emissions
    • Manufacturing > Operations [gupta2021]
  • Ratio of the parallelizable portion (Amdahl’s law)
  • Memory-bound workloads
  • Lock contention
  • False sharing
  • Inter-core communication

Dark silicon patterns on a chip

[karkar2022]

Background

Hardware Trends

42 of 74

Trends in Hardware Design Paradigms

42

Trading functionality with performance

  • Asymmetric Processors
    • E.g., P-Cores and E-Cores
  • Domain-specific Hardware (Embedded Systems)
    • E.g., GPU, TPU, DPU, FPGA
    • Power efficiency
    • Performance benefits stay ahead of next-gen general-purpose processors

P-Cores

E-Cores

Intel Raptor Lake Processor Die Shot

GPU

DPU

FPGA

Background

Hardware Trends

43 of 74

Composable Data Services

43

Essential building blocks for applications to implement data resiliency, availability, validity, and curation

  • Provide richer views of the underlying data in web applications
  • Facilitate data storage, organization, and processing in HPC applications
  • Secure data in storage and network services

Hardware Trends ║ Data Services

Types of Data Services [kufeldt2018]

Data services require frequent data movement across system layers and components

Examples:

filesystems

key/value stores

data queries

Examples:

data distribution

data recovery

load balancing

Background

Involve compute, network, and energy overheads

44 of 74

Contributions

44

Dynamic Offloading

Strategies to exploit data service offloading benefits

Accelerate Serialization

Distribute Offload Planning

Distribute Data Processing

HPC-IODC ’19 HPEC ‘22 @

HotEdge ‘20 COMPSYS ‘23 *

SNL Report In-progress

@ Outstanding Student Paper Award

* Best Paper Award

Metrics for quantifying offloading benefits

Work Unit

Data Availability

Potential for offloading data services

Performance Gap Delineation

Network Processing Headroom

Data Partitioning Service

Parallel Processing Service

45 of 74

Opportunities for Key-value Offloading

45

Data Access Functions

Metrics

Basic Architecture Of RocksDB

Read/write amplification

RocksDB Data Access Overhead

YCSB Workload A: 50/50 put and get operations

Device throughput ⬆ 20% with 1 more thread

No significant improvement in RocksDB throughput

6x data access amplification

Offloading to embedded storage devices can return amplified system resources occupation to applications

Opportunity

B

46 of 74

Cost-benefit Quantification for Key-value Offloading

46

Software Stack

Hardware Infrastructure

  • YCSB serves the key-value workload generator
  • RocksDB APIs are exposed through Java RMI framework
  • YCSB communicates with RocksDB via RPC
  • Host is 33x more expensive than the embedded storage devices
  • SATA SSD as the storage media

Metrics

Data Access Functions ⇒ MBWU ⇒ Evaluation

B

47 of 74

Metrics for Quantifying Benefits of Offloading

47

For optimizing a function offloaded

  • Commonly used in existing research
  • Rely solely on external characteristics (e.g., workload, cost, and energy)
  • Examples: MB/sec and Kops/sec, $/op, J/op

Type 1 Metrics

For optimizing an embedded device

  • Facilitate cost-performance optimization of a device
  • Incorporates the domain-specific resource efficiency of a device
    • Storage media efficiency of an embedded storage device
    • Network bandwidth efficiency of an embedded network interface card.

Type 2 Metrics

Metrics

Data Access Functions ⇒ MBWU

B

48 of 74

MBWU: Data Access Function Efficiency

48

Different workloads and storage media lead to diverse cost-optimal placements of data access functions

Metrics

Data Access Functions ⇒ MBWU

B

Examples of data access functions:

  • get/put in key/value stores
  • read/write in filesystems
  • select/project in DBMS

If k < x, system B’s storage resource is underutilized,

Its cost-effectiveness could be improved by

  • Increasing the computing resources
  • Reducing the number of storage media
  • A throughput-oriented normalization metric
  • Normalizes workload performance relative to storage media's workload capacity
  • Exclude caching effects

Media-based Work Unit (MBWU)

49 of 74

MBWU-based Efficiency Metrics

49

Evaluate the the cost incurred in enabling the performance of the storage media on a system

measures cost-efficiency

measures energy-efficiency

measures space-efficiency

$/MBWU

kWh/MBWU

m3/MBWU

Metrics

Data Access Functions ⇒ MBWU

B

Device throughput ⬆ 20% with 1 more thread

No significant improvement in RocksDB throughput

6x data access amplification

Key-value Data Access Overhead

A case study

Cost-benefit Quantification for Key-value Offloading

50 of 74

Reduce Time to Insights!

50

Metrics

Data Access Functions ⇒ MBWU ⇒ Case Study

  1. Construct the MBWU for the key-value workload
  2. Evaluate the MBWUs of systems for comparison

Pre-condition storage devices

Rate control data initialization

Monitor system resources’ utilization

Identify peak performance

different offloading configurations

Evaluation Complexity:

B

Hardware Infrastructure

33x more expensive

Software Stack

RocksDB via Java RMI

RPC YCSB workloads

51 of 74

Insights of Key-value Offloading Landscapes

51

57.9%

73.4%

45.9%

39.6%

70.7%

Network Tests

Integrated Tests

Disaggregated Tests

Reduction in $/MBWU

Reduction in kWh/MBWU

64%

  • Imbalance in server compute vs. storage allocation offers offloading opportunities
  • Changes in offloading configurations directly affect the offloading benefits
  • Disaggregate storage in servers amplifies the offloading benefits

MBWUs (host, embedded)

5.95, 0.5

5.2, 0.37

3.28, 0.37

Benefits

Metrics

Data Access Functions ⇒ MBWU ⇒ Case Study

B

52 of 74

Evaluation with Solid-state Drives

52

Impact of Compute Aggregation

Impact of Storage Aggregation

Background

Metrics

Data Access Functions ⇒ MBWU ⇒ Evaluation ║ Data Availability ⇒ Mathematical Model ⇒ Evaluation

⯈ server-based system has (m=) 10 servers

⯈ each server has (n=) 4 storage devices

relative benefit is 20.7

53 of 74

Data Availability

53

Challenges of small storage systems ➡ edge data centers

  • Environmental constraints (e.g., space, network and power)
  • Server maintenance costs rival redundancy provisioning expenses
  • Limited number of failure domains

Embedded storage increases data availability

  • Lower resource aggregation enhances node reliability
  • Denser deployment within the same cost/space/power limits
  • Failover mechanism spans more failure domains

Metrics

Data Access Functions ⇒ MBWU ⇒ Case Study ║ Data Availability

B

Failure Domains

Server-based Storage System

Embedded Storage System

54 of 74

Mathematically Evaluate the Data Availability Benefit

54

Develop a mathematical model to compare server storage systems to embedded storage systems

Metrics

Data Access Functions ⇒ MBWU ⇒ Case Study ║ Data Availability ⇒ Mathematical Model

B

Model Assumptions

Higher Storage Aggregation

Higher Compute Aggregation

Evaluation for Spinning Media

Evaluation for Solid-state Drives

55 of 74

Insights of Disaggregating Storage and Compute

55

Pdata-loss(server-based storage system)

Pdata-loss(embedded storage system)

Relative Benefit =

Evaluate the relative probability of data loss between the server-based and embedded storage systems:

Higher Relative Benefits when:

  • Higher server storage aggregation in the server-based system
  • Higher server compute aggregation in the server-based system
  • Lower storage device failure rate in storage systems
    • Data loss risk shifts from storage to other components

Metrics

Data Access Functions ⇒ MBWU ⇒ Case Study ║ Data Availability ⇒ Mathematical Model ⇒ Insights

B

the greater the ratio, the better the embedded storage

56 of 74

Relative Benefit

56

Pdata-loss(server-based storage system)

Pdata-loss(embedded storage system)

Relative Benefit =

Evaluate the relative probability of data loss between the server-based and embedded storage systems:

  • the failure rate of the storage component over that of the non-storage components
    • spinning media f = 2 [vishwanath2010]
    • solid-state drive f = 0.06 [xu2019lessons]
  • the number of nodes that have a replica set relationship with a node
    • w = 4 [cidon2013]
  • # of GP servers
  • the ratio of storage aggregation (# of storage devices in a server)
  • the ratio of compute aggregation (# of embedded storage device / # of servers)

Metrics

Data Access Functions ⇒ MBWU ⇒ Evaluation ║ Data Availability ⇒ Mathematical Model ⇒ Evaluation

B

the greater the ratio, the better the embedded storage

57 of 74

Mathematically Evaluate the Data Availability Benefit

57

  • and
  • , where ( the ratio of failure rates)
  • , where ( the ratio of computing performance)
  • ( the ratio of storage performance)
  • (3-way replication)

Develop a mathematical model to compare a storage system built with general-purpose servers to one that is built with embedded storage nodes.

Metrics

Data Access Functions ⇒ MBWU ⇒ Evaluation ║ Data Availability ⇒ Mathematical Model

  • Storage Node nodes of the same type have the same configuration
  • External Redundancy all nodes have the same network and power redundancy
  • Data Redundancy 3-way replication
  • Replication Schema copyset scheme [cidon2013] ➡ reduce the number of joint failure domains that share replicas
  • Failure Correlation independency of node failures ➡ model hardware failures with Poisson distribution

System Configuration Assumptions

Model Parameters Assumptions

B

58 of 74

Evaluation with Spinning Media

58

Impact of Compute Aggregation

Impact of Storage Aggregation

Higher Storage Aggregation

Higher Compute Aggregation

c = n = 4 ➡ embedded storage system has (10x4=) 40 devices

⯈ server-based system has (m=) 10 servers

⯈ each server has (n=) 4 storage devices

relative benefit is 7.1

⯈ each server has 12 storage devices

⯈ server-based system has (m=) 10 servers

⯈ embedded storage system has (17x10=) 170 devices

relative benefit is 114.3

Metrics

Data Access Functions ⇒ MBWU ⇒ Evaluation ║ Data Availability ⇒ Mathematical Model ⇒ Evaluation

B

59 of 74

59

Discovered potentials and new metrics should navigate offloading by aligning service needs with the potentials.

B

Potential

M

60 of 74

Processing Headroom Evaluation

60

The Host (two AMD EPYC 7542 CPU @ 2.9GHz, 512GB DRAM):

  • saturates the full 100 Gbps bandwidth, with 5 vCPU cores and packet size 832 B
  • with less than 1% processing headroom on 5 vCPU, 123 CPUs remain available

Potential

Platform ║ Performance Delineation ║ Network Process Headroom

M

61 of 74

Bitar Implementation

61

dequeue burst�(an operation list)

mbuf struct

Input buffer

seg size < 64KB

an operation

(two mbuf chains with metadata)

enque burst�(an operation list)

allocated in huge pages

DMA

source mbufs

(an mbuf chain)

preallocated memzone pool�(from huge pages)

❶ slice input buf

❷ assemble output buf

❸ enqueue ops

❹ dequeue ops

❺ recycle resources

preallocated ops pool

preallocated mbuf pool

destination mbufs

(an mbuf chain)

62 of 74

Particle Data Flows

62

B

Potential

Platform ║ Performance Delineation ║ Network Process Headroom ║ Data Partitioning

  • HPC applications often require complex network services for particle data
  • Discrepancies exist between between data producers and consumers

Airplane data from spatial to temporal representation

Partition &

Distribute

SmartNIC

Host

SmartNIC

SmartNIC

Data reorganization with SmartNICs

M

Embedding Distributed Data Reorganization

  • Use Apache Arrow [arrow2023] for particle data representation to simplify movement and processing
    • serialization, compression, partition, projection, aggregation, etc.
  • Use SmartNICs as distributed processing elements

63 of 74

Bitar: Optimizing Data Compression for Serialization

63

M

P

Strategies

B

Bitar

  • Compression plays a crucial role in data communication
    • Increase the effective data throughput of the network
    • Compression codec support is in many I/O libraries (e.g., Apache Kafka, Apache Arrow)
  • How to build a general interface for data services to accelerate compression?
    • Hide the communication complexity with the hardware
    • Explore the hardware capability (e.g., mutiqueue and multithreading)
    • Ensure performance efficiency (e.g., minimize memory allocation)

Implementation https://github.com/skyhookdm/bitar

  • Build on top of DPDK [dpdk] and Apache Arrow
  • Automatically map input data to DPDK buffers
  • Chain operations to improve performance
  • Maintain memory pools to reduce allocation overhead

Features

  • Features zero-copy (a)synchronous API
  • Supports multi-core/multi-device processing
  • Runs on either the host or the BlueField-2 card
  • Hardware/software compatible compression

64 of 74

Serialization Performance with Multi-threaded Compression

64

M

P

Strategies

B

Bitar ⇒ Performance

Evaluate software vs. hardware compression (Bitar) performance impact on data serialization

  • Input data uses Arrow IPC “airplanes” reference dataset

8.6x

5.7x

2.8x

1.9x

Hardware compression beats the performance with 35 host threads

Hardware compression significantly improves serialization performance

Hardware compression provides a comparable compression ratio for particle data

Bitar simplifies the use of compression hardware to improve performance of data service offloaded to SmartNICs.

65 of 74

Embedded Processing Pipeline

65

M

P

Strategies

B

Bitar ⇒ Performance ║ Embedded Processing Pipeline

  • How can we construct an environment to host the data service on SmartNICs?
  • Requirements on communication and computation
    • Addressability: Efficient endpoint access
    • Configurability: Customizable resource mapping and computation dispatch
    • Adaptability: Optimized data processing through common data representation

Compute node A

Compute node B

Dispatch data processing workload at runtime

Resource group

Explore computation with parallelization

Workload with common data representation

  • Prototype: Distributed Particle Sifting
    • Faodel: provide globally accessible endpoints, resource group policy, and computation dispatching
    • Apache Arrow: provide a standardized, efficient tabular data format with flexible data processing APIs

66 of 74

Distributed Particle Sifting

66

  • Simulation tasks inject particle data to local SmartNICs
  • SmartNICs sift particles using Log-Structured Merge Tree
  • Particles split and transfer to next SmartNIC level during compaction
  • Tested on a 100-node BlueField-2 SmartNICs cluster

For 64M particles, the overall transfer rate of 1.32GB/s

Local Injection performance

Transfer to local SmartNIC takes 81% of the overall injection time

Particle sifting performance

32-core host systems are roughly 4x faster than the SmartNICs

M

P

Strategies

B

Bitar ⇒ Performance ║ Embedded Processing Pipeline

67 of 74

Estimating the Cost of Query Workloads — Network Transfer

67

Network Transfer Cost

  • Decouple network transfer cost from intricate data details
    • Offloaded workloads: Particle Table ➡ Data Size ➡ Transfer Time
    • Pushed-back workloads: aggregate the data object size
  • Collect training data via round-trip time of different-sized buffer requests to SmartNIC
  • Prediction error rates for table size and transfer time are within 6% and 7%, respectively

Prediction performance for table size

Prediction performance for transfer time

M

P

Strategies

B

Bitar ⇒ Performance ║ Embedded Processing Pipeline ║ Dynamic Offloading ⇒ Cost Estimation

68 of 74

Takeaway

68

Why

What

How

Metrics

MBWU ║ Data Availability ║ Takeaway

These metrics and methods present new toolkits to evaluate the cost-benefit of offloading data services, which are also applicable to future embedded systems for evaluation.

B

69 of 74

Takeaway

69

Why

What

How

B

M

Potential

Platform ║ Performance Delineation ║ Network Process Headroom ║ Data Processing Services ⇒ Performance ║ Takeaway

Modern SmartNICs are powerful enough to handle specific data services by aligning needs with potentials, providing desirable resource isolation and locality benefits.

70 of 74

Takeaway

70

Why

What

How

M

P

Strategies

B

Bitar ║ Embedded Processing Pipeline ║ Dynamic Offloading ⇒ Cost Estimation ⇒ Case Study ║ Takeaway

Efficient solutions for communication, computation, and scheduling complexities are essential to harness the benefits of offloading data services to embedded devices.

71 of 74

Enabling Dynamic Offloading of Data Service Workloads

71

M

P

Strategies

Bitar ║ Embedded Processing Pipeline ║ Dynamic Offloading

  • Potential for overloading resource-constrained embedded systems ➡ Need dynamic offloading
  • SmartNICs and not hosts decide when to offload
  • Design goals of decision engine
    • Flexible Specification: Define workloads with a flexible, architecture-neutral specification (Substrait [substrait2023])
    • Job Size Estimation: Estimate based on the workload definition and referred data (Cardinality Estimator)
    • Adaptable Decision: Fine-tune offloading decisions based on resource availability (Predictive Model)
    • Efficient Decision: Make offloading decision efficiently
  • Ideal dynamic offloading use case: data query workloads (Substrait [substrait2023])

Efficiently estimate the job sizes then feed into the decision engine

Dissect workloads for offloading decision-making to maximize benefits

Refer to remote data sources

Serialize to architecture-agnostic, concise workload definition

72 of 74

Prediction Performance

72

Query execution time prediction with execution context information

Extremely efficient with relatively low error rate

aggregation

with reducible condition

simple filtering

projection

M

P

Strategies

Bitar ║ Embedded Processing Pipeline ║ Distribute Offload Planning ⇒ Cost Estimation

73 of 74

Case Study I

73

select * from particles where

x >= 0.7 and y < 0.3 and z <= 0.1

query

6,177,731 rows

table

Cardinality Estimator

Estimated output rows: 55,036.7 ( 0.865% difference )

Actual output rows: 55,517 (selectivity 0.9%)

  • Host system: two Intel Xeon 16-core E5-2698 CPUs running at 2.30GHz and 512 GB of memory
  • Host measurements and estimates utilize 32 threads, while the BlueField-2 SmartNIC uses 6 threads
  • Offloading is 74.6% faster than pushing back due to low-percentage selectivity
  • Main overhead of pushing back stems from network transfer

Model Prediction

Transfer Table Size

3.79 MiB 3.81 MiB 0.49%

418.19 MiB 424.19 MiB 1.42%

M

P

Strategies

Bitar ║ Embedded Processing Pipeline ║ Distribute Offload Planning ⇒ Cost Estimation ⇒ Case Studies

74 of 74

Case Study II

74

select * from particles where

x >= 0.5 and y < 0.55 and z <= 0.67

query

6,177,731 rows

table

Cardinality Estimator

Estimated output rows: 1,152,860 ( 1.41% difference )

Actual output rows: 1,136,847 (selectivity 18.4%)

Crossover point:

  • Estimation suggests pushing back
  • Offloading is only 1.38% faster than pushing back due to higher-percentage selectivity

Transfer Table Size

75.35 MiB 78.06 MiB 3.48%

Model Prediction

418.19 MiB 424.19 MiB 1.42%

M

P

Strategies

Bitar ║ Embedded Processing Pipeline ║ Distribute Offload Planning ⇒ Cost Estimation ⇒ Case Studies