Extending Composable Data Services to the Realm of Embedded Systems
Jianshen Liu
University of California, Santa Cruz�Dissertation Defense @ June 21, 2023
Challenges and Opportunities for Modern Data Processing
2
Data processing on general-purpose systems faces performance, cost, and energy challenges as modern applications increasingly rely on larger data volumes.
* the International Symposium on Computer Architecture
Word cloud of most common words in abstracts for ISCA* 1983 - 2023
[upasani2023]
Embedded systems present opportunities to optimize by offloading the services for data processing.
But:�Cost/benefit landscape for offloading is complex.
Types of Data Service Offloading
3
Examples:
filesystems
key/value stores
data queries
Examples:
data distribution
data recovery
load balancing
Data services require frequent data movement across system layers and components [kufeldt2018]
Example benefits of offloading data services
Involve compute, network, and energy overheads
Hypotheses and Scope
4
Scope
Providing a set of new tools to estimate and realize the benefits of offloading composable data services for any embedded systems.
Essential building blocks for applications to implement data availability, validity, and curation
Hypothesis #1: Quantifying the benefits of data service offloading requires multiple and specialized evaluation metrics.
Hypothesis #2: Data service offloading has to be carefully tailored based on the function, the data, and the fluctuating and highly specialized resource availability on embedded systems.
Hypothesis #3: Harnessing the benefits of data service offloading requires efficient mechanisms for communication, computation, and scheduling.
Why: Metrics for Evaluation
What: Potential for Offloading
How: Strategies to Offload
Related Publications
5
( Picked up by The Next Platform)
[prickett2021]
Contributions
6
Why
What
How
before advancement
after advancement
Contributions
7
Why
What
How
Metrics
These metrics and methods present new toolkits to evaluate the cost-benefit of offloading data services, which are also applicable to future embedded systems for evaluation.
Quantifying the Benefits of Offloading [liu2019] collaborated with Philip Kufeldt
8
Metrics
MBWU
Work Unit (WU)
As new devices emerge at increasing frequency
Automated determination of a (Media-Based) Work Unit
As long as workload and device is fixed, we can compare throughput with and without offloading.
Data Availability
9
Metrics
MBWU ║ Data Availability
Challenges of small storage systems ➡ edge data centers
The more independent failure domains a failover mechanism spans, the more available the data becomes.
Server-based Storage System
Embedded Storage System
Embedded storage enables more failure domains under the same cost/space/power restrictions.
Contributions
10
Why
What
How
Metrics
These metrics and methods present new toolkits to evaluate the cost-benefit of offloading data services, which are also applicable to future embedded systems for evaluation.
Contributions
11
Why
What
How
Metrics
Potential
Realizing benefits goes beyond the use of specialized metrics. It requires exploring the potential of embedded systems in both computation and communication dimensions.
Evaluation Platform (SmartNIC)
12
Potential
NVIDIA BlueField-2 SmartNIC
M
Platform
Performance Comparison using Microbenchmarks
13
Potential
Platform ║ Performance Delineation
Stress-ng Microbenchmark [ubuntustressng]
Operation latency-based performance metrics are not directly comparable
M
(executed on the SmartNIC vs. on the host)
Normalize the stressor results relative to the Raspberry Pi 4B’s performance
Raspberry-Pi-based Work Unit, or RPWU
Performance Characterization
14
Compare BlueField-2 SmartNIC with a variety of host systems available at CloudLab [ricci2014]
Triangle data points denote the relative performance of the BlueField-2 SmartNIC
Potential
Platform ║ Performance Delineation
M
Offloading Potentials
BlueField-2 SmartNIC exhibits advantageous performance compared to hosts in:
Network Processing Headroom
15
Potential
Platform ║ Performance Delineation ║ Network Process Headroom
Linux pktgen [linuxpktgen]
M
Evaluation Method
Yet another specialized metric for evaluation
Processing Headroom Evaluation
16
Potential
Platform ║ Performance Delineation ║ Network Process Headroom
M
The BlueField-2:
Processing Headroom Evaluation
17
The BlueField-2:
Potential
Platform ║ Performance Delineation ║ Network Process Headroom
M
Offloading Potentials
with DPDK
with kernel IP stack
Performance Leap!
Offloading Potential for Collective Acting SmartNICs
18
What additional potential do SmartNICs have when they can act collectively?
Problem: data services on compute nodes complete for resources with applications, causing potential delays [tseng2016]
Opportunity: Offload to SmartNICs
But: Require SmartNICs to act collectively to expand processing capability
Platform ║ Performance Delineation ║ Network Process Headroom ║ Potential for E/W Offloading
Service Placement
Background: HPC Scientific Computing Workflows
Potential
M
Collective Acting SmartNICs
19
Software Stack
How do these data service libraries perform on the SmartNIC?
Expand in-transit data processing capability with multiple SmartNICs
Sorted by�position and time
Sorted by�ID and time
Offload data reorganization service and online query service
Platform ║ Performance Delineation ║ Network Process Headroom ║ Potential for E/W Offloading
Potential
M
Particle Data Partitioning Performance
20
Overhead for partitioning without compression
Host is ~4x faster than the SmartNIC
Timing breakdown on BlueField-2 with compressions
Handling compression in software incurs significant overhead
Offloading Potential
Platform ║ Performance Delineation ║ Network Process Headroom ║ Potential for E/W Offloading ⇒ Data Partitioning
Potential
M
Multi-threaded Service Performance
21
Bookkeeping overhead on the SmartNIC
Faodel LocalKV test uses multi-threading for put/get/delete operations on a 2D in-memory hashmap
32-core AMD EPYC 7543P
68-core
Query Arrow Data
32-core Xeon E5-2698
At 8 threads, the host is only 37.8% faster than the SmartNIC
Servers are roughly four times faster than the SmartNIC when using the same number of threads
Offloading Potentials
Platform ║ Performance Delineation ║ Network Process Headroom ║ Potential for E/W Offloading ⇒ Data Partitioning ⇒ Parallel Processing
Potential
M
Contributions
22
Why
What
How
Metrics
Potential
Realizing benefits goes beyond the use of specialized metrics. It requires exploring the potential of embedded systems in both computation and communication dimensions.
Contributions
23
Why
What
How
Metrics
Potential
Strategies
Efficient solutions for communication, computation, and scheduling complexities are essential to harness the benefits of offloading data services to embedded devices.
Bitar: Optimizing Data Compression for Serialization
24
M
P
Strategies
Bitar
8.6x
5.7x
2.8x
1.9x
Bitar simplifies the use of compression hardware to improve performance of data service offloaded to SmartNICs.
Bitar implementation
Ensure the hardware accelerator is efficiently utilized
Embedded Processing Pipeline
25
M
P
Strategies
Bitar ║ Embedded Processing Pipeline
32-core host systems are roughly 4x faster than the SmartNICs
Particle sifting performance
* 100M particles
Enabling Dynamic Offloading of Data Service Workloads
26
M
P
Strategies
Bitar ║ Embedded Processing Pipeline ║ Distribute Offload Planning
Estimating the Cost of Query Workloads
27
Serialization Cost
Prediction performance for serialization time
M
P
Strategies
Bitar ║ Embedded Processing Pipeline ║ Distribute Offload Planning ⇒ Cost Estimation
deserialization
serialization
Deserialization incurs a constant cost, whereas serialization does not
Estimating the Cost of Query Execution
28
Query Execution Cost
Implementing Cardinality Estimator
Estimate the # of output rows and the call count for each operation in a query
M
P
Strategies
Bitar ║ Embedded Processing Pipeline ║ Distribute Offload Planning ⇒ Cost Estimation
Prediction Performance
29
M
P
Strategies
Bitar ║ Embedded Processing Pipeline ║ Distribute Offload Planning ⇒ Cost Estimation
simple selection
project with compute function
filter with reducible condition
aggregate with compute function
Q1
Q3
Q4
Q7
Result from aggregation of multiple statistics' biases
Performance of Cardinality Estimation on Test Queries
Cost =
CPU time for estimation /
CPU time for query execution
Query Execution Time Prediction Performance (actual vs. estimated time)
# of threads 5 2 7 8 7 1 8 3 4 2 5 3 1 7 1 3
Estimation based on available threads
Case Study I
30
select * from particles where
x >= 0.7 and y < 0.3 and z <= 0.1
query
6,177,731 rows
table
Cardinality Estimator
Estimated output rows: 55,036.7 ( 0.87% difference )
Actual output rows: 55,517 (selectivity 0.9%)
M
P
Strategies
Bitar ║ Embedded Processing Pipeline ║ Distribute Offload Planning ⇒ Cost Estimation ⇒ Case Studies
Model Prediction
Transfer Table Size 3.79 MiB 3.81 MiB 418.2 MiB 424.2 MiB
10.3% diff
5.1% diff
Offloaded
Pushed-back
Case Study II
31
select * from particles where
x >= 0.5 and y < 0.55 and z <= 0.67
query
6,177,731 rows
table
Crossover point:
M
P
Strategies
Bitar ║ Embedded Processing Pipeline ║ Distribute Offload Planning ⇒ Cost Estimation ⇒ Case Studies
Model Prediction
Transfer Table Size 75.3 MiB 78.1 MiB 418.2 MiB 424.2 MiB
1.9% diff
3.8% diff
Offloaded
Pushed-back
Cardinality Estimator
Estimated output rows: 1,152,860 ( 1.4% difference )
Actual output rows: 1,136,847 (selectivity 18.4%)
Conclusion
32
Metrics
Potential
Strategies
Conclusion
What's Next?
33
Metrics
Potential
Strategies
Conclusion
Jianshen Liu (jliu120@ucsc.edu)
University of California, Santa Cruz�Dissertation Defense @ June 21, 2023
Thank you!
Mentors: Carlos Maltzahn, Scott Brandt, Peter Alvaro, Craig Ulmer, Matthew Curry, Philip Kufeldt, Jeff LeFevre, Paul Stamwitz, Ike Nassi, Shel Finkelstein
SRL team: Ivo Jimenez, Noah Watkins, Michael Sevilla, Aldrin Montana, Jayjeet Chakraborty, Holly Casaletto, Saheed Adepoju, Esmaeil Mirvakili, Farid Zakaria
Related Publications
35
( Picked up by The Next Platform)
[prickett2021]
36
[rong2016] Rong, Huigui, et al. "Optimizing energy consumption for data centers." Renewable and Sustainable Energy Reviews 58 (2016): 674-691.
[masanet2020] Masanet, Eric, et al. "Recalibrating global data center energy-use estimates." Science 367.6481 (2020): 984-986.
[dayarathna2015] Dayarathna, Miyuru, Yonggang Wen, and Rui Fan. "Data center energy consumption modeling: A survey." IEEE Communications Surveys & Tutorials 18.1 (2015): 732-794.
[openai2018] Amodei, Dario et al. "AI and compute.", https://openai.com/research/ai-and-compute.
[gupta2021] Gupta, Udit, et al. "Chasing carbon: The elusive environmental footprint of computing." 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 2021.
[nikolic2022] Nikolić, Tatjana R., et al. "From Single CPU to Multicore Systems." 2022 57th International Scientific Conference on Information, Communication and Energy Systems and Technologies (ICEST). IEEE, 2022.
[karkar2022] Karkar, Ammar, et al. "Thermal and performance efficient on-chip surface-wave communication for many-core systems in dark silicon era." ACM Journal on Emerging Technologies in Computing Systems (JETC) 18.3 (2022): 1-18.
[gupta2021] Gupta, Udit, et al. "Chasing carbon: The elusive environmental footprint of computing." 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 2021.
[cidon2013] Cidon, Asaf, et al. "Copysets: Reducing the frequency of data loss in cloud storage." 2013 {USENIX} Annual Technical Conference ({USENIX}{ATC} 13). 2013.
[vishwanath2010] Vishwanath, Kashi Venkatesh, and Nachiappan Nagappan. "Characterizing cloud computing hardware reliability." Proceedings of the 1st ACM symposium on Cloud computing. 2010.
[xu2019lessons] Xu, Erci, et al. "Lessons and Actions: What We Learned from 10K SSD-Related Storage System Failures." USENIX Annual Technical Conference. 2019.
[ubuntustressng] King, Colin. Stress-Ng: A Tool to Load and Stress a Computer System. https://kernel.ubuntu.com/git/cking/stress-ng.git. Accessed 29 May 2023.
[ricci2014] Ricci, Robert, Eric Eide, and CloudLab Team. "Introducing CloudLab: Scientific infrastructure for advancing cloud architectures and applications." ; login:: the magazine of USENIX & SAGE 39.6 (2014): 36-38.
[linuxpktgen] “HOWTO for the Linux Packet Generator.” The Linux Kernel Documentation, https://docs.kernel.org/networking/pktgen.html. Accessed 29 May 2023.
[iperf999] Tirumala, Ajay. "Iperf: The TCP/UDP bandwidth measurement tool." http://dast. nlanr. net/Projects/Iperf/ (1999).
[nuttcp2014] Tierney, Brain. "Experiences with 40G/100G applications." Berkeley: ESnet (2014).
[upasani2023] Upasani, Gaurang, et al. "Fifty Years of ISCA: A data-driven retrospective on key trends." arXiv preprint arXiv:2306.03964 (2023).
[liu2019] Liu, Jianshen, Philip Kufeldt, and Carlos Maltzahn. "Mbwu: Benefit quantification for data access function offloading." High Performance Computing: ISC High Performance 2019 International Workshops, Frankfurt, Germany, June 16-20, 2019.
37
[netperf2012] Jones, Rick. "Netperf benchmark." http://www. netperf. org/ (2012).
[arrow2023] Richardson N, Cook I, Crane N, Dunnington D, François R, Keane J, Moldovan-Grünfeld D, Ooms J, Apache Arrow (2023). arrow: Integration to 'Apache' 'Arrow'.
[dpdk] Data Plane Development Kit (DPDK). Linux Foundation, https://www.dpdk.org/. Accessed 31 May 2023.
[ulmer2018] Ulmer, Craig, et al. "Faodel: Data management for next-generation application workflows." Proceedings of the 9th Workshop on Scientific Cloud Computing. 2018.
[ioannidis1996] Ioannidis, Yannis E. "Query optimization." ACM Computing Surveys (CSUR) 28.1 (1996): 121-123.
[substrait2023] Substrait: Cross-Language Serialization for Relational Algebra. https://substrait.io/. Accessed 2 June 2023.
[datasketches] Rhodes, Lee, et al. Apache DataSketches: A Software Library of Stochastic Streaming Algorithms. https://datasketches.apache.org/. Accessed 9 Apr. 2023.
[kufeldt2018] Kufeldt, Philip, et al. "Eusocial Storage Devices-Offloading Data Management to Storage Devices that Can Act Collectively." ; login: The USENIX Magazine 43.2 (2018): 16-22.
[miano2018] Miano, Sebastiano, et al. "Creating complex network services with ebpf: Experience and lessons learned." 2018 IEEE 19th International Conference on High Performance Switching and Routing (HPSR). IEEE, 2018.
[menetrey2021] Ménétrey, Jämes, et al. "Twine: An embedded trusted runtime for webassembly." 2021 IEEE 37th International Conference on Data Engineering (ICDE). IEEE, 2021.
[kung1989] Kung, H. T. "Network-based multicomputers: redefining high performance computing in the 1990s." (1989).
[tseng2016] Tseng, Hung-Wei, et al. "Morpheus: Creating application objects efficiently for heterogeneous computing." ACM SIGARCH Computer Architecture News 44.3 (2016): 53-65.
[boboila2012] Boboila, Simona, et al. "Active flash: Out-of-core data analytics on flash storage." 2012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST). IEEE, 2012.
[chakraborty2022] Chakraborty, Jayjeet, et al. "Skyhook: Towards an arrow-native storage system." 2022 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing (CCGrid). IEEE, 2022.
[hennessy2017] Hennessy, John L., and David A. Patterson. Computer Architecture, Sixth Edition: A Quantitative Approach. 6th ed., Morgan Kaufmann Publishers Inc., 2017.
[samuels2019] Samuels, Allen. The Consequences of Infinite Storage Bandwidth. 27 Apr. 2016, https://www.youtube.com/watch?v=-X9BuepxGko.
[wikibitrates] Wikipedia contributors. List of Interface Bit Rates --- Wikipedia, The Free Encyclopedia. 2019.
[prickett2021] Prickett, Nicole Hemsoth. “Testing the Limits of the BlueField-2 SmartNIC.” The Next Platform, 24 May 2021, https://www.nextplatform.com/2021/05/24/testing-the-limits-of-the-bluefield-2-smartnic/.
38
[liu2020] Liu, Jianshen, and Matthew Leon. "Scale-out edge storage systems with embedded storage nodes to get better availability and cost-efficiency at the same time." HotEdge'20 (2020).
Performance of deserialization + serialization vs. memcpy on the BlueField-2
39
131072 rows, 9 MiB object size
Evolving Trends in Computer Hardware
40
Hardware Trends
Background
* The CPU performance is depicted in relation to the VAX 11/780 using the SPEC integer benchmarks.
[hennessy2017, samuels2019, wikibitrates]
Ceased improvement in single-core performance
Expedited improvement in network and storage
GPU
DPU
Challenges in General-purpose Computing
41
Dark silicon patterns on a chip
[karkar2022]
Background
Hardware Trends
Trends in Hardware Design Paradigms
42
Trading functionality with performance
P-Cores
E-Cores
Intel Raptor Lake Processor Die Shot
GPU
DPU
FPGA
Background
Hardware Trends
Composable Data Services
43
Essential building blocks for applications to implement data resiliency, availability, validity, and curation
Hardware Trends ║ Data Services
Types of Data Services [kufeldt2018]
Data services require frequent data movement across system layers and components
Examples:
filesystems
key/value stores
data queries
Examples:
data distribution
data recovery
load balancing
Background
Involve compute, network, and energy overheads
Contributions
44
Dynamic Offloading
Strategies to exploit data service offloading benefits
Accelerate Serialization
Distribute Offload Planning
Distribute Data Processing
HPC-IODC ’19 HPEC ‘22 @
HotEdge ‘20 COMPSYS ‘23 *
SNL Report In-progress
@ Outstanding Student Paper Award
* Best Paper Award
Metrics for quantifying offloading benefits
Work Unit
Data Availability
Potential for offloading data services
Performance Gap Delineation
Network Processing Headroom
Data Partitioning Service
Parallel Processing Service
Opportunities for Key-value Offloading
45
Data Access Functions
Metrics
Basic Architecture Of RocksDB
Read/write amplification
RocksDB Data Access Overhead
YCSB Workload A: 50/50 put and get operations
Device throughput ⬆ 20% with 1 more thread
No significant improvement in RocksDB throughput
6x data access amplification
Offloading to embedded storage devices can return amplified system resources occupation to applications
Opportunity
B
Cost-benefit Quantification for Key-value Offloading
46
Software Stack
Hardware Infrastructure
Metrics
Data Access Functions ⇒ MBWU ⇒ Evaluation
B
Metrics for Quantifying Benefits of Offloading
47
For optimizing a function offloaded
Type 1 Metrics
For optimizing an embedded device
Type 2 Metrics
Metrics
Data Access Functions ⇒ MBWU
B
MBWU: Data Access Function Efficiency
48
Different workloads and storage media lead to diverse cost-optimal placements of data access functions
Metrics
Data Access Functions ⇒ MBWU
B
Examples of data access functions:
If k < x, system B’s storage resource is underutilized,
Its cost-effectiveness could be improved by
Media-based Work Unit (MBWU)
MBWU-based Efficiency Metrics
49
Evaluate the the cost incurred in enabling the performance of the storage media on a system
measures cost-efficiency
measures energy-efficiency
measures space-efficiency
$/MBWU
kWh/MBWU
m3/MBWU
Metrics
Data Access Functions ⇒ MBWU
B
Device throughput ⬆ 20% with 1 more thread
No significant improvement in RocksDB throughput
6x data access amplification
Key-value Data Access Overhead
A case study
Cost-benefit Quantification for Key-value Offloading
Reduce Time to Insights!
50
Metrics
Data Access Functions ⇒ MBWU ⇒ Case Study
Pre-condition storage devices
Rate control data initialization
Monitor system resources’ utilization
Identify peak performance
different offloading configurations
Evaluation Complexity:
B
Hardware Infrastructure
33x more expensive
…
Software Stack
RocksDB via Java RMI
RPC YCSB workloads
Insights of Key-value Offloading Landscapes
51
57.9%
73.4%
45.9%
39.6%
70.7%
Network Tests
Integrated Tests
Disaggregated Tests
Reduction in $/MBWU
Reduction in kWh/MBWU
64%
MBWUs (host, embedded)
5.95, 0.5
5.2, 0.37
3.28, 0.37
Benefits
Metrics
Data Access Functions ⇒ MBWU ⇒ Case Study
B
Evaluation with Solid-state Drives
52
Impact of Compute Aggregation
Impact of Storage Aggregation
Background
Metrics
Data Access Functions ⇒ MBWU ⇒ Evaluation ║ Data Availability ⇒ Mathematical Model ⇒ Evaluation
⯈ server-based system has (m=) 10 servers
⯈ each server has (n=) 4 storage devices
⯈ relative benefit is 20.7
Data Availability
53
Challenges of small storage systems ➡ edge data centers
Embedded storage increases data availability
Metrics
Data Access Functions ⇒ MBWU ⇒ Case Study ║ Data Availability
B
Failure Domains
Server-based Storage System
Embedded Storage System
Mathematically Evaluate the Data Availability Benefit
54
Develop a mathematical model to compare server storage systems to embedded storage systems
Metrics
Data Access Functions ⇒ MBWU ⇒ Case Study ║ Data Availability ⇒ Mathematical Model
B
Model Assumptions
Higher Storage Aggregation
Higher Compute Aggregation
Evaluation for Spinning Media
Evaluation for Solid-state Drives
Insights of Disaggregating Storage and Compute
55
Pdata-loss(server-based storage system)
Pdata-loss(embedded storage system)
Relative Benefit =
Evaluate the relative probability of data loss between the server-based and embedded storage systems:
Higher Relative Benefits when:
Metrics
Data Access Functions ⇒ MBWU ⇒ Case Study ║ Data Availability ⇒ Mathematical Model ⇒ Insights
B
the greater the ratio, the better the embedded storage
Relative Benefit
56
Pdata-loss(server-based storage system)
Pdata-loss(embedded storage system)
Relative Benefit =
Evaluate the relative probability of data loss between the server-based and embedded storage systems:
Metrics
Data Access Functions ⇒ MBWU ⇒ Evaluation ║ Data Availability ⇒ Mathematical Model ⇒ Evaluation
B
the greater the ratio, the better the embedded storage
Mathematically Evaluate the Data Availability Benefit
57
Develop a mathematical model to compare a storage system built with general-purpose servers to one that is built with embedded storage nodes.
Metrics
Data Access Functions ⇒ MBWU ⇒ Evaluation ║ Data Availability ⇒ Mathematical Model
System Configuration Assumptions
Model Parameters Assumptions
B
Evaluation with Spinning Media
58
Impact of Compute Aggregation
Impact of Storage Aggregation
Higher Storage Aggregation
Higher Compute Aggregation
⯈ c = n = 4 ➡ embedded storage system has (10x4=) 40 devices
⯈ server-based system has (m=) 10 servers
⯈ each server has (n=) 4 storage devices
⯈ relative benefit is 7.1
⯈ each server has 12 storage devices
⯈ server-based system has (m=) 10 servers
⯈ embedded storage system has (17x10=) 170 devices
⯈ relative benefit is 114.3
Metrics
Data Access Functions ⇒ MBWU ⇒ Evaluation ║ Data Availability ⇒ Mathematical Model ⇒ Evaluation
B
59
Discovered potentials and new metrics should navigate offloading by aligning service needs with the potentials.
B
Potential
M
Processing Headroom Evaluation
60
The Host (two AMD EPYC 7542 CPU @ 2.9GHz, 512GB DRAM):
Potential
Platform ║ Performance Delineation ║ Network Process Headroom
M
Bitar Implementation
61
dequeue burst�(an operation list)
…
mbuf struct
Input buffer
seg size < 64KB
…
an operation
(two mbuf chains with metadata)
enque burst�(an operation list)
allocated in huge pages
DMA
source mbufs
(an mbuf chain)
preallocated memzone pool�(from huge pages)
…
❶ slice input buf
❷ assemble output buf
❸ enqueue ops
❹ dequeue ops
❺ recycle resources
❶
❷
❸
❹
preallocated ops pool
preallocated mbuf pool
destination mbufs
(an mbuf chain)
❺
Particle Data Flows
62
B
Potential
Platform ║ Performance Delineation ║ Network Process Headroom ║ Data Partitioning
Airplane data from spatial to temporal representation
…
Partition &
Distribute
SmartNIC
Host
SmartNIC
SmartNIC
Data reorganization with SmartNICs
M
Embedding Distributed Data Reorganization
Bitar: Optimizing Data Compression for Serialization
63
M
P
Strategies
B
Bitar
Implementation https://github.com/skyhookdm/bitar
Features
Serialization Performance with Multi-threaded Compression
64
M
P
Strategies
B
Bitar ⇒ Performance
Evaluate software vs. hardware compression (Bitar) performance impact on data serialization
8.6x
5.7x
2.8x
1.9x
Hardware compression beats the performance with 35 host threads
Hardware compression significantly improves serialization performance
Hardware compression provides a comparable compression ratio for particle data
Bitar simplifies the use of compression hardware to improve performance of data service offloaded to SmartNICs.
Embedded Processing Pipeline
65
M
P
Strategies
B
Bitar ⇒ Performance ║ Embedded Processing Pipeline
Compute node A
Compute node B
Dispatch data processing workload at runtime
Resource group
Explore computation with parallelization
Workload with common data representation
Distributed Particle Sifting
66
For 64M particles, the overall transfer rate of 1.32GB/s
Local Injection performance
Transfer to local SmartNIC takes 81% of the overall injection time
Particle sifting performance
32-core host systems are roughly 4x faster than the SmartNICs
M
P
Strategies
B
Bitar ⇒ Performance ║ Embedded Processing Pipeline
Estimating the Cost of Query Workloads — Network Transfer
67
Network Transfer Cost
Prediction performance for table size
Prediction performance for transfer time
M
P
Strategies
B
Bitar ⇒ Performance ║ Embedded Processing Pipeline ║ Dynamic Offloading ⇒ Cost Estimation
Takeaway
68
Why
What
How
Metrics
MBWU ║ Data Availability ║ Takeaway
These metrics and methods present new toolkits to evaluate the cost-benefit of offloading data services, which are also applicable to future embedded systems for evaluation.
B
Takeaway
69
Why
What
How
B
M
Potential
Platform ║ Performance Delineation ║ Network Process Headroom ║ Data Processing Services ⇒ Performance ║ Takeaway
Modern SmartNICs are powerful enough to handle specific data services by aligning needs with potentials, providing desirable resource isolation and locality benefits.
Takeaway
70
Why
What
How
M
P
Strategies
B
Bitar ║ Embedded Processing Pipeline ║ Dynamic Offloading ⇒ Cost Estimation ⇒ Case Study ║ Takeaway
Efficient solutions for communication, computation, and scheduling complexities are essential to harness the benefits of offloading data services to embedded devices.
Enabling Dynamic Offloading of Data Service Workloads
71
M
P
Strategies
Bitar ║ Embedded Processing Pipeline ║ Dynamic Offloading
Efficiently estimate the job sizes then feed into the decision engine
Dissect workloads for offloading decision-making to maximize benefits
Refer to remote data sources
Serialize to architecture-agnostic, concise workload definition
Prediction Performance
72
Query execution time prediction with execution context information
Extremely efficient with relatively low error rate
aggregation
with reducible condition
simple filtering
projection
M
P
Strategies
Bitar ║ Embedded Processing Pipeline ║ Distribute Offload Planning ⇒ Cost Estimation
Case Study I
73
select * from particles where
x >= 0.7 and y < 0.3 and z <= 0.1
query
6,177,731 rows
table
Cardinality Estimator
Estimated output rows: 55,036.7 ( 0.865% difference )
Actual output rows: 55,517 (selectivity 0.9%)
Model Prediction
Transfer Table Size
3.79 MiB 3.81 MiB 0.49%
418.19 MiB 424.19 MiB 1.42%
M
P
Strategies
Bitar ║ Embedded Processing Pipeline ║ Distribute Offload Planning ⇒ Cost Estimation ⇒ Case Studies
Case Study II
74
select * from particles where
x >= 0.5 and y < 0.55 and z <= 0.67
query
6,177,731 rows
table
Cardinality Estimator
Estimated output rows: 1,152,860 ( 1.41% difference )
Actual output rows: 1,136,847 (selectivity 18.4%)
Crossover point:
Transfer Table Size
75.35 MiB 78.06 MiB 3.48%
Model Prediction
418.19 MiB 424.19 MiB 1.42%
M
P
Strategies
Bitar ║ Embedded Processing Pipeline ║ Distribute Offload Planning ⇒ Cost Estimation ⇒ Case Studies