Motivation
Objective: Cost under Ext. Mem. Model
References
Challenges
Experiments
Search Space: Hierarchical Index
Optimization: Greedy Stack and Balance
Supawit Chockchowwat�Department of Computer Science, College of Engineering, University of Illinois at Urbana-Champaign
Tuning Hierarchical Learned Indexes on Disk and Beyond
Baselines: PostgreSQL [2], RocksDB [3], RMI [4]
Benchmark: Search On Sorted Data (SOSD) [1, 5]
Environment: Azure VM D4s_v3 (4 vCPUs, 16 GiB RAM), data and indexes on Azure NFS
Figure 1: Initial lookup latency across SOSD datasets of different systems whose external memory is on Azure NFS.
Storages have their unique performance profiles.
Index structures gain more design choices.
Objective and constraints are more complex.
Learned indexes outperform in internal memory setting
Large data systems require external storages
Are fast random accesses necessary?
How to design or tune�a learned index on external storage?
Round-trip time violates fast-random-access assumption
Number of�Layers
Model Type and Parameters
Key Space Partition
Precision
Layer Sizes
These variables are dependent on each other.
Cost to access external memory dominates the total cost.
Represent the time to read x bytes from storage as
Index search cost under EMM
retrieve root layer
retrieve layer L-1
retrieve layer L-2
retrieve data layer
Layer L
Layer L-1
Layer 1
Data Layer
…
Hierarchical index can be identified with various variables
Layer i
Layer i-1
Layer L
Layer L-1
Layer 1
…
Incrementally try number of layers to solve
e.g., greedy packing for step functions, convex hull packing for piecewise linear function
Relevant terms in EMM
Build order
…
Ternary search to find minimum
Larger, slower
Smaller, faster
❌
❌
❌
❌
✅
e.g., error correction is slower
[1]
e.g., model, loss function, error distribution
Pareto size and accuracy
Model preference
👍
👎
accuracy
size
feasible
👍
👎
SSD
HDD
NFS
Cloud Storage
Spectrum of latency, bandwidth, IOPS, …
[1] Ryan Marcus, Andreas Kipf, Alexandervan Renen, Mihail Stoian, Sanchit Misra, Alfons Kemper, Thomas Neumann, and Tim Kraska. 2020. Benchmarking Learned Indexes. Proc. VLDB Endow. 14, 1 (2020), 1–13.
[2] PostgreSQL. [n. d.]. PostgreSQL: The World’s Most Advanced Open Source Relational Database. https://www.postgresql.org. [Online; accessed November- 12-2021].[3] RocksDB
[3] Siying Dong, Andrew Kryczka, Yanqin Jin, and Michael Stumm. 2021. RocksDB: Evolution of Development Priorities in a Key-Value Store Serving Large-Scale Applications. ACM Trans. Storage 17, 4, Article 26 (Oct. 2021), 32 pages. https://doi.org/10.1145/3483840
[4] Tim Kraska, Alex Beutel, Ed H. Chi, Jeffrey Dean, and Neoklis Polyzotis. 2018. The case for learned index structures. Proceedings of the ACM SIGMOD International Conference on Management of Data. https://doi.org/10.1145/3183713.3196909
[5] Andreas Kipf, Ryan Marcus, Alexandervan Renen, Mihail Stoian, Alfons Kemper, Tim Kraska, and Thomas Neumann. 2019. SOSD: A Benchmark for Learned Indexes. NeurIPS Workshop on Machine Learning for Systems (2019)