1 of 1

Motivation

 

Objective: Cost under Ext. Mem. Model

References

Challenges

Experiments

Search Space: Hierarchical Index

Optimization: Greedy Stack and Balance

Supawit ChockchowwatDepartment of Computer Science, College of Engineering, University of Illinois at Urbana-Champaign

Tuning Hierarchical Learned Indexes on Disk and Beyond

Baselines: PostgreSQL [2], RocksDB [3], RMI [4]

Benchmark: Search On Sorted Data (SOSD) [1, 5]

  • books: Amazon sale popularity data
  • fb: Facebook user IDs, upsampled
  • osm: OpenStreetMap locations in Google S2 Cellids
  • wiki: Wikipedia article edit timestamps

Environment: Azure VM D4s_v3 (4 vCPUs, 16 GiB RAM), data and indexes on Azure NFS

Figure 1: Initial lookup latency across SOSD datasets of different systems whose external memory is on Azure NFS.

 

 

Storages have their unique performance profiles.

Index structures gain more design choices.

Objective and constraints are more complex.

Learned indexes outperform in internal memory setting

Large data systems require external storages

Are fast random accesses necessary?

How to design or tune�a learned index on external storage?

Round-trip time violates fast-random-access assumption

Number of�Layers

Model Type and Parameters

Key Space Partition

Precision

Layer Sizes

These variables are dependent on each other.

Cost to access external memory dominates the total cost.

Represent the time to read x bytes from storage as

 

 

Index search cost under EMM

 

retrieve root layer

retrieve layer L-1

retrieve layer L-2

retrieve data layer

 

Layer L

Layer L-1

Layer 1

Data Layer

 

 

 

 

Hierarchical index can be identified with various variables

Layer i

Layer i-1

 

 

 

Layer L

Layer L-1

Layer 1

 

Incrementally try number of layers to solve

e.g., greedy packing for step functions, convex hull packing for piecewise linear function

 

 

Relevant terms in EMM

Build order

 

 

Ternary search to find minimum

 

Larger, slower

Smaller, faster

e.g., error correction is slower

[1]

e.g., model, loss function, error distribution

Pareto size and accuracy

Model preference

👍

👎

 

 

 

accuracy

size

feasible

👍

👎

SSD

HDD

NFS

Cloud Storage

Spectrum of latency, bandwidth, IOPS, …

[1] Ryan Marcus, Andreas Kipf, Alexandervan Renen, Mihail Stoian, Sanchit Misra, Alfons Kemper, Thomas Neumann, and Tim Kraska. 2020. Benchmarking Learned Indexes. Proc. VLDB Endow. 14, 1 (2020), 1–13.

[2] PostgreSQL. [n. d.]. PostgreSQL: The World’s Most Advanced Open Source Relational Database. https://www.postgresql.org. [Online; accessed November- 12-2021].[3] RocksDB

[3] Siying Dong, Andrew Kryczka, Yanqin Jin, and Michael Stumm. 2021. RocksDB: Evolution of Development Priorities in a Key-Value Store Serving Large-Scale Applications. ACM Trans. Storage 17, 4, Article 26 (Oct. 2021), 32 pages. https://doi.org/10.1145/3483840

[4] Tim Kraska, Alex Beutel, Ed H. Chi, Jeffrey Dean, and Neoklis Polyzotis. 2018. The case for learned index structures. Proceedings of the ACM SIGMOD International Conference on Management of Data. https://doi.org/10.1145/3183713.3196909

[5] Andreas Kipf, Ryan Marcus, Alexandervan Renen, Mihail Stoian, Alfons Kemper, Tim Kraska, and Thomas Neumann. 2019. SOSD: A Benchmark for Learned Indexes. NeurIPS Workshop on Machine Learning for Systems (2019)