1 of 17

Reducing query latency in DataFusion via a caching object store layer

Artjoms Iškovs,�Principal Engineer �27th September 2024

© E D B 2 0 2 4 — A L L R I G H T S R E S E R V E D .

2 of 17

Before EDB: Splitgraph and Seafowl

  • Seafowl: “analytics at the edge”
  • Released in 2022, uses DataFusion
  • Kind of like PostgREST + Varnish for OLAP

© E D B 2 0 2 4 — A L L R I G H T S R E S E R V E D .

3 of 17

At EDB: Postgres Lakehouse + HTAP

© E D B 2 0 2 4 — A L L R I G H T S R E S E R V E D .

4 of 17

© E D B 2 0 2 4 — A L L R I G H T S R E S E R V E D .

5 of 17

© E D B 2 0 2 4 — A L L R I G H T S R E S E R V E D .

6 of 17

  • Multiple files
  • Small byte ranges
  • Big byte ranges
  • (partially) overlapping byte ranges
  • Byte ranges requested multiple times
    • …from different parts of the query tree
    • …from different queries
    • …concurrently?!

© E D B 2 0 2 4 — A L L R I G H T S R E S E R V E D .

7 of 17

© E D B 2 0 2 4 — A L L R I G H T S R E S E R V E D .

8 of 17

Caching logic

© E D B 2 0 2 4 — A L L R I G H T S R E S E R V E D .

9 of 17

Benchmarking

22 TPC-H queries, SF10 (3GiB in Delta)

  • cache off
  • cache on (256MiB)
  • Inject 30ms latency with toxiproxy

© E D B 2 0 2 4 — A L L R I G H T S R E S E R V E D .

10 of 17

© E D B 2 0 2 4 — A L L R I G H T S R E S E R V E D .

11 of 17

© E D B 2 0 2 4 — A L L R I G H T S R E S E R V E D .

12 of 17

Demo

TPC-H Q6

SELECT

sum(l_extendedprice * l_discount) AS revenue

FROM

lineitem

WHERE

l_shipdate >= CAST('1994-01-01' AS date)

AND l_shipdate < CAST('1995-01-01' AS date)

AND l_discount BETWEEN 0.05

AND 0.07

AND l_quantity < 24;

© E D B 2 0 2 4 — A L L R I G H T S R E S E R V E D .

13 of 17

© E D B 2 0 2 4 — A L L R I G H T S R E S E R V E D .

14 of 17

© E D B 2 0 2 4 — A L L R I G H T S R E S E R V E D .

15 of 17

Q&A

© E D B 2 0 2 4 — A L L R I G H T S R E S E R V E D .

16 of 17

Backup slides

© E D B 2 0 2 4 — A L L R I G H T S R E S E R V E D .

17 of 17

Future work

This version of this slide features an image at right and copy to the left. When using more than two such slides in a row, be sure to alternate these layouts.

  • Concurrency issues!
    • Reading a file chunk that just got evicted and deleted
    • Deleting a chunk that already got deleted
    • Two threads downloading the same chunk
      • Possibility of deadlocks!
  • Writes?

© E D B 2 0 2 4 — A L L R I G H T S R E S E R V E D .