1 of 12

BufferPool - Spill to Disk

Design Summary

2 of 12

High-level Design

BufferPool API

ReservationTracker

In-Memory Buffer/Page Mgmt

Operators & Query Lifecycle

Buffer-splitting allocator for hash tables

Spilled Page Mgmt

Scalable Buffer Allocator

Tmp File Mgmt

DiskIoMgr improvements

Completed

Code available

Work in progress/

Not started

Planner Reservation Computation/Allocation

3 of 12

Spilled Page Management

Need to define states that a Page can be in + invariants
Policies

If/when to write an unpinned page to disk

Mechanisms

Data structures referencing pages in each state

4 of 12

Spilled Page Mgmt - Page States

Pinned

Unpinned

(dirty)

Create()

Unpin()

pin_count == 0

pin_count > 0

Unpinned

(write in flight)

Unpinned

(clean)

Flush

Write Complete

Error

Write Error

Pin()

Destroyed

Close() from any state

Unpinned

(evicted)

Evict

Pin() +

blocking read

Legend

Has buffer

No buffer

Buffer Ownership

Each buffer is owned by either:

• a BufferPool client, e.g. ExecNode or MemPool

• a page (see above)

• internal free lists (if freed by client or a page was evicted)

Note: there is now a pinned (read in flight) state here. Need to update.

5 of 12

Spilled Page Mgmt - Policies

Goals for page eviction policy

Simplicity
Fairness - a spilling query should not force an in-memory query to block waiting for I/O

Policy requirements:

Buffers allocated + pinned pages + dirty unpinned pages + write in flight <= reservation
Clean unpinned pages can exceed reservation - evicting these requires no I/O

Proposed eviction algorithm:

Clients enforce the invariant before allocating each buffer by writing out dirty pages
Asynchronous writes are initiated when the client is approaching the limit (so compute and I/O can be overlapped).
The thread that called the BufferPool API blocks until the invariant becomes true

Buffer allocation algorithm:

First try to allocate a buffer, otherwise if all memory is used, evict clean unpinned pages.

6 of 12

Spilled Page Mgmt

Client

Pinned pages:

Dirty pages�(LIFO):

Write in-flight pages:

Unpin()

flush

Clean pages

write complete

Evicted pages

Clean Page Lists�(one per core)

evict

MAX_MB mb

2mb

64kb

...

Memory allocator

Free buffer lists (one per core)

MAX_MB mb

2mb

64kb

...

TBD: eviction policy - FIFO?

Implementation notes:

These data structures will be combined to reduce the # of lock acquisitions.

7 of 12

Scalable Memory Allocator

Reduces contention on TCMalloc heap lock
Buffers allocated either from TCMalloc or mmap().
Set MAV_HUGEPAGE for >= 2MB pages.
One freelist per core per power-of-two buffer size
Challenges: scalability, fragmentation, free space management

Core 0

64kb

16mb

Core 1

64kb

16mb

...

8 of 12

Buffer-splitting Allocator for Hash Tables

Partitioned hash join/aggregation

Allocates many small hash table bucket-directories of various sizes
Aggregation resizes the bucket directories frequently

Problems:

TLB pressure even for small joins/aggs (random access to 16 bucket directories per operator plus 16+ buffers)
Many calls into global buffer allocator - scalability challenges
Many allocations of smaller variable-size buffers - harder to efficiently manage free space

Proposal:

“suballocator” utility that splits 2MB buffers backed by transparent huge pages
Agg/join can use a suballocator to allocate hash tables

one buffer could back multiple bucket directories for a small agg or join

Buddy allocator has good fragmentation behaviour.
https://gerrit.cloudera.org/#/c/4715/

9 of 12

Managing split between BufferPool and non-BufferPool Memory

Long-term vision:

All memory on the data path (IO buffers, MemPools, etc) will be served from the BufferPool
Many control data structures still allocated via malloc()
We can devote most memory to the buffer pool, minus a percentage of other overhead

Main challenges: what to do about LLVM and Thrift data structures

Short-term approach:

Much memory (IO buffers, MemPools) still allocated via malloc()
Limit BufferPool to 80% of process limit (similar to BufferedBlockMgr currently)
MemTracker consumption includes reservation:

reserved memory + other consumption <= process limit

If > 20% of process limit is required for non-BufferPool memory (e.g. scans), MemTracker GC can reclaim memory from BufferPool

The memory isn’t reserved, so will be present in BufferPool as free buffers or clean pages

10 of 12

Tmp File Management

Need to manage

temporary files round-robined across disks
I/O to those files

Lifetime of files

Per-operator

Con: need additional abstraction to implement scratch_limit
Con: need to reclaim scratch space during query execution, otherwise we may waste lots of scratch space. I.e so peak(query scratch usage) != peak(sum(operator scratch usage))

Global - scratch space is allocated out of a global pool

Con: free space management and cleanup are non-trivial - do we garbage collect space? What if files get fragmented over time?
Con: scalability

Per-query

Simple, consistent with current behaviour.
Con: need to think about scalability with high DOP

11 of 12

Tmp File Management Design

Build on FileGroup abstraction
One FileGroup per query

scratch_limit is easily enforced.
Files can be recycled during query, then destroyed when query ends.

Needs extensions:

Add free space management
Manage I/O within FileGroup
Retry on write failure (IMPALA-2079)

https://gerrit.cloudera.org/#/c/5141/

12 of 12

DiskIoMgr Improvements

Currently DiskIoMgr read path allocates <= 8mb buffers

BufferedBlockMgr does an extra copy when reading spilled blocks
Memory and CPU inefficient

BufferPool needs better control over memory
Need mechanism to read directly into a caller-provided buffer:

https://gerrit.cloudera.org/#/c/4631/