1 of 12

BufferPool - Spill to Disk

Design Summary

2 of 12

High-level Design

BufferPool API

ReservationTracker

In-Memory Buffer/Page Mgmt

Operators & Query Lifecycle

Buffer-splitting allocator for hash tables

Spilled Page Mgmt

Scalable Buffer Allocator

Tmp File Mgmt

DiskIoMgr improvements

Completed

Code available

Work in progress/

Not started

Planner Reservation Computation/Allocation

3 of 12

Spilled Page Management

  • Need to define states that a Page can be in + invariants
  • Policies
    • If/when to write an unpinned page to disk
  • Mechanisms
    • Data structures referencing pages in each state

4 of 12

Spilled Page Mgmt - Page States

Pinned

Unpinned

(dirty)

Create()

Unpin()

pin_count == 0

pin_count > 0

Unpinned

(write in flight)

Unpinned

(clean)

Flush

Write Complete

Error

Write Error

Pin()

Pin()

Pin()

Destroyed

Close() from any state

Unpinned

(evicted)

Evict

Pin() +

blocking read

Legend

Has buffer

No buffer

Buffer Ownership

Each buffer is owned by either:

• a BufferPool client, e.g. ExecNode or MemPool

• a page (see above)

• internal free lists (if freed by client or a page was evicted)

Note: there is now a pinned (read in flight) state here. Need to update.

5 of 12

Spilled Page Mgmt - Policies

  • Goals for page eviction policy
    • Simplicity
    • Fairness - a spilling query should not force an in-memory query to block waiting for I/O
  • Policy requirements:
    • Buffers allocated + pinned pages + dirty unpinned pages + write in flight <= reservation
    • Clean unpinned pages can exceed reservation - evicting these requires no I/O
  • Proposed eviction algorithm:
    • Clients enforce the invariant before allocating each buffer by writing out dirty pages
    • Asynchronous writes are initiated when the client is approaching the limit (so compute and I/O can be overlapped).
    • The thread that called the BufferPool API blocks until the invariant becomes true
  • Buffer allocation algorithm:
    • First try to allocate a buffer, otherwise if all memory is used, evict clean unpinned pages.

6 of 12

Spilled Page Mgmt

Client

Pinned pages:

Dirty pages�(LIFO):

Write in-flight pages:

Unpin()

flush

Clean pages

write complete

Evicted pages

Clean Page Lists�(one per core)

evict

MAX_MB mb

2mb

64kb

...

...

Memory allocator

Free buffer lists (one per core)

MAX_MB mb

2mb

64kb

...

...

TBD: eviction policy - FIFO?

Implementation notes:

These data structures will be combined to reduce the # of lock acquisitions.

7 of 12

Scalable Memory Allocator

  • Reduces contention on TCMalloc heap lock
  • Buffers allocated either from TCMalloc or mmap().
  • Set MAV_HUGEPAGE for >= 2MB pages.
  • One freelist per core per power-of-two buffer size
  • Challenges: scalability, fragmentation, free space management

Core 0

64kb

16mb

Core 1

64kb

16mb

...

...

8 of 12

Buffer-splitting Allocator for Hash Tables

  • Partitioned hash join/aggregation
    • Allocates many small hash table bucket-directories of various sizes
    • Aggregation resizes the bucket directories frequently
  • Problems:
    • TLB pressure even for small joins/aggs (random access to 16 bucket directories per operator plus 16+ buffers)
    • Many calls into global buffer allocator - scalability challenges
    • Many allocations of smaller variable-size buffers - harder to efficiently manage free space
  • Proposal:
    • “suballocator” utility that splits 2MB buffers backed by transparent huge pages
    • Agg/join can use a suballocator to allocate hash tables
      • one buffer could back multiple bucket directories for a small agg or join
    • Buddy allocator has good fragmentation behaviour.
    • https://gerrit.cloudera.org/#/c/4715/

9 of 12

Managing split between BufferPool and non-BufferPool Memory

  • Long-term vision:
    • All memory on the data path (IO buffers, MemPools, etc) will be served from the BufferPool
    • Many control data structures still allocated via malloc()
    • We can devote most memory to the buffer pool, minus a percentage of other overhead
      • Main challenges: what to do about LLVM and Thrift data structures
  • Short-term approach:
    • Much memory (IO buffers, MemPools) still allocated via malloc()
    • Limit BufferPool to 80% of process limit (similar to BufferedBlockMgr currently)
    • MemTracker consumption includes reservation:
      • reserved memory + other consumption <= process limit
    • If > 20% of process limit is required for non-BufferPool memory (e.g. scans), MemTracker GC can reclaim memory from BufferPool
      • The memory isn’t reserved, so will be present in BufferPool as free buffers or clean pages

10 of 12

Tmp File Management

  • Need to manage
    • temporary files round-robined across disks
    • I/O to those files
  • Lifetime of files
    • Per-operator
      • Con: need additional abstraction to implement scratch_limit
      • Con: need to reclaim scratch space during query execution, otherwise we may waste lots of scratch space. I.e so peak(query scratch usage) != peak(sum(operator scratch usage))
    • Global - scratch space is allocated out of a global pool
      • Con: free space management and cleanup are non-trivial - do we garbage collect space? What if files get fragmented over time?
      • Con: scalability
    • Per-query
      • Simple, consistent with current behaviour.
      • Con: need to think about scalability with high DOP

11 of 12

Tmp File Management Design

  • Build on FileGroup abstraction
  • One FileGroup per query
    • scratch_limit is easily enforced.
    • Files can be recycled during query, then destroyed when query ends.
  • Needs extensions:
    • Add free space management
    • Manage I/O within FileGroup
    • Retry on write failure (IMPALA-2079)
  • https://gerrit.cloudera.org/#/c/5141/

12 of 12

DiskIoMgr Improvements

  • Currently DiskIoMgr read path allocates <= 8mb buffers
    • BufferedBlockMgr does an extra copy when reading spilled blocks
    • Memory and CPU inefficient
  • BufferPool needs better control over memory
  • Need mechanism to read directly into a caller-provided buffer: