1 of 27

AstriFlash�A Flash-Based System for Online Services

Siddharth Gupta Yunho Oh Lei Yan Mark Sutherland�Abhishek Bhattacharjee Babak Falsafi Peter Hsu

2 of 27

Online Services Host Data in DRAM

  • Crucial for high performance and low tail latency
  • DRAM is expensive and is not scaling in density

DRAM forms a significant portion of the overall server cost

2

3 of 27

Can We Serve Data from Flash?

Good news!

    • Many online services have 100 ms Service-Level Objective (SLO)

Problem:

    • Demand paging eases programmability and adaptivity to runtime dynamics
    • But paging has ~10 μs of overhead per Flash access
    • Even if the working set fits in DRAM → 50% drop in throughput

How do we integrate Flash for maximum throughput?

3

Device

Cost Benefit

Raw Latency

DRAM

1x

50 ns

Flash

50x

50,000 ns

4 of 27

AstriFlash

Correctly size DRAM capacity → meet SLO

Eliminate demand paging → maximize throughput

How?

    • Memory-mapped Flash
    • Hardware-managed DRAM cache
    • User-level threads

Results validated with analytic model and timing simulation

    • Reduces cost of memory by 20x
    • Achieves 95% of DRAM throughput under SLO

4

5 of 27

Outline

  • Introduction
  • Goals
    • Meet SLO
    • Minimize DRAM capacity
    • Maximize throughput
  • AstriFlash
  • Evaluation
  • Conclusions

5

6 of 27

Need DRAM to Meet SLO

Many services have ms-scale SLO

  • Allows absorbing μs-scale raw Flash latencies

But, cannot replace all DRAM with Flash

  • Flash accesses w/ queueing impact tail latency
  • Too many Flash accesses will violate SLO
  • Need DRAM to filter Flash accesses

6

What is the correct capacity for DRAM?

7 of 27

Capture the Working Set

    • Access patterns for online services are inherently skewed
    • 3% of the dataset size absorbs >95% of the accesses (miss every 10 μs)

7

3% DRAM reduces memory cost by 20x

= Knee

Upper Bound

Lower Bound

CloudSuite 3.0

Data Caching

Data Serving

Media Streaming

Web Search

8 of 27

The Demand Paging Dilemma

Demand paging between DRAM and Flash eases programming model

  • OS transfers pages between DRAM and Flash using page faults

But page faults have high overheads

  • IO request handling to locate a page
  • Paging between Flash and DRAM namespace
  • Context switch

8

Demand paging fundamentally limits throughput!

9 of 27

Demand Paging Overhead

9

Significant throughput loss due to demand paging

Thread 1

Thread 2

Thread 3

App

10 μs

Page fault

5 μs

Flash Access

50 μs

Context switch

5 μs

Demand

paging

Ideal

Thread 1

Thread 2

Thread 3

10 of 27

Related Work

Demand paging overheads limit application throughput

  • Kernel bypassing proposed for high performance [SPDK][Kourtis, FAST ‘19]
  • Moves the burden of DRAM management to application
  • Abandons Virtual Memory

10

11 of 27

Outline

  • Introduction
  • Goals
  • AstriFlash
    1. Memory-mapped Flash
    2. Hardware-managed DRAM cache
    3. User-level threads
  • Evaluation
  • Conclusions

11

12 of 27

Mechanism #1: Memory-Mapped Flash

Expose Flash to the OS as memory

  • PCIe support for memory regions [Bae, ISCA ’18]
  • Use Base Address Registers (BARs)
  • PCIe controller directs requests to the correct device

Need TLBs with large reach to cover Flash

  • Multi-level TLBs, huge pages, coalescing 10-100 GBs
  • Novel approaches (e.g., Midgard [ISCA ’21]) 100s of TBs�

12

How to control the data transfer between Flash and DRAM?

Core

Flash

BAR

PCIe Controller

TLB

13 of 27

Mechanism #2: Hardware DRAM Cache

Footprint cache [ISCA ’13]

Frontside controller

    • Interacts with the on-chip caches
    • Probes DRAM for data hit/miss

Backside controller

    • Interacts with the Flash
    • Inserts/evicts pages from DRAM

13

DRAM cache misses trigger user-level thread switches

Core

DRAM�Cache

Flash

F

B

1

2

3

4

5

6

1

Access DRAM: Miss!

2

Signal miss to

3

Access Flash

4

Signal miss to core

5

Switch thread

6

Receive page

F

Frontside controller

B

Backside controller

B

14 of 27

Mechanism #3: User-Level Threads

DRAM cache miss triggers thread switches

  • Handler to save context and schedule next thread
  • ISA support to register software handler
    • Handler address and Link register
  • Microarchitectural support to jump to handler
    • Flush ROB
    • Save PC in Link register
    • Jump to handler address

14

Fast thread switches (~100 ns) to hide Flash access

LD 0xAddr

ROB

PC

Handler

Address

Register

Link Register

15 of 27

Overhead Comparison

15

Traditional Systems

AstriFlash

Flash accesses using OS

2 μs

Memory-mapped Flash

-

Paging between address spaces

3 μs

Hardware-managed cache

50 ns

OS-level context switches

5 μs

User-level thread switches

100 ns

Hardware-software co-design to achieve DRAM-like performance

16 of 27

Outline

  • Introduction
  • Goals
  • AstriFlash
  • Evaluation
  • Conclusions

16

17 of 27

Methodology

  • User-level library for thread switches
  • Skewed request distribution for data-intensive workloads
    • Masstree/Silo + data structures along with threading library
    • 256 GB dataset in Flash, 8 GB DRAM cache
  • Cycle-accurate full-system simulation with QFlex
    • 16x ARM A57 cores, 4x4 mesh architecture
    • 4 memory nodes (each with 128K sets, 4 ways, 4KB page)
    • 50/100 μs Flash read/write latency

17

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

Mem

Mem

Mem

Mem

18 of 27

Throughput Comparison

    • AstriFlash loses throughput because of tag lookups, pipeline flushes, and thread switches

18

AstriFlash achieves ~95% of DRAM-only throughput

19 of 27

Tail Latency Comparison

    • AstriFlash results in higher tail latency at low loads because of Flash accesses

19

AstriFlash matches tail latency with only 5% throughput degradation

20 of 27

Conclusions

Flash can serve data online at 20x less memory cost

Correctly size DRAM capacity → meet SLO

Eliminate demand paging → maximize throughput

    • Memory-mapped Flash
    • Hardware-managed DRAM cache
    • User-level threads

20

AstriFlash achieves 95% of DRAM throughput under SLO

21 of 27

Backup Slides

21

22 of 27

Bandwidth Requirements from Flash

  • Flash access every ~10 μs of execution

22

DRAM cache reduces access latency and filters accesses to Flash

Max. PCIe Gen5 Bandwidth

Max. PCIe Gen4 Bandwidth

23 of 27

How to Access Flash?

Traditional memory accesses are synchronous

  • Cores wait for the DRAM access to complete
  • OoO cores can (partially) hide DRAM latencies
  • Synchronous Flash access will stall the cores for 50 μs

Flash requires asynchronous accesses

  • Cores are notified of DRAM misses
  • Cores can work in parallel with the Flash access

23

High access latency of Flash calls for asynchronous access model

Core

DRAM

Flash

Async

Sync

24 of 27

Throughput Comparison

    • AstriFlash loses throughput because of tag lookups, pipeline flushes, and thread switches

24

AstriFlash achieves ~95% of DRAM-only throughput

25 of 27

Memory Requirements in Datacenters

Online services host TB-scale datasets in DRAM

  • Crucial for high performance and low tail latency
  • DRAM is expensive and is not scaling in density

DRAM forms a significant portion of the overall server cost

25

Amazon EC2

vCPU

Memory

$ / hour

448

12 TB

109.20

448

6 TB

54.60

224

6 TB

46.40

26 of 27

M/M/1 Queueing Model

26

AstriFlash promises similar performance as DRAM at high loads

SLO >�40x Avg. service time

27 of 27

Heterogeneous System Architecture

27

Core

DRAM�Cache

Flash

SRAM�Cache

OS-managed

DRAM

CXL / RDMA