1 of 27

AstriFlash�A Flash-Based System for Online Services

Siddharth Gupta Yunho Oh Lei Yan Mark Sutherland�Abhishek Bhattacharjee Babak Falsafi Peter Hsu

2 of 27

Online Services Host Data in DRAM

Crucial for high performance and low tail latency
DRAM is expensive and is not scaling in density

DRAM forms a significant portion of the overall server cost

2

3 of 27

Can We Serve Data from Flash?

Good news!

Many online services have 100 ms Service-Level Objective (SLO)

Problem:

Demand paging eases programmability and adaptivity to runtime dynamics
But paging has ~10 μs of overhead per Flash access
Even if the working set fits in DRAM → 50% drop in throughput

How do we integrate Flash for maximum throughput?

3

Device	Cost Benefit	Raw Latency
DRAM	1x	50 ns
Flash	50x	50,000 ns

4 of 27

AstriFlash

Correctly size DRAM capacity → meet SLO

Eliminate demand paging → maximize throughput

How?

Memory-mapped Flash
Hardware-managed DRAM cache
User-level threads

Results validated with analytic model and timing simulation

Reduces cost of memory by 20x
Achieves 95% of DRAM throughput under SLO

4

5 of 27

Outline

Introduction
Goals

Meet SLO
Minimize DRAM capacity
Maximize throughput

AstriFlash
Evaluation
Conclusions

5

6 of 27

Need DRAM to Meet SLO

Many services have ms-scale SLO

Allows absorbing μs-scale raw Flash latencies

But, cannot replace all DRAM with Flash

Flash accesses w/ queueing impact tail latency
Too many Flash accesses will violate SLO
Need DRAM to filter Flash accesses

6

What is the correct capacity for DRAM?

7 of 27

Capture the Working Set

Access patterns for online services are inherently skewed
3% of the dataset size absorbs >95% of the accesses (miss every 10 μs)

7

3% DRAM reduces memory cost by 20x

= Knee

Upper Bound

Lower Bound

CloudSuite 3.0

Data Caching

Data Serving

Media Streaming

Web Search

8 of 27

The Demand Paging Dilemma

Demand paging between DRAM and Flash eases programming model

OS transfers pages between DRAM and Flash using page faults

But page faults have high overheads

IO request handling to locate a page
Paging between Flash and DRAM namespace
Context switch

8

Demand paging fundamentally limits throughput!

9 of 27

Demand Paging Overhead

9

Significant throughput loss due to demand paging

Thread 1

Thread 2

Thread 3

App

10 μs

Page fault

5 μs

Flash Access

50 μs

Context switch

5 μs

Demand

paging

Ideal

Thread 1

Thread 2

Thread 3

10 of 27

Related Work

Demand paging overheads limit application throughput

Kernel bypassing proposed for high performance [SPDK][Kourtis, FAST ‘19]
Moves the burden of DRAM management to application
Abandons Virtual Memory

10

11 of 27

Outline

Introduction
Goals
AstriFlash

Memory-mapped Flash
Hardware-managed DRAM cache
User-level threads

Evaluation
Conclusions

11

12 of 27

Mechanism #1: Memory-Mapped Flash

Expose Flash to the OS as memory

PCIe support for memory regions [Bae, ISCA ’18]
Use Base Address Registers (BARs)
PCIe controller directs requests to the correct device

Need TLBs with large reach to cover Flash

Multi-level TLBs, huge pages, coalescing → 10-100 GBs
Novel approaches (e.g., Midgard [ISCA ’21]) → 100s of TBs�

12

How to control the data transfer between Flash and DRAM?

Core

Flash

BAR

PCIe Controller

TLB

13 of 27

Mechanism #2: Hardware DRAM Cache

Footprint cache [ISCA ’13]

Frontside controller

Interacts with the on-chip caches
Probes DRAM for data hit/miss

Backside controller

Interacts with the Flash
Inserts/evicts pages from DRAM

13

DRAM cache misses trigger user-level thread switches

Core

DRAM�Cache

Flash

F

B

1

2

3

4

5

6

1

Access DRAM: Miss!

2

Signal miss to

3

Access Flash

4

Signal miss to core

5

Switch thread

6

Receive page

F

Frontside controller

B

Backside controller

B

14 of 27

Mechanism #3: User-Level Threads

DRAM cache miss triggers thread switches

Handler to save context and schedule next thread
ISA support to register software handler

Handler address and Link register

Microarchitectural support to jump to handler

Flush ROB
Save PC in Link register
Jump to handler address

14

Fast thread switches (~100 ns) to hide Flash access

…

LD 0xAddr

ROB

PC

Handler

Address

Register

Link Register

15 of 27

Overhead Comparison

15

Traditional Systems		AstriFlash
Flash accesses using OS	2 μs	Memory-mapped Flash	-
Paging between address spaces	3 μs	Hardware-managed cache	50 ns
OS-level context switches	5 μs	User-level thread switches	100 ns

Hardware-software co-design to achieve DRAM-like performance

16 of 27

Outline

Introduction
Goals
AstriFlash
Evaluation
Conclusions

16

17 of 27

Methodology

User-level library for thread switches
Skewed request distribution for data-intensive workloads

Masstree/Silo + data structures along with threading library
256 GB dataset in Flash, 8 GB DRAM cache

Cycle-accurate full-system simulation with QFlex

16x ARM A57 cores, 4x4 mesh architecture
4 memory nodes (each with 128K sets, 4 ways, 4KB page)
50/100 μs Flash read/write latency

17

C

Mem

18 of 27

Throughput Comparison

AstriFlash loses throughput because of tag lookups, pipeline flushes, and thread switches

18

AstriFlash achieves ~95% of DRAM-only throughput

19 of 27

Tail Latency Comparison

AstriFlash results in higher tail latency at low loads because of Flash accesses

19

AstriFlash matches tail latency with only 5% throughput degradation

20 of 27

Conclusions

Flash can serve data online at 20x less memory cost

Correctly size DRAM capacity → meet SLO

Eliminate demand paging → maximize throughput

Memory-mapped Flash
Hardware-managed DRAM cache
User-level threads

20

AstriFlash achieves 95% of DRAM throughput under SLO

21 of 27

Backup Slides

21

22 of 27

Bandwidth Requirements from Flash

Flash access every ~10 μs of execution

22

DRAM cache reduces access latency and filters accesses to Flash

Max. PCIe Gen5 Bandwidth

Max. PCIe Gen4 Bandwidth

23 of 27

How to Access Flash?

Traditional memory accesses are synchronous

Cores wait for the DRAM access to complete
OoO cores can (partially) hide DRAM latencies
Synchronous Flash access will stall the cores for 50 μs

Flash requires asynchronous accesses

Cores are notified of DRAM misses
Cores can work in parallel with the Flash access

23

High access latency of Flash calls for asynchronous access model

Core

DRAM

Flash

Async

Sync

24 of 27

Throughput Comparison

AstriFlash loses throughput because of tag lookups, pipeline flushes, and thread switches

24

AstriFlash achieves ~95% of DRAM-only throughput

25 of 27

Memory Requirements in Datacenters

Online services host TB-scale datasets in DRAM

Crucial for high performance and low tail latency
DRAM is expensive and is not scaling in density

DRAM forms a significant portion of the overall server cost

25

Amazon EC2

vCPU	Memory	$ / hour
448	12 TB	109.20
448	6 TB	54.60
224	6 TB	46.40

26 of 27

M/M/1 Queueing Model

26

AstriFlash promises similar performance as DRAM at high loads

SLO >�40x Avg. service time

27 of 27

Heterogeneous System Architecture

27

Core

DRAM�Cache

Flash

SRAM�Cache

OS-managed

DRAM

CXL / RDMA