AstriFlash�A Flash-Based System for Online Services
Siddharth Gupta Yunho Oh Lei Yan Mark Sutherland�Abhishek Bhattacharjee Babak Falsafi Peter Hsu
Online Services Host Data in DRAM
DRAM forms a significant portion of the overall server cost
2
Can We Serve Data from Flash?
Good news!
Problem:
How do we integrate Flash for maximum throughput?
3
Device | Cost Benefit | Raw Latency |
DRAM | 1x | 50 ns |
Flash | 50x | 50,000 ns |
AstriFlash
Correctly size DRAM capacity → meet SLO
Eliminate demand paging → maximize throughput
How?
Results validated with analytic model and timing simulation
4
Outline
5
Need DRAM to Meet SLO
Many services have ms-scale SLO
But, cannot replace all DRAM with Flash
6
What is the correct capacity for DRAM?
Capture the Working Set
7
3% DRAM reduces memory cost by 20x
= Knee
Upper Bound
Lower Bound
CloudSuite 3.0
Data Caching
Data Serving
Media Streaming
Web Search
The Demand Paging Dilemma
Demand paging between DRAM and Flash eases programming model
But page faults have high overheads
8
Demand paging fundamentally limits throughput!
Demand Paging Overhead
9
Significant throughput loss due to demand paging
Thread 1
Thread 2
Thread 3
App
10 μs
Page fault
5 μs
Flash Access
50 μs
Context switch
5 μs
Demand
paging
Ideal
Thread 1
Thread 2
Thread 3
Related Work
Demand paging overheads limit application throughput
10
Outline
11
Mechanism #1: Memory-Mapped Flash
Expose Flash to the OS as memory
Need TLBs with large reach to cover Flash
12
How to control the data transfer between Flash and DRAM?
Core
Flash
BAR
PCIe Controller
TLB
Mechanism #2: Hardware DRAM Cache
Footprint cache [ISCA ’13]
Frontside controller
Backside controller
13
DRAM cache misses trigger user-level thread switches
Core
DRAM�Cache
Flash
F
B
1
2
3
4
5
6
1
Access DRAM: Miss!
2
Signal miss to
3
Access Flash
4
Signal miss to core
5
Switch thread
6
Receive page
F
Frontside controller
B
Backside controller
B
Mechanism #3: User-Level Threads
DRAM cache miss triggers thread switches
14
Fast thread switches (~100 ns) to hide Flash access
…
…
…
LD 0xAddr
ROB
PC
Handler
Address
Register
Link Register
Overhead Comparison
15
Traditional Systems | AstriFlash | ||
Flash accesses using OS | 2 μs | Memory-mapped Flash | - |
Paging between address spaces | 3 μs | Hardware-managed cache | 50 ns |
OS-level context switches | 5 μs | User-level thread switches | 100 ns |
Hardware-software co-design to achieve DRAM-like performance
Outline
16
Methodology
17
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
C
Mem
Mem
Mem
Mem
Throughput Comparison
18
AstriFlash achieves ~95% of DRAM-only throughput
Tail Latency Comparison
19
AstriFlash matches tail latency with only 5% throughput degradation
Conclusions
Flash can serve data online at 20x less memory cost
Correctly size DRAM capacity → meet SLO
Eliminate demand paging → maximize throughput
20
AstriFlash achieves 95% of DRAM throughput under SLO
Backup Slides
21
Bandwidth Requirements from Flash
22
DRAM cache reduces access latency and filters accesses to Flash
Max. PCIe Gen5 Bandwidth
Max. PCIe Gen4 Bandwidth
How to Access Flash?
Traditional memory accesses are synchronous
Flash requires asynchronous accesses
23
High access latency of Flash calls for asynchronous access model
Core
DRAM
Flash
Async
Sync
Throughput Comparison
24
AstriFlash achieves ~95% of DRAM-only throughput
Memory Requirements in Datacenters
Online services host TB-scale datasets in DRAM
DRAM forms a significant portion of the overall server cost
25
Amazon EC2
vCPU | Memory | $ / hour |
448 | 12 TB | 109.20 |
448 | 6 TB | 54.60 |
224 | 6 TB | 46.40 |
M/M/1 Queueing Model
26
AstriFlash promises similar performance as DRAM at high loads
SLO >�40x Avg. service time
Heterogeneous System Architecture
27
Core
DRAM�Cache
Flash
SRAM�Cache
OS-managed
DRAM
CXL / RDMA