Programmable storage
19 June 2018
Noah Watkins
Thesis defense
Three common application I/O stack architectures
2
n
App
App
App
App
Unified storage
File
Object
Block
3
n
App
App-specific Storage
App-specific
2
App
POSIX file storage
POSIX*
Middleware**
* POSIX-ish, **HDF5, PLFS, MPI-IO
1
n
Redundancy and specialization costs
3
Paxos is like the simplest thing ever…
System stabilization is expen$$ive
“It takes 10 years before a new storage system is trusted.”
-- Gary Grider (LANL)�The man, the myth, the legend
4
-- Brent Welch, MSST 2010
Footbag World Champion, Mixed Doubles, 1997
Eyeballs act as a proxy for reliability!
Share code-hardened sub-systems!
Outline and contributions
5
Transactional data
Compute resources
Structured data
Graph processing
Durability
Data interfaces
Metadata management
Naming resources
Shared resources
Cluster-level metadata
Programmable storage design paradigm
2
Development process
Declarative storage
In-vivo storage development
1
EuroSys ‘18
HotStorage ‘17
BDMC ‘13
PDSW ‘12
In-progress
3
Will this�ever work?
Outline and contributions
6
Compute resources
Graph processing
Data interfaces
Metadata management
Naming resources
Shared resources
Cluster-level metadata
Programmable storage design paradigm
Development process
Declarative storage
In-vivo storage development
EuroSys ‘18
HotStorage ‘17
BDMC ‘13
PDSW ‘12
In-progress
Transactional data
Structured data
Durability
Outline and contributions
7
Compute resources
Graph processing
Data interfaces
Metadata management
Naming resources
Shared resources
Cluster-level metadata
Programmable storage design paradigm
Development process
Declarative storage
In-vivo storage development
EuroSys ‘18
HotStorage ‘17
BDMC ‘13
PDSW ‘12
In-progress
1
New way to build storage interfaces!
Transactional data
Structured data
Durability
Outline and contributions
8
Graph processing
Compute resources
Transactional data
Structured data
Durability
Data interfaces
Metadata management
Naming resources
Shared resources
Cluster-level metadata
Programmable storage design paradigm
Development process
Declarative storage
In-vivo storage development
EuroSys ‘18
HotStorage ‘17
BDMC ‘13
PDSW ‘12
In-progress
1
Widely applicable, but challenges!
2
Outline and contributions
9
Transactional data
Compute resources
Structured data
Graph processing
Durability
Data interfaces
Metadata management
Naming resources
Shared resources
Cluster-level metadata
Programmable storage design paradigm
Development process
Declarative storage
In-vivo storage development
EuroSys ‘18
HotStorage ‘17
BDMC ‘13
PDSW ‘12
In-progress
Will this�ever work?
1
Tying it all together
2
3
Programmability: avoid duplication and specialization
10
n
App
App
App
App
Unified storage
File
Object
Block
3
n
App
App-specific Storage
App-specific
2
App
POSIX file storage
POSIX*
Middleware**
* POSIX-ish, **HDF5, PLFS, MPI-IO
1
n
Programmable storage
11
n+k
App
App
App
App
Programmable storage system
(share common sub-systems)
File
Object
Block
n
App
POSIX file storage
App
App-specific Storage
App-specific
2
POSIX*
Middleware**
* POSIX-ish, **HDF5, PLFS, MPI-IO
1
n
App
App
App
A
B
C
Programmable storage
Storage system that facilitates the reuse and extension of existing storage abstractions provided by the underlying software stack, to enable to the creation of new services via composition.
Programmable storage compared to…
Software-defined storage (SDS)
Active storage
12
Motivation, use cases, and lessons learned
Similar motivation, HPC specific, entire systems from scratch
Programmable storage exposes internal subsystems
13
file, object, block
Consensus
Persistence
Migration
Batching
Atomic operations
, data i/o, service metadata, file type, shared resources, durability
New interfaces through composition and customization
Storage system
Programmability survey in Ceph (data interfaces)
14
App-specific interfaces
Interface groupings
Programmability in Ceph (reusable data interfaces)
15
Method Examples
App-specific interfaces
Interface groupings
Developers willing to break layers and use non-standard APIs
Outline and contributions
16
Transactional data
Compute resources
Structured data
Graph processing
Durability
Data interfaces
Metadata management
Naming resources
Shared resources
Cluster-level metadata
Programmable storage design paradigm
2
Development process
Declarative storage
In-vivo storage development
1
3
Will this
ever work?
Driving example: CORFU distributed shared-log
17
Balakrishnan et al., “CORFU: A Shared Log Design for Flash Clusters”, NSDI, `12
1 | 2 | 3 | 4 | | | | | | | | | | | | | | | | |
log striping
read
clients
Sequencer (P)
Sequencer (B)
1, 2, 3, 4, 5, ….
append
pos = seq++
send_msg(pos)
How can we implement CORFU with programmability?
Driving example: CORFU distributed shared-log
18
ZLog: CORFU on Ceph
Malacology
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | ... |
?
3 x {entry, metadata}
ℕ:𝕄
Design space: ZLog design on Ceph
19
metadata / entry | omap | bytestream | xattr |
omap | | | |
bytestream | | | |
xattr | | | |
Are these combinations worth exploring?
Append throughput for 128 byte log entries
20
Design space: ZLog design on Ceph
21
metadata / entry | omap | bytestream | xattr |
omap |
|
| No |
bytestream |
|
| |
xattr | ??? | ??? |
Append performance for a variety of entry sizes
22
Append size (bytes)
if then else
4K exception (config!)
Design space: ZLog design on Ceph
23
metadata / entry | omap | bytestream | xattr |
omap |
|
| No |
bytestream | No |
| |
xattr | ??? | ??? |
What about extended attributes (xattr)
24
metadata / entry | omap | bytestream |
xattr | ??? | ??? |
What am I missing?
Are my other decisions incorrect?
Interest in ZLog from MegaCorp®
25
Acceptance of programmable storage paradigm
26
“... we are considering large scale deployments of Ceph ...“
“... Zlog seems more attractive as its on the same technology stack.”
Many use cases; tail latency is universally important
27
Tail latency in Ceph isn’t good, but interfaces matter!
28
Interfaces affect tail latency
Design space: ZLog design on Ceph
29
metadata / entry | omap | bytestream | xattr |
omap |
|
| No |
bytestream | No |
| |
xattr | No | No |
Navigating the design space is an obstacle
30
Outline and contributions
31
Transactional data
Compute resources
Structured data
Graph processing
Durability
Data interfaces
Metadata management
Naming resources
Shared resources
Cluster-level metadata
Programmable storage design paradigm
2
Development process
Declarative storage
In-vivo storage development
1
3
Will this
ever work?
Outline and contributions
32
Transactional data
Compute resources
Structured data
Graph processing
Durability
Data interfaces
Metadata management
Naming resources
Shared resources
Cluster-level metadata
Programmable storage design paradigm
2
Development process
Declarative storage
In-vivo storage development
1
3
Will this
ever work?
How to grow a database: scale-up approach
33
Database Node
CPU
RAM
Database Storage
Network /
Bus
Q
https://aws.amazon.com/ec2/instance-types/
Skyhook: exploit storage resources
34
Database Node
CPU
RAM
Database storage
Network /
Bus
Database storage�
Network
Q
Q
Q
Q
(Q)
Skyhook project
Single-node architecture
Database Node
CPU
RAM
Q
Skyhook�architecture
Programmable storage
Skyhook: aligns data with storage interfaces
35
DB-specific Data Interface
Ceph OSD
RAM
CPU
Storage+Index
Q
Ceph OSD
RAM
CPU
Storage+Index
Ceph OSD
RAM
CPU
Storage+Index
Ceph OSD
RAM
CPU
Storage+Index
Ceph OSD
RAM
CPU
Storage+Index
Ceph OSD
RAM
CPU
Storage+Index
Ceph OSD
RAM
CPU
Storage+Index
Ceph Cluster
C1 | C2 | C3 |
| | |
| | |
Table
Table
Shards
partitioning
{ object.i }
{ object.i }
Programmability used in a completely different way than to build a log abstraction
Database Node
RAM
CPU
Foreign Data Wrappers
Q
Q
Database node
* Indexing
* Projection
* Filtering
* Aggregation
App-specific�interface
Skyhook experiments with programmable storage
36
Database Node
CPU
RAM
Programmable storage
Network
(Database-specific data interface)
Q
Q
Q
Q
Benchmark queries evaluated
Qa: Range query with 10% selectivity:
SELECT * FROM lineitem WHERE extendedprice > 71000.0
Qb: Point query (unique row) issued with and without index:
SELECT extendedprice
FROM lineitem� WHERE orderkey=5 AND linenumber=3
Qc: Regex query with 10% selectivity (CPU intensive):
SELECT * FROM lineitem WHERE comment iLIKE '%uriously%'
37
+ data loading
Range query performance (10% selectivity)
38
Improved I/O performance
Database Node
CPU
RAM
Database Storage
Network
Lower = 💚 💚 💚
Client-side processing
Server-side processing
Bulk-load and index generation performance
39
Internal data structures don’t handle bulk inserts efficiently
Per-object overheads accumulate
Table
Shards
Storage+Index
{ object.i }
Lower = 💚 💚 💚
Point query performance (find unique row)
40
Database Node
CPU
RAM
Database Storage
Network
Lower = 💚 💚 💚
Client-side processing
Server-side processing
Server-side processing with index acceleration
Outline and contributions
41
Transactional data
Compute resources
Structured data
Graph processing
Durability
Data interfaces
Metadata management
Naming resources
Shared resources
Cluster-level metadata
Programmable storage design paradigm
2
Development process
Declarative storage
In-vivo storage development
1
3
Will this
ever work?
Outline and contributions
42
Transactional data
Compute resources
Structured data
Graph processing
Durability
Data interfaces
Metadata management
Naming resources
Shared resources
Cluster-level metadata
Programmable storage design paradigm
2
Development process
Declarative storage
In-vivo storage development
1
3
Will this
ever work?
What is durability?
43
Long-term storage
500 years
Driving example: CORFU distributed shared-log
44
Balakrishnan et al., “CORFU: A Shared Log Design for Flash Clusters”, NSDI, `12
| | | | | | | | | | | | | | | | | | | |
log striping
Sequencer (P)
Sequencer (B)
1, 2, 3, 4, 5, ….
read
clients
append
Programmed data interface
Ceph data interfaces
Persistent media is only part of the bottleneck
45
SSD
DRAM
Software is a bottleneck
46
Queuing and scheduling
network
Persistent storage media
Transactional context
client req
This I/O path is taken by all requests… regardless of need!
error conditions
replication
tiering
error conditions
{ client operations }
concurrency control
clone
indexing
error
conditions
CoW
?
Goal: optimize for broad spectrum of request needs
State: intertwined with correctness-sensitive handling
The CORFU sequencer is a high availability service
47
Balakrishnan et al., “CORFU: A Shared Log Design for Flash Clusters”, NSDI, `12
| | | | | | | | | | | | | | | | | | | |
log striping
Sequencer (P)
Sequencer (B)
1, 2, 3, 4, 5, ….
read
clients
append
Ceph already provides availability... recovery!
Function CORFU Sequencer Recover
1: FOR EACH STORAGE DEVICE
2: MAX = SEAL(DEVICE)
3: RETURN MAX
EndFunction
Data availability applied to sequencer interface
48
Seq @ OSD.0
Seq @ OSD.1
OSD.0 Failure
Seq @ OSD.2
OSD.1 Failure
Configurable
timeouts
Configurable replicas
Outline and contributions
49
Transactional data
Compute resources
Structured data
Graph processing
Durability
Data interfaces
Metadata management
Naming resources
Shared resources
Cluster-level metadata
Programmable storage design paradigm
2
Development process
Declarative storage
In-vivo storage development
1
3
Will this
ever work?
Outline and contributions
50
Transactional data
Compute resources
Structured data
Graph processing
Durability
Data interfaces
Metadata management
Naming resources
Shared resources
Cluster-level metadata
Programmable storage design paradigm
2
Development process
Declarative storage
In-vivo storage development
1
3
Will this
ever work?
Metadata management
51
App
App
App
Storage system
File
Object
mount/
POSIX file resource
Config DB
Metadata management
52
App
App
App
Storage system
File
Object
A
B
C
mount/
POSIX file resource
Config DB
Need: Naming, metadata storage. etc...
POSIX namespace management of all interfaces
53
/users/
mount/
/science/
Metadata cluster
concurrency control
capabilities
security
cache management
POSIX/File
inode
inode
/log-instances/
Instances of ZLog used by applications
/log-instances/
Programmable metadata management (file types)
54
/users/
mount/
/science/
/log-instances/
/log-instances/
inode
inode
Metadata cluster
concurrency control
capabilities
security
cache management
POSIX/File
ZLog Basic
ZLog Streaming
inode
inode
inode
inode
[Karpovich, ‘94]
Programmable metadata management (file types)
55
mount/
/log-instances/
inode
Metadata cluster
concurrency control
capabilities
security
cache management
ZLog Basic
inode
ZLog Interface
FS client
ZLog metadata
Naming / discovery
Sequencer (P)
Sequencer (B)
[Karpovich, ‘94]
Programmable metadata management (coherency)
56
mount/
/log-instances/
inode
Metadata cluster
concurrency control
capabilities
security
cache management
ZLog Basic
inode
ZLog Interface
FS client
ZLog metadata
Naming / discovery
ZLog Interface
FS client
ZLog Interface
FS client
Cache invalidation�protocol
Can enforce exclusive�access to metadata
pos = seq++
The capability-based sequencer is round-robin
57
ZLog Interface
FS client
ZLog Interface
FS client
Compared to centralized architecture…
can benefit bursty workloads
best effort
delay
quota
Findings trade-offs with capability-based sequencer
58
better
Throughput per Policy
Latency per Policy
More opportunities to receive the shared resource capability
Outline and contributions
59
Transactional data
Compute resources
Structured data
Graph processing
Durability
Data interfaces
Metadata management
Naming resources
Shared resources
Cluster-level metadata
Programmable storage design paradigm
2
Development process
Declarative storage
In-vivo storage development
1
3
Will this
ever work?
Outline and contributions
60
Transactional data
Compute resources
Structured data
Graph processing
Durability
Data interfaces
Metadata management
Naming resources
Shared resources
Cluster-level metadata
Programmable storage design paradigm
2
Development process
Declarative storage
In-vivo storage development
1
3
Will this
ever work?
Programmability looks like a one-time cost
61
WRONG!
is not
ZLog performance toss up on 2014 version of Ceph
62
Performance Comparison of 4 Designs
Appends / Sec
Ceph 2014
Clear ZLog implementation choice in 2016
Takeaway:
63
Performance Comparison of 4 Designs
Appends / Sec
Ceph 2014
Ceph 2016
Appends / Sec
That is the state of programmable storage
64
Ceph programmability 2010 to 2016
65
Ceph programmability usage since 2016
66
61%
Programmability is critical to the Ceph ecosystem
When asked about the fate of object classes given the redesign opportunity:
67
“cls [object classes] isn't going away... it's proven pretty important for all of RGW, RBD, and CephFS... It has proven extremely useful and it's also a clean way to incorporate logic during updates without slowing down the I/O pipeline (mostly!).”
-- Sage Weil, lead architect of Ceph
Popping up in different places
68
Swift
Storlets
Microsoft
Amazon
Outline and contributions
69
Transactional data
Compute resources
Structured data
Graph processing
Durability
Data interfaces
Metadata management
Naming resources
Shared resources
Cluster-level metadata
Programmable storage design paradigm
2
Development process
1
3
Will this
ever work?
Declarative storage
In-vivo storage development
Outline and contributions
70
Transactional data
Compute resources
Structured data
Graph processing
Durability
Data interfaces
Metadata management
Naming resources
Shared resources
Cluster-level metadata
Programmable storage design paradigm
2
Development process
1
3
Will this
ever work?
Declarative storage
In-vivo storage development
Declarative storage
71
Query optimization & plan generation
Cost model
In-vivo storage system development
72
Outline and contributions
73
Transactional data
Compute resources
Structured data
Graph processing
Durability
Data interfaces
Metadata management
Naming resources
Shared resources
Cluster-level metadata
Programmable storage design paradigm
2
Development process
Declarative storage
In-vivo storage development
1
3
Will this
ever work?
Outline and contributions
74
Transactional data
Compute resources
Structured data
Graph processing
Durability
Data interfaces
Metadata management
Naming resources
Shared resources
Cluster-level metadata
Programmable storage design paradigm
2
Development process
Declarative storage
In-vivo storage development
1
3
Will this
ever work?
Recap and conclusions
75
Future work
76
Storage system architecture
Developer assistance
Declarative storage
Publications
77
HotStorage ’17 | DeclStore: Layering is for the Faint of Heart N. Watkins, M. Sevilla, I. Jimenez, K. Dahlgren, P. Alvaro, S. Finkelstein, and C. Maltzahn |
EuroSys ’17 | Malacology: A Programmable Storage System M. Sevilla, N. Watkins, I. Jimenez, P. Alvaro, S. Finkelstein, J. LeFevre, and C. Maltzahn |
HotStorage ’16 | ZEA, A Data Management Approach for SMR A. Manzanares, N. Watkins, C. Guyot, D. LeMoal, C., and Z. Bandic |
PDSW ’15 | Automatic and Transparent I/O Optimization With Storage Integrated Application Runtime Support N. Watkins, Z. Jia, G. Shipman, C. Maltzahn, A. Aiken, and P. McCormick |
SC ’15 | Mantle: A Programmable Metadata Load Balancer for the Ceph File System M. Sevilla, N. Watkins, C. Maltzahn, I. Nassi, S. Brandt, S. Weil, G. Farnum, and S. Fineberg |
BDMC ’13 | In-Vivo Storage System Development N. Watkins, C. Maltzahn, S. Brandt, I. Pye, and A. Manzanares |
PDSW ’12 | DataMods: Programmable File System Service N. Watkins, C. Maltzahn, S. Brandt, A. Manzanares |
SC ’11 | SciHadoop: Array-based Query Processing in Hadoop J. Buck, N. Watkins, J. Lefevre, K. Ioannidou, C. Maltzahn, N. Polyzotis, S. Brandt |
DADC ’09 | Abstract Storage: Moving File Format-specific Abstractions Into Petabyte-scale Storage Systems J. Buck, N. Watkins, C. Maltzahn, S. Brandt |
Thank you everyone
Committee: Carlos Maltzahn, Scott Brandt, Peter Alvaro, and other amazing collaborators: Neoklis Polyzotis, Jeff LeFevre, Shel Finkelstein, Ike Nassi, Kleoni Ioannidou
Michael Sevilla, Ivo Jimenez, Joe Buck, Dimitris Skourtis, Adam Crume
Pat McCormick, Galen Shipman, John Bent, Gary Grider, Adam Manzanares, Kleoni Ioannidou, Jay Lofstead, Sage Weil, Anna Povzner, Greg Farnum
78