Scalable, Global Namespaces with Programmable Storage
Michael A. Sevilla
April 25th, 2018
Dissertation Defense
LA-UR-18-21419
What is a global namespace?
Names ↦ Data; Hierarchical Structure�e.g., DNS, network topologies, URLs, � scoping in programming languages
2
subtree
file
.txt
/dir
In this thesis, we focus on file system namespaces.
Hierarchical Semantics
Global Semantics
inherit
parent's ownership
strong
consistency
durability
Problem! POSIX IO file system metadata access semantics are difficult to scale.
Scalable, Global Namespaces
Dissertation Defense
Michael A. Sevilla
April 24, 2018
Introduction ←
Prototyping Platform
Subtree Load Balancing
API/framework
beyond FSs
Subtree Semantics
mechanisms
guarantees
Subtree Schemas
structure
generators
Conclusion
File system metadata access patterns:
→ Small and frequent requests
→ Target same resource�
3
[Sevilla et. al., SC'15]
Many metadata reads/writes
Fewer metadata reads/writes
Single Node
File System
metadata IO
(permissions, size, atime, etc.)
data IO
client
client
Distributed
File System
[Mesnier et. al., IEEE Comm.]
data IO
metadata IO
metadata IO does not scale like data IO
… as a result
Scalable, Global Namespaces
Dissertation Defense
Michael A. Sevilla
April 24, 2018
Introduction ←
Prototyping Platform
Subtree Load Balancing
API/framework
beyond FSs
Subtree Semantics
mechanisms
guarantees
Subtree Schemas
structure
generators
Conclusion
Scalable FS Metadata Access
… a brief history
4
'02
'04
'06
'08
'10
'12
'14
'16
'18
Metadata Distribution
Single Node Hash
Subtree Partition Table
CalvinFS
Ursa
Minor
IBRIX
Colossus�FS
Hierarchical Semantics
Global Semantics
inherit
parent's ownership
strong
consistency
durability
GlusterFS
GPFS
Lazy
Hybrid
HBA
Giga+
PVFS2
SkyFS
{Index/Delta/
Batch/Shard}FS
pNFS
MarFS
'clean-slate' file systems
→ migrate? compare?
'dirty-slate' file systems
→ modify code? tunables?
File System
PanFS
Ceph
FS
Two
Tiers
Farsite
HopsFS
HDFS
GFS
Lustre
ADLS
in summary
Scalable, Global Namespaces
Dissertation Defense
Michael A. Sevilla
April 24, 2018
Introduction ←
Prototyping Platform
Subtree Load Balancing
API/framework
beyond FSs
Subtree Semantics
mechanisms
guarantees
Subtree Schemas
structure
generators
Conclusion
Programmable Storage
(our solution throughout this talk)
5
File System
Hierarchical Semantics
Global Semantics
inherit
parent's ownership
strong
consistency
durability
Scalable, Global Namespaces
Dissertation Defense
Michael A. Sevilla
April 24, 2018
Introduction ←
Prototyping Platform
Subtree Load Balancing
API/framework
beyond FSs
Subtree Semantics
mechanisms
guarantees
Subtree Schemas
structure
generators
Conclusion
Contributions:
6
2
Subtree Semantics
Consistency/Durability
3
Subtree
Schemas
1
Subtree
Load Balancing
design policies that shape
metadata management techniques
1
API for specifying policies;
Policy engine for guiding mechanisms
2
Malacology
[EuroSys '17]
Mantle�[SC '15, CCGrid '18]
Cudele
[IPDPS '18]
Tintenfisch
[HotStorage '18]
Outline: Scalable, Global Namespaces
7
Malacology
Prototyping Platform
Mantle
Cudele
Tintenfisch
Malacology
Prototyping Platform
Scalable Global
Namespaces
1
Subtree
Load Balancing
3
Subtree
Schemas
2
Subtree Semantics
Consistency/Durability
Mantle
Cudele
Tintenfisch
Ceph Background (Used to Build Malacology)
8
Traditional Storage Interfaces
LIB
OBJECT
BLOCK
FILE (hierarchical namespace)
data IO
journal IO
client
client
metadata IO
RADOS
MDS Cluster
data IO
client
client
client
client
client
client
Scalable, Global Namespaces
Dissertation Defense
Michael A. Sevilla
April 24, 2018
Introduction
Prototyping Platform ←
Subtree Load Balancing
API/framework
beyond FSs
Subtree Semantics
mechanisms
guarantees
Subtree Schemas
structure
generators
Conclusion
Malacology�[EuroSys '17]
Malacology: A Programmable Storage System
9
application
developer
RADOS
RADOS
✓ atomic ops
✓ batching
✓ data access
✓ consensus
✓ migration
My App
Traditional Storage Interfaces
LIB
OBJ
BLOCK
FILE
Scalable, Global Namespaces
Dissertation Defense
Michael A. Sevilla
April 24, 2018
Introduction
Prototyping Platform ←
Subtree Load Balancing
API/framework
beyond FSs
Subtree Semantics
mechanisms
guarantees
Subtree Schemas
structure
generators
Conclusion
Malacology�[EuroSys '17]
Malacology: A Programmable Storage System
10
My App
RADOS
RADOS
✓ atomic ops
✓ batching
✓ data access
✓ consensus
✓ migration
My App
✓ consensus
✓ atomic operations
✓ batching
✓ migration
Traditional Storage Interfaces
LIB
OBJ
BLK
FILE
application
developer
Scalable, Global Namespaces
Dissertation Defense
Michael A. Sevilla
April 24, 2018
Introduction
Prototyping Platform ←
Subtree Load Balancing
API/framework
beyond FSs
Subtree Semantics
mechanisms
guarantees
Subtree Schemas
structure
generators
Conclusion
Malacology�[EuroSys '17]
Outline: Scalable, Global Namespaces
11
Mantle
Malacology
Prototyping Platform
Scalable Global
Namespaces
1
Subtree
Load Balancing
3
Subtree
Schemas
2
Subtree Semantics
Consistency/Durability
Mantle
Cudele
Tintenfisch
Data IO does not scale like metadata IO
→ Distribute File System Metadata Across Cluster
Solution 1: Hash File ID Solution 2: Subtree Partitioning
Dynamic version of these approaches
12
Current Approaches:�
File systems have mechanisms for metadata migration but it is the policies that determine performance
fundamental insight
locality
balance
locality
balance
Scalable, Global Namespaces
Dissertation Defense
Michael A. Sevilla
April 24, 2018
Introduction
Prototyping Platform
Subtree Load Balancing ←
API/framework
beyond FSs
Subtree Semantics
mechanisms
guarantees
Subtree Schemas
structure
generators
Conclusion
Mantle�[SC '15, CCGrid '18]
Mantle: Programmable Load Balancer
CephFS's Subtree Partitioning
Current Approach:
13
MDS Cluster
Policies
rebalance
recv HB
fragment
where
how much
when
migrate!
Mantle
API
where
how much
when
admin
locality
balance
Our Approach:
Scalable, Global Namespaces
Dissertation Defense
Michael A. Sevilla
April 24, 2018
Introduction
Prototyping Platform
Subtree Load Balancing
API/framework ←
beyond FSs
Subtree Semantics
mechanisms
guarantees
Subtree Schemas
structure
generators
Conclusion
Mantle�[SC '15, CCGrid '18]
Policies Expressed w/ Storage-agnostic Language
Mantle: API & Policy Engine for FS Metadata Load Balancing (our solution)
14
MDS
Cluster
# of inode writes
metadata load on subtrees
load on myself
load on neighbor
Good for mixed workloads
Good for create-heavy workload
Simple implementation
Scalable, Global Namespaces
Dissertation Defense
Michael A. Sevilla
April 24, 2018
Introduction
Prototyping Platform
Subtree Load Balancing
API/framework ←
beyond FSs
Subtree Semantics
mechanisms
guarantees
Subtree Schemas
structure
generators
Conclusion
Mantle�[SC '15, CCGrid '18]
Evaluating the Mantle API & Policy Engine
Mantle: API & Policy Engine for FS Metadata Load Balancing (our solution)
15
policy
GIGA+
GIGA+
modified
LARD
MDS0
MDS1
MDS2
MDS3
% of total load
100
½ load
50
50
¼ load
25
25
⅛ load
13
13
25
25
25
25
% of total load
75
25
% of total load
Time(minutes)
Time(minutes)
Time(minutes)
Metadata
(reqs/s)
Metadata
(reqs/s)
Metadata
(reqs/s)
File System
File systems have mechanisms for metadata migration but it is the policies that determine performance
fundamental insight
MDS capacity → 9% speedup
Conservative vs. Aggressive → 6-9% speedup
Metadata Protocols → 40% slowdown
Performance Summary
Scalable, Global Namespaces
Dissertation Defense
Michael A. Sevilla
April 24, 2018
Introduction
Prototyping Platform
Subtree Load Balancing
API/framework ←
beyond FSs
Subtree Semantics
mechanisms
guarantees
Subtree Schemas
structure
generators
Conclusion
Mantle�[SC '15, CCGrid '18]
Load Balance ZLog Sequencers
Cache Management for ParSplice
Using API & Policy Engine in Other Domains
16
↓ = workload access pattern detected
customized for sequencer workload
Customized to be Less Aggressive
distributed shared commit log
molecular dynamics simulation
Thanks, Michael Leece!
App-specific balancer → 1.5X speedup
Performance Summary
App-specific cache → 32-66% less memory
→ 0% perf. degradation
Memory Savings Summary
Scalable, Global Namespaces
Dissertation Defense
Michael A. Sevilla
April 24, 2018
Introduction
Prototyping Platform
Subtree Load Balancing
API/framework
beyond FSs ←
Subtree Semantics
mechanisms
guarantees
Subtree Schemas
structure
generators
Conclusion
Mantle�[SC '15, CCGrid '18]
Mantle Takeaway
17
Mantle�general data management API
overhead of POSIX IO semantics
Malacology
Prototyping Platform
Scalable Global
Namespaces
1
Subtree
Load Balancing
3
Subtree
Schemas
2
Subtree Semantics
Consistency/Durability
Mantle
Cudele
Tintenfisch
Outline: Scalable, Global Namespaces
18
Mantle�general data management API
overhead of POSIX IO semantics
Cudele
Malacology
Prototyping Platform
Scalable Global
Namespaces
1
Subtree
Load Balancing
3
Subtree
Schemas
2
Subtree Semantics
Consistency/Durability
Mantle
Cudele
Tintenfisch
Global FS consistency/durability semantics
… which hurts performance OR correctness
Current Approaches:�
19
decoupled, no durability
RAMDisk semantics
decoupled, durable
DeltaFS semantics
weak consistency
HDFS semantics
strong consistency
POSIX IO semantics
Scalable, Global Namespaces
Dissertation Defense
Michael A. Sevilla
April 24, 2018
Introduction
Prototyping Platform
Subtree Load Balancing
API/framework
beyond FSs
Subtree Semantics ←
mechanisms
guarantees
Subtree Schemas
structure
generators
Conclusion
Cudele�[IPDPS '18]
Dynamically Assign Semantics to Subtrees
Cudele: API/policy engine for app-specific customizations (our solution)
20
�
With Cudele, clients can:�
decoupled, no durability
RAMDisk semantics
decoupled, durable
DeltaFS semantics
weak consistency
HDFS semantics
strong consistency
POSIX IO semantics
Scalable, Global Namespaces
Dissertation Defense
Michael A. Sevilla
April 24, 2018
Introduction
Prototyping Platform
Subtree Load Balancing
API/framework
beyond FSs
Subtree Semantics ←
mechanisms
guarantees
Subtree Schemas
structure
generators
Conclusion
Cudele�[IPDPS '18]
Composable Interfaces for Building Guarantees
… with Cudele (our solution)
21
Metadata Servers
Client
local�persist
append client journal
Object
Store
RPCs
stream
volatile
apply
Global
persist
Leveraged Ceph Internal Subsystem
Inode Cache
Journal
Metadata Store
Journal Tool
Consistency
C
Durability
D
Scalable, Global Namespaces
Dissertation Defense
Michael A. Sevilla
April 24, 2018
Introduction
Prototyping Platform
Subtree Load Balancing
API/framework
beyond FSs
Subtree Semantics
mechanisms ←
guarantees
Subtree Schemas
structure
generators
Conclusion
Cudele�[IPDPS '18]
Composable Interfaces for Building Guarantees
… with Cudele (our solution)
22
Strong consistency
Weak consistency
Invisible consistency
Meta-
data
Server
Clients
Custom fit subtree semantics
→ checkpoint-restart (91.7x speedup)
→ user home directories (0.03 stddev from optimal)
→ users checking partial results (2% overhead)
Performance Summary
Scalable, Global Namespaces
Dissertation Defense
Michael A. Sevilla
April 24, 2018
Introduction
Prototyping Platform
Subtree Load Balancing
API/framework
beyond FSs
Subtree Semantics
mechanisms
guarantees ←
Subtree Schemas
structure
generators
Conclusion
Cudele�[IPDPS '18]
Composable Interfaces for Building Guarantees
… with Cudele (our solution)
23
File System
Strong consistency
Weak consistency
Invisible consistency
Meta-
data
Server
Clients
Custom fit subtree semantics
→ checkpoint-restart (91.7x speedup)
→ user home directories (0.03 stddev from optimal)
→ users checking partial results (2% overhead)
Performance Summary
Scalable, Global Namespaces
Dissertation Defense
Michael A. Sevilla
April 24, 2018
Introduction
Prototyping Platform
Subtree Load Balancing
API/framework
beyond FSs
Subtree Semantics
mechanisms
guarantees ←
Subtree Schemas
structure
generators
Conclusion
Cudele�[IPDPS '18]
Cudele Takeaway
24
Mantle�general data management API
overhead of POSIX IO semantics
Cudele
different semantics can co-exist
read overheads (manage, mater-
ialize, transfer)
Malacology
Prototyping Platform
Scalable Global
Namespaces
1
Subtree
Load Balancing
3
Subtree
Schemas
2
Subtree Semantics
Consistency/Durability
Mantle
Cudele
Tintenfisch
Outline: Scalable, Global Namespaces
25
Mantle�general data management API
overhead of POSIX IO semantics
Cudele
different semantics can co-exist
read overheads (manage, mater-
ialize, transfer)
Tintenfisch
Malacology
Prototyping Platform
Scalable Global
Namespaces
1
Subtree
Load Balancing
3
Subtree
Schemas
2
Subtree Semantics
Consistency/Durability
Mantle
Cudele
Tintenfisch
Transferring and Materializing Large Lists
Current Approaches:
26
Client
Metadata Server
Traditional
Client
RPCs
Scalable, Global Namespaces
Dissertation Defense
Michael A. Sevilla
April 24, 2018
Introduction
Prototyping Platform
Subtree Load Balancing
API/framework
beyond FSs
Subtree Semantics
mechanisms
guarantees
Subtree Schemas ←
structure
generators
Conclusion
Tintenfisch�[HotStorage '18]
Generate Namespaces for Large Lists
Our Solution:
27
Client
Metadata Server
Traditional
Client
RPCs
Scalable, Global Namespaces
Dissertation Defense
Michael A. Sevilla
April 24, 2018
Introduction
Prototyping Platform
Subtree Load Balancing
API/framework
beyond FSs
Subtree Semantics
mechanisms
guarantees
Subtree Schemas ←
structure
generators
Conclusion
Tintenfisch�[HotStorage '18]
Example: PLFS Namespace
Middleware used in HPC �for checkpoint-restart
28
pattern
Repeat Pattern twice
PLFS specific metadata
Scalable, Global Namespaces
Dissertation Defense
Michael A. Sevilla
April 24, 2018
Introduction
Prototyping Platform
Subtree Load Balancing
API/framework
beyond FSs
Subtree Semantics
mechanisms
guarantees
Subtree Schemas
structure ←
generators
Conclusion
Tintenfisch�[HotStorage '18]
Example: Namespace Generators
Our Solution:
29
3. Pointer
for ROOT
2. Code for SIRIUS
1. Formula�for PLFS
Obj Store
*
*
*
Scalable, Global Namespaces
Dissertation Defense
Michael A. Sevilla
April 24, 2018
Introduction
Prototyping Platform
Subtree Load Balancing
API/framework
beyond FSs
Subtree Semantics
mechanisms
guarantees
Subtree Schemas
structure
generators ←
Conclusion
Tintenfisch�[HotStorage '18]
Tintenfisch Takeaway
30
Mantle�general data management API
overhead of POSIX IO semantics
Cudele
different semantics can co-exist
read overheads (manage, mater-
ialize, transfer)
Tintenfisch
metadata struct. → generators
Malacology
Prototyping Platform
Scalable Global
Namespaces
1
Subtree
Load Balancing
3
Subtree
Schemas
2
Subtree Semantics
Consistency/Durability
Mantle
Cudele
Tintenfisch
Academic/Community Impact
31
Malacology
[EuroSys '17]
Mantle�[SC '15, CCGrid '18]
Cudele
[IPDPS '18]
Tintenfisch
[HotStorage '18]
Community
Impact
Funding:
Merged into
Presented to community
Featured in
"Reproducible" papers (Popper-compliant)
Malacology
Prototyping Platform
Scalable Global
Namespaces
1
Subtree
Load Balancing
3
Subtree
Schemas
2
Subtree Semantics
Consistency/Durability
Mantle
Cudele
Tintenfisch
Conclusion
32
policies that shape metadata
management techniques
1
API for specifying policies;
Policy engine for guiding mechanisms
2
→ facilitates application-specific software stacks
Scalable FS�metadata�techniques exist�(clean-slate/� dirty-slate)�
Problem! POSIX IO file system metadata access semantics are difficult to scale.
File System
Scalable, Global Namespaces
Dissertation Defense
Michael A. Sevilla
April 24, 2018
Introduction
Prototyping Platform
Subtree Load Balancing
API/framework
beyond FSs
Subtree Semantics
mechanisms
guarantees
Subtree Schemas
structure
generators
Conclusion ←
Future work
33
Scalable, Global Namespaces
Dissertation Defense
Michael A. Sevilla
April 24, 2018
Introduction
Prototyping Platform
Subtree Load Balancing
API/framework
beyond FSs
Subtree Semantics
mechanisms
guarantees
Subtree Schemas
structure
generators
Conclusion ←
What's Next?
34
Future work for me and this project:
Future work for Scalable Global Namespaces
Publications
[SC '15] Mantle: A Programmable Metadata Load Balancer for the Ceph File System � M. Sevilla, N. Watkins, C. Maltzahn, I. Nassi, S. Brandt, S. Weil, G. Farnum, S. Fineberg�[EuroSys '17] Malacology: A Programmable Storage System � M. Sevilla, N. Watkins, I. Jimenez, P. Alvaro, S. Finkelstein, J. LeFevre, C. Maltzahn�[HotStorage '17] DeclStore: Layering is for the Faint of Heart � N. Watkins, M. Sevilla, I. Jimenez, K. Dahlgren, P. Alvaro, Shel Finkelstein, C. Maltzahn �[IPDPS '18] Cudele: An API and Framework for Programmable Consistency and Durability in a Global Namespace � M. Sevilla, I. Jimenez, N. Watkins, J. LeFevre, P. Alvaro, S. Finkelstein, P. Donnelly, C. Maltzahn �[HotStorage '18] Tintenfisch: File System Namespace Schemas and Generators� M. Sevilla, R. Nasirigerdeh, C. Maltzahn, J. LeFevre, N. Watkins, P. Alvaro, M. Lawson, J. Lofstead, J. Pivarski��[;login '16] Standing on the Shoulders of Giants by Managing Scientific Experiments Like Software � I. Jimenez, M. Sevilla, N. Watkins, C. Maltzahn, J. Lofstead, K. Mohror, R. Arpaci-Dusseau, A. Arpaci-Dusseau�[IPDPSW '17] The Popper Convention: Making Reproducible Systems Evaluation Practical � I. Jimenez, M. Sevilla, N. Watkins, C. Maltzahn, J. Lofstead, K. Mohror, A. Arpaci-Dusseau, R. Arpaci-Dusseau �[ICPE '18] quiho: Automated Performance Regression Testing Using Fine Granularity Resource Utilization Profiles � I. Jimenez, N. Watkins, M. Sevilla, J. Lofstead, C. Maltzahn
[DISCS '13] A Framework for an In-depth Comparison of Scale-up and Scale-out � M. Sevilla I. Nassi, K. Ioannidou, S. Brandt, C. Maltzahn�[LSPP '14] SupMR:Circumventing Disk and Memory Bandwidth Bottlenecks for Scale-up MapReduce � M. Sevilla, I. Nassi, K. Ioannidou, S. Brandt, C. Maltzahn��
35
Popper
This Thesis
Big Data
Thanks
Mentors: Carlos Maltzahn, Scott Brandt, Jeff LeFevre, Peter Alvaro, Ike Nassi, Shel Finkelstein, and Kleoni Ioannidou
Industry Peers: Sam Fineberg, Bob Franks, Brad Settlemyer, Sage Weil, Greg Farnum, John Spray, Patrick Donnelly, Danny Perez, David Rich, Galen Shipman, Jim Pivarski, Margaret Lawson
SRL team: Noah Watkins, Ivo Jimenez, Joe Buck, Dimitris Skourtis, Adam Crume, Andrew Shewmaker, Jianshen Liu, Reza Nasirigerdeh, Ken Iizawa, and Lucho Ionkov
36
Lessons Learned
Checklist for success (2012)
2015
2018
37
38
Questions?
39