1 of 39

Scalable, Global Namespaces with Programmable Storage

Michael A. Sevilla

April 25th, 2018

Dissertation Defense

LA-UR-18-21419

2 of 39

What is a global namespace?

Names Data; Hierarchical Structure�e.g., DNS, network topologies, URLs, � scoping in programming languages

2

subtree

file

.txt

/dir

In this thesis, we focus on file system namespaces.

Hierarchical Semantics

Global Semantics

inherit

parent's ownership

strong

consistency

durability

Problem! POSIX IO file system metadata access semantics are difficult to scale.

Scalable, Global Namespaces

Dissertation Defense

Michael A. Sevilla

April 24, 2018

Introduction ←

Prototyping Platform

Subtree Load Balancing

API/framework

beyond FSs

Subtree Semantics

mechanisms

guarantees

Subtree Schemas

structure

generators

Conclusion

3 of 39

File system metadata access patterns:

→ Small and frequent requests

→ Target same resource�

3

[Sevilla et. al., SC'15]

Many metadata reads/writes

Fewer metadata reads/writes

Single Node

File System

metadata IO

(permissions, size, atime, etc.)

data IO

client

client

Distributed

File System

[Mesnier et. al., IEEE Comm.]

data IO

metadata IO

metadata IO does not scale like data IO

… as a result

Scalable, Global Namespaces

Dissertation Defense

Michael A. Sevilla

April 24, 2018

Introduction ←

Prototyping Platform

Subtree Load Balancing

API/framework

beyond FSs

Subtree Semantics

mechanisms

guarantees

Subtree Schemas

structure

generators

Conclusion

4 of 39

Scalable FS Metadata Access

… a brief history

4

'02

'04

'06

'08

'10

'12

'14

'16

'18

Metadata Distribution

Single Node Hash

Subtree Partition Table

CalvinFS

Ursa

Minor

IBRIX

Colossus�FS

Hierarchical Semantics

Global Semantics

inherit

parent's ownership

strong

consistency

durability

  • lock management
  • relaxing consistency
  • caching inodes
  • journal formats
  • journal safety
  • caching paths
  • metadata distribution
  • load balancing

GlusterFS

GPFS

Lazy

Hybrid

HBA

Giga+

PVFS2

SkyFS

{Index/Delta/

Batch/Shard}FS

pNFS

MarFS

'clean-slate' file systems

→ migrate? compare?

'dirty-slate' file systems

→ modify code? tunables?

File System

PanFS

Ceph

FS

Two

Tiers

Farsite

HopsFS

HDFS

GFS

Lustre

ADLS

in summary

Scalable, Global Namespaces

Dissertation Defense

Michael A. Sevilla

April 24, 2018

Introduction ←

Prototyping Platform

Subtree Load Balancing

API/framework

beyond FSs

Subtree Semantics

mechanisms

guarantees

Subtree Schemas

structure

generators

Conclusion

5 of 39

Programmable Storage

(our solution throughout this talk)

5

File System

Hierarchical Semantics

Global Semantics

inherit

parent's ownership

strong

consistency

durability

  • lock management
  • relaxing consistency
  • caching inodes
  • journal formats
  • journal safety
  • caching paths
  • metadata distribution
  • load balancing

Scalable, Global Namespaces

Dissertation Defense

Michael A. Sevilla

April 24, 2018

Introduction ←

Prototyping Platform

Subtree Load Balancing

API/framework

beyond FSs

Subtree Semantics

mechanisms

guarantees

Subtree Schemas

structure

generators

Conclusion

6 of 39

Contributions:

6

2

Subtree Semantics

Consistency/Durability

3

Subtree

Schemas

1

Subtree

Load Balancing

design policies that shape

metadata management techniques

1

API for specifying policies;

Policy engine for guiding mechanisms

2

Malacology

[EuroSys '17]

Mantle�[SC '15, CCGrid '18]

Cudele

[IPDPS '18]

Tintenfisch

[HotStorage '18]

7 of 39

Outline: Scalable, Global Namespaces

7

Malacology

Prototyping Platform

Mantle

Cudele

Tintenfisch

Malacology

Prototyping Platform

Scalable Global

Namespaces

1

Subtree

Load Balancing

3

Subtree

Schemas

2

Subtree Semantics

Consistency/Durability

Mantle

Cudele

Tintenfisch

8 of 39

Ceph Background (Used to Build Malacology)

8

Traditional Storage Interfaces

LIB

OBJECT

BLOCK

FILE (hierarchical namespace)

data IO

journal IO

client

client

metadata IO

RADOS

MDS Cluster

data IO

client

client

client

client

client

client

Scalable, Global Namespaces

Dissertation Defense

Michael A. Sevilla

April 24, 2018

Introduction

Prototyping Platform ←

Subtree Load Balancing

API/framework

beyond FSs

Subtree Semantics

mechanisms

guarantees

Subtree Schemas

structure

generators

Conclusion

Malacology�[EuroSys '17]

9 of 39

Malacology: A Programmable Storage System

9

application

developer

RADOS

RADOS

atomic ops

batching

data access

consensus

migration

My App

Traditional Storage Interfaces

LIB

OBJ

BLOCK

FILE

Scalable, Global Namespaces

Dissertation Defense

Michael A. Sevilla

April 24, 2018

Introduction

Prototyping Platform ←

Subtree Load Balancing

API/framework

beyond FSs

Subtree Semantics

mechanisms

guarantees

Subtree Schemas

structure

generators

Conclusion

Malacology�[EuroSys '17]

10 of 39

Malacology: A Programmable Storage System

10

My App

RADOS

RADOS

atomic ops

batching

data access

consensus

migration

My App

consensus

atomic operations

batching

migration

Traditional Storage Interfaces

LIB

OBJ

BLK

FILE

application

developer

Scalable, Global Namespaces

Dissertation Defense

Michael A. Sevilla

April 24, 2018

Introduction

Prototyping Platform ←

Subtree Load Balancing

API/framework

beyond FSs

Subtree Semantics

mechanisms

guarantees

Subtree Schemas

structure

generators

Conclusion

Malacology�[EuroSys '17]

11 of 39

Outline: Scalable, Global Namespaces

11

Mantle

Malacology

Prototyping Platform

Scalable Global

Namespaces

1

Subtree

Load Balancing

3

Subtree

Schemas

2

Subtree Semantics

Consistency/Durability

Mantle

Cudele

Tintenfisch

12 of 39

Data IO does not scale like metadata IO

Distribute File System Metadata Across Cluster

Solution 1: Hash File ID Solution 2: Subtree Partitioning

Dynamic version of these approaches

12

Current Approaches:�

File systems have mechanisms for metadata migration but it is the policies that determine performance

fundamental insight

locality

balance

locality

balance

Scalable, Global Namespaces

Dissertation Defense

Michael A. Sevilla

April 24, 2018

Introduction

Prototyping Platform

Subtree Load Balancing ←

API/framework

beyond FSs

Subtree Semantics

mechanisms

guarantees

Subtree Schemas

structure

generators

Conclusion

Mantle�[SC '15, CCGrid '18]

13 of 39

Mantle: Programmable Load Balancer

CephFS's Subtree Partitioning

Current Approach:

13

MDS Cluster

Policies

rebalance

recv HB

fragment

where

how much

when

migrate!

Mantle

API

where

how much

when

admin

locality

balance

Our Approach:

Scalable, Global Namespaces

Dissertation Defense

Michael A. Sevilla

April 24, 2018

Introduction

Prototyping Platform

Subtree Load Balancing

API/framework ←

beyond FSs

Subtree Semantics

mechanisms

guarantees

Subtree Schemas

structure

generators

Conclusion

Mantle�[SC '15, CCGrid '18]

14 of 39

Policies Expressed w/ Storage-agnostic Language

Mantle: API & Policy Engine for FS Metadata Load Balancing (our solution)

14

MDS

Cluster

# of inode writes

metadata load on subtrees

load on myself

load on neighbor

Good for mixed workloads

Good for create-heavy workload

Simple implementation

Scalable, Global Namespaces

Dissertation Defense

Michael A. Sevilla

April 24, 2018

Introduction

Prototyping Platform

Subtree Load Balancing

API/framework ←

beyond FSs

Subtree Semantics

mechanisms

guarantees

Subtree Schemas

structure

generators

Conclusion

Mantle�[SC '15, CCGrid '18]

15 of 39

Evaluating the Mantle API & Policy Engine

Mantle: API & Policy Engine for FS Metadata Load Balancing (our solution)

15

policy

GIGA+

GIGA+

modified

LARD

MDS0

MDS1

MDS2

MDS3

% of total load

100

½ load

50

50

¼ load

25

25

⅛ load

13

13

25

25

25

25

% of total load

75

25

% of total load

Time(minutes)

Time(minutes)

Time(minutes)

Metadata

(reqs/s)

Metadata

(reqs/s)

Metadata

(reqs/s)

File System

File systems have mechanisms for metadata migration but it is the policies that determine performance

fundamental insight

MDS capacity → 9% speedup

Conservative vs. Aggressive → 6-9% speedup

Metadata Protocols → 40% slowdown

Performance Summary

Scalable, Global Namespaces

Dissertation Defense

Michael A. Sevilla

April 24, 2018

Introduction

Prototyping Platform

Subtree Load Balancing

API/framework ←

beyond FSs

Subtree Semantics

mechanisms

guarantees

Subtree Schemas

structure

generators

Conclusion

Mantle�[SC '15, CCGrid '18]

16 of 39

Load Balance ZLog Sequencers

Cache Management for ParSplice

Using API & Policy Engine in Other Domains

16

↓ = workload access pattern detected

customized for sequencer workload

Customized to be Less Aggressive

distributed shared commit log

molecular dynamics simulation

Thanks, Michael Leece!

App-specific balancer → 1.5X speedup

Performance Summary

App-specific cache → 32-66% less memory

0% perf. degradation

Memory Savings Summary

Scalable, Global Namespaces

Dissertation Defense

Michael A. Sevilla

April 24, 2018

Introduction

Prototyping Platform

Subtree Load Balancing

API/framework

beyond FSs ←

Subtree Semantics

mechanisms

guarantees

Subtree Schemas

structure

generators

Conclusion

Mantle�[SC '15, CCGrid '18]

17 of 39

Mantle Takeaway

17

Mantle�general data management API

overhead of POSIX IO semantics

Malacology

Prototyping Platform

Scalable Global

Namespaces

1

Subtree

Load Balancing

3

Subtree

Schemas

2

Subtree Semantics

Consistency/Durability

Mantle

Cudele

Tintenfisch

18 of 39

Outline: Scalable, Global Namespaces

18

Mantle�general data management API

overhead of POSIX IO semantics

Cudele

Malacology

Prototyping Platform

Scalable Global

Namespaces

1

Subtree

Load Balancing

3

Subtree

Schemas

2

Subtree Semantics

Consistency/Durability

Mantle

Cudele

Tintenfisch

19 of 39

Global FS consistency/durability semantics

… which hurts performance OR correctness

Current Approaches:�

19

decoupled, no durability

RAMDisk semantics

decoupled, durable

DeltaFS semantics

weak consistency

HDFS semantics

strong consistency

POSIX IO semantics

Scalable, Global Namespaces

Dissertation Defense

Michael A. Sevilla

April 24, 2018

Introduction

Prototyping Platform

Subtree Load Balancing

API/framework

beyond FSs

Subtree Semantics ←

mechanisms

guarantees

Subtree Schemas

structure

generators

Conclusion

Cudele�[IPDPS '18]

20 of 39

Dynamically Assign Semantics to Subtrees

Cudele: API/policy engine for app-specific customizations (our solution)

20

With Cudele, clients can:�

  • relax consistency�
    • decouple subtrees�
    • lock subtrees�
  • adjust durability

decoupled, no durability

RAMDisk semantics

decoupled, durable

DeltaFS semantics

weak consistency

HDFS semantics

strong consistency

POSIX IO semantics

Scalable, Global Namespaces

Dissertation Defense

Michael A. Sevilla

April 24, 2018

Introduction

Prototyping Platform

Subtree Load Balancing

API/framework

beyond FSs

Subtree Semantics ←

mechanisms

guarantees

Subtree Schemas

structure

generators

Conclusion

Cudele�[IPDPS '18]

21 of 39

Composable Interfaces for Building Guarantees

… with Cudele (our solution)

21

Metadata Servers

Client

local�persist

append client journal

Object

Store

RPCs

stream

volatile

apply

Global

persist

Leveraged Ceph Internal Subsystem

Inode Cache

Journal

Metadata Store

Journal Tool

Consistency

C

Durability

D

Scalable, Global Namespaces

Dissertation Defense

Michael A. Sevilla

April 24, 2018

Introduction

Prototyping Platform

Subtree Load Balancing

API/framework

beyond FSs

Subtree Semantics

mechanisms ←

guarantees

Subtree Schemas

structure

generators

Conclusion

Cudele�[IPDPS '18]

22 of 39

Composable Interfaces for Building Guarantees

… with Cudele (our solution)

22

Strong consistency

Weak consistency

Invisible consistency

Meta-

data

Server

Clients

Custom fit subtree semantics

→ checkpoint-restart (91.7x speedup)

→ user home directories (0.03 stddev from optimal)

→ users checking partial results (2% overhead)

Performance Summary

Scalable, Global Namespaces

Dissertation Defense

Michael A. Sevilla

April 24, 2018

Introduction

Prototyping Platform

Subtree Load Balancing

API/framework

beyond FSs

Subtree Semantics

mechanisms

guarantees ←

Subtree Schemas

structure

generators

Conclusion

Cudele�[IPDPS '18]

23 of 39

Composable Interfaces for Building Guarantees

… with Cudele (our solution)

23

File System

Strong consistency

Weak consistency

Invisible consistency

Meta-

data

Server

Clients

Custom fit subtree semantics

→ checkpoint-restart (91.7x speedup)

→ user home directories (0.03 stddev from optimal)

→ users checking partial results (2% overhead)

Performance Summary

Scalable, Global Namespaces

Dissertation Defense

Michael A. Sevilla

April 24, 2018

Introduction

Prototyping Platform

Subtree Load Balancing

API/framework

beyond FSs

Subtree Semantics

mechanisms

guarantees ←

Subtree Schemas

structure

generators

Conclusion

Cudele�[IPDPS '18]

24 of 39

Cudele Takeaway

24

Mantle�general data management API

overhead of POSIX IO semantics

Cudele

different semantics can co-exist

read overheads (manage, mater-

ialize, transfer)

Malacology

Prototyping Platform

Scalable Global

Namespaces

1

Subtree

Load Balancing

3

Subtree

Schemas

2

Subtree Semantics

Consistency/Durability

Mantle

Cudele

Tintenfisch

25 of 39

Outline: Scalable, Global Namespaces

25

Mantle�general data management API

overhead of POSIX IO semantics

Cudele

different semantics can co-exist

read overheads (manage, mater-

ialize, transfer)

Tintenfisch

Malacology

Prototyping Platform

Scalable Global

Namespaces

1

Subtree

Load Balancing

3

Subtree

Schemas

2

Subtree Semantics

Consistency/Durability

Mantle

Cudele

Tintenfisch

26 of 39

Transferring and Materializing Large Lists

Current Approaches:

26

Client

Metadata Server

Traditional

Client

RPCs

  • High Performance Computing
  • High Energy Physics
  • Fusion Simulation

Scalable, Global Namespaces

Dissertation Defense

Michael A. Sevilla

April 24, 2018

Introduction

Prototyping Platform

Subtree Load Balancing

API/framework

beyond FSs

Subtree Semantics

mechanisms

guarantees

Subtree Schemas ←

structure

generators

Conclusion

Tintenfisch�[HotStorage '18]

27 of 39

Generate Namespaces for Large Lists

Our Solution:

27

Client

Metadata Server

Traditional

Client

RPCs

  • High Performance Computing
  • High Energy Physics
  • Fusion Simulation

Scalable, Global Namespaces

Dissertation Defense

Michael A. Sevilla

April 24, 2018

Introduction

Prototyping Platform

Subtree Load Balancing

API/framework

beyond FSs

Subtree Semantics

mechanisms

guarantees

Subtree Schemas ←

structure

generators

Conclusion

Tintenfisch�[HotStorage '18]

28 of 39

Example: PLFS Namespace

Middleware used in HPC �for checkpoint-restart

28

pattern

Repeat Pattern twice

PLFS specific metadata

Scalable, Global Namespaces

Dissertation Defense

Michael A. Sevilla

April 24, 2018

Introduction

Prototyping Platform

Subtree Load Balancing

API/framework

beyond FSs

Subtree Semantics

mechanisms

guarantees

Subtree Schemas

structure ←

generators

Conclusion

Tintenfisch�[HotStorage '18]

29 of 39

Example: Namespace Generators

Our Solution:

29

3. Pointer

for ROOT

2. Code for SIRIUS

1. Formula�for PLFS

  • High Performance �Computing

  • Fusion �Simulation

  • High Energy Physics

Obj Store

*

*

*

Scalable, Global Namespaces

Dissertation Defense

Michael A. Sevilla

April 24, 2018

Introduction

Prototyping Platform

Subtree Load Balancing

API/framework

beyond FSs

Subtree Semantics

mechanisms

guarantees

Subtree Schemas

structure

generators ←

Conclusion

Tintenfisch�[HotStorage '18]

30 of 39

Tintenfisch Takeaway

30

Mantle�general data management API

overhead of POSIX IO semantics

Cudele

different semantics can co-exist

read overheads (manage, mater-

ialize, transfer)

Tintenfisch

metadata struct. → generators

Malacology

Prototyping Platform

Scalable Global

Namespaces

1

Subtree

Load Balancing

3

Subtree

Schemas

2

Subtree Semantics

Consistency/Durability

Mantle

Cudele

Tintenfisch

31 of 39

Academic/Community Impact

31

Malacology

[EuroSys '17]

Mantle�[SC '15, CCGrid '18]

Cudele

[IPDPS '18]

Tintenfisch

[HotStorage '18]

Community

Impact

Funding:

Merged into

Presented to community

Featured in

"Reproducible" papers (Popper-compliant)

Malacology

Prototyping Platform

Scalable Global

Namespaces

1

Subtree

Load Balancing

3

Subtree

Schemas

2

Subtree Semantics

Consistency/Durability

Mantle

Cudele

Tintenfisch

32 of 39

Conclusion

32

policies that shape metadata

management techniques

  • subtree load balancing
  • subtree semantics
  • subtree schemas

1

API for specifying policies;

Policy engine for guiding mechanisms

2

→ facilitates application-specific software stacks

Scalable FS�metadata�techniques exist�(clean-slate/� dirty-slate)

Problem! POSIX IO file system metadata access semantics are difficult to scale.

File System

Scalable, Global Namespaces

Dissertation Defense

Michael A. Sevilla

April 24, 2018

Introduction

Prototyping Platform

Subtree Load Balancing

API/framework

beyond FSs

Subtree Semantics

mechanisms

guarantees

Subtree Schemas

structure

generators

Conclusion ←

33 of 39

Future work

33

Scalable, Global Namespaces

Dissertation Defense

Michael A. Sevilla

April 24, 2018

Introduction

Prototyping Platform

Subtree Load Balancing

API/framework

beyond FSs

Subtree Semantics

mechanisms

guarantees

Subtree Schemas

structure

generators

Conclusion ←

34 of 39

What's Next?

34

Future work for me and this project:

  • academic interest [publications, conferences]
  • community interest [co-authors, pull requests]
    • important part of Ceph project
    • other communities

Future work for Scalable Global Namespaces

  • Subtree Load Balancing: general data management policies that can be used across apps/storage systems
  • Subtree semantics: dynamically changing subtree semantics; embeddable policies (child subtrees)
  • Subtree schemas: prototype; storage system agnostic metadata generation

35 of 39

Publications

[SC '15] Mantle: A Programmable Metadata Load Balancer for the Ceph File System M. Sevilla, N. Watkins, C. Maltzahn, I. Nassi, S. Brandt, S. Weil, G. Farnum, S. Fineberg�[EuroSys '17] Malacology: A Programmable Storage System M. Sevilla, N. Watkins, I. Jimenez, P. Alvaro, S. Finkelstein, J. LeFevre, C. Maltzahn�[HotStorage '17] DeclStore: Layering is for the Faint of Heart � N. Watkins, M. Sevilla, I. Jimenez, K. Dahlgren, P. Alvaro, Shel Finkelstein, C. Maltzahn �[IPDPS '18] Cudele: An API and Framework for Programmable Consistency and Durability in a Global Namespace M. Sevilla, I. Jimenez, N. Watkins, J. LeFevre, P. Alvaro, S. Finkelstein, P. Donnelly, C. Maltzahn �[HotStorage '18] Tintenfisch: File System Namespace Schemas and GeneratorsM. Sevilla, R. Nasirigerdeh, C. Maltzahn, J. LeFevre, N. Watkins, P. Alvaro, M. Lawson, J. Lofstead, J. Pivarski��[;login '16] Standing on the Shoulders of Giants by Managing Scientific Experiments Like Software � I. Jimenez, M. Sevilla, N. Watkins, C. Maltzahn, J. Lofstead, K. Mohror, R. Arpaci-Dusseau, A. Arpaci-Dusseau�[IPDPSW '17] The Popper Convention: Making Reproducible Systems Evaluation Practical � I. Jimenez, M. Sevilla, N. Watkins, C. Maltzahn, J. Lofstead, K. Mohror, A. Arpaci-Dusseau, R. Arpaci-Dusseau �[ICPE '18] quiho: Automated Performance Regression Testing Using Fine Granularity Resource Utilization Profiles � I. Jimenez, N. Watkins, M. Sevilla, J. Lofstead, C. Maltzahn

[DISCS '13] A Framework for an In-depth Comparison of Scale-up and Scale-out M. Sevilla I. Nassi, K. Ioannidou, S. Brandt, C. Maltzahn�[LSPP '14] SupMR:Circumventing Disk and Memory Bandwidth Bottlenecks for Scale-up MapReduce M. Sevilla, I. Nassi, K. Ioannidou, S. Brandt, C. Maltzahn��

35

Popper

This Thesis

Big Data

36 of 39

Thanks

Mentors: Carlos Maltzahn, Scott Brandt, Jeff LeFevre, Peter Alvaro, Ike Nassi, Shel Finkelstein, and Kleoni Ioannidou

Industry Peers: Sam Fineberg, Bob Franks, Brad Settlemyer, Sage Weil, Greg Farnum, John Spray, Patrick Donnelly, Danny Perez, David Rich, Galen Shipman, Jim Pivarski, Margaret Lawson

SRL team: Noah Watkins, Ivo Jimenez, Joe Buck, Dimitris Skourtis, Adam Crume, Andrew Shewmaker, Jianshen Liu, Reza Nasirigerdeh, Ken Iizawa, and Lucho Ionkov

36

37 of 39

Lessons Learned

Checklist for success (2012)

  • High paying job
  • Certainty for future
  • Pass coding interview

2015

  • High paying job
  • Certainty for future
  • Pass coding interview

2018

  • High paying job
  • Certainty for future
  • Pass coding interview

37

38 of 39

38

39 of 39

Questions?

39