1 of 78

Programmable storage

19 June 2018

Noah Watkins

Thesis defense

2 of 78

Three common application I/O stack architectures

2

n

App

App

App

App

Unified storage

File

Object

Block

3

n

App

App-specific Storage

App-specific

2

App

POSIX file storage

POSIX*

Middleware**

* POSIX-ish, **HDF5, PLFS, MPI-IO

1

n

3 of 78

Redundancy and specialization costs

3

Paxos is like the simplest thing ever…

4 of 78

System stabilization is expen$$ive

“It takes 10 years before a new storage system is trusted.”

-- Gary Grider (LANL)The man, the myth, the legend

4

-- Brent Welch, MSST 2010

Footbag World Champion, Mixed Doubles, 1997

Eyeballs act as a proxy for reliability!

Share code-hardened sub-systems!

5 of 78

Outline and contributions

5

Transactional data

Compute resources

Structured data

Graph processing

Durability

Data interfaces

Metadata management

Naming resources

Shared resources

Cluster-level metadata

Programmable storage design paradigm

2

Development process

Declarative storage

In-vivo storage development

1

EuroSys ‘18

HotStorage ‘17

BDMC ‘13

PDSW ‘12

In-progress

3

Will this�ever work?

6 of 78

Outline and contributions

6

Compute resources

Graph processing

Data interfaces

Metadata management

Naming resources

Shared resources

Cluster-level metadata

Programmable storage design paradigm

Development process

Declarative storage

In-vivo storage development

EuroSys ‘18

HotStorage ‘17

BDMC ‘13

PDSW ‘12

In-progress

Transactional data

Structured data

Durability

7 of 78

Outline and contributions

7

Compute resources

Graph processing

Data interfaces

Metadata management

Naming resources

Shared resources

Cluster-level metadata

Programmable storage design paradigm

Development process

Declarative storage

In-vivo storage development

EuroSys ‘18

HotStorage ‘17

BDMC ‘13

PDSW ‘12

In-progress

1

New way to build storage interfaces!

  • Don’t re-implement an entire system
  • Recombining existing sub-systems

Transactional data

Structured data

Durability

8 of 78

Outline and contributions

8

Graph processing

Compute resources

Transactional data

Structured data

Durability

Data interfaces

Metadata management

Naming resources

Shared resources

Cluster-level metadata

Programmable storage design paradigm

Development process

Declarative storage

In-vivo storage development

EuroSys ‘18

HotStorage ‘17

BDMC ‘13

PDSW ‘12

In-progress

1

Widely applicable, but challenges!

  • Design space and unpredictable costs
  • Problems → research opportunities

2

9 of 78

Outline and contributions

9

Transactional data

Compute resources

Structured data

Graph processing

Durability

Data interfaces

Metadata management

Naming resources

Shared resources

Cluster-level metadata

Programmable storage design paradigm

Development process

Declarative storage

In-vivo storage development

EuroSys ‘18

HotStorage ‘17

BDMC ‘13

PDSW ‘12

In-progress

Will this�ever work?

1

Tying it all together

  • Declaratively specifying interfaces
  • Apply DB opt. to composable subsystems

2

3

10 of 78

Programmability: avoid duplication and specialization

10

n

App

App

App

App

Unified storage

File

Object

Block

3

n

App

App-specific Storage

App-specific

2

App

POSIX file storage

POSIX*

Middleware**

* POSIX-ish, **HDF5, PLFS, MPI-IO

1

n

11 of 78

Programmable storage

11

n+k

App

App

App

App

Programmable storage system

(share common sub-systems)

File

Object

Block

n

App

POSIX file storage

App

App-specific Storage

App-specific

2

POSIX*

Middleware**

* POSIX-ish, **HDF5, PLFS, MPI-IO

1

n

App

App

App

A

B

C

Programmable storage

Storage system that facilitates the reuse and extension of existing storage abstractions provided by the underlying software stack, to enable to the creation of new services via composition.

12 of 78

Programmable storage compared to…

Software-defined storage (SDS)

  • Buzzword for provisioning
  • Original SDS work
    • [Thereska, ‘13], IOFlow
    • [Stefanovici, ‘16], sRoute
  • Recombining services
    • [Dorier, ‘17], CoSS/Mochi
    • [Gracia-Tinedo, ‘17], Crystal

Active storage

  • Use remote hardware computation resources
  • Original work with disks
    • [Riedel, ‘98] Active storage
    • [Keeton, ‘98], Intelligent disks
  • Object-based storage
    • [Du, ‘05], Intelligent OSDs
    • [Xie, ‘11], T10 integration
  • Recent work with SSDs
    • [Seshadri, ‘14], Willow
    • [Jo, ‘16], YourSQL
    • [Do, ‘13], SmartSSDs

12

Motivation, use cases, and lessons learned

Similar motivation, HPC specific, entire systems from scratch

13 of 78

Programmable storage exposes internal subsystems

13

file, object, block

Consensus

Persistence

Migration

Batching

Atomic operations

, data i/o, service metadata, file type, shared resources, durability

New interfaces through composition and customization

Storage system

14 of 78

Programmability survey in Ceph (data interfaces)

14

App-specific interfaces

Interface groupings

15 of 78

Programmability in Ceph (reusable data interfaces)

15

Method Examples

  • Locking and concurrency control
  • Logging
  • Metadata management
  • Remote compute

App-specific interfaces

Interface groupings

Developers willing to break layers and use non-standard APIs

16 of 78

Outline and contributions

16

Transactional data

Compute resources

Structured data

Graph processing

Durability

Data interfaces

Metadata management

Naming resources

Shared resources

Cluster-level metadata

Programmable storage design paradigm

2

Development process

Declarative storage

In-vivo storage development

1

3

Will this

ever work?

17 of 78

Driving example: CORFU distributed shared-log

17

Balakrishnan et al., “CORFU: A Shared Log Design for Flash Clusters”, NSDI, `12

1

2

3

4

log striping

read

clients

Sequencer (P)

Sequencer (B)

1, 2, 3, 4, 5, ….

append

  • high-performance
  • total ordering

pos = seq++

send_msg(pos)

How can we implement CORFU with programmability?

18 of 78

Driving example: CORFU distributed shared-log

  • Implementation concerns
    • Partitioning, distribution
    • Metadata management
    • I/O interfaces

18

ZLog: CORFU on Ceph

Malacology

0

1

2

3

4

5

6

7

8

9

...

?

  • How to evaluate strategies?
    • Don’t want 9 implementations!
    • Metrics (throughput, latency)
    • Reduce the search space
  • Design space of 9 strategies
    • 1 partitioning
    • 3 I/O interfaces (entry data)
    • 3 I/O interfaces (metadata)
  • Bluestore, XFS
  • RocksDB, LMDB
  • SSD, HDD, NVMe
  • Bytestream
  • Omap (K/V)
  • Xattr (K/V)

3 x {entry, metadata}

:𝕄

19 of 78

Design space: ZLog design on Ceph

19

metadata / entry

omap

bytestream

xattr

omap

bytestream

xattr

Are these combinations worth exploring?

20 of 78

Append throughput for 128 byte log entries

20

21 of 78

Design space: ZLog design on Ceph

21

metadata / entry

omap

bytestream

xattr

omap

  • IOPS
  • Stable performance

No

bytestream

  • IOPS
  • Stable performance

xattr

???

???

22 of 78

Append performance for a variety of entry sizes

22

Append size (bytes)

if then else

4K exception (config!)

  • if |Entry| < 4K - X → omap
  • else BS + pad to 4K alignment
  • if |Entry| > 4K + X → omap

23 of 78

Design space: ZLog design on Ceph

23

metadata / entry

omap

bytestream

xattr

omap

  • IOPS (< 8KB)
  • or unaligned (see →)
  • Stable performance
  • or IOPS (> 8KB)
  • or aligned

No

bytestream

No

  • Stable performance
  • of IOPS (> 8KB)
  • or pad aligned < X

xattr

???

???

24 of 78

What about extended attributes (xattr)

  • Extended attributes are fast
    • Cached in-memory
    • Efficient [small] writes

24

metadata / entry

omap

bytestream

xattr

???

???

  • Created a design for small metadata
    • Compressed bitmaps!
    • Expected benefit over heavy-weight interfaces
  • The results are inconclusive
  • Ceph is gaslighting us!
    • And we are relative experts

What am I missing?

Are my other decisions incorrect?

25 of 78

Interest in ZLog from MegaCorp®

25

26 of 78

Acceptance of programmable storage paradigm

  • Salesforce was using Apache BookKeeper
  • Distributed logging solution
  • Separate everything
    • Hardware storage cluster
    • Software ecosystem
    • External services (Zookeeper)
    • Maintenance and expertise

26

... we are considering large scale deployments of Ceph ...

... Zlog seems more attractive as its on the same technology stack.

27 of 78

Many use cases; tail latency is universally important

27

28 of 78

Tail latency in Ceph isn’t good, but interfaces matter!

28

Interfaces affect tail latency

29 of 78

Design space: ZLog design on Ceph

29

metadata / entry

omap

bytestream

xattr

omap

  • IOPS (< 8KB)
  • or unaligned (see →)
  • Stable performance
  • or IOPS (> 8KB)
  • or aligned
  • Tail latency

No

bytestream

No

  • Stable performance
  • of IOPS (> 8KB)
  • or pad aligned < X
  • Tail latency

xattr

No

No

30 of 78

Navigating the design space is an obstacle

  • The CORFU storage abstraction is conceptually simple
    • Far more complex abstractions!
  • Searching and pruning the design space are essential
  • Need to build actual prototypes
    • Implementation issues always emerge
    • What about reads, scans, etc… ?
  • How do we know we aren’t missing something?
    • Deep understanding of Ceph, and still have doubts?

30

31 of 78

Outline and contributions

31

Transactional data

Compute resources

Structured data

Graph processing

Durability

Data interfaces

Metadata management

Naming resources

Shared resources

Cluster-level metadata

Programmable storage design paradigm

2

Development process

Declarative storage

In-vivo storage development

1

3

Will this

ever work?

32 of 78

Outline and contributions

32

Transactional data

Compute resources

Structured data

Graph processing

Durability

Data interfaces

Metadata management

Naming resources

Shared resources

Cluster-level metadata

Programmable storage design paradigm

2

Development process

Declarative storage

In-vivo storage development

1

3

Will this

ever work?

33 of 78

How to grow a database: scale-up approach

33

Database Node

CPU

RAM

Database Storage

Network /

Bus

Q

https://aws.amazon.com/ec2/instance-types/

  • This isn’t a talk about database architectures
  • Many, many, approaches to scalability
  • See (MPP, Hybrid-MPP, etc…)

34 of 78

Skyhook: exploit storage resources

34

Database Node

CPU

RAM

Database storage

Network /

Bus

Database storage

Network

Q

Q

Q

Q

(Q)

Skyhook project

  • Elastic database system
  • Lead: Jeff LeFevre
  • Active CROSS incubator

Single-node architecture

Database Node

CPU

RAM

Q

Skyhook�architecture

Programmable storage

35 of 78

Skyhook: aligns data with storage interfaces

35

DB-specific Data Interface

Ceph OSD

RAM

CPU

Storage+Index

Q

Ceph OSD

RAM

CPU

Storage+Index

Ceph OSD

RAM

CPU

Storage+Index

Ceph OSD

RAM

CPU

Storage+Index

Ceph OSD

RAM

CPU

Storage+Index

Ceph OSD

RAM

CPU

Storage+Index

Ceph OSD

RAM

CPU

Storage+Index

Ceph Cluster

C1

C2

C3

Table

Table

Shards

partitioning

{ object.i }

{ object.i }

Programmability used in a completely different way than to build a log abstraction

Database Node

RAM

CPU

Foreign Data Wrappers

Q

Q

Database node

* Indexing

* Projection

* Filtering

* Aggregation

App-specific�interface

36 of 78

Skyhook experiments with programmable storage

  • Real-world dataset
    • TPC lineitem table
    • 1 billion rows
    • 140 GB
  • Storage in Ceph objects
    • Table divided into ~10,000 14 MB objects
      • Optimize for workload (e.g. 4MB)
    • Each object contains a dedicated index
      • Index stored in omap (RocksDB)
  • Storage hardware (thanks CloudLab!)
    • Modern 20 core Intel
    • 128 GB DRAM, 500 GB SSD
    • 10 GB/s Ethernet
    • 1 -- 16 Ceph nodes

36

Database Node

CPU

RAM

Programmable storage

Network

(Database-specific data interface)

Q

Q

Q

Q

37 of 78

Benchmark queries evaluated

Qa: Range query with 10% selectivity:

SELECT * FROM lineitem WHERE extendedprice > 71000.0

Qb: Point query (unique row) issued with and without index:

SELECT extendedprice

FROM lineitem� WHERE orderkey=5 AND linenumber=3

Qc: Regex query with 10% selectivity (CPU intensive):

SELECT * FROM lineitem WHERE comment iLIKE '%uriously%'

37

+ data loading

38 of 78

Range query performance (10% selectivity)

38

Improved I/O performance

  • Local I/O bandwidth
  • Local CPU resources
  • Reduced network traffic
  • CPU parallelism

Database Node

CPU

RAM

Database Storage

Network

Lower = 💚 💚 💚

Client-side processing

Server-side processing

39 of 78

Bulk-load and index generation performance

39

Internal data structures don’t handle bulk inserts efficiently

Per-object overheads accumulate

Table

Shards

Storage+Index

{ object.i }

Lower = 💚 💚 💚

40 of 78

Point query performance (find unique row)

40

  • Local I/O bandwidth
  • Local CPU resources
  • Reduced network traffic
  • CPU parallelism
  • 10,000 index lookups!
  • 1 billion rows

Database Node

CPU

RAM

Database Storage

Network

Lower = 💚 💚 💚

Client-side processing

Server-side processing

Server-side processing with index acceleration

41 of 78

Outline and contributions

41

Transactional data

Compute resources

Structured data

Graph processing

Durability

Data interfaces

Metadata management

Naming resources

Shared resources

Cluster-level metadata

Programmable storage design paradigm

2

Development process

Declarative storage

In-vivo storage development

1

3

Will this

ever work?

42 of 78

Outline and contributions

42

Transactional data

Compute resources

Structured data

Graph processing

Durability

Data interfaces

Metadata management

Naming resources

Shared resources

Cluster-level metadata

Programmable storage design paradigm

2

Development process

Declarative storage

In-vivo storage development

1

3

Will this

ever work?

43 of 78

What is durability?

  • Overloaded terminology
    • You get out what you put into it
    • Preserve historically important info
    • Always access the very best data now
  • We assume storage systems expose durable interfaces
    • Unless explicitly otherwise
  • Broad class of subsystems
    • Redundancy (replication, EC)
    • Recovery / failover / availability
    • Consistency semantics
    • Properties and behaviors of media

43

Long-term storage

500 years

44 of 78

Driving example: CORFU distributed shared-log

44

Balakrishnan et al., “CORFU: A Shared Log Design for Flash Clusters”, NSDI, `12

log striping

Sequencer (P)

Sequencer (B)

1, 2, 3, 4, 5, ….

read

clients

append

Programmed data interface

  • State: integer
  • Read, ReadNext

Ceph data interfaces

  • They are all durable / persistent!
  • Test it out: map to DRAM

45 of 78

Persistent media is only part of the bottleneck

45

SSD

DRAM

46 of 78

Software is a bottleneck

46

Queuing and scheduling

network

Persistent storage media

Transactional context

client req

This I/O path is taken by all requests… regardless of need!

error conditions

replication

tiering

error conditions

{ client operations }

concurrency control

clone

indexing

error

conditions

CoW

?

Goal: optimize for broad spectrum of request needs

State: intertwined with correctness-sensitive handling

47 of 78

The CORFU sequencer is a high availability service

47

Balakrishnan et al., “CORFU: A Shared Log Design for Flash Clusters”, NSDI, `12

log striping

Sequencer (P)

Sequencer (B)

1, 2, 3, 4, 5, ….

read

clients

append

Ceph already provides availability... recovery!

  • But the state is volatile…
  • What is recovered?

Function CORFU Sequencer Recover

1: FOR EACH STORAGE DEVICE

2: MAX = SEAL(DEVICE)

3: RETURN MAX

EndFunction

48 of 78

Data availability applied to sequencer interface

48

Seq @ OSD.0

Seq @ OSD.1

OSD.0 Failure

Seq @ OSD.2

OSD.1 Failure

Configurable

timeouts

Configurable replicas

49 of 78

Outline and contributions

49

Transactional data

Compute resources

Structured data

Graph processing

Durability

Data interfaces

Metadata management

Naming resources

Shared resources

Cluster-level metadata

Programmable storage design paradigm

2

Development process

Declarative storage

In-vivo storage development

1

3

Will this

ever work?

50 of 78

Outline and contributions

50

Transactional data

Compute resources

Structured data

Graph processing

Durability

Data interfaces

Metadata management

Naming resources

Shared resources

Cluster-level metadata

Programmable storage design paradigm

2

Development process

Declarative storage

In-vivo storage development

1

3

Will this

ever work?

51 of 78

Metadata management

  • Metadata is everywhere in storage
    • It gives the data we store meaning
    • Plays a supportive role in everything
      • Naming resources
      • Feature implementation

51

App

App

App

Storage system

File

Object

mount/

  • file1
  • file2

POSIX file resource

Config DB

  • Many types of metadata
    • POSIX file abstraction
    • Cluster-level metadata
    • Sub-systems (caching, scrubbing, …)

52 of 78

Metadata management

  • Metadata is everywhere in storage
    • It gives the data we store meaning
    • Plays a supportive role in everything
      • Naming resources
      • Feature implementation
  • Many types of metadata
    • POSIX file abstraction
    • Cluster-level metadata
    • Sub-systems (caching, scrubbing, …)
  • New interfaces with programmability
    • A programmable interface needs a name
    • Instances of interfaces track metadata
    • Need similar sets of services

52

App

App

App

Storage system

File

Object

A

B

C

mount/

  • file1
  • file2

POSIX file resource

Config DB

Need: Naming, metadata storage. etc...

53 of 78

POSIX namespace management of all interfaces

53

/users/

  • jane/
  • jerry/
  • john/

mount/

/science/

  • fake/
  • data/
  • graphs/

Metadata cluster

concurrency control

capabilities

security

cache management

POSIX/File

inode

inode

/log-instances/

  • log1
  • log2
  • log3

Instances of ZLog used by applications

/log-instances/

  • log4/
    • stream0
    • stream1

54 of 78

Programmable metadata management (file types)

54

/users/

  • jane/
  • jerry/
  • john/

mount/

/science/

  • fake/
  • data/
  • graphs/

/log-instances/

  • log1
  • log2
  • log3

/log-instances/

  • log4/
    • stream0
    • stream1

inode

inode

Metadata cluster

concurrency control

capabilities

security

cache management

POSIX/File

ZLog Basic

ZLog Streaming

inode

inode

inode

inode

[Karpovich, ‘94]

55 of 78

Programmable metadata management (file types)

55

mount/

/log-instances/

  • log1
  • log2
  • log3

inode

Metadata cluster

concurrency control

capabilities

security

cache management

ZLog Basic

inode

ZLog Interface

FS client

ZLog metadata

  • Cost model
  • Sequencer IP

Naming / discovery

Sequencer (P)

Sequencer (B)

[Karpovich, ‘94]

56 of 78

Programmable metadata management (coherency)

56

mount/

/log-instances/

  • log1
  • log2
  • log3

inode

Metadata cluster

concurrency control

capabilities

security

cache management

ZLog Basic

inode

ZLog Interface

FS client

ZLog metadata

  • Cost model

Naming / discovery

ZLog Interface

FS client

ZLog Interface

FS client

Cache invalidation�protocol

Can enforce exclusive�access to metadata

  • Sequencer state (integer)

pos = seq++

57 of 78

The capability-based sequencer is round-robin

57

ZLog Interface

FS client

ZLog Interface

FS client

Compared to centralized architecture…

can benefit bursty workloads

best effort

delay

quota

58 of 78

Findings trade-offs with capability-based sequencer

58

better

Throughput per Policy

Latency per Policy

More opportunities to receive the shared resource capability

59 of 78

Outline and contributions

59

Transactional data

Compute resources

Structured data

Graph processing

Durability

Data interfaces

Metadata management

Naming resources

Shared resources

Cluster-level metadata

Programmable storage design paradigm

2

Development process

Declarative storage

In-vivo storage development

1

3

Will this

ever work?

60 of 78

Outline and contributions

60

Transactional data

Compute resources

Structured data

Graph processing

Durability

Data interfaces

Metadata management

Naming resources

Shared resources

Cluster-level metadata

Programmable storage design paradigm

2

Development process

Declarative storage

In-vivo storage development

1

3

Will this

ever work?

61 of 78

Programmability looks like a one-time cost

  • Navigating the design space
    • Difficult and time-consuming

61

  • Tricks of the trade
    • Trimming down the search space
  • But if you put in the hard work…
    • …you’ll eventually arrive at a design that works

WRONG!

is not

62 of 78

ZLog performance toss up on 2014 version of Ceph

  • Four ZLog implementations
  • Ceph release (2014)
  • Graph takeaways
    • Clear performance losers
    • Similar top performers (few %)
  • Our claim…
    • Select simpler implementation
    • Added complexity for no benefit
  • What is complexity?
    • Lines of code
    • Conceptual

62

Performance Comparison of 4 Designs

Appends / Sec

Ceph 2014

63 of 78

Clear ZLog implementation choice in 2016

  • Same implementations
  • Same hardware / benchmark
  • Newer version of Ceph
  • Clear performance winner

Takeaway:

  • A reasonable choice in 2014 is bad choice in 2016
  • Worst part: happy with one not knowing about the other

63

Performance Comparison of 4 Designs

Appends / Sec

Ceph 2014

Ceph 2016

Appends / Sec

64 of 78

That is the state of programmable storage

  • Large design space
    • High cost of searching this space
  • Costs are difficult to predict
    • Simple upgrade and change the calculus!
  • Much harder than what we have presented
    • > 500 tunables/settings in Ceph
      • Not counting dependencies
    • Runs on a wide-variety of hardware
  • No hope of migrating to a new system
    • There are no standards!

64

65 of 78

Ceph programmability 2010 to 2016

65

66 of 78

Ceph programmability usage since 2016

66

61%

67 of 78

Programmability is critical to the Ceph ecosystem

  • Ceph is currently undergoing a major redesign
    • Address performance issues
    • Next-generation hardware (e.g. persistent memories).

When asked about the fate of object classes given the redesign opportunity:

67

“cls [object classes] isn't going away... it's proven pretty important for all of RGW, RBD, and CephFS... It has proven extremely useful and it's also a clean way to incorporate logic during updates without slowing down the I/O pipeline (mostly!).”

-- Sage Weil, lead architect of Ceph

68 of 78

Popping up in different places

  • OpenStack Swift Storlets
    • Swift is a cloud-scale object storage system
    • Storlets allow developers to push code into system
  • Cloud functions / lambda
    • Similar conceptual ideas to programmability
    • Priming new generation of developers
  • Coming to a storage system near you
  • What can we do to deal with the issues?

68

Swift

Storlets

Microsoft

Google

Amazon

69 of 78

Outline and contributions

69

Transactional data

Compute resources

Structured data

Graph processing

Durability

Data interfaces

Metadata management

Naming resources

Shared resources

Cluster-level metadata

Programmable storage design paradigm

2

Development process

1

3

Will this

ever work?

Declarative storage

In-vivo storage development

70 of 78

Outline and contributions

70

Transactional data

Compute resources

Structured data

Graph processing

Durability

Data interfaces

Metadata management

Naming resources

Shared resources

Cluster-level metadata

Programmable storage design paradigm

2

Development process

1

3

Will this

ever work?

Declarative storage

In-vivo storage development

71 of 78

Declarative storage

  • Design space was a late discovery
    • And operational pitfalls (e.g. simple upgrades!)
    • Immediately understood it to be a major concern

71

  • Automate parts of this process
    • Searching the design space
    • Generating implementations

Query optimization & plan generation

Cost model

  • Express interfaces declaratively
    • Eliminate need for storage system expertise
    • High-level abstractions across services / systems
  • Prototyping with the language
    • Formal underpinning, demonstrated across domains
    • Can express all of the CORFU semantics

72 of 78

In-vivo storage system development

  • Storage systems are high-availability; they’re “always-on”
  • Once systems are made to be more adaptable…
    • ...they’ll need to adapt while running
  • How will developers interact with the system?
    • Interfaces are developed like software
    • And inextricably linked to data
  • How will the system evolve?
    • Conflicting goals and user needs
    • Maintain common requirements like SLA
  • Observation
    • Well-defined points of change
    • Hardware and software changes
  • Offline optimization techniques

72

73 of 78

Outline and contributions

73

Transactional data

Compute resources

Structured data

Graph processing

Durability

Data interfaces

Metadata management

Naming resources

Shared resources

Cluster-level metadata

Programmable storage design paradigm

2

Development process

Declarative storage

In-vivo storage development

1

3

Will this

ever work?

74 of 78

Outline and contributions

74

Transactional data

Compute resources

Structured data

Graph processing

Durability

Data interfaces

Metadata management

Naming resources

Shared resources

Cluster-level metadata

Programmable storage design paradigm

2

Development process

Declarative storage

In-vivo storage development

1

3

Will this

ever work?

75 of 78

Recap and conclusions

  • Storage systems from scratch
    • Expensive and unreliable
    • Reproduce common sub-systems
  • Programmability surfaces sub-systems for reuse
  • Applicable across a wide range of application needs
    • Transactional, computation resources
    • Structured and graph data models
  • Challenges with programmability → new research goals
    • Large design space
    • Portability challenges
  • Introducing the next stage: declarative storage
    • Applying techniques from database world to programmable components
    • This is where a large portion of real-value comes from
    • This is a large research area, and we’ve come very close to connecting it all :)

75

76 of 78

Future work

76

Storage system architecture

  • Internal organizations that increase programmability (e.g. request handling)
  • Expanding the search of system components
    • Queueing and batching

Developer assistance

  • Automatically generating cost models
    • Automated performance sweeps
    • Identify equivalence classes
  • Survey patterns across large set of examples found in real-world use cases

Declarative storage

  • End-to-end demonstration using CORFU specification and cost models
  • Non-volatile memories
    • “The CPU is the bottleneck”
    • Generating “less” code
  • Formalizing the interaction between components
    • Allow modeling and verification tools to be applied

77 of 78

Publications

77

HotStorage ’17

DeclStore: Layering is for the Faint of Heart

N. Watkins, M. Sevilla, I. Jimenez, K. Dahlgren, P. Alvaro, S. Finkelstein, and C. Maltzahn

EuroSys ’17

Malacology: A Programmable Storage System

M. Sevilla, N. Watkins, I. Jimenez, P. Alvaro, S. Finkelstein, J. LeFevre, and C. Maltzahn

HotStorage ’16

ZEA, A Data Management Approach for SMR

A. Manzanares, N. Watkins, C. Guyot, D. LeMoal, C., and Z. Bandic

PDSW ’15

Automatic and Transparent I/O Optimization With Storage Integrated Application Runtime Support

N. Watkins, Z. Jia, G. Shipman, C. Maltzahn, A. Aiken, and P. McCormick

SC ’15

Mantle: A Programmable Metadata Load Balancer for the Ceph File System

M. Sevilla, N. Watkins, C. Maltzahn, I. Nassi, S. Brandt, S. Weil, G. Farnum, and S. Fineberg

BDMC ’13

In-Vivo Storage System Development

N. Watkins, C. Maltzahn, S. Brandt, I. Pye, and A. Manzanares

PDSW ’12

DataMods: Programmable File System Service

N. Watkins, C. Maltzahn, S. Brandt, A. Manzanares

SC ’11

SciHadoop: Array-based Query Processing in Hadoop

J. Buck, N. Watkins, J. Lefevre, K. Ioannidou, C. Maltzahn, N. Polyzotis, S. Brandt

DADC ’09

Abstract Storage: Moving File Format-specific Abstractions Into Petabyte-scale Storage Systems

J. Buck, N. Watkins, C. Maltzahn, S. Brandt

78 of 78

Thank you everyone

Committee: Carlos Maltzahn, Scott Brandt, Peter Alvaro, and other amazing collaborators: Neoklis Polyzotis, Jeff LeFevre, Shel Finkelstein, Ike Nassi, Kleoni Ioannidou

Michael Sevilla, Ivo Jimenez, Joe Buck, Dimitris Skourtis, Adam Crume

Pat McCormick, Galen Shipman, John Bent, Gary Grider, Adam Manzanares, Kleoni Ioannidou, Jay Lofstead, Sage Weil, Anna Povzner, Greg Farnum

78