1 of 44

Storage devices

Lars Ailo Bongo (larsab@cs.uit.no)

inf-2201 Spring 2021

06.04.21

2 of 44

Big data sources

  • Human produced content
  • Machine produced content
  • Scientific instruments
  • ...

3 of 44

Data growth at EMBL-EBI

Source: Charles E. Cook et al. Nucl. Acids Res. 2016;44:D20-D26

  • DNA sequence data is doubling every 6-8 months over the last 3 years and looks to continue for this decade

4 of 44

What is big data?

< 8GB

< 1TB

TBs

PBs

5 of 44

What is big data?

  • Billions of samples & few dimensions, or
  • Billions of samples & thousands of dimensions, or
  • Thousands of samples & thousands of dimensions

6 of 44

What is big data?

7 of 44

What is big data?

<100ms

seconds

minutes

hours

weeks

Computation time:

8 of 44

Big data optimizations

  • R or Matlab implementation
  • Algorithm parameter tuning
  • C++/ Java / … implementation
  • Data structure optimization
  • Multi-threaded parallelization (single machine)
  • Distributed parallelization (multiple-machines)

9 of 44

10 of 44

Overview

  • Big data
  • Magnetic disks
  • Disk arrays
  • Flash storage
  • DRAM storage
  • Storage hierarchy

11 of 44

Storage properties

  • Reliability
    • Archival
    • Reliable
    • Persistent
    • Temporal
  • Access pattern
    • Write or read intensive
    • Sequential or random access
  • Low-latency or high throughput
  • Cost
  • Power usage

12 of 44

Jiahua He, Arun Jagatheesan, Sandeep Gupta, Jeffrey Bennett, Allan Snavely, "DASH: a Recipe for a Flash-based Data Intensive Supercomputer," sc, pp.1-11, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, 2010

13 of 44

14 of 44

Punch cards

15 of 44

Magnetophone

16 of 44

Tape

17 of 44

18 of 44

Disk Sectors and Access

  • Each sector records
    • Sector ID
    • Data (512 bytes, 4096 bytes proposed)
    • Error correcting code (ECC)
      • Used to hide defects and recording errors
    • Synchronization fields and gaps
  • Access to a sector involves
    • Queuing delay if other accesses are pending
    • Seek: move the heads
    • Rotational latency
    • Data transfer
    • Controller overhead

Chapter 6 — Storage and Other I/O Topics — 18

19 of 44

A Typical Magnetic Disk Controller

  • External connection
    • Parallel ATA (aka IDE or EIDE), Serial ATA, SCSI, Serial Attached SCSI (SAS), Fibre Channel, FireWire, USB
  • Cache
    • Buffer data between disk and interface
  • Controller
    • Read/write operation
    • Cache replacement
    • Failure detection and recovery

20 of 44

Disk Caching

  • Method
    • Use DRAM to cache recently accessed blocks
      • Most disks has 32MB
      • Some of the RAM space stores “firmware” (an embedded OS)
    • Blocks are replaced usually in an LRU order
  • Pros
    • Good for reads if accesses have locality
  • Cons
    • Cost
    • Need to deal with reliable writes

21 of 44

Kryder’s vs. Moore’s Law

22 of 44

50 Years Later (Mark Kryder at SNW 2006)

23 of 44

Hard drive teardown

24 of 44

2021 numbers?

25 of 44

Disk Performance (2TB disk)

  • Seek
    • Position heads over cylinder, typically 3.5-9.5 ms
  • Rotational delay
    • Wait for a sector to rotate underneath the heads
    • Typically 8 - 4 ms (7,200 – 15,000RPM)
  • Transfer bytes
    • Transfer bandwidth is typically 40-138 Mbytes/sec
  • Performance of transfer 1 Kbytes
    • Seek (4 ms) + half rotational delay (2ms) + transfer (0.013 ms)
    • Total time is 6.01 ms or 167 Kbytes/sec (1/360 of 60MB/sec)!

26 of 44

More on Performance

  • What transfer size can get 90% of the disk bandwidth?
    • Assume Disk BW = 60MB/sec, rotation = 2ms, seek = 4ms
    • BW * 90% = size / (size/BW + rotation + seek) �size = BW * (rotation + seek) * 0.9 / 0.1= 60MB * 0.006 * 0.9 / 0.1 = 3.24MB

  • Seek and rotational times dominate the cost of small accesses
    • Disk transfer bandwidth are wasted
    • Need algorithms to reduce seek time
  • Speed depends on which sectors to access

Block Size

% of Disk Transfer Bandwidth

1Kbytes

0.28%

1Mbytes

73.99%

3.24Mbytes

90%

27 of 44

FCFS order

  • Method
    • First come first serve
  • Pros
    • Fairness among requests
    • In the order applications expect
  • Cons
    • Arrival may be on random spots on the disk (long seeks)
    • Wild swing can happen

28 of 44

SSTF (Shortest Seek Time First)

  • Method
    • Pick the one closest on disk
    • Rotational delay is in calculation
  • Pros
    • Try to minimize seek time
  • Cons
    • Starvation
  • Question
    • Is SSTF optimal?
    • Can we avoid the starvation?

29 of 44

Elevator (SCAN)

  • Method
    • Take the closest request in the direction of travel
    • Real implementations do not go to the end (called LOOK)
  • Pros
    • Bounded time for each request
  • Cons
    • Request at the other end will take a while

30 of 44

C-SCAN (Circular SCAN)

  • Method
    • Like SCAN
    • But, wrap around
    • Real implementation doesn’t go to the end (C-LOOK)
  • Pros
    • Uniform service time
  • Cons
    • Do nothing on the return

31 of 44

Storage System

  • Network connected box with many disks
  • Entry level box has 12 disks
    • Mean time to failure?
    • How to improve reliability?
    • What if there are 1000 disks?

32 of 44

RAID (Redundant Array of Independent Disks)

  • More details in inf-2200 Memory lecture slides

33 of 44

Flash Storage

  • Nonvolatile semiconductor storage
    • 100× – 1000× faster than disk
    • Smaller, lower power, more robust
    • But more $/GB (between disk and DRAM)

34 of 44

Flash Memory

  • NOR
    • Byte addressable
    • Often used for BIOS
    • Much higher price than for NAND
  • NAND
    • Dominant for consumer and enterprise devices
    • Single Level Cell (SLC) vs. Multi Level Cell (MLC):
      • SLC is more robust but expensive
      • MLC offers higher density and lower price

35 of 44

NAND Memory Organization

  • Organized into a set of erase blocks (EB)
  • Each erase block has a set of pages
  • Example configuration for a 512 MB NAND device:
    • 4096 EB’s, 64 pages per EB, 2112 bytes per page (2KB user data + 64 bytes metadata)
  • Read:
    • Random access on any page, multiple times
    • 25-60μs
  • Write
    • Data must be written sequentially to pages in an erase block
    • Entire page should be written for best reliability
    • 250-900 μs
  • Erase:
    • Entire erase block must be erased before re-writing
    • Up to 3.5ms

36 of 44

Flash Translation Layer

  • Emulate a hard disk by exposing an array of blocks
  • Logic block mapping
    • Map from logical to physical blocks
    • Cannot do random writes
    • Granularity: block vs. page (read-modify-write block vs. large RAM for storing map table
    • Hybrid approach often used: maintain a small set of log blocks for page level mapping
  • Garbage collection
    • Maintain an allocation pool of clean blocks (blocks must be erased before reuse)
    • Recycle invalidated pages
    • Merge blocks if page level mapping
  • Wear leveling
    • Most writes are to a few pages (metadata)
    • Even out writes over blocks

37 of 44

Hardware/Software Architecture

SSD

SSD

HD

Controller

Controller

Controller

Block Device Driver

Block Device Driver

I/O Scheduler

Generic Block Layer

File System

File System

Virtual File System

Device Driver

Flash Storage Layer

File System

SATA

PCI-E

HW

SW (OS)

38 of 44

39 of 44

Non-volatile DRAM (NVRAM)

  • Battery backed DRAM
    • Backup power during power-out
    • Ordinary DRAM technology
  • One part of a storage system
  • Expensive
  • Targeted at specific application domains such as databases

(Netlist Nvvault)

40 of 44

Remote Direct Memory Access

41 of 44

Low latency remote memory access

42 of 44

43 of 44

10.000 Years Storage System

  • How to build a storage system with a mean time to failure = 10.000 years?

  • Fun read
  • Uses lots of basic storage system principles
  • I/O performance not great

44 of 44

Summary

  • Disk real density is on Moore’s law curve
  • Need large disk blocks to achieve good throughput
  • OS needs to perform disk scheduling
  • RAID improves reliability and high throughput at a cost
  • Careful designs to deal with disk failures
  • Flash memory has emerged at low and high ends
  • DRAM only storage systems are emerging