1 of 32

BTRFS Declustered Parity RAID For Zoned Devices

Johannes Thumshirn

System Software Group, WD Research

12 May, 2022

© 2021 Western Digital Corporation or its affiliates. All rights reserved.

© 2021 Western Digital Corporation or its affiliates. All rights reserved.

2 of 32

Outline

  • Background
    • Btrfs Overview
    • Zoned Devices
    • ZONE APPEND Write Operations
    • Btrfs On Zoned Devices
  • Problem Statement
    • Lessons Learned From RAID5/6
  • Proposed Changes
    • Distribute Data Placement
    • Journaling
    • Configurable Parity Algorithm

  • Design Background
    • Distributed Data Placement
    • RAID Stripe Tree
  • Current Status
    • Outlook
    • Screenshots

© 2021 Western Digital Corporation or its affiliates. All rights reserved.

3 of 32

Background

Refresher Of BTRFS And Zoned Storage

3

5/12/22

© 2021 Western Digital Corporation or its affiliates. All rights reserved.

4 of 32

Btrfs Overview

  • Copy-on-Write Filesystem
    • Based on CoW B-Trees
    • Snapshots
    • Subvolumes
  • Additional Features
    • Transparent data compression
      • lzo, zlib or zstd
    • Checksums for data and metadata
      • crc32c, xxhash64, sha256, blake2b
    • Built-in multi device support (RAID)
      • RAID 0, RAID 1, RAID 10, RAID 5, RAID 6
    • Incremental backups with send/receive
      • Send stream of changes between two subvolume snapshots

What’s btrfs?

© 2021 Western Digital Corporation or its affiliates. All rights reserved.

5 of 32

Zoned Block Devices

  • Most commonly found today in the form of SMR hard-disks (Shingled Magnetic Recording) or ZNS SSDs
    • Defined in SCSI ZBC, ATA ZAC and NVMe ZNS
  • LBA range divided into zones
    • Conventional zones
      • Accept random writes
    • Sequential write required zones
      • Writes must be issued sequentially starting from the “write pointer”
      • Zones must be reset before rewriting
        • “rewind” write pointer to beginning of the zone
  • Users of zoned devices must be aware of the sequential write rule
    • Device fails write command not starting at the zone write pointer

What’s ZBC, ZAC And ZNS?

Zone 1

Zone 2

Zone 3

Zone X

Write pointer

position

Device LBA range divided in zones

WRITE commands

advance the write pointer

ZONE RESET command

rewinds the write pointer

Zone 0

© 2021 Western Digital Corporation or its affiliates. All rights reserved.

6 of 32

ZONE APPEND Write Operations

  • ZONE APPEND write operation only specifies the target zone
    • The device automatically write at the current write pointer position of the zone
    • The first written LBA number is returned to the host with the command completion notification
  • ZONE APPEND command is not defined in the ZBC (SCSI) and ZAC (ATA) standards
    • Emulated in the SCSI disk driver since kernel version 5.8
  • With zone append, writes to a zone can be delivered in any order without failing
    • User must however be ready to handle out-of-order completions

Introduced with NVMe Zoned Namespace (ZNS) SSDs

A: 4K Write0

B: 8K Write1

C: 16K Write2

WP0

(after W0)

WP1

(after W1)

WP2

(after W2)

Regular Write

Queue Depth = 1

A

B

B

C

C

C

C

Zone Append

Queue Depth = 3

A: 4K Write0

B: 8K Write1

C: 16K Write2

WP

(after all writes)

A

C

C

C

C

B

B

© 2021 Western Digital Corporation or its affiliates. All rights reserved.

7 of 32

Btrfs On Zoned Block Devices

  • Basic support merged with kernel v5.11
    • Log structured super block
      • Superblock is the only fixed location data structure in btrfs
    • Align block groups to zones
    • Zoned extent allocator
      • Append only allocation to avoid random writes
  • Fully functional since kernel v5.12
    • Use ZONE APPEND for data writes
    • Not yet completely on par with regular BTRFS features
      • No NOCOW
      • No fallocate(2)
      • No RAID yet

  • NVMe ZNS support since kernel v5.16
    • Zone capacity smaller than zone size
    • Respecting queue_max_active_zones() limits
  • Currently in stabilization phase
    • Automatic zone reclaim merged in v5.13
      • Greedy GC in v5.16
      • Only reclaim on x%-full file-systems v5.17
    • Bug fixes for corner cases
      • max_active_zones starvation v5.19

What we’ve done

© 2021 Western Digital Corporation or its affiliates. All rights reserved.

8 of 32

Problem Statement

What’s The Problem With RAID On (Zoned) BTRFS

8

5/12/22

© 2021 Western Digital Corporation or its affiliates. All rights reserved.

9 of 32

Problem Statement

  • Disconnection of ”File-Extent-Layer” and “RAID-Layer”
    • Sub stripe length updates in place
      • RAID Write Hole
      • Not possible on a zoned btrfs
    • CoW needs to know about RAID and vice versa
    • Needs to work with “nocow” files/filesystem as well

Lessons Learned From Btrfs RAID5/6

© 2021 Western Digital Corporation or its affiliates. All rights reserved.

10 of 32

Problem Statement

  • Implicit data placement
    • Each per disk sub-stripe has same offset from chunk start
  • Doesn’t work with a zoned filesystem (even for RAID 1)
    • Multiple writes to different drives can race
      • No explicit write position with zone append command: the drives decides

Lessons Learned From Btrfs RAID

D1

D1

D2

D2

D1

D1

D2

D2

vs.

Deterministic Placement

Non-deterministic Placement

© 2021 Western Digital Corporation or its affiliates. All rights reserved.

11 of 32

Problem Statement

  • RAID Rebuild Stress
    • RAID5 can only tolerate one missing drive, two for RAID 6
    • High stress on remaining drives for rebuild
    • Increased chance of disk dying during rebuild
  • Inflexible Encoding Scheme
    • XOR for RAID 5 (P-Stripe)
    • XOR and Shift for RAID 6 (Q-Stripe)

Lessons Learned From RAID

© 2021 Western Digital Corporation or its affiliates. All rights reserved.

12 of 32

Proposed Changes

How To Fix These Problems

12

5/12/22

© 2021 Western Digital Corporation or its affiliates. All rights reserved.

13 of 32

Proposed Changes

  • Distribute Data Placement
    • Similar to what BTRFS RAID 1 already does
    • Less pressure on single disks in recovery
  • Copy-on-Write to circumvent write hole
    • Introduce RAID Stripe Tree
    • Write data first, then meta-data describing the stripe
    • Allows us to use REQ_OP_ZONE_APPEND for zoned data writes
  • Configurable Parity Algorithm
    • None (RAID 0/1)
    • XOR/P-Q Stripe (RAID 5/6)
    • Erasure-Codes: Reed Solomon or MDS Codes (more than 2 blocks of parity)

How to fix these problems

© 2021 Western Digital Corporation or its affiliates. All rights reserved.

14 of 32

Design Background

14

5/12/22

© 2021 Western Digital Corporation or its affiliates. All rights reserved.

15 of 32

Design Background

  • Traditional RAID6 (2D+2P)
    • Dataset + parity is striped across all disks

Distributed Data Placement

RAID 6 volume (2D+2P)

D0

D1

Q

D2

D3

P

stripe

file

D0

D1

D2

D3

P

Q

© 2021 Western Digital Corporation or its affiliates. All rights reserved.

16 of 32

Design Background

  • Traditional RAID6 (2D+2P)
    • Dataset + parity is striped across all disks

  • Declustered RAID (2D+2P)
    • Dataset + parity is distributed among a subset of disks

Distributed Data Placement

DP volume (2D+2P over 8 disks)

D0

D1

Q

D3

P

D4

stripe

stripe

file

D0

D1

D2

D3

file

D0

D1

D2

D3

P

Q

RAID 6 volume (2D+2P)

D0

D1

Q

D2

D3

P

stripe

P

Q

© 2021 Western Digital Corporation or its affiliates. All rights reserved.

17 of 32

Design Background

  • Can be seen as an inverse of the free space tree
    • Written after the data has reached the disks
    • Records the location (disk, LBA) of each sub-stripe
  • Kind of RAID “journal”
    • Removes write hole (CoW)
    • Can be use for “nocow” as well
  • Logical to physical addresses translation
    • Logical (start, length) tuple maps to N (disk, start) tuples

RAID Stripe Tree

© 2021 Western Digital Corporation or its affiliates. All rights reserved.

18 of 32

Design Background

  • Translate logical to physical addresses (3D + 2P)

RAID Stripe Tree

Device

File Extent 0-3M

Stripe Extent

0-1M

Disk 0

256M + 1M

Stripe Extent

1-2M

Disk 4

128M + 1M

Stripe Extent

2-3M

Disk 3

1024M + 1M

Stripe Extent

P- Parity

Disk 6

512M + 1M

Stripe Extent

Q-Parity

Disk 7

2048M + 1M

File Extent 24M-27M

File

Logical Space

Physical

Space

Block Group

Device

Device

Device

Device

© 2021 Western Digital Corporation or its affiliates. All rights reserved.

19 of 32

Design Background

  • Keyed by logical, length
  • Additional per file extent space consumption
    • N*16 Bytes
  • Example 3D + 2P RAID
    • 5 * 16 Bytes = 80 Bytes stripe tree nodes
    • 51 Nodes per 4k sector
  • Might not yet be the final version
    • Only RAID 1 implemented at the moment, striping might require changes

RAID Stripe Tree

struct btrfs_key {

.objectid = file_extent_logical,

.type = BTRFS_RAID_STRIPE_EXTENT,

.offset = file_extent_length,

};

 

struct btrfs_dp_stripe {

        /* array of RAID stripe extents this stripe is

* comprised of

*/

        struct btrfs_stripe_extent extents[];

} __attribute__ ((__packed__));

struct btrfs_stripe_extent {

        /* btrfs device-id this raid extent lives on */

        __le64 devid;

        /* physical start address on the device */

        __le64 physical;

} __attribute__ ((__packed__));

© 2021 Western Digital Corporation or its affiliates. All rights reserved.

20 of 32

Design Background

struct btrfs_file_extent_item {

__le64 generation;

__le64 ram_bytes;

__u8 compression;

__u8 encryption;

__le16 other_encoding;

__u8 type;

__le64 disk_bytenr;

__le64 disk_num_bytes;

__le64 offset;

__le64 num_bytes;

} __attribute__ ((__packed__));

RAID Stripe Tree

struct btrfs_key {

.objectid = file_extent_logical,

.type = BTRFS_RAID_STRIPE_EXTENT,

.offset = file_extent_length,

};

 

struct btrfs_dp_stripe {

        /* array of RAID stripe extents this stripe is

* comprised of

*/

        struct btrfs_stripe_extent extents[];

} __attribute__ ((__packed__));

struct btrfs_stripe_extent {

        /* btrfs device-id this raid extent lives on */

        __le64 devid;

        /* physical start address on the device */

        __le64 physical;

} __attribute__ ((__packed__));

© 2021 Western Digital Corporation or its affiliates. All rights reserved.

21 of 32

Design Background

struct btrfs_file_extent_item {

__le64 generation;

__le64 ram_bytes;

__u8 compression;

__u8 encryption;

__le16 other_encoding;

__u8 type;

__le64 disk_bytenr;

__le64 disk_num_bytes;

__le64 offset;

__le64 num_bytes;

} __attribute__ ((__packed__));

RAID Stripe Tree

struct btrfs_key {

.objectid = file_extent_logical,

.type = BTRFS_RAID_STRIPE_EXTENT,

.offset = file_extent_length,

};

 

struct btrfs_dp_stripe {

        /* array of RAID stripe extents this stripe is

* comprised of

*/

        struct btrfs_stripe_extent extents[];

} __attribute__ ((__packed__));

struct btrfs_stripe_extent {

        /* btrfs device-id this raid extent lives on */

        __le64 devid;

        /* physical start address on the device */

        __le64 physical;

} __attribute__ ((__packed__));

© 2021 Western Digital Corporation or its affiliates. All rights reserved.

22 of 32

Design Background

struct btrfs_file_extent_item {

__le64 generation;

__le64 ram_bytes;

__u8 compression;

__u8 encryption;

__le16 other_encoding;

__u8 type;

__le64 disk_bytenr;

__le64 disk_num_bytes;

__le64 offset;

__le64 num_bytes;

} __attribute__ ((__packed__));

RAID Stripe Tree

struct btrfs_key {

.objectid = file_extent_logical,

.type = BTRFS_RAID_STRIPE_EXTENT,

.offset = file_extent_length,

};

 

struct btrfs_dp_stripe {

        /* array of RAID stripe extents this stripe is

* comprised of

*/

        struct btrfs_stripe_extent extents[];

} __attribute__ ((__packed__));

struct btrfs_stripe_extent {

        /* btrfs device-id this raid extent lives on */

        __le64 devid;

        /* physical start address on the device */

        __le64 physical;

} __attribute__ ((__packed__));

© 2021 Western Digital Corporation or its affiliates. All rights reserved.

23 of 32

Design Background

struct btrfs_file_extent_item {

__le64 generation;

__le64 ram_bytes;

__u8 compression;

__u8 encryption;

__le16 other_encoding;

__u8 type;

__le64 disk_bytenr;

__le64 disk_num_bytes;

__le64 offset;

__le64 num_bytes;

} __attribute__ ((__packed__));

RAID Stripe Tree

struct btrfs_key {

.objectid = file_extent_logical,

.type = BTRFS_RAID_STRIPE_EXTENT,

.offset = file_extent_length,

};

 

struct btrfs_dp_stripe {

        /* array of RAID stripe extents this stripe is

* comprised of

*/

        struct btrfs_stripe_extent extents[];

} __attribute__ ((__packed__));

struct btrfs_stripe_extent {

        /* btrfs device-id this raid extent lives on */

        __le64 devid;

        /* physical start address on the device */

        __le64 physical;

} __attribute__ ((__packed__));

© 2021 Western Digital Corporation or its affiliates. All rights reserved.

24 of 32

Design Background

Advantages

  • Address translation
    • Scrub friendly
  • RAID Journal
    • Ordered updates
    • Similar to how checksums are handled
  • No implicit connection needed
    • REQ_OP_ZONE_APPEND compatible
  • Stronger reliability against device faults
    • M+K erasure code can be high

Disadvantages

  • Additional Metadata
    • Especially if we also have to do stripe tree entries for metadata
    • Merge consecutive and sequential on-disk stripe extents?

RAID Stripe Tree

© 2021 Western Digital Corporation or its affiliates. All rights reserved.

25 of 32

Design Background

  • Generates Parity or EC information
  • Similar to how we handle compression
    • Do the math on data read/write
  • But different to how we handle compression
    • Doesn’t modify the actual data but adds data

Configurable Parity Algorithm

© 2021 Western Digital Corporation or its affiliates. All rights reserved.

26 of 32

Current Status

Where are at the moment?

26

5/12/22

© 2021 Western Digital Corporation or its affiliates. All rights reserved.

27 of 32

Current Status

  • Data RAID1 implemented
    • Easiest to do
    • Metadata doesn’t use REQ_OP_ZONE_APPEND
      • Already working out-of-the-box
    • Data writes are recorded in raid-stripe-tree
  • Readahead still gives me troubles
    • Read batching breaks 1:1 relation of ‘struct btrfs_file_extent_item’ and ‘struct btrfs_dp_stripe
    • Once that’s solved I’ll send an RFC to linux-btrfs@vger.kernel.org for ‘design review’
  • RAID0 will be next
    • RAID10 will be a solved problem then I hope

Where are at the moment?

27

5/12/22

© 2021 Western Digital Corporation or its affiliates. All rights reserved.

28 of 32

Current Status

  • Boilerplate mkfs creating an FS with empty RAID stripe tree

© 2021 Western Digital Corporation or its affiliates. All rights reserved.

29 of 32

Current Status

  • Tree-dump

© 2021 Western Digital Corporation or its affiliates. All rights reserved.

30 of 32

Current Status

  • Fsck On An Non-Empty RAID Filesystem

© 2021 Western Digital Corporation or its affiliates. All rights reserved.

31 of 32

Thanks

Questions?

31

5/12/22

© 2021 Western Digital Corporation or its affiliates. All rights reserved.

32 of 32

© 2021 Western Digital Corporation or its affiliates. All rights reserved.

© 2021 Western Digital Corporation or its affiliates. All rights reserved.