BTRFS Declustered Parity RAID For Zoned Devices
Johannes Thumshirn
System Software Group, WD Research
12 May, 2022
© 2021 Western Digital Corporation or its affiliates. All rights reserved.
© 2021 Western Digital Corporation or its affiliates. All rights reserved.
Outline
© 2021 Western Digital Corporation or its affiliates. All rights reserved.
Background
Refresher Of BTRFS And Zoned Storage
3
5/12/22
© 2021 Western Digital Corporation or its affiliates. All rights reserved.
Btrfs Overview
What’s btrfs?
© 2021 Western Digital Corporation or its affiliates. All rights reserved.
Zoned Block Devices
What’s ZBC, ZAC And ZNS?
Zone 1
Zone 2
Zone 3
Zone X
Write pointer
position
Device LBA range divided in zones
WRITE commands
advance the write pointer
ZONE RESET command
rewinds the write pointer
Zone 0
© 2021 Western Digital Corporation or its affiliates. All rights reserved.
ZONE APPEND Write Operations
Introduced with NVMe Zoned Namespace (ZNS) SSDs
A: 4K Write0
B: 8K Write1
C: 16K Write2
WP0
(after W0)
WP1
(after W1)
WP2
(after W2)
Regular Write
Queue Depth = 1
A
B
B
C
C
C
C
Zone Append
Queue Depth = 3
A: 4K Write0
B: 8K Write1
C: 16K Write2
WP
(after all writes)
A
C
C
C
C
B
B
© 2021 Western Digital Corporation or its affiliates. All rights reserved.
Btrfs On Zoned Block Devices
What we’ve done
© 2021 Western Digital Corporation or its affiliates. All rights reserved.
Problem Statement
What’s The Problem With RAID On (Zoned) BTRFS
8
5/12/22
© 2021 Western Digital Corporation or its affiliates. All rights reserved.
Problem Statement
Lessons Learned From Btrfs RAID5/6
© 2021 Western Digital Corporation or its affiliates. All rights reserved.
Problem Statement
Lessons Learned From Btrfs RAID
D1
D1
D2
D2
D1
D1
D2
D2
vs.
Deterministic Placement
Non-deterministic Placement
© 2021 Western Digital Corporation or its affiliates. All rights reserved.
Problem Statement
Lessons Learned From RAID
© 2021 Western Digital Corporation or its affiliates. All rights reserved.
Proposed Changes
How To Fix These Problems
12
5/12/22
© 2021 Western Digital Corporation or its affiliates. All rights reserved.
Proposed Changes
How to fix these problems
© 2021 Western Digital Corporation or its affiliates. All rights reserved.
Design Background
14
5/12/22
© 2021 Western Digital Corporation or its affiliates. All rights reserved.
Design Background
Distributed Data Placement
RAID 6 volume (2D+2P)
D0
D1
Q
D2
D3
P
stripe
file
D0
D1
D2
D3
P
Q
© 2021 Western Digital Corporation or its affiliates. All rights reserved.
Design Background
Distributed Data Placement
DP volume (2D+2P over 8 disks)
D0
D1
Q
D3
P
D4
stripe
stripe
file
D0
D1
D2
D3
file
D0
D1
D2
D3
P
Q
RAID 6 volume (2D+2P)
D0
D1
Q
D2
D3
P
stripe
P
Q
© 2021 Western Digital Corporation or its affiliates. All rights reserved.
Design Background
RAID Stripe Tree
© 2021 Western Digital Corporation or its affiliates. All rights reserved.
Design Background
RAID Stripe Tree
Device
File Extent 0-3M
Stripe Extent
0-1M
Disk 0
256M + 1M
Stripe Extent
1-2M
Disk 4
128M + 1M
Stripe Extent
2-3M
Disk 3
1024M + 1M
Stripe Extent
P- Parity
Disk 6
512M + 1M
Stripe Extent
Q-Parity
Disk 7
2048M + 1M
File Extent 24M-27M
…
File
Logical Space
Physical
Space
Block Group
Device
Device
Device
Device
© 2021 Western Digital Corporation or its affiliates. All rights reserved.
Design Background
RAID Stripe Tree
struct btrfs_key {
.objectid = file_extent_logical,
.type = BTRFS_RAID_STRIPE_EXTENT,
.offset = file_extent_length,
};
struct btrfs_dp_stripe {
/* array of RAID stripe extents this stripe is
* comprised of
*/
struct btrfs_stripe_extent extents[];
} __attribute__ ((__packed__));
struct btrfs_stripe_extent {
/* btrfs device-id this raid extent lives on */
__le64 devid;
/* physical start address on the device */
__le64 physical;
} __attribute__ ((__packed__));
© 2021 Western Digital Corporation or its affiliates. All rights reserved.
Design Background
struct btrfs_file_extent_item {
__le64 generation;
__le64 ram_bytes;
__u8 compression;
__u8 encryption;
__le16 other_encoding;
__u8 type;
__le64 disk_bytenr;
__le64 disk_num_bytes;
__le64 offset;
__le64 num_bytes;
} __attribute__ ((__packed__));
RAID Stripe Tree
struct btrfs_key {
.objectid = file_extent_logical,
.type = BTRFS_RAID_STRIPE_EXTENT,
.offset = file_extent_length,
};
struct btrfs_dp_stripe {
/* array of RAID stripe extents this stripe is
* comprised of
*/
struct btrfs_stripe_extent extents[];
} __attribute__ ((__packed__));
struct btrfs_stripe_extent {
/* btrfs device-id this raid extent lives on */
__le64 devid;
/* physical start address on the device */
__le64 physical;
} __attribute__ ((__packed__));
© 2021 Western Digital Corporation or its affiliates. All rights reserved.
Design Background
struct btrfs_file_extent_item {
__le64 generation;
__le64 ram_bytes;
__u8 compression;
__u8 encryption;
__le16 other_encoding;
__u8 type;
__le64 disk_bytenr;
__le64 disk_num_bytes;
__le64 offset;
__le64 num_bytes;
} __attribute__ ((__packed__));
RAID Stripe Tree
struct btrfs_key {
.objectid = file_extent_logical,
.type = BTRFS_RAID_STRIPE_EXTENT,
.offset = file_extent_length,
};
struct btrfs_dp_stripe {
/* array of RAID stripe extents this stripe is
* comprised of
*/
struct btrfs_stripe_extent extents[];
} __attribute__ ((__packed__));
struct btrfs_stripe_extent {
/* btrfs device-id this raid extent lives on */
__le64 devid;
/* physical start address on the device */
__le64 physical;
} __attribute__ ((__packed__));
© 2021 Western Digital Corporation or its affiliates. All rights reserved.
Design Background
struct btrfs_file_extent_item {
__le64 generation;
__le64 ram_bytes;
__u8 compression;
__u8 encryption;
__le16 other_encoding;
__u8 type;
__le64 disk_bytenr;
__le64 disk_num_bytes;
__le64 offset;
__le64 num_bytes;
} __attribute__ ((__packed__));
RAID Stripe Tree
struct btrfs_key {
.objectid = file_extent_logical,
.type = BTRFS_RAID_STRIPE_EXTENT,
.offset = file_extent_length,
};
struct btrfs_dp_stripe {
/* array of RAID stripe extents this stripe is
* comprised of
*/
struct btrfs_stripe_extent extents[];
} __attribute__ ((__packed__));
struct btrfs_stripe_extent {
/* btrfs device-id this raid extent lives on */
__le64 devid;
/* physical start address on the device */
__le64 physical;
} __attribute__ ((__packed__));
© 2021 Western Digital Corporation or its affiliates. All rights reserved.
Design Background
struct btrfs_file_extent_item {
__le64 generation;
__le64 ram_bytes;
__u8 compression;
__u8 encryption;
__le16 other_encoding;
__u8 type;
__le64 disk_bytenr;
__le64 disk_num_bytes;
__le64 offset;
__le64 num_bytes;
} __attribute__ ((__packed__));
RAID Stripe Tree
struct btrfs_key {
.objectid = file_extent_logical,
.type = BTRFS_RAID_STRIPE_EXTENT,
.offset = file_extent_length,
};
struct btrfs_dp_stripe {
/* array of RAID stripe extents this stripe is
* comprised of
*/
struct btrfs_stripe_extent extents[];
} __attribute__ ((__packed__));
struct btrfs_stripe_extent {
/* btrfs device-id this raid extent lives on */
__le64 devid;
/* physical start address on the device */
__le64 physical;
} __attribute__ ((__packed__));
© 2021 Western Digital Corporation or its affiliates. All rights reserved.
Design Background
Advantages
Disadvantages
RAID Stripe Tree
© 2021 Western Digital Corporation or its affiliates. All rights reserved.
Design Background
Configurable Parity Algorithm
© 2021 Western Digital Corporation or its affiliates. All rights reserved.
Current Status
Where are at the moment?
26
5/12/22
© 2021 Western Digital Corporation or its affiliates. All rights reserved.
Current Status
Where are at the moment?
27
5/12/22
© 2021 Western Digital Corporation or its affiliates. All rights reserved.
Current Status
© 2021 Western Digital Corporation or its affiliates. All rights reserved.
Current Status
© 2021 Western Digital Corporation or its affiliates. All rights reserved.
Current Status
© 2021 Western Digital Corporation or its affiliates. All rights reserved.
Thanks
Questions?
31
5/12/22
© 2021 Western Digital Corporation or its affiliates. All rights reserved.
© 2021 Western Digital Corporation or its affiliates. All rights reserved.
© 2021 Western Digital Corporation or its affiliates. All rights reserved.