Shared L2ARC
Christian Schwarz
OpenZFS Developer Summit 2022
Context: Nutanix Files
Software-defined scale-out file storage.
Core functionality: NFS, SMB, Multi-Protocol shares.
= ZFS dataset(s) spread over many zpools.
Compute (ZFS, protocols) in VMs, on Nutanix HCI.
Storage for zpools in vDisks provided by Nutanix HCI.�Access via iSCSI.
Each zpool is imported in one VM at a given time.
zpools move between VMs for HA & load balancing. Cheap because data does not move.
ZFS’s role: provide POSIX-compliant filesystem with enterprise features.
2
Project Files Extended Buffer Cache
Goal: accelerate read-heavy workloads >> ARC size
Plan:
Problems:�Adding host-local device as L2ARC would prevent moving zpools between VMs.�⇒ Want to avoid L2ARC vdev add/remove when moving zpools.
L2ARC is per-zpool, but we can’t predict which share/zpool will need acceleration.�⇒ Would need to partition host-local device ⇒ under-utilization if only 1 pool is hot.
3
How Does L2ARC Work, Anyways?
ARC maps (spa_load_guid, dva, txg)�to ARC headers (arc_buf_hdr_t).
Header points to storage location of cached data in L1 and/or L2.
L1: pointer to DRAM buffer.
L2: pointer to cache vdev + offset.
4
How Does L2ARC Work, Anyways?
L2ARC feed thread iterates over L1 buffers that are eviction candidates�(= tail of MRU/MFU lists).
If eviction candidate is from a zpool with cache devices:�1. write L1 buffer to cache dev�2. store offset in ARC header
Upon eviction from L1, the header with the offset remains in DRAM!
5
Upstream L2ARC With Multiple Zpools
Each zpool requires its own cache devices.
A cache device in a zpool only hosts L2 buffers for that zpool.
Can be desirable if the zpools serve different workloads and this is known upfront.
But in Nutanix Files, we don’t know upfront, so it’s better to pool and share the capacity.
6
Shared L2ARC
Special L2ARC zpool called�NTNX-fsvm-local-l2arc
Consists only of the host-local devices.
We change the feed thread so that it feeds buffers from any zpools to the L2ARC pool’s cache vdevs.
parted /dev/nvme{0,1}n1 …;�zpool create NTNX-fsvm-local-l2arc mirror /dev/nvme{0,1}n1p1 cache /dev/nvme{0,1}n1p2
7
Correctness
✓ No changes to the core ARC/L2ARC invariants (tagging, invalidation, …).
Fallback Reads: If L2 read fails (evicted L2 buffer, checksum error, …),�we need to read from primary pool.�⇒ Need to guarantee that primary pool is still there.�⇒ Hold both pools’ SCL_L2ARC.
Changing lifetimes ⇒ risk: dangling pointers!
Primary pool headers reference L2ARC pool in-core structures.�When exporting L2ARC pool, must invalidate those headers.�Turns out: existing code & locking is sufficient.
Disclaimer: basis for the design was ZoL 0.7.1.
8
Future: Prototype ⇒ Production
Project is a PoC, not productized yet.
Rebased patch against current OpenZFS master: GitHub PR #14060
TODOs:
Review design wrt newer features: native encryption & persistent L2ARC.
Remove hard-coded magic name; use a property instead:
zpool create -o share_l2arc_vdevs=on my-l2arc-pool …
Coexistence with non-shared L2ARC; also via property:
zpool set use_shared_l2arc=on my-data-pool
9
Summary & QA
System-wide L2ARC, shared dynamically among all imported zpools.
Impl: magic zpool name; tweaks to L2 feed thread; minor locking changes.
Working PoC on GitHub (PR #14060)
Interested in the code? Demo?�⇒ Hackathon
Questions or comments on design?�⇒ Now!
Thanks to my team at Nutanix and the ZFS community, especially,�Paul Dagnelie and George Wilson!
10
Appendix
11
L2ARC read path
12
Alternative Designs
#1: zpool l2arc status|add|remove
⇒ Lot of refactoring work for a slightly improved UX.
#2: zpool add $pool cache the same device to multiple zpools (like spare)
⇒ Quite brittle, and counterintuitive to users (spares are already not very intuitive).
13