1 of 13

Shared L2ARC

Christian Schwarz

OpenZFS Developer Summit 2022

2 of 13

Context: Nutanix Files

Software-defined scale-out file storage.

Core functionality: NFS, SMB, Multi-Protocol shares.

= ZFS dataset(s) spread over many zpools.

Compute (ZFS, protocols) in VMs, on Nutanix HCI.

Storage for zpools in vDisks provided by Nutanix HCI.�Access via iSCSI.

Each zpool is imported in one VM at a given time.

zpools move between VMs for HA & load balancing. Cheap because data does not move.

ZFS’s role: provide POSIX-compliant filesystem with enterprise features.

3 of 13

Project Files Extended Buffer Cache

Goal: accelerate read-heavy workloads >> ARC size

Plan:

Attach local disk of VM’s current host system.
Use it as L2ARC.
Serve reads from local disk instead of vDisk.

Problems:�Adding host-local device as L2ARC would prevent moving zpools between VMs.�⇒ Want to avoid L2ARC vdev add/remove when moving zpools.

L2ARC is per-zpool, but we can’t predict which share/zpool will need acceleration.�⇒ Would need to partition host-local device ⇒ under-utilization if only 1 pool is hot.

4 of 13

How Does L2ARC Work, Anyways?

ARC maps (spa_load_guid, dva, txg)�to ARC headers (arc_buf_hdr_t).

Header points to storage location of cached data in L1 and/or L2.

L1: pointer to DRAM buffer.

L2: pointer to cache vdev + offset.

5 of 13

How Does L2ARC Work, Anyways?

L2ARC feed thread iterates over L1 buffers that are eviction candidates�(= tail of MRU/MFU lists).

If eviction candidate is from a zpool with cache devices:�1. write L1 buffer to cache dev�2. store offset in ARC header

Upon eviction from L1, the header with the offset remains in DRAM!

6 of 13

Upstream L2ARC With Multiple Zpools

Each zpool requires its own cache devices.

A cache device in a zpool only hosts L2 buffers for that zpool.

Can be desirable if the zpools serve different workloads and this is known upfront.

But in Nutanix Files, we don’t know upfront, so it’s better to pool and share the capacity.

7 of 13

Shared L2ARC

Special L2ARC zpool called�NTNX-fsvm-local-l2arc

Consists only of the host-local devices.

We change the feed thread so that it feeds buffers from any zpools to the L2ARC pool’s cache vdevs.

parted /dev/nvme{0,1}n1 …;�zpool create NTNX-fsvm-local-l2arc mirror /dev/nvme{0,1}n1p1 cache /dev/nvme{0,1}n1p2

8 of 13

Correctness

✓ No changes to the core ARC/L2ARC invariants (tagging, invalidation, …).

Fallback Reads: If L2 read fails (evicted L2 buffer, checksum error, …),�we need to read from primary pool.�⇒ Need to guarantee that primary pool is still there.�⇒ Hold both pools’ SCL_L2ARC.

Changing lifetimes ⇒ risk: dangling pointers!

Primary pool headers reference L2ARC pool in-core structures.�When exporting L2ARC pool, must invalidate those headers.�Turns out: existing code & locking is sufficient.

Disclaimer: basis for the design was ZoL 0.7.1.

9 of 13

Future: Prototype ⇒ Production

Project is a PoC, not productized yet.

Rebased patch against current OpenZFS master: GitHub PR #14060

TODOs:

Review design wrt newer features: native encryption & persistent L2ARC.

Remove hard-coded magic name; use a property instead:

zpool create -o share_l2arc_vdevs=on my-l2arc-pool …

Coexistence with non-shared L2ARC; also via property:

zpool set use_shared_l2arc=on my-data-pool

10 of 13

Summary & QA

System-wide L2ARC, shared dynamically among all imported zpools.

Impl: magic zpool name; tweaks to L2 feed thread; minor locking changes.

Working PoC on GitHub (PR #14060)

Interested in the code? Demo?�⇒ Hackathon

Questions or comments on design?�⇒ Now!

Thanks to my team at Nutanix and the ZFS community, especially,�Paul Dagnelie and George Wilson!

11 of 13

Appendix

12 of 13

L2ARC read path

13 of 13

Alternative Designs

#1: zpool l2arc status|add|remove

Refactor code to allow for l2arc_vdev_t that are not part of a spa_t.
Ensure L2ARC I/O to such spa-less l2arc vdevs works.
Add zpool commands add spa-less l2arc vdev

⇒ Lot of refactoring work for a slightly improved UX.

#2: zpool add $pool cache the same device to multiple zpools (like spare)

One pool has the real l2arc_vdev_t that points to the real disk.
Others pools have a shim l2arc_vdev_t that just forward to the real one.
On zpool export, one of the shims becomes the new owner.

⇒ Quite brittle, and counterintuitive to users (spares are already not very intuitive).