1 of 25

MM Alignment - ASI

Brendan Jackman <jackmanb@google.com>

Feb 2025

2 of 25

Agenda

  1. Reintroduce ASI - 5 mins
  2. High-level strategy & RFCv2
  3. page_alloc integration (LSF/MM/BPF pre-discussion)
  4. Page cache problem
  5. KUnit for page_alloc? (Non-ASI topic)

3 of 25

ASI Refresher

4 of 25

ASI Micro-refresher

Longer intro: My talk at LSF/MM/BPF 2024. LWN

Kernel

Userspace/Guest

Page fault

Unrestricted address space (secrets mapped)

Restricted address space (secrets unmapped)

5 of 25

ASI Micro-refresher

From an mm perspective, it’s a bit like KPTI++:

userspace

Kernel

address space

kernel

userspace

kernel

KPTI: unmap (almost) whole kernel from userspace

userspace

kernel

Userspace

address space

Restricted

address space

ASI: unmap (almost) arbitrary subset of kernel from untrusted task.

Now we can keep unused stuff unmapped even when in the kernel.

Unrestricted

address space

6 of 25

ASI Micro-refresher

  • There’s a new copy of the kernel address space, with holes in it.
  • When we hit one of those holes, we get a #PF, do some stuff, and switch to the unrestricted address space (no holes).

Restricted

Unrestricted

VM entry

Data

Flush

Page fault

Control

Flush

Stun HT Sibling

Today I’m glossing over all that stuff. (But let’s discuss on the RFC thread).

Today let’s talk about how to manage the holes.

7 of 25

Strategy & RFCv2

8 of 25

Strategy & RFCv2 (link)

  • Now with protection from bare-metal attackers
    • (RFCv1 only sandboxed KVM guests)
  • Now with 70% fio degradation
  • No substantive feedback
  • My idea of what’s needed to merge first ASI series:
    • Userspace sandboxing support - done
    • Proof-of-concept fix for the 70% fio deg (more on that later)
    • Design proper page_alloc.c integration (my LSF/MM/BPF topic)
      • Includes getting rid of pageflag
    • Testing for more configs (e.g. CONFIG_PARAVIRT)
    • Reorganise a lot of the actual code
      • But, most such “trivialities” can be determined on the road from [PATCH] to [PATCH v20]
  • Question: what do you need to see?
  • We have KUnit tests. (see GitHub branch linked from RFC)
    • Would including this/being more explicit about it make the series more attractive to review?
  • What else can I do to get feedback?
    • Just race to [PATCH]?

9 of 25

page_alloc integration (LSF/MM/BPF pre-discussion)

10 of 25

page_alloc integration - let’s poke some holes

  • Would like to use 2MB TLB entries for the restricted physmap
    • So we want to group nonsensitive pages together
  • Really want to avoid TLB shootdowns
    • So when allocating __GFP_SENSITIVE we should prefer pages that are already unmapped.
  • Sensitivity is a property that we want to…
    • physically group pages by
    • index free pages by
  • Stuff visible in restricted address space is called “nonsensitive memory”.
  • You have to decide if it’s sensitive when you allocate it.
  • GFP_USER now includes __GFP_SENSITIVE.
  • Now page allocator needs to map and unmap pages.
    • Mapping might require allocating pagetables (also requires zeroing)
    • Unmapping requires TLB shootdown

… sounds like a migratetype?

11 of 25

page_alloc integration - let’s poke some holes

Simplification: Everything but UNMOVABLE is assumed to be sensitive. Might have to change that later.

enum migratetype {

- MIGRATE_UNMOVABLE,

+ MIGRATE_UNMOVABLE_SENSITIVE,

+ MIGRATE_UNMOVABLE_NONSENSITIVE,

MIGRATE_MOVABLE,

MIGRATE_RECLAIMABLE,

MIGRATE_PCPTYPES,

12 of 25

page_alloc integration - let’s poke some holes

How to transition sensitive -> nonsensitive (i.e. map pages into ASI)?

In the general case, this requires allocating pagetables.��But not if map/unmap happens at existing pagetable boundaries.�

So we never have to allocate if:

  1. Page blocks never cross physmap pagetable boundaries
  2. ASI’s pagetables for the physmap have the same structure as the unrestricted address space

13 of 25

page_alloc integration - let’s poke some holes

How to transition sensitive -> nonsensitive (i.e. map pages into ASI)?

No allocation required, life is easy.

So: let MIGRATE_UNMOVABLE_NONSENSITIVE allocations fallback to other migratetypes right in the fastpath.

When that happens, change the pageblock migratetype and map it into ASI.

(Note: must flip the whole block, can’t steal individual pages)

Hacky prototype of this easy part seems OK so far.

14 of 25

page_alloc integration - let’s poke some holes

How to transition nonsensitive -> sensitive (i.e. unmap pages from ASI)?

Requires a TLB flush + IPI. (debug_pagealloc skips the IPI, ASI can’t).

Cannot do this with IRQs off. Also, expensive, want it batched.

So, if we are serving a sensitive allocation but only have MIGRATE_UNMOVABLE_NONSENSITIVE pages left, fail the fastpath.

I think this becomes something kinda like compaction/reclaim?

15 of 25

page_alloc integration - let’s poke some holes

My simplistic understanding of the levels of fallback:

preferred migratetype

fallback

migratetype

preferred zone

fallback zone

preferred node

fallback node

direct compact

direct reclaim

if gfp_mask allows the blocking

16 of 25

page_alloc integration - let’s poke some holes

Try ASI unmap here - if compact/reclaim was allowed, the IPI must be safe?

Simple place to try ASI unmapping:

preferred migratetype

fallback

migratetype

preferred zone

fallback zone

preferred node

fallback node

direct compact

direct reclaim

17 of 25

page_alloc integration - let’s poke some holes

BUT

preferred migratetype

fallback

migratetype

preferred zone

fallback zone

preferred node

fallback node

direct compact

direct reclaim

That means we would prefer falling back to wrong node, just to avoid a TLB shootdown. Seems like a bad trade?

So, try ASI unmap here, if gfp_mask allows blocking?

18 of 25

page_alloc integration - let’s poke some holes

sensitive -> nonsensitive is pretty painless, nonsensitive -> sensitive is painful.

But if you’re doing lots of pages at once, this cost gets heavily amortized.

So: something like kswapd/kcompactd should try to flip free pageblocks back to sensitive to avoid inflicting this pain on future allocations.

Haven’t made a working prototype for this hard part yet. Any thoughts up front?

19 of 25

Page cache problem

20 of 25

Page cache problem (70% FIO degradation)

Painful lesson I learned when adding bare-metal support: read() matters!

All file pages are __GFP_SENSITIVE. ASI obviously needs to protect those. No problem for mmap() but every read() causes an ASI #PF.

  • (The #PF itself isn’t causing the 70%, it’s the other stuff I’m glossing over)

Don’t want to pay the cost of protecting pages that the current process is about to read anyway.

Older prototypes of ASI included __GFP_LOCAL_NONSENSITIVE for data that the current process is allowed to leak.

But with file pages, we don’t know if the process is allowed to read it at allocation time.

21 of 25

Page cache problem (70% FIO degradation)

Warning: this bit is likely to be stupid nonsense, I have not researched it much.

To avoid this, we need to read file pages through a different, per-process mapping.

I see two classes of solution:

Stable mappings

Whenever a process gains logical access to a file, map all that file’s pages into its restricted address space. Unmap them when it loses access.

Ephemeral mappings

Create local mappings for file pages on-demand when they are about to be accessed. Tear them down as soon as the process might lose access.

22 of 25

Page cache problem (70% FIO degradation)

Maintaining “stable mappings” feels like an inevitable combinatorial explosion

Tearing down “ephemeral mappings” requires TLB shootdowns.

But what if those mappings were CPU-local? Then we only need a local flush.

This is a bit like kmap_local_page() but harder:

kmap_local_page() segregates CPUs by virtual address. Mappings for other CPU’s addresses can be TLB-stale, we won’t access those addresses.

Doesn’t work for ASI, attacker doesn’t care about our rules.

So… we’d need a separate PGD for each CPU. …cool? Seems fine, yeah?

23 of 25

KUnit for page_alloc

24 of 25

KUnit for page_alloc.c

KUnit lets you test internal kernel interfaces. It’s:

  • A minimal framework for organising test code, writing assertions etc.
  • A subsystem in the kernel for running those tests, either as modules or during boot.
  • A Python script for booting QEMU/UML to quickly and conveniently run those tests.

static void test_alloc_smoke(struct kunit *test)

{

struct page *page;

page = alloc_pages(GFP_KERNEL, 0);

KUNIT_ASSERT_NOT_NULL(test, page);

__free_pages(page, 0);

}��static struct kunit_case test_cases[] = {

KUNIT_CASE(test_alloc),

{}

};

./tools/testing/kunit/kunit.py run --kunitconfig mm/.kunitconfig

25 of 25

KUnit for page_alloc.c

Global variables make unit tests difficult.

But node_data seems to be the only important global variable in page_alloc.c?

So: when KUnit enabled:

  • Create a “mock” node at boot. Like numa=fake but also has no memory, no CPUs, no kswapd/kcompactd, no SLUB caches.
  • Nothing will touch the mock node without test code explicitly causing it.
  • When test starts, hotplug out some memory, then hotplug it back in in the mock node.
  • Use that node to exercise page_alloc.c.

This has been useful for ASI work so far, should I invest more time in sharing it?