1 of 25

MM Alignment - ASI

Brendan Jackman <jackmanb@google.com>

Feb 2025

These slides: 73.nu/mm-alignment-asi

2 of 25

Agenda

Reintroduce ASI - 5 mins
High-level strategy & RFCv2
page_alloc integration (LSF/MM/BPF pre-discussion)
Page cache problem
KUnit for page_alloc? (Non-ASI topic)

3 of 25

ASI Refresher

4 of 25

ASI Micro-refresher

Longer intro: My talk at LSF/MM/BPF 2024. LWN

Kernel

Userspace/Guest

Page fault

Unrestricted address space (secrets mapped)

Restricted address space (secrets unmapped)

These slides: 73.nu/mm-alignment-asi

5 of 25

ASI Micro-refresher

From an mm perspective, it’s a bit like KPTI++:

userspace

Kernel

address space

kernel

userspace

kernel

KPTI: unmap (almost) whole kernel from userspace

userspace

kernel

Userspace

address space

Restricted

address space

ASI: unmap (almost) arbitrary subset of kernel from untrusted task.

Now we can keep unused stuff unmapped even when in the kernel.

Unrestricted

address space

These slides: 73.nu/mm-alignment-asi

6 of 25

ASI Micro-refresher

There’s a new copy of the kernel address space, with holes in it.
When we hit one of those holes, we get a #PF, do some stuff, and switch to the unrestricted address space (no holes).

Restricted

Unrestricted

VM entry

Data

Flush

Page fault

Control

Flush

Stun HT Sibling

Today I’m glossing over all that stuff. (But let’s discuss on the RFC thread).

Today let’s talk about how to manage the holes.

These slides: 73.nu/mm-alignment-asi

7 of 25

Strategy & RFCv2

8 of 25

Strategy & RFCv2 (link)

Now with protection from bare-metal attackers

(RFCv1 only sandboxed KVM guests)

Now with 70% fio degradation

No substantive feedback
My idea of what’s needed to merge first ASI series:

Userspace sandboxing support - done
Proof-of-concept fix for the 70% fio deg (more on that later)
Design proper page_alloc.c integration (my LSF/MM/BPF topic)

Includes getting rid of pageflag

Testing for more configs (e.g. CONFIG_PARAVIRT)
Reorganise a lot of the actual code

But, most such “trivialities” can be determined on the road from [PATCH] to [PATCH v20]

Question: what do you need to see?

We have KUnit tests. (see GitHub branch linked from RFC)

Would including this/being more explicit about it make the series more attractive to review?

What else can I do to get feedback?

Just race to [PATCH]?

These slides: 73.nu/mm-alignment-asi

9 of 25

page_alloc integration (LSF/MM/BPF pre-discussion)

These slides: 73.nu/mm-alignment-asi

Conference topic proposal

10 of 25

page_alloc integration - let’s poke some holes

Would like to use 2MB TLB entries for the restricted physmap

So we want to group nonsensitive pages together

Really want to avoid TLB shootdowns

So when allocating __GFP_SENSITIVE we should prefer pages that are already unmapped.

Sensitivity is a property that we want to…

physically group pages by
index free pages by

Stuff visible in restricted address space is called “nonsensitive memory”.
You have to decide if it’s sensitive when you allocate it.
GFP_USER now includes __GFP_SENSITIVE.
Now page allocator needs to map and unmap pages.

Mapping might require allocating pagetables (also requires zeroing)
Unmapping requires TLB shootdown

… sounds like a migratetype?

These slides: 73.nu/mm-alignment-asi

11 of 25

page_alloc integration - let’s poke some holes

Simplification: Everything but UNMOVABLE is assumed to be sensitive. Might have to change that later.

enum migratetype {

- MIGRATE_UNMOVABLE,

+ MIGRATE_UNMOVABLE_SENSITIVE,

+ MIGRATE_UNMOVABLE_NONSENSITIVE,

MIGRATE_MOVABLE,

MIGRATE_RECLAIMABLE,

MIGRATE_PCPTYPES,

These slides: 73.nu/mm-alignment-asi

12 of 25

page_alloc integration - let’s poke some holes

How to transition sensitive -> nonsensitive (i.e. map pages into ASI)?

In the general case, this requires allocating pagetables.��But not if map/unmap happens at existing pagetable boundaries.�

So we never have to allocate if:

Page blocks never cross physmap pagetable boundaries
ASI’s pagetables for the physmap have the same structure as the unrestricted address space

These slides: 73.nu/mm-alignment-asi

13 of 25

page_alloc integration - let’s poke some holes

How to transition sensitive -> nonsensitive (i.e. map pages into ASI)?

No allocation required, life is easy.

So: let MIGRATE_UNMOVABLE_NONSENSITIVE allocations fallback to other migratetypes right in the fastpath.

When that happens, change the pageblock migratetype and map it into ASI.

(Note: must flip the whole block, can’t steal individual pages)

Hacky prototype of this easy part seems OK so far.

These slides: 73.nu/mm-alignment-asi

14 of 25

page_alloc integration - let’s poke some holes

How to transition nonsensitive -> sensitive (i.e. unmap pages from ASI)?

Requires a TLB flush + IPI. (debug_pagealloc skips the IPI, ASI can’t).

Cannot do this with IRQs off. Also, expensive, want it batched.

So, if we are serving a sensitive allocation but only have MIGRATE_UNMOVABLE_NONSENSITIVE pages left, fail the fastpath.

I think this becomes something kinda like compaction/reclaim?

These slides: 73.nu/mm-alignment-asi

15 of 25

page_alloc integration - let’s poke some holes

My simplistic understanding of the levels of fallback:

preferred migratetype

fallback

migratetype

preferred zone

fallback zone

preferred node

fallback node

direct compact

direct reclaim

if gfp_mask allows the blocking

These slides: 73.nu/mm-alignment-asi

16 of 25

page_alloc integration - let’s poke some holes

Try ASI unmap here - if compact/reclaim was allowed, the IPI must be safe?

Simple place to try ASI unmapping:

preferred migratetype

fallback

migratetype

preferred zone

fallback zone

preferred node

fallback node

direct compact

direct reclaim

These slides: 73.nu/mm-alignment-asi

17 of 25

page_alloc integration - let’s poke some holes

BUT

preferred migratetype

fallback

migratetype

preferred zone

fallback zone

preferred node

fallback node

direct compact

direct reclaim

That means we would prefer falling back to wrong node, just to avoid a TLB shootdown. Seems like a bad trade?

So, try ASI unmap here, if gfp_mask allows blocking?

These slides: 73.nu/mm-alignment-asi

18 of 25

page_alloc integration - let’s poke some holes

sensitive -> nonsensitive is pretty painless, nonsensitive -> sensitive is painful.

But if you’re doing lots of pages at once, this cost gets heavily amortized.

So: something like kswapd/kcompactd should try to flip free pageblocks back to sensitive to avoid inflicting this pain on future allocations.

Haven’t made a working prototype for this hard part yet. Any thoughts up front?

These slides: 73.nu/mm-alignment-asi

19 of 25

Page cache problem

These slides: 73.nu/mm-alignment-asi

Conference topic proposal

20 of 25

Page cache problem (70% FIO degradation)

Painful lesson I learned when adding bare-metal support: read() matters!

All file pages are __GFP_SENSITIVE. ASI obviously needs to protect those. No problem for mmap() but every read() causes an ASI #PF.

(The #PF itself isn’t causing the 70%, it’s the other stuff I’m glossing over)

Don’t want to pay the cost of protecting pages that the current process is about to read anyway.

Older prototypes of ASI included __GFP_LOCAL_NONSENSITIVE for data that the current process is allowed to leak.

But with file pages, we don’t know if the process is allowed to read it at allocation time.

These slides: 73.nu/mm-alignment-asi

21 of 25

Page cache problem (70% FIO degradation)

Warning: this bit is likely to be stupid nonsense, I have not researched it much.

To avoid this, we need to read file pages through a different, per-process mapping.

I see two classes of solution:

These slides: 73.nu/mm-alignment-asi

Stable mappings

Whenever a process gains logical access to a file, map all that file’s pages into its restricted address space. Unmap them when it loses access.

Ephemeral mappings

Create local mappings for file pages on-demand when they are about to be accessed. Tear them down as soon as the process might lose access.

22 of 25

Page cache problem (70% FIO degradation)

Maintaining “stable mappings” feels like an inevitable combinatorial explosion

Tearing down “ephemeral mappings” requires TLB shootdowns.

But what if those mappings were CPU-local? Then we only need a local flush.

This is a bit like kmap_local_page() but harder:

kmap_local_page() segregates CPUs by virtual address. Mappings for other CPU’s addresses can be TLB-stale, we won’t access those addresses.

Doesn’t work for ASI, attacker doesn’t care about our rules.

So… we’d need a separate PGD for each CPU. …cool? Seems fine, yeah?

These slides: 73.nu/mm-alignment-asi

23 of 25

KUnit for page_alloc

These slides: 73.nu/mm-alignment-asi

24 of 25

KUnit for page_alloc.c

KUnit lets you test internal kernel interfaces. It’s:

A minimal framework for organising test code, writing assertions etc.
A subsystem in the kernel for running those tests, either as modules or during boot.
A Python script for booting QEMU/UML to quickly and conveniently run those tests.

These slides: 73.nu/mm-alignment-asi

static void test_alloc_smoke(struct kunit *test)

{

struct page *page;

page = alloc_pages(GFP_KERNEL, 0);

KUNIT_ASSERT_NOT_NULL(test, page);

__free_pages(page, 0);

}��static struct kunit_case test_cases[] = {

KUNIT_CASE(test_alloc),

{}

};

./tools/testing/kunit/kunit.py run --kunitconfig mm/.kunitconfig

25 of 25

KUnit for page_alloc.c

Global variables make unit tests difficult.

But node_data seems to be the only important global variable in page_alloc.c?

So: when KUnit enabled:

Create a “mock” node at boot. Like numa=fake but also has no memory, no CPUs, no kswapd/kcompactd, no SLUB caches.
Nothing will touch the mock node without test code explicitly causing it.
When test starts, hotplug out some memory, then hotplug it back in in the mock node.
Use that node to exercise page_alloc.c.

This has been useful for ASI work so far, should I invest more time in sharing it?

These slides: 73.nu/mm-alignment-asi