MM Alignment - ASI
Brendan Jackman <jackmanb@google.com>
Feb 2025
These slides: 73.nu/mm-alignment-asi
Agenda
ASI Refresher
ASI Micro-refresher
Kernel
Userspace/Guest
Page fault
Unrestricted address space (secrets mapped)
Restricted address space (secrets unmapped)
These slides: 73.nu/mm-alignment-asi
ASI Micro-refresher
From an mm perspective, it’s a bit like KPTI++:
userspace
Kernel
address space
kernel
userspace
kernel
KPTI: unmap (almost) whole kernel from userspace
userspace
kernel
Userspace
address space
Restricted
address space
ASI: unmap (almost) arbitrary subset of kernel from untrusted task.
Now we can keep unused stuff unmapped even when in the kernel.
Unrestricted
address space
These slides: 73.nu/mm-alignment-asi
ASI Micro-refresher
Restricted
Unrestricted
VM entry
Data
Flush
Page fault
Control
Flush
Stun HT Sibling
Today I’m glossing over all that stuff. (But let’s discuss on the RFC thread).
Today let’s talk about how to manage the holes.
These slides: 73.nu/mm-alignment-asi
Strategy & RFCv2
Strategy & RFCv2 (link)
These slides: 73.nu/mm-alignment-asi
page_alloc integration (LSF/MM/BPF pre-discussion)
These slides: 73.nu/mm-alignment-asi
page_alloc integration - let’s poke some holes
… sounds like a migratetype?
These slides: 73.nu/mm-alignment-asi
page_alloc integration - let’s poke some holes
Simplification: Everything but UNMOVABLE is assumed to be sensitive. Might have to change that later.
enum migratetype {
- MIGRATE_UNMOVABLE,
+ MIGRATE_UNMOVABLE_SENSITIVE,
+ MIGRATE_UNMOVABLE_NONSENSITIVE,
MIGRATE_MOVABLE,
MIGRATE_RECLAIMABLE,
MIGRATE_PCPTYPES,
These slides: 73.nu/mm-alignment-asi
page_alloc integration - let’s poke some holes
How to transition sensitive -> nonsensitive (i.e. map pages into ASI)?
In the general case, this requires allocating pagetables.��But not if map/unmap happens at existing pagetable boundaries.�
So we never have to allocate if:
These slides: 73.nu/mm-alignment-asi
page_alloc integration - let’s poke some holes
How to transition sensitive -> nonsensitive (i.e. map pages into ASI)?
No allocation required, life is easy.
So: let MIGRATE_UNMOVABLE_NONSENSITIVE allocations fallback to other migratetypes right in the fastpath.
When that happens, change the pageblock migratetype and map it into ASI.
(Note: must flip the whole block, can’t steal individual pages)
Hacky prototype of this easy part seems OK so far.
These slides: 73.nu/mm-alignment-asi
page_alloc integration - let’s poke some holes
How to transition nonsensitive -> sensitive (i.e. unmap pages from ASI)?
Requires a TLB flush + IPI. (debug_pagealloc skips the IPI, ASI can’t).
Cannot do this with IRQs off. Also, expensive, want it batched.
So, if we are serving a sensitive allocation but only have MIGRATE_UNMOVABLE_NONSENSITIVE pages left, fail the fastpath.
I think this becomes something kinda like compaction/reclaim?
These slides: 73.nu/mm-alignment-asi
page_alloc integration - let’s poke some holes
My simplistic understanding of the levels of fallback:
preferred migratetype
fallback
migratetype
preferred zone
fallback zone
preferred node
fallback node
direct compact
direct reclaim
if gfp_mask allows the blocking
These slides: 73.nu/mm-alignment-asi
page_alloc integration - let’s poke some holes
Try ASI unmap here - if compact/reclaim was allowed, the IPI must be safe?
Simple place to try ASI unmapping:
preferred migratetype
fallback
migratetype
preferred zone
fallback zone
preferred node
fallback node
direct compact
direct reclaim
These slides: 73.nu/mm-alignment-asi
page_alloc integration - let’s poke some holes
BUT
preferred migratetype
fallback
migratetype
preferred zone
fallback zone
preferred node
fallback node
direct compact
direct reclaim
That means we would prefer falling back to wrong node, just to avoid a TLB shootdown. Seems like a bad trade?
So, try ASI unmap here, if gfp_mask allows blocking?
These slides: 73.nu/mm-alignment-asi
page_alloc integration - let’s poke some holes
sensitive -> nonsensitive is pretty painless, nonsensitive -> sensitive is painful.
But if you’re doing lots of pages at once, this cost gets heavily amortized.
So: something like kswapd/kcompactd should try to flip free pageblocks back to sensitive to avoid inflicting this pain on future allocations.
Haven’t made a working prototype for this hard part yet. Any thoughts up front?
These slides: 73.nu/mm-alignment-asi
Page cache problem
These slides: 73.nu/mm-alignment-asi
Page cache problem (70% FIO degradation)
Painful lesson I learned when adding bare-metal support: read() matters!
All file pages are __GFP_SENSITIVE. ASI obviously needs to protect those. No problem for mmap() but every read() causes an ASI #PF.
Don’t want to pay the cost of protecting pages that the current process is about to read anyway.
Older prototypes of ASI included __GFP_LOCAL_NONSENSITIVE for data that the current process is allowed to leak.
But with file pages, we don’t know if the process is allowed to read it at allocation time.
These slides: 73.nu/mm-alignment-asi
Page cache problem (70% FIO degradation)
Warning: this bit is likely to be stupid nonsense, I have not researched it much.
To avoid this, we need to read file pages through a different, per-process mapping.
I see two classes of solution:
These slides: 73.nu/mm-alignment-asi
Stable mappings
Whenever a process gains logical access to a file, map all that file’s pages into its restricted address space. Unmap them when it loses access.
Ephemeral mappings
Create local mappings for file pages on-demand when they are about to be accessed. Tear them down as soon as the process might lose access.
Page cache problem (70% FIO degradation)
Maintaining “stable mappings” feels like an inevitable combinatorial explosion
Tearing down “ephemeral mappings” requires TLB shootdowns.
But what if those mappings were CPU-local? Then we only need a local flush.
This is a bit like kmap_local_page() but harder:
kmap_local_page() segregates CPUs by virtual address. Mappings for other CPU’s addresses can be TLB-stale, we won’t access those addresses.
Doesn’t work for ASI, attacker doesn’t care about our rules.
So… we’d need a separate PGD for each CPU. …cool? Seems fine, yeah?
These slides: 73.nu/mm-alignment-asi
KUnit for page_alloc
These slides: 73.nu/mm-alignment-asi
KUnit for page_alloc.c
KUnit lets you test internal kernel interfaces. It’s:
These slides: 73.nu/mm-alignment-asi
static void test_alloc_smoke(struct kunit *test)
{
struct page *page;
page = alloc_pages(GFP_KERNEL, 0);
KUNIT_ASSERT_NOT_NULL(test, page);
__free_pages(page, 0);
}��static struct kunit_case test_cases[] = {
KUNIT_CASE(test_alloc),
{}
};
./tools/testing/kunit/kunit.py run --kunitconfig mm/.kunitconfig
KUnit for page_alloc.c
Global variables make unit tests difficult.
But node_data seems to be the only important global variable in page_alloc.c?
So: when KUnit enabled:
This has been useful for ASI work so far, should I invest more time in sharing it?
These slides: 73.nu/mm-alignment-asi