Accelerating page migration with multithreading
Zi Yan
Design
Batch multiple folios to distribute across multiple threads
If #folios <= #threads, each thread copies a folio
Otherwise, each thread copies a part of each folio
workqueue is used and copy jobs are scheduled by CPU scheduler
CPU scheduler chooses the most idle core for copies
high priority workqueue is used, so copy jobs can replace existing jobs (otherwise no benefit)
No HIGHMEM support, since kmap_local might incur too much overhead
Experiment setup
Experiments ran on two systems
Two sockets of Intel E5-2650 v4 (24 cores (48 threads) in each socket), and
Two sockets of NVIDIA Grace arm64 (72 cores (72 threads) in each socket)
Page sizes:
On x86_64, 4KB and 2MB
On arm64, 64KB and 2MB
Measuring copy throughput using userspace move_pages() syscall time, it includes:
Walk through page table to collect all struct page
Call to kernel function migrate_pages()
Results
x86_64 copy throughput improvement
4KB: up to ~60% when 1024 4KB folios (4MB) are copied using 8 threads
2MB: up to ~600% when 1024 2MB folios (2GB) are copied using 16 threads
arm64 copy throughput improvement
64KB: up to ~400% when 1024 64KB folios (64MB) are copied using 32 threads
2MB: up to ~700% when 1024 2MB folios (2GB) are copied using 32 threads
TODO
Choose a proper number of threads for page copying
More threads do not mean higher copy throughput
Different architectures prefer different thread numbers
Boot time profiling might help decide the number
Charge page copy CPU cycles to user process
workqueue CPU usage is lost
Charge user process runtime with page copy runtime from workqueue
A better arch-aware CPU scheduling
AMD CPUs with multiple CCDs see higher throughput when threads are scheduled across CCDs