1 of 5

Accelerating page migration with multithreading

Zi Yan

2 of 5

Design

Batch multiple folios to distribute across multiple threads

If #folios <= #threads, each thread copies a folio

Otherwise, each thread copies a part of each folio

workqueue is used and copy jobs are scheduled by CPU scheduler

CPU scheduler chooses the most idle core for copies

high priority workqueue is used, so copy jobs can replace existing jobs (otherwise no benefit)

No HIGHMEM support, since kmap_local might incur too much overhead

3 of 5

Experiment setup

Experiments ran on two systems

Two sockets of Intel E5-2650 v4 (24 cores (48 threads) in each socket), and

Two sockets of NVIDIA Grace arm64 (72 cores (72 threads) in each socket)

Page sizes:

On x86_64, 4KB and 2MB

On arm64, 64KB and 2MB

Measuring copy throughput using userspace move_pages() syscall time, it includes:

Walk through page table to collect all struct page

Call to kernel function migrate_pages()

4 of 5

Results

x86_64 copy throughput improvement

4KB: up to ~60% when 1024 4KB folios (4MB) are copied using 8 threads

2MB: up to ~600% when 1024 2MB folios (2GB) are copied using 16 threads

arm64 copy throughput improvement

64KB: up to ~400% when 1024 64KB folios (64MB) are copied using 32 threads

2MB: up to ~700% when 1024 2MB folios (2GB) are copied using 32 threads

5 of 5

TODO

Choose a proper number of threads for page copying

More threads do not mean higher copy throughput

Different architectures prefer different thread numbers

Boot time profiling might help decide the number

Charge page copy CPU cycles to user process

workqueue CPU usage is lost

Charge user process runtime with page copy runtime from workqueue

A better arch-aware CPU scheduling

AMD CPUs with multiple CCDs see higher throughput when threads are scheduled across CCDs