Accelerating Critical OS Services in Virtualized Systems with Flexible Micro-sliced Cores
Jeongseob Ahn*, Chang Hyun Park‡, Taekyung Heo‡, Jaehyuk Huh‡
‡
*
Challenge of Server Consolidation
Utilization
Fraction of time
0
1
0.5
Consolidation improves system utilization
However, resources are contended
Challenge of Server Consolidation
Utilization
Fraction of time
0
1
0.5
Consolidation improves system utilization
However, resources are contended
So, What Can Happen?
vCPU 1
pCPU 0
vCPU 0
Physical Time
Virtual Time
Running
VMEXIT / VMENTER
Time shared
Waiting
So, What Can Happen?
vCPU 1
pCPU 0
vCPU 0
Physical Time
Virtual Time
Running
VMEXIT / VMENTER
Time shared
Waiting
Kernel Component | Avg. waiting time (μsec) | |
solo | co-run* | |
Page reclaim | 1.03 | 420.13 |
Page allocator | 3.42 | 1,053.26 |
Dentry | 2.93 | 1,298.87 |
Runqueue | 1.22 | 256.07 |
➊ Spinlock waiting time(gmake)
| Avg. | Min. | Max. | |
dedup | solo | 28 | 5 | 1927 |
co-run* | 6,354 | 7 | 74915 | |
vips | solo | 55 | 5 | 2052 |
co-run* | 14,928 | 17 | 121548 |
➋ TLB synchronization latency (μsec)
| Jitters (ms) | Throughput (Mbits/sec) |
solo | 0.0043 | 936.3 |
mixed co-run* | 9.2507 | 435.6 |
➌ I/O latency & throughput (iPerf)
Processing time is amplified
* Concurrently running with Swaptions of PARSEC
How about Shortening Time Slice?
Waiting time
vCPU 1
pCPU 0
vCPU 0
vCPU 2
Time slice
T
Time shared
vCPU 1
pCPU 0
vCPU 0
vCPU 2
Reduced waiting time
T
Time shared
Shortening time slice is very simple and powerful, but the overhead of frequent context switches is significant
Approach: Dividing CPUs into Two Pools
vCPU 1
vCPU 0
vCPU 2
Waiting time
Time slice
pCPU 0
Shortened time slice
pCPU 3
vCPU 1
vCPU 0
vCPU 2
➊ Normal pool
➋ Micro-sliced pool
Serving critical OS services to minimize the waiting time
- quickly but briefly schedule vCPUs
Serving the main work of applications
Challenges in Serving Critical OS Services on Micro-sliced Cores
1. Precise detection of urgent tasks
2. Guest OS transparency
3. Dynamic adjustment of micro-sliced cores
Detecting Critical OS Services
➊
Examining the instruction pointer (a.k.a PC)
whenever a vCPU yields its pCPU
Instruction pointer
workloads | # of yields | |
solo | co-run* | |
gmake | 79,440 | 295,262,662 |
exim | 157,023 | 24,102,495 |
dedup | 290,406 | 164,578,839 |
vips | 644,643 | 57,650,538 |
* Concurrently running with Swaptions of PARSEC
0x8106ed62
Profiling Virtual CPU Scheduling Logs
Kernel symbol tables
Module | Operation |
sched | scheduler_ipi() resched_curr() … |
mm | flush_tlb_all() get_page_from_freelist() … |
irq | irq_enter() irq_exit() … |
spinlock | __raw_spin_unlock() __raw_spin_unlock_irq() … |
In our paper, you can find the table in details
Critical Guest OS Components
vCPU scheduling trace
(w/ Inst. Pointer)
Instruction pointer and kernel symbols enable to precisely detect vCPUs preempted while executing critical OS services without guest OS modification
Accelerating Critical Sections
➋
➊
➊ Yield occurring
➋ Investigating the preempted vCPUs
➌ Scheduling the selected vCPU on the micro-sliced pool
P3
P2
P1
P0
➌
Accelerating Critical TLB Synchronizations
➋
➊
➊ Yield occurring
➋ Investigating the preempted vCPUs
➌ Scheduling the selected vCPU on the micro-sliced pool
➍ Dynamically adjusting micro-sliced cores based on profiling
P3
P2
P1
P0
➌
➍
Detecting Critical I/O Events
I/O handling consists of a chain of operations involving potentially multiple vCPUs
➋
vIRQ
➌
vIPI
➊pIRQ
Experimental Environments
Xen hypervisor
OS
App
OS
App
12 physical threads
12 virtual
CPUs
2-to-1 consolidation ratio
Performance of Micro-sliced Cores
3
3
3
3
1
1
[Our schemes]
Performance of Micro-sliced Cores
3
3
3
3
1
1
[Our schemes]
8% gap
I/O Performance
| Workloads |
VM-1 | iPerf lookbusy |
VM-2 | lookbusy |
Conclusions
Thank You!�jsahn@ajou.ac.kr
Jeongseob Ahn*, Chang Hyun Park‡, Taekyung Heo‡, Jaehyuk Huh‡
Accelerating Critical OS Services in Virtualized Systems with Flexible Micro-sliced Cores
‡
*