1 of 19

Accelerating Critical OS Services in Virtualized Systems with Flexible Micro-sliced Cores

Jeongseob Ahn*, Chang Hyun Park, Taekyung Heo, Jaehyuk Huh

*

2 of 19

Challenge of Server Consolidation

Utilization

Fraction of time

0

1

0.5

Consolidation improves system utilization

However, resources are contended

3 of 19

Challenge of Server Consolidation

Utilization

Fraction of time

0

1

0.5

Consolidation improves system utilization

However, resources are contended

4 of 19

So, What Can Happen?

  • Virtual time discontinuity

vCPU 1

pCPU 0

vCPU 0

Physical Time

Virtual Time

Running

VMEXIT / VMENTER

Time shared

Waiting

5 of 19

So, What Can Happen?

  • Virtual time discontinuity

vCPU 1

pCPU 0

vCPU 0

Physical Time

Virtual Time

Running

VMEXIT / VMENTER

Time shared

Waiting

Kernel

Component

Avg. waiting time (μsec)

solo

co-run*

Page reclaim

1.03

420.13

Page allocator

3.42

1,053.26

Dentry

2.93

1,298.87

Runqueue

1.22

256.07

➊ Spinlock waiting time(gmake)

Avg.

Min.

Max.

dedup

solo

28

5

1927

co-run*

6,354

7

74915

vips

solo

55

5

2052

co-run*

14,928

17

121548

➋ TLB synchronization latency (μsec)

Jitters

(ms)

Throughput

(Mbits/sec)

solo

0.0043

936.3

mixed co-run*

9.2507

435.6

➌ I/O latency & throughput (iPerf)

Processing time is amplified

* Concurrently running with Swaptions of PARSEC

6 of 19

How about Shortening Time Slice?

 

Waiting time

vCPU 1

pCPU 0

vCPU 0

vCPU 2

Time slice

T

Time shared

vCPU 1

pCPU 0

vCPU 0

vCPU 2

Reduced waiting time

T

Time shared

Shortening time slice is very simple and powerful, but the overhead of frequent context switches is significant

7 of 19

Approach: Dividing CPUs into Two Pools

vCPU 1

vCPU 0

vCPU 2

Waiting time

Time slice

pCPU 0

Shortened time slice

pCPU 3

vCPU 1

vCPU 0

vCPU 2

➊ Normal pool

➋ Micro-sliced pool

Serving critical OS services to minimize the waiting time

- quickly but briefly schedule vCPUs

Serving the main work of applications

8 of 19

Challenges in Serving Critical OS Services on Micro-sliced Cores

1. Precise detection of urgent tasks

2. Guest OS transparency

3. Dynamic adjustment of micro-sliced cores

9 of 19

Detecting Critical OS Services

Examining the instruction pointer (a.k.a PC)

whenever a vCPU yields its pCPU

Instruction pointer

workloads

# of yields

solo

co-run*

gmake

79,440

295,262,662

exim

157,023

24,102,495

dedup

290,406

164,578,839

vips

644,643

57,650,538

* Concurrently running with Swaptions of PARSEC

0x8106ed62

10 of 19

Profiling Virtual CPU Scheduling Logs

  • Investigating frequently preempted regions

Kernel symbol tables

Module

Operation

sched

scheduler_ipi()

resched_curr()

mm

flush_tlb_all()

get_page_from_freelist()

irq

irq_enter()

irq_exit()

spinlock

__raw_spin_unlock()

__raw_spin_unlock_irq()

In our paper, you can find the table in details

Critical Guest OS Components

vCPU scheduling trace

(w/ Inst. Pointer)

Instruction pointer and kernel symbols enable to precisely detect vCPUs preempted while executing critical OS services without guest OS modification

11 of 19

Accelerating Critical Sections

➊ Yield occurring

➋ Investigating the preempted vCPUs

➌ Scheduling the selected vCPU on the micro-sliced pool

P3

P2

P1

P0

12 of 19

Accelerating Critical TLB Synchronizations

➊ Yield occurring

➋ Investigating the preempted vCPUs

➌ Scheduling the selected vCPU on the micro-sliced pool

➍ Dynamically adjusting micro-sliced cores based on profiling

P3

P2

P1

P0

13 of 19

Detecting Critical I/O Events

I/O handling consists of a chain of operations involving potentially multiple vCPUs

vIRQ

vIPI

pIRQ

14 of 19

Experimental Environments

  • Testbed
    • 12 HW threads (Intel Xeon)
    • 2 VMs with 12 vCPUs for each
    • Xen hypervisor 4.7

  • Benchmarking workloads
    • dedup and vips from PARSEC
    • exim and gmake from MOSBENCH

  • Pool configuration
    • Normal: 30ms (Xen default)
    • Micro-sliced: 0.1ms

Xen hypervisor

OS

App

OS

App

12 physical threads

12 virtual

CPUs

2-to-1 consolidation ratio

15 of 19

Performance of Micro-sliced Cores

3

3

3

3

1

1

[Our schemes]

16 of 19

Performance of Micro-sliced Cores

3

3

3

3

1

1

[Our schemes]

8% gap

17 of 19

I/O Performance

Workloads

VM-1

iPerf

lookbusy

VM-2

lookbusy

18 of 19

Conclusions

  • We introduced a new approach to mitigate the virtual time discontinuity problem

  • Three distinct contributions
    • Precise detection of urgent tasks
    • Guest OS transparency
    • Dynamic adjustment of the micro-sliced cores

  • Overhead is very low for applications which do not frequently use OS services

19 of 19

Thank You!�jsahn@ajou.ac.kr

Jeongseob Ahn*, Chang Hyun Park, Taekyung Heo, Jaehyuk Huh

Accelerating Critical OS Services in Virtualized Systems with Flexible Micro-sliced Cores

*