1 of 46

KVM Forum 2024, Brno, Czech:�Unleashing SR-IOV�on Virtual Machines

Yui Washizu

NTT Open Source Software Center�yui.washidu@gmail.com

Akihiko Odaki

Daynix Computing Ltd.�akihiko.odaki@daynix.com

2 of 46

Introduction

Unleashing SR-IOV on Virtual Machines

2

3 of 46

Multi-tenant cloud environments

Two goals of multi-tenant cloud environments:

  • Cost-effective
    • A primary motivator for multi-tenant cloud deployments
    • Maximize resource utilization by eliminating the need for infrastructure dedicated to each tenant
  • Secure
    • Each tenant is independent
    • Each tenant can only view and configure its own resources

3

4 of 46

Single Root I/O Virtualization (SR-IOV)

Achieve the two goals of multi-tenant cloud environment

  • A single PCIe device presents Physical and Virtual Functions
    • Physical Function (PF) allows the host configuring VFs
    • Virtual Function (VF) is independently assigned to containers or VMs
  • Cost-effective: Reduce network virtualization overhead
    • Configure the offloading of packet switching with the PF
  • Secure: Network functions are allocated exclusively
  • A problem arises in some situations (next slide)

4

5 of 46

Problem with offloading container networks on VMs

Containers on VMs require their own virtual network

  • However, VMs cannot offload such a virtual network to hardware
      • The host controls the PF exclusively to construct a virtual network for VMs

5

Deploying to Physical Machine

Deploying to Virtual Machine

Container

SmartNIC

PF

Physical Server

(Host)

VF

Container

SmartNIC

PF

Physical Server

(Host)

VF

VF’s VF ?

VM (Container host)

Admin or Network Construction Software

Can access PF and control NW

Admin or Network Construction Software

Can’t access PF or control NW

6 of 46

Our proposal

Unleashing SR-IOV on Virtual Machines

6

7 of 46

Proposal: SR-IOV emulation

Emulate SR-IOV with VMM

  • VMs gain their own PFs (=“virtual PFs”) and corresponding�VFs (=“virtual VFs”)
  • VMs can configure network offloading through virtual PFs

7

Container

SmartNIC

PF

Physical Server

(Host)

Virtual PF

Virtual VF

VM (Container Host)

Admin or Network Construction Software

Can access (virtual) PF�and configure

virtual NW !

VF

VF

8 of 46

Advantages of SR-IOV emulation

  • Consistent with bare-metal system
    • Allows the use of existing network construction software (e.g. SR-IOV CNI plugin)
  • Scalable: Eliminates the need to assign one PF per VM
  • Secure: Users cannot control entire NIC hardware from within a VM

8

9 of 46

Avoiding emulation overhead with vDPA

SR-IOV emulation solely governs the control path

  • The data path can be still offloaded with vDPA (virtio Data Path Acceleration)

9

Physical�Server

VM

Virtual PF

Virtual VF

SmartNIC

VF

VF

Offloaded L2SW

PF

QEMU

vdpa

virtio-net

vdpa

virtio-net

Physical�Server

VM

Virtual PF

Virtual VF

SmartNIC

VF

VF

Offloaded L2SW

PF

QEMU

Control Path

Data Path

With SR-IOV emulation

10 of 46

Adapting SR-IOV emulation to other use cases

Replace backends for other use cases

  • E.g., use tap/user devices to test virtio-net driver with SR-IOV for the guest OS

10

Physical�Server

VM

Virtual PF

Virtual VF

SmartNIC

VF

VF

Offloaded L2SW

PF

QEMU

vdpa

virtio-net

vdpa

virtio-net

Physical�Server

VM

Virtual PF

Virtual VF

Host OS

QEMU

For acceleration

For testing & debugging

tap

virtio-net

tap

virtio-net

TAP

TAP

Linux Bridge

11 of 46

Interface for virtio SR-IOV embedded switch

virtio 1.3 will expose SR-IOV embedded switch capability as device groups

  • A device group allow its owner device to control its member devices
  • Some features using device groups are already proposed:
  • Extend virtio spec for packet switching based on device groups

11

12 of 46

Future work: offload packet switching

Provide comprehensive solution of NW offloading on VMs with OVS

  • QEMU processes TC�configuration from within�VMs
  • Host’s OVS configures �offloading
    • A bridge in OVS�accommodates�virtual PF and VFs

12

Physical Server

VM�

Virtual VF

SmartNIC

VF

VF

Offloaded L2SW

PF

QEMU

vdpa

virtio-net

vdpa

virtio-net

Virtual PF

OVS

tc command

PF Driver

TC Flower

Bridge (for guest)

ovs-ofctl�add-flow

Configure offloading

13 of 46

Performance Verification

Unleashing SR-IOV on Virtual Machines

13

14 of 46

Verification target’s setup

Confirmed 2x performance improvement with vDPA in the following setup:

14

(*) with slight modification to adapt to virtio’s sysfs

Server model

HPE ProLiant DL360 Gen10

CPU

Intel Xeon CPU 4210R @2.4GHz

NIC

Mellanox Technologies MT27710 family ConnectX-6 Dx (100G)

Host/Guest OS

Rocky Linux 9.2

QEMU version

QEMU 8.1.1 (w/ virtio SR-IOV emulation patch applied)

Kubernetes version

1.27.6

CNI plugin

Calico v3.26.3

SR-IOV CNI plugin

2.7.0 (*)

netperf

2.7

15 of 46

Environments

Compare the following 2 environments

  • Using SR-IOV VFs as the backends and netperf as a measurement tool

15

Without offloading in VM

  • Baseline

With offloading in VM

  • SR-IOV CNI configures virtual VFs

Physical Server

VM

Container

Physical NIC (*)

VF

virtual VF

vDPA

SR-IOV CNI

Physical Server

VM

Container

Physical NIC

virtio-net

Device Assignment

Kubernetes NW through Software

VF

External Server

External Server

(*) The host configures L2 packet switching

16 of 46

Metrics and section

Verification metrics are throughput and latency

  • Metrics: Throughput and latency
  • Section: Between a container on the VM and an external machine

16

Without offloading in VM

With offloading in VM

Physical Server

VM

Container

Physical NIC (*)

VF

virtual VF

vDPA

SR-IOV CNI

Physical Server

VM

Container

Physical NIC

virtio-net

Device Assignment

Kubernetes NW through Software

VF

External Server

External Server

Measurement Section

Measurement Section

(*) The host configures L2 packet switching

17 of 46

Throughput

  • Measuring method
    • Average throughput of UDP bulk transfer with netperf
  • Results
    • Transmission: x 2 (2874.1 Mbps → 5000.1 Mbps)
    • Reception: x 2 (1827.8 Mbps → 3289.1 Mbps)

17

18 of 46

Latency

  • Measuring method
    • 99%ile of UDP round trip time using netperf
  • Results
    • - 100 μsec (327 μsec → 236 μsec)

18

19 of 46

Development of SR-IOV Emulation in QEMU

Unleashing SR-IOV on Virtual Machines

19

20 of 46

History of SR-IOV emulation

20

21 of 46

Adding SR-IOV to virtio-net-pci

Paravirtualized

  • igb: physical device emulation

Challenge: flexible configuration

  • Varying number of VFs
    • igb: Fixed number of VFs
  • Network backends (netdev)
    • igb: One backend and�hardware-defined�packet switching

21

22 of 46

Conventional PCI multifunction

Just specify multifunction and�addr:

-netdev user,id=n -netdev user,id=o� -netdev user,id=p -netdev user,id=q� -device pcie-root-port,id=b� -device virtio-net-pci,netdev=q,bus=b,� addr=0x0.0x3� -device virtio-net-pci,netdev=p,bus=b,� addr=0x0.0x2� -device virtio-net-pci,netdev=o,bus=b,� addr=0x0.0x1� -device virtio-net-pci,netdev=n,bus=b,� addr=0x0.0x0,multifunction=on

22

23 of 46

Composable SR-IOV device

Add the sriov-pf property:

-netdev user,id=n -netdev user,id=o� -netdev user,id=p -netdev user,id=q� -device pcie-root-port,id=b� -device virtio-net-pci,netdev=q,bus=b,� addr=0x0.0x3,sriov-pf=f� -device virtio-net-pci,netdev=p,bus=b,� addr=0x0.0x2,sriov-pf=f� -device virtio-net-pci,netdev=o,bus=b,� addr=0x0.0x1,sriov-pf=f� -device virtio-net-pci,netdev=n,bus=b,� addr=0x0.0x0,id=f

The implementation is a bit more�complicated though.

23

24 of 46

SR-IOV as guest-controlled hotplugging

The VF lifetime is controlled by the guest

  1. The guest configures the resource allocation in the device
  2. The guest enables VFs

Similar to hotplug, but the guest expects:

  • Function numbers are already allocated
  • Underlying resources (e.g., netdev) are already allocated

…because physical devices do

24

25 of 46

Issues with hotplugging

Today: literally hotplugging VFs as the guest requests.

  • Problem 1: function numbers and resources are not reserved
  • Problem 2: ad-hoc CLI
    • The conventional CLI immediately plugs VFs instead of hotplugging

-netdev user,id=n -netdev user,id=o� -netdev user,id=p -netdev user,id=q� -device pcie-root-port,id=b� -device virtio-net-pci,bus=b,addr=0x0.0x3,netdev=q,sriov-pf=f� -device virtio-net-pci,bus=b,addr=0x0.0x2,netdev=p,sriov-pf=f� -device virtio-net-pci,bus=b,addr=0x0.0x1,netdev=o,sriov-pf=f� -device virtio-net-pci,bus=b,addr=0x0.0x0,netdev=n,id=f

25

26 of 46

Avoiding hotplugging

[PATCH v16 00/13] hw/pci: SR-IOV related fixes and improvements:

  1. Realize the VFs when the paired PF gets realized
    • Reserves function numbers and resources
    • VFs can be added with normal -device command line
  2. But leave them disabled
    • Reuses the code to power down PCI devices
  3. Enable them when the guest requests

26

27 of 46

Validation of SR-IOV device configuration

SR-IOV imposes several restrictions:

  • SR-IOV requires PCI Express.
  • A function cannot be a PF and VF at the same time.
  • A pair of PF and VFs must be on the same bus.
  • The vendor and device ID of the VFs must be consistent.
  • The memory region configurations of the VFs must be consistent.
  • The IDs of VFs must be linear (i.e., have a consistent stride).
  • VFs do not implement Expansion ROM.

To satisfy these requirements:

  • Check them when composing a SR-IOV device[1][2][3]
  • Limit SR-IOV device composition to virtio-net-pci as a precaution[4]

27

28 of 46

Summary

Unleashing SR-IOV on Virtual Machines

28

29 of 46

Summary

  • Background: virtual network offloading with SR-IOV
    • Emulating SR-IOV can provide configurability to the guest for containers
    • Proposals to exploit the emulated SR-IOV capability
    • The SR-IOV emulation will be immediately useful for testing
  • Benchmarks show a big win with offloaded packet switching:
    • Throughput (Tx): x 2 (2874.1 Mbps → 5000.1 Mbps)
    • Throughput (Rx): x 2 (1827.8 Mbps → 3289.1 Mbps)
    • Latency: - 100 μsec (327 μsec → 236 μsec)
  • We aim to land the SR-IOV emulation for virtio-net-pci in QEMU 9.2

29

30 of 46

Introduction

Unleashing SR-IOV on Virtual Machines

30

31 of 46

Multi-tenant cloud environments

Two goals of multi-tenant cloud environments:

  • Cost-effectiveness
    • A primary motivator for multi-tenant cloud deployments
    • Maximize resource utilization and avoid the need for dedicated infrastructure�to each tenant
  • Security
    • Each tenant is independent
    • Each tenant can only view and configure their own resources

31

32 of 46

Single Root I/O Virtualization (SR-IOV)

A single PCIe device presents Physical and Virtual Functions

  • Roles of the Physical Function and Virtual Functions:
    • Physical Function (PF) allows the host configuring VFs through PFs
    • Virtual Functions (VFs) are assigned to containers or VMs
  • Reduce network virtualization overhead
    • Configure the offloading of packet switching with the PF
  • A problem arises in a specific workload

32

33 of 46

The problem with offloading container networks on VMs

VMs cannot control networks by accessing the SR-IOV PF

  • Objective: Virtual network for containers
  • Container virtual networks cannot be offloaded using SR-IOV
    • PF is only available on physical machine

33

Deploying to physical machine

Deploying to virtual machine

Container

SmartNIC

PF

Physical server

(host)

VF

Container

SmartNIC

PF

Physical server

(host)

VF

VF’s VF ?

VM (container host)

User or Network construction software

Can access PF and control NW

User or Network construction software

Can’t access PF and control NW

34 of 46

Our proposal

Unleashing SR-IOV on Virtual Machines

34

35 of 46

Proposal: SR-IOV emulation

Emulate PCIe devices with SR-IOV on QEMU

  • Allow using virtio SR-IOV in VMs
  • Able to create new VFs (=“virtual VFs”) through a PF (=“virtual PF”) in a VM
  • Also able to handle hardware control requests from a VM through the virtual PF in the future

35

Container

SmartNIC

PF

Physical Server

(Host)

Virtual PF

Virtual VF

VM (Container host)

User or Network Construction Software

Can access (virtual) PF�and configure

virtual NW !

VF

VF

36 of 46

Advantage of SR-IOV emulation

  • Consistent with bare-metal system
    • Allows the use of existing network construction software (e.g. SR-IOV CNI plug-in)
    • Facilitates the offloading of more advanced network features in the future
  • Scalable: Eliminates the need to assign one PF per VM
  • Secure: Users cannot control entire NICs hardware from within a VM

36

37 of 46

Offload container networks with SR-IOV emulation

Combine with Virtio Data Path Acceleration (vDPA)

  • vDPA offloads only the data plane
  • Allows the host‘s VF to exist as a virtual PF within a VM

37

Physical�Server

VM

Virtual PF

Virtual VF

SmartNIC

VF

VF

Offloaded L2SW

PF

QEMU

vdpa

virtio-net

vdpa

virtio-net

Physical�Server

VM

Virtual PF

Virtual VF

SmartNIC

VF

VF

Offloaded L2SW

PF

QEMU

Control Plane

Data Plane

with SR-IOV emulation

38 of 46

Adapting SR-IOV emulation to other use cases

Replace backends for other use cases

  • E.g., use tap devices to test guest OS with SR-IOV

38

Physical�Server

VM

Virtual PF

Virtual VF

SmartNIC

VF

VF

Offloaded L2SW

PF

QEMU

vdpa

virtio-net

vdpa

virtio-net

Physical�Server

VM

Virtual PF

Virtual VF

SmartNIC

tap

tap

Bridge

QEMU

For acceleration

For testing & debugging

vhost

virtio-net

vhost

virtio-net

39 of 46

Implement device groups for virtio with SR-IOV

Help to implement the features related to virtio SR-IOV

39

40 of 46

Future work: offload advanced NW features

Potential solution for advanced network features on VMs with OVS

  • QEMU (+ libvirt etc.) handles�TC configuration�from within VMs
  • Host’s OVS configures �offloading
    • The bridge handles�virtual PF and VFs

40

Physical server

VM�

Virtual VF

SmartNIC

VF

VF

Offloaded L2SW

PF

QEMU

vdpa

virtio-net

vdpa

virtio-net

Virtual PF

rep

OVS

tc command

PF driver

rep

tc flower

bridge (for guest)

ovs-ofctl�add-flow

Configure offloading

41 of 46

Performance Verification

Unleashing SR-IOV on Virtual Machines

41

42 of 46

Verification target’s setup

We confirmed performance improvement with vDPA in the following setup:

42

(*) with slight modification to adapt to virtio’s sysfs

Server model

HPE ProLiant DL360 Gen9

CPU

Intel Xeon CPU E5-2600 @2.3GHz

NIC

Mellanox Technologies MT27710 family ConnectX-6 Dx (100G)

Host/Guest OS

Rocky Linux 9.2

QEMU version

QEMU 8.1.1 (w/ virtio SR-IOV emulatiopatch applied)

Kubernetes version

1.27.6

CNI plugin

Calico v3.26.3

SR-IOV CNI plugin

2.7.0 (*)

netperf

2.7

43 of 46

Environments

Performance verification in the following 2 environments

  • Using SR-IOV VFs as the backend and netperf as measurement tool

43

Without offloading in VM

  • Baseline

With offloading in VM

  • SR-IOV CNI configures HW offload�on VM

Physical Server

VM

Container

Physical NIC

VF

virtual VF

vDPA

Offloading with SR-IOV

Physical Server

VM

Container

Physical NIC

virtio-net

PCI device allocation

Kubernetes NW through software(*)

VF

External Server

External Server

(*) including firewall and NAT

44 of 46

Metrics and section

Verification metrics are throughput and latency

  • Metrics: Throughput and latency
  • Section: Between a container on the VM and an external machine

44

Without offloading in VM

With offloading in VM

Physical Server

VM

Container

Physical NIC

VF

virtual VF

vDPA

Offloading with SR-IOV

Physical Server

VM

Container

Physical NIC

virtio-net

PCI device allocation

Kubernetes NW through software(*)

VF

External Server

External Server

measured section

measured section

45 of 46

Throughput

  • Measuring method
    • Average throughput of UDP bulk transfer with netperf
  • Results
    • Transmission: x 2 (2874.1 Mbps → 5000.1 Mbps)
    • Reception: x 2 (1827.8 Mbps → 3289.1 Mbps)

45

Y

x 2

Y

x 2

Results [Mbps]

46 of 46

Latency

  • Measuring method
    • 99%ile of UDP round trip time using netperf
  • Results
    • - 100 μsec (327 μsec → 236 μsec)

46

Y

- 100 μsec

Result of single core [μsec]