This document describes the architecture of the virgil3d virtual GPU for providing 3D capabilities to qemu guest operating systems. This document is written from the point of view of running a modern Linux guest graphics stack.
The virgil3d stack is made up of 5 logical pieces initially:
qemu virgl “vga” device: This device is a qemu hw device and exposes itself as a vga device inside qemu.
qemu renderer: This code has two parts, one inside qemu setting up SDL/GL for non-virgl rendering, so the BIOS and vesa works, this is quite simple. The second part is a library
that qemu links to that provides the OpenGL rendering for the protocol.
Linux kernel KMS virgl driver: This is a standard Linux drm/kms driver that provides memory management and talks to the virgl hw device.
X.org virgl driver: This is an X.org driver that talks to the kernel driver and provides basic 2D rendering and DRI2 support.
Mesa Gallium3D based virgl driver: This is the driver that is built into the 3D virgl_dri.so along with the standard mesa stack and gallium state trackers. This provides OpenGL interfaces in the guest and creates command streams to be sent to the kernel to be transferred to the host
These are some basics that I’d like to state upfront:
1. the guest will have no direct access ever to the GPU resources, all resource access inside the guest will have to be done via DMA-like operations called transfers.
2. Current code breaches certain virtio limitations on size of transfers.
The virgl hw exposes a single virtio queue to the guest for it to use as a primary ring to send a number of commands to the host. The commands sent on this ring deal with contexts, resources and transfers, along with some basic modesetting tasks. It also includes a secondary indirect command submission that is used to send rendering command stream from the userspace tasks.
The hw enforces some security between contexts, and all resources must be bound to a context before being used. All rendering command streams must have an attached context. The guest OS should usually ensure context 0 is used by the kernel.
The rendering command stream consists of a protocol based on the gallium3d states, and contains a set of commands to handle state objects, setup various misc states, send drawing commands, blits and deal with queries.
On the qemu side the renderer uses a single OpenGL context and manages all the OpenGL state necessary to render the command streams from the host.
This section will go into more depth on the protocol as currently used. This will probably change before you ever read this, and so I encourage people to ask!
The virtio ring is provided by qemu and the kernel, and you can attach a bunch of scatter gather pages to it. For this device we treat the first sg page as containing the command that the guest wants the host to execute. These commands are all to be used only by the guest OS kernel.
The commands also contain a flag to emit a fence after the command has executed. Fences are a graphics term for an object that signals after the GPU has finished executing certain commands. Fences are later signalled using irqs.
This creates a context in the host with the handle specified by the guest. The guest can manage the context IDs. Currently there is no limit on contexts, but this may change or a limit may be specced as a capability. In the host this will allocate storage for a new context.
This just destroys a context previously created, and destroy any storage allocated for it.
This creates a resource in the host OpenGL implementation using the specified parameters. Resources are specified using a gallium like interface. Resource formats are specified as per gallium and the guest kernel should prevent resources of non-defined types from being allocated.
Current resource parameters are,
target - BUFFER, 1D, 2D, 3D, CUBE, ARRAY etc, 3D rendering target types for this resource
format - Gallium pipe format specifier.
bind - binding flags as per gallium
width/height/depth - note depth is 3D depth not bit depth
array_size - texture array size for GL texture arrays
last_level - number of mipmap levels in the resource
nr_samples - number of MSAA samples in the resource.
The guest OS kernel keeps track of resource handles that have been used.
This just destroys a resource previously created.
This attaches a resource to a context so a command stream for that context can do rendering operations on the resource.
This detaches a resource from a context, so any subsequent rendering will generate some sort of error.
This initiates a DMA transfer from the host to the guest. The guest OS will attach a scatter-gather list of pages as a destination for the transfer. Transfer parameters are
res_handle - the resource to transfer from
ctx_id - the context that wishes to do the transfer
box - a 3D box - x,y,z, w,h,d - all parameters not always used, but this is the subset box of the resource that is to be transferred.
level - mipmap level to transfer from
data - the subpage size offset into the scatter gather pages to start the transfer
dst_stride - if transferring into an object with a different stride to the box width this needs to be specified. - (TODO: do we need a level or array stride?)
This initiates a DMA transfer from the guest to the host. The guest OS will attach a scatter-gather list of pages as a source for the transfer. Transfer parameters are:
res_handle - the resource to transfer to
ctx_id - the context that wishes to do the transfer
box - a 3D box - x,y,z, w,h,d - all parameters not always used, but this is the subset box of the resource that is to be transferred into.
level - mipmap level to transfer to
data - the subpage size offset into the scatter gather pages to start the transfer
src_stride - if transferring from a source with a different stride than the box width, this needs to be specified.
This submits a rendering command stream to the host. The guest OS will attach a scatter-gather list of pages containing the commands to execute, along with a size
and context id for which the commands apply.
This flushes a subset of the current scanout resource (at least it should be) to the screen.
This sets the currently shown mode for the qemu viewing window, it contains a resource handle to be used as the scanout resource and a box to specify the subset of the resource to scanout and to be used as the “mode”.
TODO fill in the rendering commands and objects
Current state objects for 3D rendering are based on gallium state objects,
Blend, rasterizer, depth/stencil/alpha, vertex shader, fragment shader, vertex elements,
sampler views, sampler states, surfaces, shaders and queries. These objects can be created, bound and destroyed.
Misc states consist of viewport, framebuffer binding, vertex buffer binding, per-shader sampler view bindings, index buffer binding, constant buffer binding, stencil reference, blend color, scissor state. The states can just be set and don’t require create/bind cycles.
Rendering commands are clear, draw_vbo and resource copy region and blit. (Do we need both of the last two?).
There are also commands dealing with queries for occlusion/time/etc queries.
As well as the virtio ring, there is an irq from the host to guest and a set of io registers that the guest can read/write to from. The standard virtio PCI register are duplicated. Virgl uses the virtio features registers as normal. It also uses
ISR: 8-bit register to denote what irq action has happened, bit 1 is the virtio virtqueue irq, bit 6 is the fence irq.
FENCE_ID: 32-bit register containing the last fence to be fully signalled.
CURSOR_ID: 32-bit resource id containing the current cursor to be drawn
CURSOR_HOT_X_Y: 32-bit containing 16-bit x/y hotspot for cursors
CURSOR_CUR_X: current cursor X position
CURSOR_CUR_Y: current cursor Y position
TODO: error handling
The renderer takes the command stream from the guest and renders it using the host’s OpenGL implementation.
Currently the renderer runs in a thread inside the qemu process context, it contains code to dequeue commands from the virtio ring and execute them in order.
Contexts in the renderer are just a big bunch of state that is stored per-context, when the renderer switches context, the new renderer state is sent to OpenGL.
resources are global in the renderer, and have to be attached to contexts to be used. Resources are created as OpenGL objects, like buffer objects or texture objects.
Transfers are used to put data in/out of the GL objects. Transfers into objects are generally done with SubData or SubImage commands. However DrawPixels could also be used in some cases in the future.
Transfer from objects is done with MapBuffer, or binding a texture as an FBO and using ReadPixels. As there is no GetTexSubImage there maybe be cases where we have to retrieve a full image with GetTexImage and process the subregions to send back.
Rendering Command stream:
The renderer takes the rendering command stream for a context, and executes each command in order. The specifics of converting the Gallium state to OpenGL is left to the code.
The renderer currently takes the TGSI text shader, converts it to the TGSI parser token list and then converts that into GLSL 1.30 shaders. It currently works for a lot of simpler shaders, but fails badly for a lot of more complex ones.
Currently the renderer creates every guest resource as an OpenGL object. However this includes resources like the frontbuffer. So when e.g. X is drawing to the frontbuffer, the renderer is rendering to an offscreen framebuffer object. The FLUSH_BUFFER interface then allows the guest to signal to the host to copy this framebuffer object to the current OpenGL backbuffer and do a SwapBuffers on it. There is the possibility to perhaps render direct to the OpenGL backbuffer for the guest frontbuffer in certain circumstances.
This area has already gone through a bit of a redesign. The initial guest implementation separate resources from guest buffer objects. However upon thinking about how hibernation or migration might have to work I realised that I’d at some point have to allocate enough space in the guest to store most of the resources I’d allocated on the host. Also how current Linux resource sharing works (DRI2) having resource and buffers separate was a bit messy.
So the current design is currently slower than what I showed in the demo videos, but I plan on fixing that up soon. When the guest userspace app allocates a resource, it also allocates a buffer object big enough to back the resource. This buffer object is then used as the basis for all transfer to/from the host.
The main problem with tying resource information to the buffer object creation is it makes reusing buffer objects in the guest a bit harder than the current buffer reuse code allows for, but the plan is to use guest userspace caching and possibly allowing the resource attached to an object to be overridden, so we can avoid the overheads of page allocation and mapping in the guest by reusing the buffer objects, but changing the underlying GL resource attached to them.
This design doesn’t cover a few things that I would consider requirements:
OpenGL is quite a diverse interface and the host OpenGL implementations can vary greatly in quality and ability. Ideally we’d like to expose in the guest as much of OpenGL as we can, this however means we need to add some sort of capability interface for the guest to know what the host can deal with. Currently I just limit things to GL3.0.
2. GL versioning
A baseline for what GL versions and what extensions are required needs to be developed in association with capabilities.
3. Running on top of GL 3.1
OpenGL 3.1 without compatibility as implemented by many open source drivers has dropped support for a number of old GL features, like ALPHA and LUMINANCE textures etc. Having a guest expose these features on a host that doesn’t expose them will require some thought, and possibly rewriting shaders on the host.
3. Unversioned Gallium API
The current gallium interfaces aren’t public or stable APIs. There is probably a need to move to a fixed point in time of these interfaces and add versioning in order to add new features in the future via capabilities. The TGSI shader text format isn’t versioned or stable, it may be needed to see if we can version this somehow upstream.
4. Error handling
Currently the virtual GPU has no error notification mechanism into the guest. So if the guest manages to do something illegal that the host wants to signal it, there is no mechanism. The current renderer drops a lot of rendering on the floor in this case. An irq and error delivery mechanism would be useful so the guest kernel can kill offending applications.
5. Interaction with libvirt/virt-manager/viewers.
Currently the renderer requires OpenGL/SDL from qemu and requires access to the users X server and /dev/ files for the GPU. However a common way for distros to run VMs is as a user qemu disconnected from the screen and with a vnc or spice viewer connection to the actual VM. It would probably be useful to retain this set up somehow even if it adds a bit of overhead in the rendering and disallow some optimisations. Investigation how EGL based rendering for a qemu process would work, and using something like dma-buf sharing to pass the rendered guest front buffer handle to a compositing viewer process running as the user on the same machine might be a step towards ensuring better security between guest and host user.
This solution currently ignores remoting, reading back each frame from the GPU is going to make things a lot slower, and adding video encoding even more fun. This is another area for future investigation.
7. virtio msi/mmio/pci other architectures
Need to investigate using some other virtio and qemu bits like MSI interrupts, mmio/pci BAR as possible optimisations. How it might work with ARM architectures, or Power.
8. GL subsets
It might be more useful on ARM to provide a renderer that uses GLES2 and exposes GLES2 in the guest if possible.
9. Direct3D and Windows guests
Getting Windows guests running would also be another hurdle, though the overall architecture should support it fine.