2 of 33

Introduction

During this course we have paid special attention to the visual quality of the rendered frame.

Sadly, we couldn't spend so much time into performance, and performance is key when rendering real time environments.

Fast rendering means you can render more objects per frame, and more frames per second.

In this presentation I will discuss some techniques that could make your renders run much faster.

3 of 33

Reducing Bottlenecks in VRAM

4 of 33

Geometry packing

GPUs can read vertices and other vertex properties in multiple ways, and if the data is packed efficiently, the fetching will be faster.

That means using techniques like indexing vertices, interleaving attributes, and packing information in less bytes.

It is a manual process that will reduce the bottlenecks (less bytes in the bus), reduce shader executions per vertex (indexing) and make better use of the caches in the GPU.

5 of 33

Material Packing

Another basic trick if instead of having lots of uniforms in our shader, just use buffers to store structs (using SSBOs).

By using structs that contain all properties of a material, we can send all the material properties in a single call to the GPU.

Keep in mind that GPUs have some restrictions about padding of struct fields when using structs.

Also this only includes material values. Textures are more tricky.

6 of 33

Reducing PSOs

When rendering objects we need to use Pipeline State Objects that specifies exactly how to render the triangles (it includes shader, flags, geometry packing info, etc).

Switching between different PSOs per object has its penalty. If we reduce the number of shader permutations, and enforce the same packing in all geometries we can use less PSOs.

This comes with its drawbacks and limitations but it is worth considering.

7 of 33

Level of Detail (LODs)

The next trick is to have multiple versions of every mesh but with less geometry and then using the low version meshes when rendering object small in the screen.

By reducing the number of polygons GPUs can render faster and users won't notice the difference.

To generate the low resolution versions you will need to use some authoring tools like Blender.

8 of 33

Compressed Textures

When using texture we usually store them as JPG or PNGs in our hard drive.

These image formats reduce the image size a lot but are not suitable for the GPU, so in order to use them we need first to decompress them in RAM, a process that requires lots of CPU and can be really slow with big images.

GPUs do support compression, although in some special formats that are not as powerful as JPG or PNGs but that allow the GPU to save memory and run faster (as the data is stored compressed which helps the caches). This is critical in mobile devices.

More info here and here

9 of 33

Reducing Draw Calls

10 of 33

Draw Call

To render one object in our 3D space we need to issue a draw call from the CPU to the GPU.

Every draw call has a big cost even if at the end nothing changes in the screen. Because issuing a draw call requires lots of validation and preprocess before any pixel is sent to the buffer.

Reducing the number of drawcalls can improve a lot your rendering time.

11 of 33

Instancing

GPUs have the option to render several object in a single draw call as long as they all share the same geometry and uniforms.

This feature is called instancing, and all APIs support it.

Now instead of rendering every object with a single draw call, we must find if we have more than one instance of the same object and render it using instancing.

12 of 33

Indirect Rendering

When using instancing we pack per geometry type, so if we have 10 different geometries repeated multiple times we still will need 10 draw calls.

Thanks to Indirect Rendering, we are allowed to specify offsets in the geometry (where to start reading, and how many primitives you want to read).

This opens the possibility to render different geometries with one special draw call.

Here is more info.

13 of 33

Frustum Culling

The first and cheapest way to reduce the number of draw calls is by only rendering object that are inside the frustum of the camera.

To achieve this we need to extract the six planes that define the frustum (using the viewprojection matrix) and test if the bounding of every object is inside or overlaps the six planes. If not it is fair to assume the object is outside of our view and we can skip rendering it.

Doing the culling from CPU means you need to have all the info related to the scene stored in RAM.

14 of 33

Occlusion Culling

Frustum culling is not enough when your scene contains lots of objects occluded by other objects. How to detect if one object is behind another one?

Imagine rendering the interior of a building, most of the objects are probably behind walls where I can't see them. Can I detect beforehand which objects won't be visible?

Old engines relied on precomputing visibility for every area of the scene where the camera could be, but that doesn't work with dynamic environments.

Other engines built a software rasterizer to keep in CPU a small depth buffer to use to test occlusions.

But can we take advantage of the GPU?

15 of 33

Occlusion Queries (1/2)

GPUs give you a mechanism to query how many pixels would be rendered if a draw call was issued.

This is handy as one could issue the draw call and if the number of pixels is zero then we would know it wasn't visible. But that would take almost the same time as rendering it!

We could reduce the time by issuing a draw call for the bounding box, but then we will have to wait to the GPU to get the result, and waiting is bad.

16 of 33

Occlusion Queries (2/2)

Some engine issue Occlusion Queries at the end of current frame, then read the result at the beginning of the next frame to see how many objects passed it.

The idea is that if the object was occluded in the previous frame, probably it will be occluded in the current one.

Worst case scenario: the object was occluded in the previous frame but it is not in the current one (because the camera moved a big amount). Then the object wont appear in this frame, but when checked for the next frame, then it will appear.

That produces a one-frame delay in objects, which could be noticeable in some situations but most games go along with it without problem.

17 of 33

Compute Shaders

Compute shaders allow us to execute code in the GPU, having access to GPU memory.

This opens the possibility to move some of the algorithms in our rendering pipeline from the CPU to the GPU, where they will run much faster.

We can use Compute Shaders to write into buffers stored in the GPU memory, and use these buffers in some of our rendering passes.

18 of 33

GPU Driven rendering

19 of 33

From CPU to GPU

Until know, all the draw calls were issued from the CPU and every call required to set up several resources (uniforms, textures, geometry, etc) from the CPU.

But GPUs have more processing power to detect which objects are visible (they can access the Depth buffer for instance).

If we could issue the draw calls not from the CPU but from the GPU, that would open the possibility to more fine-grained algorithms.

Lets see how we can do that.

20 of 33

Issuing draw calls from the GPU

So how can we issue draw calls from the GPU?

The short answer is that you can't, draw calls should start from the CPU, but using Indirect Rendering we can issue a draw call that could expand into several more.

The workflow is first to use a compute shader to fill several buffers that contain all the info required for our draw calls, then we can issue an indirect draw call that uses these buffers as input and generates all the rendering.

As you can see all the complex work is handled by the GPU, leaving the CPU free.

21 of 33

Draw calls from a Compute Shader

But how to construct our draw calls from a compute shader?

Well, first we must store in VRAM all the info related to which objects and materials we have.

Then execute a Compute Shader that will transform that data structure into an indirect buffer with draw calls to execute.

Indirect rendering has some limitations though, as it only allows to specify a mesh offset, number of primitives, and number of instances, for every draw call.

22 of 33

Draw Call resources

But to render one object into the screen we need several resources per draw call:

Pipeline setup (shader, depth test, etc).
Geometry (vertex and index buffers)
Globals and Material (uniforms)
Textures

If we want to be able to generate draw calls directly from the GPU we need a way to define this from the GPU, in a way that indirect rendering can use it.

Lets go one by one.

23 of 33

Pipeline setup

We cannot change pipelines from the GPU, that action must be done from the CPU.

But most of the geometry in your scene will probably use the same pipeline, so then the trick is to enforce all objects to use the same shader / rendering state / primitive.

And the special ones can be treated in a different CPU call.

You can have two groups, one for static meshes and one for animated characters. That would mean at most two draw calls for the whole scene.

24 of 33

Geometry

Indirect draw dont allow you to choose which vertex buffer to use, it only allows to specify some offset over the current vertex buffer.

But we can abuse this by packing all meshes into a single vertex and index buffer.

This works as long as all our geometry has the same attributes (vertex, normal, uv) and packing.

If we keep a list specifying the offset of every mesh, then just by changing that offset in the buffer used by indirect draw we can force GPU to render a different mesh.

25 of 33

Objects and Materials

We can store all materials into a single buffer with all the properties packed into a fixed size struct.

The same can be done with all object properties (model, mesh index and material index).

Then by fetching from that array we have all the info necessary accessible from the GPU per object.

26 of 33

Bindless textures

We know that textures must be binded before executing a draw call, so how can we change textures from the GPU?

There are two ways:

Using some sort of texture atlas (all textures stored into a single texture)
Abusing the driver to change binded texture.

While the first one is easy to code (using a Texture2DArray), it has limitations as all textures must have same properties (size, channels, etc).

The second one requires using extensions and it is harder to code but allows to fetch from random textures in VRAM from a shader.

27 of 33

Culling objects in GPU

So now that we have a way to define the drawcall of any object of our scene we can create a Compute Shader that reads our scene description and adds drawcalls based on if the object would be visible or not, for instance, using Frustum Culling.

If you want also to cull objects behind other objects, you can test against the previous frame Depth Buffer using a Hierachical Z-Buffer structure. The idea is to have a fast way to check if the bounding of an object falls behind all the pixels in the depth buffer of its projection.

Project bounding sphere to screen space.

Approximate projected bounds

1 of 33