1 of 62

Mobile Graphics,�Vulkan

Arman Papikyan

Lead Render Mobile Programmer at Ubisoft Ukraine

2 of 62

What is Graphics API (history)

Rendering on CPU

3 of 62

What is Graphics API (history)

Rendering on CPU,
First GPUs and fixed function shaders

4 of 62

What is Graphics API (history)

Rendering on CPU,
First GPUs and fixed function shaders,
Complex modern GPUs

5 of 62

What is Graphics API (history)

Rendering on CPU,
First GPUs and fixed function shaders,
Complex modern GPUs,
Standartized APIs – OpenGL

6 of 62

What is Graphics API (history)

Rendering on CPU,
First GPUs and fixed function shaders,
Complex modern GPUs,
Standartized APIs – OpenGL

- 1992

7 of 62

What is Graphics API (history)

Rendering on CPU,
First GPUs and fixed function shaders,
Complex modern GPUs,
Standartized APIs – OpenGL

- 1992

- 2003

8 of 62

What is Graphics API (history)

Rendering on CPU,
First GPUs and fixed function shaders,
Complex modern GPUs,
Standartized APIs – OpenGL

- 2016

- 1992

- 2003

9 of 62

GLSL compiler is slow

9

10 of 62

GLSL compiler is slow
Spikes ~2000 ms

10

11 of 62

GLSL compiler is slow
Spikes ~2000 ms
Prewarming hack

11

12 of 62

GLSL compiler is slow
Spikes ~2000 ms
Prewarming hack

12

Shader precompilation

13 of 62

GLSL compiler is slow
Spikes ~2000 ms
Prewarming hack

13

Shader precompilation
Safer and faster

14 of 62

GLSL compiler is slow
Spikes ~2000 ms
Prewarming hack

14

Shader precompilation
Safer and faster
Cache and store on disk

15 of 62

GLSL compiler is slow
Spikes ~2000 ms
Prewarming hack

15

Shader precompilation
Safer and faster
Cache and store on disk
Minimal error checking

16 of 62

GLSL compiler is slow
Spikes ~2000 ms
Prewarming hack

16

Shader precompilation
Safer and faster
Cache and store on disk
Minimal error checking
Rust / C++ -> Spir-V

17 of 62

vkBeginCommandBuffer(commandBuffer, &cmdBufInfo)

vkCmdBeginRenderPass(commandBuffer, &renderPassBeginInfo, VK_SUBPASS_CONTENTS_INLINE);

�vkCmdBindPipeline(commandBuffer, VK_PIPELINE_BIND_POINT_GRAPHICS, pipeline);

�vkCmdBindDescriptorSets(commandBuffer, VK_PIPELINE_BIND_POINT_GRAPHICS, pipelineLayout, 0, 1, &descriptorSet, 0, nullptr);

�vkCmdBindVertexBuffers(commandBuffer, 0, 1, &_vertexBuffer, _offsets);

vkCmdBindIndexBuffer(commandBuffer, _indexBuffer, 0, VK_INDEX_TYPE_UINT32);

�vkCmdDrawIndexed(commandBuffer, _index_count, 1, 0, 0, 1);

�vkCmdEndRenderPass(commandBuffer);

VkResult result = vkEndCommandBuffer(commandBuffer);

glClearColor(0.f, 0.f, 0.f, 1.f);

glClear(GL_COLOR_BUFFER_BIT);

�glUseProgram(shaderProgram);

�glActiveTexture(0);

glBindTexture(GL_TEXTURE_2D, tex);

glUniform1i(glGetUniformLocation(shaderProgram, "tex"), 0);

�glBindVertexArray(VAO);

�glDrawElements(GL_TRIANGLES, _index_count, GL_UNSIGNED_INT, 0);

> State Management

18 of 62

vkBeginCommandBuffer(commandBuffer, &cmdBufInfo)

vkCmdBeginRenderPass(commandBuffer, &renderPassBeginInfo, VK_SUBPASS_CONTENTS_INLINE);

glClearColor(0.f, 0.f, 0.f, 1.f);

glClear(GL_COLOR_BUFFER_BIT);

Set up the frame

> State Management

19 of 62

vkBeginCommandBuffer(commandBuffer, &cmdBufInfo)

vkCmdBeginRenderPass(commandBuffer, &renderPassBeginInfo, VK_SUBPASS_CONTENTS_INLINE);

�vkCmdBindPipeline(commandBuffer, VK_PIPELINE_BIND_POINT_GRAPHICS, pipeline);

glClearColor(0.f, 0.f, 0.f, 1.f);

glClear(GL_COLOR_BUFFER_BIT);

�glUseProgram(shaderProgram);

> State Management

Use shader

20 of 62

vkBeginCommandBuffer(commandBuffer, &cmdBufInfo)

vkCmdBeginRenderPass(commandBuffer, &renderPassBeginInfo, VK_SUBPASS_CONTENTS_INLINE);

�vkCmdBindPipeline(commandBuffer, VK_PIPELINE_BIND_POINT_GRAPHICS, pipeline);

�vkCmdBindDescriptorSets(commandBuffer, VK_PIPELINE_BIND_POINT_GRAPHICS, pipelineLayout, 0, 1, &descriptorSet, 0, nullptr);

glClearColor(0.f, 0.f, 0.f, 1.f);

glClear(GL_COLOR_BUFFER_BIT);

�glUseProgram(shaderProgram);

�glActiveTexture(0);

glBindTexture(GL_TEXTURE_2D, tex);

glUniform1i(glGetUniformLocation(shaderProgram, "tex"), 0);

> State Management

Use Texture

21 of 62

vkBeginCommandBuffer(commandBuffer, &cmdBufInfo)

vkCmdBeginRenderPass(commandBuffer, &renderPassBeginInfo, VK_SUBPASS_CONTENTS_INLINE);

�vkCmdBindPipeline(commandBuffer, VK_PIPELINE_BIND_POINT_GRAPHICS, pipeline);

�vkCmdBindDescriptorSets(commandBuffer, VK_PIPELINE_BIND_POINT_GRAPHICS, pipelineLayout, 0, 1, &descriptorSet, 0, nullptr);

�vkCmdBindVertexBuffers(commandBuffer, 0, 1, &_vertexBuffer, _offsets);

vkCmdBindIndexBuffer(commandBuffer, _indexBuffer, 0, VK_INDEX_TYPE_UINT32);

glClearColor(0.f, 0.f, 0.f, 1.f);

glClear(GL_COLOR_BUFFER_BIT);

�glUseProgram(shaderProgram);

�glActiveTexture(0);

glBindTexture(GL_TEXTURE_2D, tex);

glUniform1i(glGetUniformLocation(shaderProgram, "tex"), 0);

�glBindVertexArray(VAO);

> State Management

Use vertex & index buffers

22 of 62

vkBeginCommandBuffer(commandBuffer, &cmdBufInfo)

vkCmdBeginRenderPass(commandBuffer, &renderPassBeginInfo, VK_SUBPASS_CONTENTS_INLINE);

�vkCmdBindPipeline(commandBuffer, VK_PIPELINE_BIND_POINT_GRAPHICS, pipeline);

�vkCmdBindDescriptorSets(commandBuffer, VK_PIPELINE_BIND_POINT_GRAPHICS, pipelineLayout, 0, 1, &descriptorSet, 0, nullptr);

�vkCmdBindVertexBuffers(commandBuffer, 0, 1, &_vertexBuffer, _offsets);

vkCmdBindIndexBuffer(commandBuffer, _indexBuffer, 0, VK_INDEX_TYPE_UINT32);

�vkCmdDrawIndexed(commandBuffer, _index_count, 1, 0, 0, 1);

glClearColor(0.f, 0.f, 0.f, 1.f);

glClear(GL_COLOR_BUFFER_BIT);

�glUseProgram(shaderProgram);

�glActiveTexture(0);

glBindTexture(GL_TEXTURE_2D, tex);

glUniform1i(glGetUniformLocation(shaderProgram, "tex"), 0);

�glBindVertexArray(VAO);

�glDrawElements(GL_TRIANGLES, _index_count, GL_UNSIGNED_INT, 0);

> State Management

Draw call

23 of 62

vkBeginCommandBuffer(commandBuffer, &cmdBufInfo)

vkCmdBeginRenderPass(commandBuffer, &renderPassBeginInfo, VK_SUBPASS_CONTENTS_INLINE);

�vkCmdBindPipeline(commandBuffer, VK_PIPELINE_BIND_POINT_GRAPHICS, pipeline);

�vkCmdBindDescriptorSets(commandBuffer, VK_PIPELINE_BIND_POINT_GRAPHICS, pipelineLayout, 0, 1, &descriptorSet, 0, nullptr);

�vkCmdBindVertexBuffers(commandBuffer, 0, 1, &_vertexBuffer, _offsets);

vkCmdBindIndexBuffer(commandBuffer, _indexBuffer, 0, VK_INDEX_TYPE_UINT32);

�vkCmdDrawIndexed(commandBuffer, _index_count, 1, 0, 0, 1);

�vkCmdEndRenderPass(commandBuffer);

VkResult result = vkEndCommandBuffer(commandBuffer);

glClearColor(0.f, 0.f, 0.f, 1.f);

glClear(GL_COLOR_BUFFER_BIT);

�glUseProgram(shaderProgram);

�glActiveTexture(0);

glBindTexture(GL_TEXTURE_2D, tex);

glUniform1i(glGetUniformLocation(shaderProgram, "tex"), 0);

�glBindVertexArray(VAO);

�glDrawElements(GL_TRIANGLES, _index_count, GL_UNSIGNED_INT, 0);

> State Management

Finish

24 of 62

vkBeginCommandBuffer(commandBuffer, &cmdBufInfo)

vkCmdBeginRenderPass(commandBuffer, &renderPassBeginInfo, VK_SUBPASS_CONTENTS_INLINE);

�vkCmdBindPipeline(commandBuffer, VK_PIPELINE_BIND_POINT_GRAPHICS, pipeline);

�vkCmdBindDescriptorSets(commandBuffer, VK_PIPELINE_BIND_POINT_GRAPHICS, pipelineLayout, 0, 1, &descriptorSet, 0, nullptr);

�vkCmdBindVertexBuffers(commandBuffer, 0, 1, &_vertexBuffer, _offsets);

vkCmdBindIndexBuffer(commandBuffer, _indexBuffer, 0, VK_INDEX_TYPE_UINT32);

�vkCmdDrawIndexed(commandBuffer, _index_count, 1, 0, 0, 1);

�vkCmdEndRenderPass(commandBuffer);

VkResult result = vkEndCommandBuffer(commandBuffer);

glClearColor(0.f, 0.f, 0.f, 1.f);

glClear(GL_COLOR_BUFFER_BIT);

�glUseProgram(shaderProgram);

�glActiveTexture(0);

glBindTexture(GL_TEXTURE_2D, tex);

glUniform1i(glGetUniformLocation(shaderProgram, "tex"), 0);

�glBindVertexArray(VAO);

�glDrawElements(GL_TRIANGLES, _index_count, GL_UNSIGNED_INT, 0);

> State Management

25 of 62

vkBeginCommandBuffer(commandBuffer, &cmdBufInfo)

vkCmdBeginRenderPass(commandBuffer, &renderPassBeginInfo, VK_SUBPASS_CONTENTS_INLINE);

�vkCmdBindPipeline(commandBuffer, VK_PIPELINE_BIND_POINT_GRAPHICS, pipeline);

�vkCmdBindDescriptorSets(commandBuffer, VK_PIPELINE_BIND_POINT_GRAPHICS, pipelineLayout, 0, 1, &descriptorSet, 0, nullptr);

�vkCmdBindVertexBuffers(commandBuffer, 0, 1, &_vertexBuffer, _offsets);

vkCmdBindIndexBuffer(commandBuffer, _indexBuffer, 0, VK_INDEX_TYPE_UINT32);

�vkCmdDrawIndexed(commandBuffer, _index_count, 1, 0, 0, 1);

�vkCmdEndRenderPass(commandBuffer);

VkResult result = vkEndCommandBuffer(commandBuffer);

glClearColor(0.f, 0.f, 0.f, 1.f);

glClear(GL_COLOR_BUFFER_BIT);

�glUseProgram(shaderProgram);

�glActiveTexture(0);

glBindTexture(GL_TEXTURE_2D, tex);

glUniform1i(glGetUniformLocation(shaderProgram, "tex"), 0);

�glBindVertexArray(VAO);

�glDrawElements(GL_TRIANGLES, _index_count, GL_UNSIGNED_INT, 0);

> State Management

26 of 62

vkBeginCommandBuffer(commandBuffer, &cmdBufInfo)

vkCmdBeginRenderPass(commandBuffer, &renderPassBeginInfo, VK_SUBPASS_CONTENTS_INLINE);

�vkCmdBindPipeline(commandBuffer, VK_PIPELINE_BIND_POINT_GRAPHICS, pipeline);

�vkCmdBindDescriptorSets(commandBuffer, VK_PIPELINE_BIND_POINT_GRAPHICS, pipelineLayout, 0, 1, &descriptorSet, 0, nullptr);

�vkCmdBindVertexBuffers(commandBuffer, 0, 1, &_vertexBuffer, _offsets);

vkCmdBindIndexBuffer(commandBuffer, _indexBuffer, 0, VK_INDEX_TYPE_UINT32);

�vkCmdDrawIndexed(commandBuffer, _index_count, 1, 0, 0, 1);

�vkCmdEndRenderPass(commandBuffer);

VkResult result = vkEndCommandBuffer(commandBuffer);

glClearColor(0.f, 0.f, 0.f, 1.f);

glClear(GL_COLOR_BUFFER_BIT);

�glUseProgram(shaderProgram);

�glActiveTexture(0);

glBindTexture(GL_TEXTURE_2D, tex);

glUniform1i(glGetUniformLocation(shaderProgram, "tex"), 0);

�glBindVertexArray(VAO);

�glDrawElements(GL_TRIANGLES, _index_count, GL_UNSIGNED_INT, 0);

> State Management

27 of 62

> Multithreading

28 of 62

> Multithreading

GPU

CPU

Shared Cache

System Cache

DRAM

Example S20 Snapdragon

1 x 2.84 GHz 100%

3 x 2.42 GHz 85%

4 x 1.80 GHz 63%

Small Core – Low Frequency

Efficiency

Big Core – High Frequency

Performance

29 of 62

> Multithreading

Power consumption relationship with performance is not linear

Transistor energy per operation ~ Voltage squared

30 of 62

Why is multithreading important?

Better utilization of multi-core HW,
Using smaller, power efficient cores,
Better work distribution,

Complexity + (more work ☹)

30

> Multithreading

31 of 62

Nvidia GeForce GTX 1080 Ti

Heap 10.87 GB

DEVICE_LOCAL

AMD Radeon RX Vega 64

Heap 7.75 GB

DEVICE_LOCAL

Heap 256 MB

DEVICE_LOCAL | HOST VISIBLE

Adreno / Mali / PowerVR..

HEAP – 0 MB

*no VRAM

31

> Memory Management

32 of 62

Why do you need explicit memory management?

Choose HW that suits your needs,
Control the layout (suballocate), Predictable,
Subpasses allow tile-friendly use,

Complexity ++ (more work ☹)

# Pass two

for tile in renderPass:

for primitive in tile:

for fragment in primitive:

execute_fragment_shader(fragment)

2. Render each tile

TILED MODE

39 of 62

39

# Pass one

for draw in renderPass:

for primitive in draw:

for vertex in primitive:

execute_vertex_shader(vertex)

if primitive not culled:

append_tile_list(primitive)

# Pass two

for tile in renderPass:

for primitive in tile:

for fragment in primitive:

execute_fragment_shader(fragment)

1. Break into bins

2. Render each tile

TILED MODE

40 of 62

Memory Bandwidth

Keeping the information on-chip memory,
Useful for post effects, VFX, etc,
Deferred rendering,
Reduces the bandwidth (less sys mem RW),
Available via Vulkan Subpasses,
GL alternative via extension – limited.

40

41 of 62

41

RESOURCE

Texture / Buffer / etc

R

W

R

W

Random Frame

Render scene => Color RT

Color RT => Anti-Alias RT

Anti-Alias RT => Display

RESOURCE

Texture / Buffer / etc

R

W

R

W

Lock

Random Frame

Render scene => Color RT

Lock

Color RT => Anti-Alias RT

Lock

Anti-Alias RT => Display

> Synchronization

42 of 62

42

> Synchronization

CPU

GPU

43 of 62

43

> Synchronization

CPU - Thread 1

CPU - Thread 2

GPU

44 of 62

44

> Synchronization

Fence: GPU - to - CPU

Semaphor: CPU - to - GPU

Barrier: GPU - to - GPU

Timeline Sempahor: All - to - All

45 of 62

45

> Synchronization

// Three dispatches that don’t have conflicting resource accesses�vkCmdDispatch( 1 );�vkCmdDispatch( 2 );�vkCmdDispatch( 3 );��// 4, 5, and 6 don’t share resources with 1, 2, and 3�// No reason for them to be blocked, so set an event to wait for later�vkCmdSetEvent( A, srcStageMask = COMPUTE );�vkCmdDispatch( 4 );�vkCmdDispatch( 5 );�vkCmdDispatch( 6 );��// 7 and 8 don’t use the same resources as 4, 5, and 6. So use an event�vkCmdSetEvent( B, srcStageMask = COMPUTE );��// 7 and 8 need the results of 1, 2, and 3�// So we’ll wait for them by waiting on A�vkCmdWaitEvents( A, dstStageMask = COMPUTE );�vkCmdDispatch( 7 );�vkCmdDispatch( 8 );

�// 9 uses the same resources as 4, 5, and 6 so we wait.�// Also assumed is that 9 needs nothing from 7 and 8�vkCmdWaitEvents( B, dstStageMask = COMPUTE );

�vkCmdDispatch( 9 );

46 of 62

46

> Synchronization

47 of 62

47

> Synchronization

VK_IMAGE_LAYOUT_UNDEFINED

VK_IMAGE_LAYOUT_GENERAL

VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL

VK_IMAGE_LAYOUT_DEPTH_STENCIL_ATTACHMENT_OPTIMAL

VK_IMAGE_LAYOUT_DEPTH_STENCIL_READ_ONLY_OPTIMAL

VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL

VK_IMAGE_LAYOUT_TRANSFER_SRC_OPTIMAL

VK_IMAGE_LAYOUT_TRANSFER_DST_OPTIMAL

VK_IMAGE_LAYOUT_PREINITIALIZED

VK_IMAGE_LAYOUT_DEPTH_READ_ONLY_STENCIL_ATTACHMENT_OPTIMAL

VK_IMAGE_LAYOUT_DEPTH_ATTACHMENT_STENCIL_READ_ONLY_OPTIMAL

VK_ACCESS_INDIRECT_COMMAND_READ_BIT

VK_ACCESS_INDEX_READ_BIT

VK_ACCESS_VERTEX_ATTRIBUTE_READ_BIT

VK_ACCESS_UNIFORM_READ_BIT

VK_ACCESS_INPUT_ATTACHMENT_READ_BIT

VK_ACCESS_SHADER_READ_BIT

VK_ACCESS_SHADER_WRITE_BIT

VK_ACCESS_COLOR_ATTACHMENT_READ_BIT

VK_ACCESS_COLOR_ATTACHMENT_WRITE_BIT

VK_ACCESS_DEPTH_STENCIL_ATTACHMENT_READ_BIT

VK_ACCESS_DEPTH_STENCIL_ATTACHMENT_WRITE_BIT

VK_ACCESS_TRANSFER_READ_BIT

VK_ACCESS_TRANSFER_WRITE_BIT

VK_ACCESS_HOST_READ_BIT

VK_ACCESS_HOST_WRITE_BIT

VK_ACCESS_MEMORY_READ_BIT

VK_ACCESS_MEMORY_WRITE_BIT

Stage

Layout

.srcAccessMask = VK_ACCESS_SHADER_WRITE_BIT,

.dstAccessMask = VK_ACCESS_SHADER_READ_BIT,

.oldLayout = VK_IMAGE_LAYOUT_GENERAL,

.newLayout = VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL

48 of 62

Manual synchronization is difficult,
GL Driver does all this by default,
Application could do it better.

48

> Synchronization

49 of 62

49

> Synchronization

// Three dispatches that don’t have conflicting resource accesses�vkCmdDispatch( 1 );�vkCmdDispatch( 2 );�vkCmdDispatch( 3 );��// 4, 5, and 6 don’t share resources with 1, 2, and 3�// No reason for them to be blocked, so set an event to wait for later�vkCmdSetEvent( A, srcStageMask = COMPUTE );�vkCmdDispatch( 4 );�vkCmdDispatch( 5 );�vkCmdDispatch( 6 );��// 7 and 8 don’t use the same resources as 4, 5, and 6. So use an event�vkCmdSetEvent( B, srcStageMask = COMPUTE );��// 7 and 8 need the results of 1, 2, and 3�// So we’ll wait for them by waiting on A�vkCmdWaitEvents( A, dstStageMask = COMPUTE );�vkCmdDispatch( 7 );�vkCmdDispatch( 8 );

�// 9 uses the same resources as 4, 5, and 6 so we wait.�// Also assumed is that 9 needs nothing from 7 and 8�vkCmdWaitEvents( B, dstStageMask = COMPUTE );

�vkCmdDispatch( 9 );

1

2

3

7

8

4

5

6

9

50 of 62

50

> Synchronization

1

2

3

7

8

4

5

6

9

51 of 62

Manual synchronization is difficult,
GL Driver does all this by default,
Application could do it better,
Frame graph abstraction helps,
Frame graph knows more about your resources than the driver.

51

> Synchronization

52 of 62

- VK_KHR_acceleration_structure

- VK_KHR_ray_tracing_pipeline

- SPV_KHR_ray_tracing

- GLSL_EXT_ray_tracing

Not available on GL,
Not available on 99% mobile phones,
Supported on PC (RTX cards)

52

> Hardware Raytracing

53 of 62

Summary

Predictable shader compilation,
Minimal error checking,
State Management,
Multithreading,
Memory Management,
Synchronization,
Hardware RT

54 of 62

55 of 62

GLES 2.0 2012+

GLES 3.1 2016+

Metal 2014+

Vulkan 2018+

Vulkan -> [MoltenVK] -> Metal

56 of 62

GLES 2.0 2012+ (96.5%)*

GLES 3.1 2016+ (90%)*

Metal 2014+ (up to 50%)*

Vulkan 2018+ (up to 30%)*

*Estimates

57 of 62

GLES 2.0 2012+ (96.5%)*

GLES 3.1 2016+ (90%)*

Metal 2014+ (up to 50%)*

Vulkan 2018+ (up to 30%)*

*Estimates

Vulkan supports all 2018+ mobile devices

Limited

58 of 62

References

Vulkan-tutorial.com,
Vulkan Best practices for mobile developers,
Youtube Lectures - Computer Graphics at TU Wien
Memory Management in Vulkan – 2018, Jordan Logan, AMD
Mobile Graphics Processing Fundamentals – 2021, ARM webinar
Developer.arm.com/documentation/102662/0100/Overview
Themaister.net/blog/2017/08/15/render-graphs-and-vulkan-a-deep-dive/

59 of 62

59

Thanks for your attention,

Questions?

60 of 62

60

Geometry

Pass

Draw

Material

Pass

Lighting

Pass

Geometry

Pass

Draw

Material + Lighting

Pass

61 of 62

61

62 of 62

62

O(num_geometry_fragments * num_lights)

O(screen_resolution * num_lights)