Fixing The GPU

This represents opinions of the author and not necessarily aligned or not-aligned with any {prior, current, or future} employers! Also the author only cares about compute generated graphics (no fixed function and pixel shaders, so those asks won’t be covered here)!

The purpose is simple: record problems and suggested solutions for issues with GPU programming. Circa 2024, one of the larger problems as game production costs skyrocket, is that top AAA titles need to find their way across platforms (either at launch, or later with a timed exclusive) to hit a large enough market to amortize cost. Thus it would be nice to find a way to make forward process fixing these fundamental issues across as much of the industry as possible so that GPU shader devs can ultimately write better algorithms.

Note this is a work in progress, not even close to being finished, so best to just reference instead of grabbing a copy right now.

Change Log

20241121 - Starting to collect ideas

20241122 - Expanding WORM/etc

20241123 - Reordering to make this more useful (TLDR problems)

20241124 - Adding a section on hardware design improvements too

20241125 - Added PSO query to get workgroup count that fills empty GPU, etc

20241126 - Added CPU Push to GPU section, started inserting some AMD compiler bugs

20241127 - Started adding extra resource and doc links, like online shader compiler

20241127 - Added a section on efficient PC logic (fixed 32-bit MSB)

20241127 - Added section for portable TBUFFER support

20241128 - Added section on Stream layout qualifier (well thought through now)

20241129 - Added the section on SSBO array[index]'ing problems with disassembly

20241130 - Added section for AMD bugs on GL_EXT_buffer_reference

20241202 - Added buffer data in shaders section and mapped page cache section

20241202 - Added labels and goto support in shaders section

20241203 - Clean up and un-todo'ing intrinsics and inline asm section

20241204 - Added section on divergent texture vs hardware design

20241215 - Texel offset notes, additions and clean-up of buffers vs pointer stuff

20241216 - Accessing Memory Super Topic up

20241217 - Added CU Shared SGPR area

20241218 - Added the section on GPU Spin Loops or Lock-Free Retry Loops

20241220 - Added AtomicAdd bugs and workaround in the AMD Compiler Bugs section

Suggestions - Small Scope / Easy

These are mostly flushed out suggestions which would be easy to implement the front-ends for, even if IHV backend support might lag due to higher implementation time.

Branch Related Hinting

Ability to label a branch^[a] {if,while,etc} with explicit coherence or divergence^[b], along with providing a probability of the conditional evaluation, and explicit whole-wave mode
This would enable a well written compiler to produce some optimal code paths (examples below) that compilers today cannot do
GLSL syntax suggestion(s)

[true=N] where N is a {0 to 1} value of the probability of true
[divergent] meaning the conditional will evaluate differently per lane
[coherent] meaning the conditional will evaluate the same for all lanes
[whole] meaning all lanes are active

Compilers can do some optimizations when they know all lanes are active before a branch, for example subgroupElect() could just use lane 0 explicitly instead of searching for the lowest active lane

[partial] meaning some lanes could be inactive
[whole] [true=0.5] [divergent] if(...

Means lanes of the wave will always take both paths
On AMD for example, the compiler could remove the branching and simply mask EXEC for better performance

[whole] [true=0.99] [coherent] if(...

Means the else path would almost never be taken
Compilers could make the if(true) path completely linear in the binary
Limiting branching away to the uncommon path and branch back to the join point overhead that only happens on the uncommon path

Author's usage

Using defines for these which map to comments since there is no implementation
But at least the code is setup for them if they ever arrive
#define JMP_LANE /* divergent */
#define JMP_WAVE /* coherent */
#define JMP_FULL /* full wave */
#define JMP_PART /* partial wave */
#define JMP_BOTH /* expect both paths taken */
#define JMP_HERE /* expect this path nearly always taken */
#define JMP_RARE /* expect this path almost never taken */

Buffer Data in Shaders

Ability to create global buffer data inside a shader

With ability to control alignment
With an include for binary data
Etc

Ability to create a 64-bit pointer to the data and use GL_EXT_buffer_reference
TODO: Would also want to be able to do buffer access

Probably could query something on the PSO
To get the GPU address to pass into some descriptor creation?

This enables GPUs shaders to have similar advantages of CPU code

An initialized data segment for constant and/or mutable data

Bonus points

If the API guarantees the 32-bit MSBs of the 64-bit pointer to be constant
Such that only a 32-bit LSB needs to be passed around

Cache Control - Streaming

Suggestion to make a new layout qualifier stream
Streaming on Read

Suggested implementation: Evict First on all Levels, Miss Evict as backup

Introduce a 'stream' memory qualifier for the layout
Can mix this with 'readonly'

Obvious case, only one wave on the GPU ever reads a given line inside one load

Can use streaming cache hint on all levels of the cache

easily portable

Note on AMD K$ lines are likely 64-bytes (not 128-bytes like L0)

SMEM supports reading 16 DWORDs in one op (full 64-byte line in one shot)

More complex, only one wave accesses the line, but across multiple separate loads

If loads are separated (not claused on hardware that could join the collection), then Miss Evict would provide poor performance

Forced Miss, then Evict (aka Miss Evict) vs Evict First

Likely HW wouldn't have a true bypass on loads

Because it would need to duplicate paths (expensive)

More likely HW forces a miss then evicts the content after load
If the line was read-only it shouldn't get write-back (empty byte write mask)

So not concerned if the line stays

Evict First in theory would be better than Miss Evict in this case

They both will provide the first line for the next miss in the set
But Evict First enables the line to get reuse possibly if there is a secondary load

Streaming on Write

Suggested implementation: Evict First on all Levels, Hit Evict as backup

Introduce a 'stream' memory qualifier for the layout
Can mix this with 'writeonly'
Note don't need to force Eviction on the incoherent cache levels

Because this is streaming, no reuse case for the given frame
No chance to read a stale line (it will never get read)

Obvious case, only one wave on the GPU ever writes a given line inside one store

No write-combining at other levels in the cache
No reuse until next frame or later (no practical way to capture multi-frame coherence, given limited cache sizes)

Less obvious case, separate stores need write-combining

Maybe Evict First helps here

Likely HW wouldn't have a true bypass on stores either

One point of caches is to collect traffic to coalesce operations to the open DRAM page (hopefully that term is right) to increase efficiency

Hit Evict vs Evict First

Hit Evict could take advantage of a partial line write after a full line load
Evict First thought would have a larger window for any possible write-combining for no extra down-sides

Device-Uncached Portability Path

To make this work and be portable one has to build an API that is the LCD of the vendors
Suggesting requiring at least two things together

New layout qualifier device_uncached

This would be a NOP on AMD
But would trigger the .cv and .wt instruction bits on NVIDIA
Note if NVIDIA could implement uncached behavior via page table bits, then this new layout qualifier would not be required

New portable DEVICE_UNCACHED_BIT_KHR memory type allocation

Which if exists would need to be used to create any allocation that was accessed using the new device_uncached layout qualifier
This is a portable version of the existing DEVICE_UNCACHED_BIT_AMD

Net result is that at least uncached load/store to/from system RAM would work
The usage for this is to provide an efficient way for CPU/GPU communication which does not rely on high overhead and high latency L2 writeback, or address range tracking cache flushes, or more complex in shader cache control instructions

Global Structure Unions / Register Aliasing Hints

The first part is simple define a structure of variables in global scope

With explicit layout rules so it’s obvious the mapping to a set of 32-bit registers
For example, 64-bit values would get pairs of 32-bit registers, with .x in bank&1=0 and .y in bank&1=1, and so on

The second part is to add union support to the shader language so one could alias a collection of the above described structures in global scope

Note the initial ask here is simply union support in the shading language

Secondary ask that something like spirvopt would have an option to not flatten and remove the unions/structures

Note that union’s have important usage even outside the ideas proposed here

This collectively enables an easy high-level framework to express low-level register aliasing/allocation hints that could be passed from high level shading language all the way down through an IR to the IHV compiler
The IHV compiler could still take the construct as a hint and override the layout/allocation
The other option is that the IHV compiler uses the described aliasing as a starting point for register allocation, which could get compilers out of trouble cases, and also could make some big improvements on compile time
Note it would also be possible to use these global structure unions as ABI to provide a way to link common modules together without having all the problems of call stacks, but then one would need a more explicit layout instead of strictly allocation/aliasing hints (so perhaps out of the scope of this initial ask, BUT it could be a very good solution to the shader combinatorial explosion problem!)
One of the largest problems compilers have today is register allocation (due to the explosion of state in SSA form), and compile times, both things this construct could greatly help improve

GPU-Filling Dispatch Query for PSO

Answers the question, how many workgroups does it take to fill the machine given a PSO
Suggested implementation: uint32_t vkPipelineFillSize(VkDevice d, VkPipeline* p)
Basic functionality for persistent workgroups

Labels and Goto Support In Shaders

With ability to get pointers to labels and build jump tables
Bonus points if one can guarantee high 32-bit MSB bits are constant

So one only has to store the lower 32-bit LSBs in the jump tables instead of 64-bits

While pointers to labels is something SPIR-V likely doesn't have

SPIR-V does have labels so the initial ask of {labels and gotos} without pointers should be easily possible without IHV support even perhaps
Probably good to split this ASK into two rollouts, with and without pointers

Preoptimized Pragma

Suggested shader language syntax: #pragma preoptimized

Or alternatively not a pragma via [preoptimized]

Turn off compiler helper paths for poorly written shaders that function as de-optimizaton paths for well written shaders
Examples of these helper paths

Checking if atomic operations are wave uniform or not and then transforming the atomic from multi-lane to single lane

Realtime Shader Clock Frequency Query

Implementation suggestion for Vulkan

Add a new structure that works in conjunction with vkGetPhysicalDeviceFeatures2()
struct {...; uint64_t frequency; } VkPhysicalDeviceShaderRealtimeClockFeatures;

GL_EXT_shader_realtime_clock provides ability to get a coherent wall-clock counter from the shader, without a defined clock frequency
This would provide a way to query that clock frequency
A workaround is to attempt to derive it at runtime using questionable methods (like measuring and rounding, or using a device lookup table, etc)
One usage case for this is a component of Adaptive Workload Scaling (AWS)

TBUFFER Support - Portable Form

Generic loads from a buffer descriptor where the type is provided in the opcode

It is a foundational component of good compute code
NOTE definitely want BYTE-OFFSETs instead of INDEXES!

See "Index vs Offset - Analysis of Options"

Is there any way this could get portability?

The simple types YES: <int,uint,half,short,long,double,etc><1,2,3,4>

At a minimum the ask is for those
See a horrible example of how to emulate today in GLSL below

TODO: The complex types (image pixel types)?

In theory it is possible on architectures without TBUFFER ops
Have a collection of typed buffer descriptors that aliases the same memory

Just with completely different types

See lower section, that will also work

The author has been "emulating this" for simple types in Vukan

This example is in here to point out that this is an area that GLSL does a very poor job on
It is done by using layout() aliasing

Using different structures for the different types
Also aliasing for access control {readonly, writeonly coherent, volatile, etc}

The H_() macro used to setup a {structure and layout alias} for the different types
Note the types are also defined to be simple {F1,F2,F4,I1,I2,I4,etc}

// Aliasing as multiple types, all of these arrays should be 128-byte aligned

#define H_(t,s) struct Ram##t{\

t zro[64/s]; /* Always zero */\

t gfd[64/s]; /* Dispatch indirect and management */\

t msc[64/s]; /* Misc constants */\

t hxc[64/s]; /* Hex dump characters */\

t hxd[(256*256)/s]; /* This is for debug, but always having it in here */\

t lseGrp[(LSE_GRP_MAX*4)/s]; /* todo */\

t lseRot[(65536*8)/s]; /* todo */\

t lseRul[(65536*4)/s]; /* todo */\

...

}

// To

H_(F1,1);H_(F2,2);H_(F4,4);H_(I1,1);H_(I2,2);H_(I4,4);

H_(L1,2);H_(L2,4);H_(L4,8);H_(H2,1);H_(H4,2);

#undef H_

...

#define H_(t) layout(set=0,binding=2,std430)\

readonly buffer ssboRamR##t {Ram##t ramR##t;}

H_(F1);H_(F2);H_(F4);H_(I1);H_(I2);H_(I4);H_(L1);H_(L2);H_(L4);H_(H2);H_(H4);

#undef H_

//------------------------------------------------------------------------------

#define H_(t) layout(set=0,binding=2,std430)\

writeonly coherent buffer ssboRamW##t {Ram##t ramW##t;}

H_(F1);H_(F2);H_(F4);H_(I1);H_(I2);H_(I4);H_(L1);H_(L2);H_(L4);H_(H2);H_(H4);

#undef H_

//------------------------------------------------------------------------------

#define H_(t) layout(set=0,binding=2,std430)\

volatile /* readonly */ buffer ssboRamV##t {Ram##t ramV##t;}

H_(F1);H_(F2);H_(F4);H_(I1);H_(I2);H_(I4);H_(L1);H_(L2);H_(L4);H_(H2);H_(H4);

#undef H_

//------------------------------------------------------------------------------

#define H_(t) layout(set=0,binding=2,std430)\

buffer ssboRamA##t {Ram##t ramA##t;}

H_(F1);H_(F2);H_(F4);H_(I1);H_(I2);H_(I4);H_(L1);H_(L2);H_(L4);H_(H2);H_(H4);

#undef H_

TBUFFER Emulation Round 2?

Storage Texel Buffer

Alias SSBO and Storage Texel Buffers [STBs]

Use the SSBOs for simple types
Use the STBs for the complex pixel types (like 10:10:10:2)

Have to special case these and compute SSBO offsets manually

1D texels, provides image load/store/atomic access
Associated with a buffer resource via a buffer view
Format support?

VK_FORMAT_FEATURE_STORAGE_TEXEL_BUFFER_BIT
VK_FORMAT_FEATURE_STORAGE_TEXEL_BUFFER_ATOMIC_BIT

Don't need atomics on these complex types

STB format support - important complex formats

For AMD Vega and up, for NVIDIA Turing and up (Authors min-spec)
9-bit shared 5-bit E - No AMD, Read-only NVIDIA
10:11:11 - Yes AMD/NV
10:10:10:2 - Yes AMD/NV
8-bit/channel unorm/snorm - Yes AMD/NVIDIA
16-bit/channel unorm/snorm - Yes AMD/NVIDIA
sRGB - No AMD, Yes NVIDIA

Size limits?

VkPhysicalDeviceLimits.maxTexelBufferElements
AMD (at least Vega and up) = 4 GiElements
NVIDIA (at least Maxwell and up) = 128 MiElements

NVIDIA is the limiter
32-bit type = 512 MiB
64-bit type = 1 GiB
128-bit type = 2 GiB
Can alias buffer views that are smaller for the smaller types

Apply limits to data layout based on size but only for complex types
An acceptable trade off

Transition-Free Swapchain

Way to explicitly ask for a swapchain image allocation that requires no transition
Vulkan suggestion: a new flag for VkSwapchainCreateInfoKHR.flags

In VkSwapchainCreateFlagBitsKHR
VK_SWAPCHAIN_CREATE_TRANSITION_FREE
Which would avoid the requirement to do the image transition to VK_IMAGE_LAYOUT_PRESENT_SRC_KHR

Transition as in for say some meta-data, or perhaps a tiling mode change, etc

So this would force the driver to choose a compatible tiling mode for both compute imageStores and scanout

This way it is possible to avoid transition overhead and complexity, and also have a safe mechanism to do swap-tear if slightly over budget (write into the swap late after the image was already scheduled to be flipped), or leverage for front-buffer rendering
The current possibly unsafe workaround (that the author is currently using on AMD/NV)

Use VK_IMAGE_USAGE_STORAGE_BIT to try to get a more compatible tiling mode
Skip the layout transition in the Vulkan (not legal, but still runs)
Hope that it just works

Wave-Uniform Qualifier

GLSL/etc variable storage qualifier to denote a value is uniform across a wave (aka subgroup)
GLSL syntax suggestion: subgroup_uniform int x=23;
For AMD this would hint a variable going into an SGPR and for NVIDIA since Turing it would be a URX for a variable, or UPX for a predicate
Compilers have performance bugs related to not optimizing using wave-uniform registers and logic
This change would greatly reduce the amount of bugs in practice while also possibly increasing compile speed because less has to be inferred by the compiler

Suggestions - Medium Scope

Binary Compute PSOs

Provides ability for an ISV to choose to run either existing portable SPIR-V based shader or alternatively provide platform chipset specific binary replacements for the subset of platforms they care about performance for a given product lifetime
Definite some simplified stable and well documented ABI for compute PSOs

This should be simple for compute shaders
One could standardize on 64-bit buffer pointers instead of descriptors
And even reduce this to one descriptor set if required
This need not take on the complexity of all the Vulkan binding crazyness

Version the ABI
Query chipset compatibility list and ABI version

Note a vendor could choose to do a binary translation^[c]^[d] if they wanted to for future portability
Or just not support binaries for older chipsets, either is fine

Have a VK entry point to load a binary shader
Don’t necessarily need a IHV assembler, just need docs on required ELF parts
Up to the dev to choose platform lock-in or to provide a SPIR-V backup shader

Some markets like arcade machines or rides, etc, don’t need forward compatibility

Would be nice to be able to do R&D at the machine level
Would be nice to be able to provide both a ASM shader and the SPIR-V and file perf bugs

Binary Translation Sub-Topic

To Angelo’s point: binary shader is an IL (intermediate language) that one could compile from

High-end emulators for more modern systems are doing exactly that
Binary translation is a fast way to bring up modern hardware

Take existing binary shaders of the prior architecture and run them through binary translation for bring up design

GPU ISAs are not radically changing in core functionality
AMD since GCN (2012) is quite stable despite the opcode encoding changing

The wave64 64 VGPR formula still works quite well
Still an SGPR + SGPR descriptor based architecture
The base 32-bit opcodes are very similar across time
The big changes (SIMD16 4-clock to SIMD32 1-clock) was effectively transparent

Although effective ALU latency changes

There are add-ons over time which could be optional extensions to a base design

Like 16-bit and packed 16-bit support

Latest RDNA rolls compile scheduling hints into the opcodes

But those are things that could be done in a binary translation layer

WMMA stuff can be looked as an opcode extension/evolution but doesn’t change the basic core design

Similar to how x86 has evolved over time but still runs the base arch
The major changes on GPU-side are mostly opcode encoding which is easy binary translation

Similar story for NV at Kepler just a light evolution

PTX (a virtualized register allocation assembly) easily maps across chipsets

One could build out some nearly-same-as-HW assembly spec, for either AMD and NV, and treat the major HW additions (like ML ops) as optional extensions, then build out different wave32 and wave64 versions of shaders and then retarget to basically all modern {AMD,NV} hardware

This would be at worst better than starting from SPIR-V

Instruction Intrinsics

Vendors rely on implicit pattern matching and that pattern matching is a constant source of a large number of performance bugs, and this has been the state of the industry since the beginning

If it was practically possible to actually fix those kinds of bugs, we wouldn’t still be seeing them
Even simple examples like getting native normalized {cos,sin} on AMD is not pattern matched and if you try to optimize it you get 2 extra MULs instead of none

Some instructions have no easy way to express in ways a compiler could pattern match

AMD for instance has saturating integer math support that is impossible to use or express

Author has a decade+ long ask that all vendors expose their native instructions through a set of instruction intrinsics

Current practice requires a long political battle for each and every intrinsic, with lots of per-extension overheads
Instead simply do one extension for all intrinsics on the HW family

Suggestions - Large Scope / Hard

Inline Assembly

Like GCC’s inline assembly (both raw assembly, and assembly that can co-exist with higher-level language)
This is in many respects a better option than only having instruction intrinsics, but a much harder ask, due to lack of way of expressing this through current IRs

With inline assembly one can build the instruction intrinsics on the dev side
Example from CPU x86-land below (on of the author’s engines)

inline uint64_t RolL1(uint64_t v uint32_t r){asm("rolq %1,%0":"+g"(v):"cJ"(r));return v;}

One example of using GPU inline assembly in an open source project that got used in a huge number of titles of the time: FXAA 3.11 has a perf fix on Xbox360 leveraging the platforms inline assembly support
When one mixes inline assembly with pre-processor defines, it is possible to target a large amount of hardware in an optimized way and still have platform portable fallbacks all in the same source base

Discussion Topics

16-Bit Support

AMD - GCN5 (Vega) and up

Actually before Vega, Polaris had 16-bit support, just not double rate

But at least one could get the VGPR savings (which is huge)
Unfortunately they effectively disabled it in the drivers

NVIDIA - Turing (aka 20 series) and up
2023 it was at least 30% of hardware on Steam HW Survey that would support 16-bit

Which is good enough for the author to only write 16-bit supported shaders moving forward

Accessing Memory Super Topic

Writing this from a focus on AMD + Vulkan + Compute of course since it's the most documented or accessible platform to do analysis on ...

Review on AMD Memory Interface Throughput

GCN https://gpuopen.com/learn/gcn-memory-coalescing/

Highlights

Needed buffers for 32-bit load/store throughput
Either buffer or image for 128-bit load/store (same throughput)

RDNA

TODO: Missing a lot of information here ...
Buffers on cache hits return 32-bit/lane/clock (best case)

So 128-bit load hit takes 4 clocks to return (best case)

RDNA introduced a bypass on imageLoad to bypass the filtering pipeline

TODO: Compare latency of imageLoad vs nearest sampling vs buffer load

Buffer Compression Problem

Today's hardware has compression for images but not for buffers
If an algorithm is bandwidth limited, and requires a clear each frame, compression gets important

If it's just {clear,write,read} that's 3 round trips to DRAM
If it's {fast clear,write,read} with compression it can be a >30% savings

A subset of problems normally solved using buffers likely needs images for peak perf

And the only way to manage that is to special case it

Constant Problem

No image/texture access for constants (K$)
Can alias a 1D image with a buffer, and access using a buffer

Note such aliasing implies no compression in the image

Alternative like loading via VMEM and V_READLANE_B32'ing back to SGPRs is likely not performant, but is at least possible
Pre-copying from compressed image memory to uncompressed buffer memory likely negates the compression gain
Buffers are effectively required for per-wave granularity items

Buffer Size Limit Problem

Both NV and AMD have maxStorageBufferRange = 4 GiB

So one needs to split into multiple buffers to get beyond 4 GiB of data

Buffers though have the descriptor divergence problem
If one wants to gain unified access to >4 GiB of data the only real option is Pointers

Variable Type Access Problem?

Is this a real problem in practice?
Problem

If one has an index to an item
Then wants to access different sized types for load
One has to shift the index to the new associated granularity

So for some things 'indexing' is "worse" than simply using a 'byte offset' or 'pointer'

Summary of all the Buffer Access Points

SSBO Family

VK API: [V#Base + immSsboElementOffset + (index<<shift)]
Hardware

SMEM: [V#Base + sgprOffset32 + imm20]
VMEM: [V#Base + sgprOffset32 + imm12 + {vgprOffset} + {vgprIndex<<shift}]

Bugs

The imm12 works
It won't leverage sgprOffset32 for when immSsboElementOffset gets too large

Instead it burns VALU ops and VGPRs to add to vgprOffset

The '<<shift' value comes from the descriptor which the compiler doesn't know

Even though the driver could perhaps just pick 4-byte element size
So the optional {vgprIndex<<shift} is never used

The mismatch between API being indexed and the HW being offsetted

Means that the compiler always introduces extra VGPRs and VALU

For the '+ (index<<shift)' part

Note this trick won't work 'uintArray[byteOffset>>2]'

Instead of NOPing the '<<2' the compiler will double the number of unnecessary operations

TEXEL BUFFER Family

VK API: [V#Base + (index<<shift)]
Hardware

Same implementation as SSBOs (BUFFER ops)

Bugs

The (index<<shift) works
However compiler not able to factor out compile time immediates
Large immediates won't get placed in the free SGPR offset
Small immediates won't leverage the opcode immediate

Texture Family (Using Texture Instead of Buffer)

VK API matches HW API

imageLoad(resource, coords, lod) ... etc
texture(resources, coords, lod, offsets, etc)
gather4(resources, coords, lod, offsets, etc)

Hardware issues

AMD VEGA+ supports NSA (non-sequential addressing)

So coords need not come from a series of linear registers, it can instead gather them

Sampling offsets are limited

AMD {-64 to 63}
NV {-8 to 7}

Gather4 offsets are better AMD/NV {-32 to 31}
However using offsets means no low-latency imageLoad() bypass path
TODO: Generally shouldn't expect anything better than 4 lanes/clock?

Simply due to only building in a 4 lane/clock texture addressing unit
Maybe at best ivec4 imageLoad throughput might be 16 lanes/clock?
Compared to 32-bit per 32-lanes per clock on buffers (verified)

Small formats = higher throughput on cache hits via buffers

And even large formats might be lower throughput on hits via imageLoad
Anyway this is all speculation (need actual reference)

Pointer Family

VK API: [pointer64]
Hardware

SMEM: [sgprBase64 + sgprOffset32 + imm20]
VMEM: [vgprPointer64 + imm12]
VMEM: [sgprBase64 + vgprOffset32 + imm12]

Bugs

Serious correctness bug, layout "coherent" is completely broken

This is effectively a hard stop (this interface is useless)

RDNA2- Compiler

Everything goes to the VMEM: [vgprPointer64] path without the imm12
Everything is 64-bit integer VALU ops (pairs of v_add_co*) for ADDs
Too expensive to use

RDNA3+ Compiler

On the right track
Compiler will use VMEM: [sgprBase64 + vgprOffset32 + imm12]
But there are rough edges

For example must manually alias packed 16-bit into 32-bit, etc

Conclusions (Where Today = 20241216)

Pointers are off the table completely

Due to the "coherent" layout correctness bug
And will remain that way until AMD fixes the RDA2- compiler perf problem
Pointers are the worst way on that compiler to do anything (64-bit integer math everywhere)

When pointers become usable it will be a massive rewrite to leverage

The VMEM immediate size is tiny 12-bit (signed)
For accessing multiple things from different bases, will need to construct different 64-bit SGPR pairs for those, cannot rely on the compiler doing the right thing with the mix of 2 compile time immediates {SGPR offset, and local immedate}
This might require hiding immediates from the compiler via actual loaded constants to fix

Pointers are not necessarily faster either for VMEM stuff

They are missing a lot of the "free" addressing '(V#Base + sgprOffset32)'
Probably actually better to stick with 'one-big-buffer' approach

If doing 'one-big-buffer' put all immediate accessed constants in the lower 2 MiB

These get [V#Base + imm20] addressing which is speed-of-light using SSBOs
SSBOs is the only speed-of-light option today (pointers could be if AMD fixed the drivers)

One has to alias a STORAGE_TEXEL_BUFFER over an SSBO to emulate TBUFFER of certain types like {<S,U>NORM*, 10:10:10:2, 10:11:11, ...}
In some respects STORAGE_TEXEL_BUFFER could be the fast path on AMD hardware for an 'indexed' based API

But it certainly isn't as fast as it could be today

Looking at one extra VALU instruction and possibly VGPR per access
But this is effectively the same driver bug overhead as today's SSBOs!

If maxStorageBufferRange is fixed it's easy to get to 16 GiB with 32-bit/item accesses

Has capacity to solve the desire to keep on fast 32-bit index based 'pointers'
But one couldn't factor that back into a 32-bit SGPR offset

Would have to use the 32-bit VGPR index as the pointer

The compiler knows the element stride from the layout, so in theory it could optimize immediate offsets into either a 32-bit SGPR offset for big values, or the smaller 12-bit immediate
One possible issue is that there is no way to easily express a dynamic but wave-coherent 32-bit base offset (acting like a coherent pointer) because the interface is 100% index based

This is why ultimately you'd need intrinsics here to get full perf

The indexing is already free in current driver implementations
HW TBUFFER ops might not even be better overall

They will be better in needing no new descriptors
If HW takes index stride from the descriptor and not the type in the TBUFFER

Then one is back to needing 3 extra descriptors
TODO: What does the HW do?

There are really only 2 possible downsides

You'd need {32-bit,64-bit,128-bit} descriptors to emulate SSBOs

So burning up to 12 SGPRs for the extra three buffer descriptors
Note it is 3, because the regular buffer descriptor is needed for K$ access

If one wanted different types from the same VGPR index (acting as a divergent pointer) one would need to build a shifted version

SSBOs are likely always a pain point for the compiler

Because of the HW use of byte offsets for when descriptor type stride is not known

This mismatch between index API and offset HW usage is bad

Anything outside the immediate offset range in the opcode takes

V_LSHLREV_B32_E32 to convert the index into a byte offset
S_MOV_B32 to possibly build the sgprOffset32 when the immediate gets too big
For this case it's same overhead as just using the broken TEXEL BUFFER today

Likewise since it's all byte offsets, it can never function outside 4GiB buffers

Today's fast path is

Alias STORAGE_TEXEL_BUFFERs for VMEM over SSBO for SMEM
Don't think there is value in special casing the [vgprIndex+imm12] case to SSBOs

It's possible to save one SALU op there today
But that would go away with proper driver optimizations

SMEM

Use SSBO with layout aliasing with single arrays of a given type
SRead_<Type>(immOffset)

Via '[immOffset>>N]' in macro, where N is compile-time-immediate

N is based on type
It's all compile time so interface can just directly match HW

Driver maps to either
[V#Base + sgprOffset32] and an extra S_MOV_B32 for big imms

Extra S_MOV_B32s are not worth worrying about
Given the larger goals, due to dual issue

[V#Base + imm20] for small imms

SRead_<Type>(sgprOffset, immOffset)

Via '[(sgprOffset>>N)+(immOffset>>N)]' in macro

This another direct map to HW but with some driver overhead today

Drive maps that today to

S_AND_B32 to clear the LSBs

This is 100% safe to remove for 32-bit because the HW does it for free, meaning a compiler engineer could NOP it
For 64-bit and 128-bit and beyond, IMO it's an acceptable perf strategy to NOP, someone could put it on an extension if required, but it's an easy fix regardless

S_ADDK_I32 to add in the '(immOffset>>N)'

This is a optimization bug for small immediates

Driver should be using the HW immediates but doesn't

Note '[sgprIndex+(immOffset>>N)]' gets mapped to

S_LSHL_B32 to convert sgprIndex into a byte offset
S_ADDK_I32 to add in the '(immOffset>>N)'
So it's the same overhead today for something less like the HW
And ultimately less performance fixable

VMEM

Don't need VRead_<Type>(imm) or VRead_<Type>(sgprBase + imm)

Because address is scalar it should go into K$
Unless one wants TBUFFER types that need format conversion
Or one is working around AMD compiler bugs with 'readonly' and SMEM polling

VRead_<Type>(vgprIndex) is already on the HW fast path
VRead_<Type>(vgprIndex,immOffset12)

Via '[vgprIndex+(immOffset12>>N)]' in macro
Driver maps that today to

V_ADD_NC_U32_E32 to add the index to the immediate offset

This is a perf bug
Driver knows the buffer index byte span via the layout
It can compute the shift and just use the HW immediate

VRead_<Type>(vgprIndex,sgprOffset32)

Via '[vgprIndex+(sgprOffset32>>N)]' in macro
Driver maps that today to

S_LSHR_B32 to do the sgprOffset32>>N

This might be one instruction worse than SSBOs
But it's SALU so not concerned
Although the latency to the next op is less than ideal

V_ADD_NC_U32_E32 to add that to the index
These are both perf bugs IMO
Driver could NOP all this and just use the HW sgprOffset32
It's 100% safe for 32-bit types (HW zeros 2 LSBs anyway)
It's not strictly 'safe' for 64/128-bit types

But they could but it on an extension (allSafeAlignment)

VRead_<Type>(vgprIndex,sgprOffset32,immOffset12)

Via '[vgprIndex+(sgprOffset32>>N)+(immOffset12>>N)]' in macro
Driver maps that today to

S_LSHR_B32 to do the sgprOffset32>>N

Same as prior

V_ADD3_U32 to add all the stuff together
Again perf bugs IMO, driver could NOP all this and use the HW

Adaptive Workload Scaling (AWS)

Things like Dynamic Resolution Scaling (DRS) are re-active and thus first miss

Thus they are not good long term solutions to the problems of Quality-of-Service (QoS)

This is an alternative component in which passes switch to lower cost paths when the frame starts to go over budget in hope to avoiding the miss completely

This could be mixed with DRS, using AWS in-frame, then DRS

Components required to make this possible

GL_EXT_shader_realtime_clock: https://github.com/KhronosGroup/GLSL/blob/main/extensions/ext/GL_EXT_shader_realtime_clock.txt
Ability to query display frame rate: EnumDisplaySettingsA() and dmDisplayFrequency

Or alternatively QueryDisplayConfig() for FPS in n/d form

MISSING: Ability to get the shader realtime clock domain time of v-sync or alternatively query the h-sync position active in scanout

Bind Everything

Intel's iGPUs (pre-Arc HW) has too small HW binding limits

Author is just ignoring Intel (including Arc)

NV and AMD are both quite good with binding everything once per frame

CPU Push Data to GPU

Description of an important GPU fast path on today’s HW

Usage case is sending and updating small data to the GPU

{keyboard state, gamepad state, CPU time, resolution, framerate, etc}
Aka 'late-latch'
CPU updates async to the GPU, GPU gets state as close to when it is needed

Usage contract

CPU writes complete cachelines (GPU-sized cacheline aka 128 bytes)
Cachelines are used only for this purpose
These cachelines are read only once per frame on the GPU and the contents are copied
This makes a stable snapshot of the contents

Working around the non-atomic nature of the PCIe bus

TODO: Find the reference for only bytes are atomic when crossing the PCIe bus
Each cacheline uses 32-bit for a hash of the cacheline contents (not including the hash)
The GPU reads the cacheline completely, and can compute the hash
If the included hash and the GPU-computed hash are different, the packet is tossed out

Tossed out because it's better latency wise
No re-read because this is cached GPU memory
Want cached GPU memory, because GPU isn't necessarily going to read the full line in one transaction (if it's one lane doing the reading, etc)

The data is updated in a ring buffer of size 2 {1 older, 1 latest}
The GPU reads both entries (one or both should be good)
If both pass the hash, take the one with the newer timestamp

Memory

Dedicated always mapped allocation
Two options

Lower Latency : DEVICE_LOCAL | HOST_VISIBLE

GPU write-combine stores to the GPU memory (less latency on GPU read)

Middle : HOST_VISIBLE

GPU reads through the bus, CPU writes write-combine to CPU DRAM

Slower : HOST_VISIBLE | HOST_CACHED^[e]

GPU reads through the bus, snooping the cache
TODO: Per Adam's point, should perhaps re-profile the differences between with and with-out HOST_CACHED, could be some perf advantage without HOST_CACHED for CPU upload

Both don't require DEVICE_UNCACHED since data is read once per frame only!

CPU to GPU Workload Migration

In modern times SoCs have power sloshing

If the CPU is the bottleneck the CPU’s power carve-out will increase and the GPU’s power carve-out will decrease proportionally due to dynamic p-state changes (clocking)
Example: SteamDeck - https://chipsandcheese.com/p/van-gogh-amds-steam-deck-apu

The power slosh ratio GPU:CPU ratio ranges 1:5 to 5:1

Example: Ps5 Pro

https://www.eurogamer.net/digitalfoundry-2024-spec-analysis-playstation-5-pro-the-most-powerful-console-yet
“Activating the 10 percent speed increase means that clock speeds on the GPU decrease by circa 1.5 percent, leading to a one percent performance hit, according to Sony.”
Significantly less power slosh, but starting to be apparent even in non-mobile land

If the CPU workload is less power efficient on the CPU compared to the GPU the workload should be migrated over (and include the bus transfer power in that too)

Dropping Registers Mid-Shader Execution?

Usage case, at launch uber shader is allocated a maximum register count, then after initial control path decision, release registers early
Usage case, as a non-ubershader winds down and gets out of peak register count, it could start releasing registers back for other shaders
AMD's "Dealloc VGPRs" (in some version of RDNA)

This is a related incremental improvement
See the ISA guide on S_SENDMSG for details
Waves wait on memory-write-acknowledge (ACK) before actually exiting
This allows the VGPRs to be released before the ACK so another wave can launch faster

Registers are allocated in some grouping granularity (N registers)
Any runtime re-naming of registers requires logic on a critical path (operand fetch)

If registers are allocated in linear segments that wrap around

AMD's documented this in the ISA guide: VGPR_BASE and VGPR_SIZE
It is just a small adder
However fragmentation would probably limit what could take advantage of released registers (esp if uber shader launch needs a very large chunk)

If registers are renamed per group of N, one needs to put a small memory on the critical path of operand fetch, example an 256regs/N=8=32 entries * maxWaveCount 16 = 512 entry x 5-bit memory … and with typically 4 operands you’d need either duplication or a few read ports … yeah gets really expensive so vendors are not likely to go for this kind of approach (it’s a big latency add too)

So it’s not easy to support this efficiently in general
SPECULATION: Apple’s more recent HW designs are keeping smaller on-deck physical register files for active waves, and relying more on a hidden load/store to it’s new shared regbacking/L1/LDS memory, so it likely isn’t really a dynamic register count, but rather HW managed spilling?

Apple’s ISA has register last-use (“discard”) and register reuse hints (“cache”)

https://dougallj.github.io/applegpu/docs.html

These could be used for HW managed spilling
Could be that the register cache is made bigger
Could profile to try to figure this out by trying ALU bound algorithms and graph performance vs the amount of possible register reuse possible given register reuse cache sizing

If perf falls off a cliff on FMAs when the operands cannot get high register reuse, then it’s likely they got port limited on hardware managed offload of registers to the shared SRAM

It might soften worst case behavior, but might decrease perf of high state load (and register) algorithms that otherwise run well on PC without any hidden spilling (even to LDS)
Curious if texture fetch actually writes results into the shared regspill/L1/LDS memory (although it would be silly to have a possibility of spilling texture fetch data)

AMD RDNA’s Sub-Vector Execution

This is an alternative which provides a variable amount of registers during execution from a fixed allocation
SIMD32 so Wave64 runs normally as 2 back-to-back {lo,hi} sub-vectors (32-wide)
In Sub-Vector Execution mode, the hardware can run a long sequence of {lo} instructions, and then a long sequence of {hi} instructions
During this sub-vector execution a subsection of registers gets two times as large

See VGPR_SHARED_SIZE (RDNA2 ISA Guide)

So this is highly vendor specific (not generally usable, and I don’t think compilers automatically generate code for this)

Generic Loop Unrolling?

AMD’s Sub-Vector Execution is in some respects a very non-portable form of this
This is how CPUs manage variable workload to fixed register count problems
If a wave is running 1 thing, it has N registers
If a wave is doing 2 things, it has N/2 registers
And so on
With compute the pixel-shader like 1 pixel to one lane mapping is no longer required

Is a fixed maximum register count actually a performance problem on modern GPUs from AMD/NV?

Answer is yes for shaders that explode in register count (cannot hide their own latency)

But if one avoids that problem completely (which is possible with good compiler), then what?

Too many waves can overload L0 caches

Only want just enough waves to hide latency, not more as it decreases cache utilization (less cache per wave)

One of the points of semi-persistent waves and persistent waves is that the bubbles introduced by wave exit and restart actually decrease perf, so relying less on the HW scheduling has performance improvements
Author sticks with designing around 64 registers/wave, and uses loop-unrolling style solutions, and that does seem to work quite well in practice

Efficient Program Counter (PC) Logic

Review of AMD HW

AMD's HW is fully general purpose

Think of the wave as a CPU thread
All the opcodes are available that one would need

PC is 64-bit
Branches jump to (PC of instruction after branch + offset)

SOPP opcodes

These support a signed 16-bit immediate
Which is scaled by 4 for branch offset
So +/- 128 KiB (which is way larger than the I$ size)

S_BRANCH ... unconditional branch
S_CBRANCH_* ... conditional branch

Based on {VCC,EXEC,SCC}{Z,NZ} bit values

S_CALL_B64 ... save return PC in SGPR pair, then branch

There are 3 other instructions that can directly interact with the PC

S_GET_PC ... store the PC to an aligned SGPR pair

Code can use this to compute a physical address of a relocated (or compile-time unknown) PC if the offset to the current instruction is known

S_SET_PC ... load the PC from an aligned SGPR pair
S_SWAP_PC ... store then load the PC

Branch tables would use this

Requirement for Optimizations

These require a guarantee that all shaders exist in a consistent 4 GiB window

Something a driver should be able to do
AMD's PC driver already does this for descriptor sets/etc

Meaning the 32-bit MSB of the PC is always constant at least during execution of a given program
And only the lower 32-bit LSB of the PC ever needs to be changed
This enables pointers to only ever need to load/store or do logic on the lower 32-bit LSB of the PC

This enables efficient jump tables

Can pack just the lower 32-bits of all the jump targets
For relocation can just use ELF stuff (patch the lower 32-bit LSB)

This enables space-efficient return stacks in a single VGPR

Can get at least a fixed 32-entry (wave32) return stack
AMD HW reference

V_READLANE_B32 ... copy a VGPR to a SGPR
V_WRITELANE_B32 ... copy a SGPR to a VGPR
Note S1 argument = lane select which is an SGPR

Note latency applies

Latency of branching
Latency of SALU/VALU dependencies

Index vs Offset - Analysis of Options

TLDR

Drivers don't do a good job with SSBOs and C-style array indexing!
There is no good way to emulate the HW byte-offsetting without getting a perf tax
Seems like GL_EXT_buffer_reference is the fix, except for compiler bugs on AMD

See the related section in the AMD Compiler Bugs area

This is a prime example of why actual instruction intrinsics would be a good idea

Because the hardware works with byte-offsets internally

Goals

Would like to get the driver out of the way for 'buffer' access

One 'thing' covering a huge amount of memory

Instead of one driver managed resource per huge number of things

That can be accessed via {32,64,128}-bit load/store and atomics
That can also do some amount of format conversion (aka TBUFFER)

Would like to take advantage of the HW's 'free' addressing logic
Would like to be able to DMA a huge chunk of this one 'thing' to/from a memory mapped file

So save/load state is effectively free of any file IO

Using Images Instead of Buffer?

With DCC format/type being either in descriptors or page tables

Wouldn't gain anything from compression due to format aliasing

Would have to go linear tiling mode to get different bit-pixel aliasing to make sense

So no sense in even doing 2D
Except it's 2D providing the 'free' address logic

Cannot get away from needing different bit widths

32-bit atomics
64-bit atomics
128-bit load/store (for efficient access)

Would need a lot of image descriptors: 32-bit, 64-bit, 128-bit access, etc
So this is a definite NO on using images

Using AMD's HW (RDNA2) as an example because of the clear ISA documentation

Buffer ops are better than emulating Buffer ops via Image ops

Buffer has a lot of "free" addressing logic
TODO: Verify and find reference for this

Buffers have lower latency?
Buffers have better peak throughput?

AMD SMEM

64-bit base pointer or 128-bit descriptor
32-bit SGPR providing an unsigned byte offset
21-bit signed byte offset via immediate

This must be positive for S_BUFFER ops though
So one cannot use the trick of using negatives to double range from the base

AMD VMEM BUFFER

128-bit descriptor
32-bit SGPR providing an unsigned byte offset

TODO: Open question if the driver will even use this for optimization

Seems like it doesn't (but might not have tested that case correctly yet)

12-bit unsigned byte offset via immediate
32-bit VGPR for byte offset (optional)

Driver only uses this

32-bit VGPR for index (optional)

Driver won't use this for optimizations even for some fixed stride

NVIDIA LDG (SM89)

TODO: Author doesn't know NV hardware as well any more ...
Global load, with a huge number of addressing modes

[imm24]
[vgpr32]
[vgpr32+imm24]
[vgpr64+imm19]{+imm10} ... imm10 is likely offset for a second load
[sgpr64+vgpr32+imm24]
[sgpr64+vgpr32+imm18]{+imm10}
desc[sgpr64][vgpr64]
desc[sgpr64][vgpr64+imm24]
Supports a prefetch size included in the operation {64,128,256}-byte!

Addresses must be naturally aligned, PTX says two options

HW might fault
HW might silently clear the associated LSBs (better, too bad one cannot count on this)

All addresses are byte based (there is no support for C-style indexing)

Indexing logic (shift add) is taken up by the ALU!

Problems with the C-like shading languages, specifically SSBOs

See disassembly example below
For context, can layout alias the bind point as arrays of all types

But these C-like APIs require array indexing
No support for byte offsets
And the buffer logic in HW is based on byte offsets mostly

Descriptors do have an index size and an optional index
Author thinks (TODO confirm) that TBUFFER ops use descriptor index size

Not the size of the type in the opcode

Driver builds the SSBO descriptors

But looks like it sets index stride = 1-byte
So no way to leverage indexing

buf[immediateIndex]

Anything reduced to simply immediate index (known at compile time)
A compiler can transparently convert that back to a HW byte offset
Confirmed below for pure compile time immediates that fit in the immediate offset field
But note mixing with complex dynamic offset can undo this

buf[unsignedByteOffset>>immediateLog2Size]

Where the 'unsignedByteOffset' is not an immediate
The compiler won't optimize this

It will either do the described shift, and then the hidden unshift
Or just clear the associated LSBs (as an 'optimization')
The ask of course is for no SHIFTs or AND

buf[immediate+index]

Compiler adds an extra shift for the index
But can often factor the immediate into the the opcode immediate

AMD PC RDNA2 Disassembly Examples

So the AMD compiler is quite bad at optimizing this next case

Compiler gets screwed over by shader languages using indexing instead of byte offsetting

ramWI1.hxd[13]=0xdead+ramRI1.hxd[(27>>2)+(g1>>2)];

Yes this is 'bad' code (for a few reasons, but shows issues in the compiler)
Where

g1 ... global invocation index (1D dispatch)
ram<W,R>I1 ... SSBO aliased to the same bind point

.hxd ... uint32_t array in the SSBO

Load Behavior .hxd[(27>>2)+(g1>>2)]

Note the immediate offset is hardcoded to .hxd
The (27>>2) immediate offset doesn't get factored in

This should be an index of 6, and a byte offset of 24
Instead the driver incorporates it into the extra LSHL needed for g1>>2

Via V_LSHL_ADD_U32

The (g1>>2) part shows the compiler will put in the shift then the hidden unshift

Instead of just factoring both out
Because of the possibility of having lower bits set
And note this example does have lower bits set (bad behavior)
NOTE at least for S_LOAD_BUFFER/etc it's documented (AMD's RNDA2 ISA Guide) that the lower 2-bits are ignored anyway, so in theory NOPing the shift could perhaps become a legal optimization for 32-bit/element loads

There is SH_MEM_CONFIG.alignment_mode to force that
Not clear if 'STRICT' forces alignment for all sizes, or just causes a fault

Store Behavior .hxd[13]

The driver does factor this into the immediate offset field
So at least some amount of minimal pure immediate optimization happens
Best to keep common static offset stuff in the lower immediate addressing range!

SSBO layout aliasing does only use one descriptor (and not a 64-bit pointer)

So at least that optimization is working correctly
The descriptor is built on the fly in the shader

Created from a USER_DATA_SGPR pair (64-bits total)
This is done to save on USER_DATA_SGPR setup space

The index stride is set to 1-byte

The driver could at least convert this to N-byte
And then leverage that information for optimization for N-byte load/store
But it doesn't employ any of those kind of optimizations

Notice the AMD compiler cannot do good SGPR allocation here

There is an extra unnecessary S_MOV_B32

Driver doesn't trust the DYNAMIC descriptor created by the driver?

It explicitly clears out the {stride, cache swizzle, and AOS swizzle}
But un-factored that into GPU runtime instead of CPU create time

DECODING THE BUFFER DESCRIPTOR

11111111111111110000000000000000

fedcba9876543210fedcba9876543210

[0] _______________s5_______________ lower 32-bits of 48-bit base address

[1] 0000000000000000_______s6_______

................_______s6_______ upper 16-bits of 48-bit base address

..00000000000000................ stride = 0

.0.............................. cache swizzle (disabled)

0............................... swizzle AOS (disable)

[2] 11111111111111111111111111111111 num_records (maximum, effectively disable bounds check)

[3] 00000000000000100100111110101100

.............................100 dst_sel_x (R)

..........................101... dst_sel_y (G)

.......................110...... dst_sel_z (B)

....................111......... dst_sel_w (A)

.............0100100............ format

...........rr................... reserved

.........00..................... index stride (1-byte)

........0....................... add tid enable (disabled)

.......0........................ resource level (set to 0, even though ISA docs say set to 1)

....rrr......................... reserved

..00............................ OOB_SELECT (out of bounds select, disabled?)

00.............................. type 0=buffer

v_lshl_add_u32 v0, s8, 6, v0 // 000000000000: D1FD0000 04010C08

v_ashrrev_i32 v0, 2, v0 // 000000000008: 22000082

v_lshl_add_u32 v0, v0, 2, 24 // 00000000000C: D1FD0000 02610500

s_and_b32 s0, s6, 0x0000ffff // 000000000014: 8600FF06 0000FFFF

s_mov_b32 s1, s0 // 00000000001C: BE810000

s_movk_i32 s2, 0xffff // 000000000020: B002FFFF

s_mov_b32 s3, 0x00024fac // 000000000024: BE8300FF 00024FAC

s_mov_b32 s0, s5 // 00000000002C: BE800005

buffer_load_dword v0, v0, s[0:3], 0 offen offset:1024 // 000000000030: E0501400 80000000

s_waitcnt vmcnt(0) // 000000000038: BF8C0F70

v_add_u32 v0, 0x0000dead, v0 // 00000000003C: 680000FF 0000DEAD

buffer_store_dword v0, v0, s[0:3], 0 offset:1076 glc // 000000000044: E0704434 80000000

s_endpgm // 00000000004C: BF810000

ramWI1.hxd[13]=0xdead+ramRI1.hxd[120+g1];

Similar to the above case, but use pure indexing
Notice the compiler does now factor the (120<<2) into the compile time immediate
But needs to add an extra V_LSHLREV_B32 for the indexing of g1

v_lshl_add_u32 v0, s8, 6, v0 // 000000000000: D1FD0000 04010C08

v_lshlrev_b32 v0, 2, v0 // 000000000008: 24000082

s_and_b32 s0, s6, 0x0000ffff // 00000000000C: 8600FF06 0000FFFF

s_mov_b32 s1, s0 // 000000000014: BE810000

s_movk_i32 s2, 0xffff // 000000000018: B002FFFF

s_mov_b32 s3, 0x00024fac // 00000000001C: BE8300FF 00024FAC

s_mov_b32 s0, s5 // 000000000024: BE800005

buffer_load_dword v0, v0, s[0:3], 0 offen offset:1504 // 000000000028: E05015E0 80000000

s_waitcnt vmcnt(0) // 000000000030: BF8C0F70

v_add_u32 v0, 0x0000dead, v0 // 000000000034: 680000FF 0000DEAD

buffer_store_dword v0, v0, s[0:3], 0 offset:1076 glc // 00000000003C: E0704434 80000000

s_endpgm // 000000000044: BF810000

Write-Once-Read-Many (WORM)

Description of an important GPU fast path on today’s HW

One Implicit Per-Frame Cache Flush

Submit would typically chain command buffer for performance reasons
At submit boundaries there would typically be a preamble
The preamble does state setup and typically a cache flush
This method is a way to avoid doing any extra cache flushes mid-frame
So in Vulkan having an app that has NO vkCmdPipelineBarrier() calls!
And only using VkEvents for pipelining
TODO: It might be better to have a command buffer creation flag or submit flag which explicitly asks for the cache flush (in case the implicit cache flush gets factored out at some point)

Rules for GPU-Side Memory Access

Stores are all done using resources with ‘layout(...) coherent’ or atomic ops

Note you can alias different memory qualifiers on the same binding

To get cached reads mixed with write-through stores

This guarantees no stale lines can be left in the non-coherent caches
On AMD this invokes L0 GLC=1 to have stores write-through to the coherent L2 cache domain (note L1 is bypassed on store)
Note AMD’s driver already uses GL1=1 stores even without that coherent qualifier
Write-through on stores and not leaving the line in the lower level caches is a good default performance optimization because it avoids cache pollution!

All writing jobs don’t alias cachelines unless they use atomic operations

So non-atomic lines are effectively write once (so no aliasing)

After writes are done, any number of invocations are free to do cached reads as many times as they want from any caches (K$ for constants, or vector memory cache, etc)
No cache flush is needed between the write to read transition!

HW Requirements

These are true of much modern hardware, but they didn’t necessarily hold in older hardware

TODO: Vendor support chipset list

This is supported on all desktop GPUs with packed 16-bit support at a minimum
Which covers the author’s usage cases, but support should extend back more

HW must read indirect dispatch from the coherent cache domain
HW must read scanout from the coherent cache domain

GPU Memory Aliasing

WORM doesn’t support mid-frame memory aliasing

Note aliasing say for resizable images is safe if they are used on separate frames
And in modern times the majority of mid-frame stuff needs temporal feedback

So those couldn’t be aliased anyway

In order to support mid-frame memory aliasing, one would need something that invalidated the incoherent cache domain

To avoid the possibility of stale lines
Prior reads for the same lines might be in the non-coherent caches

Using work to effectively flush the non-coherent caches is an ‘unsafe’ but valid approach too
A safer approach would be to use a vkCmdPipelineBarrier()

TODO: Include how to invoke and only flush the incoherent caches (as reference)

GPU Spin Loops or Lock-Free Retry Loops

PC Problems

There is no forward progress guarantee in the APIs (fail)
Preemption can happen, and there is no guarantee post-preemption will restore all workgroups back on the machine (compared to what workgroups had been active prior)

So if a workgroup is spinning on memory waiting for another non-loaded workgroup to change, a livelock will happen (aka deadlock, likely a TDR, or at worst another preemption and just no forward progress)

A suggested minimum ASK for IHVs is to have a guarantee that during the preemption restore, that the oldest workgroups get restored first (aka what has a lower global kernel workgroup ID)

WIth this it becomes safe to block on workgroups with lower workgroup IDs
Which provides a way to implement many algorithms

There is the possibility if someone is launching workgroups that fill full CU (in AMD speak) and it's associated resources, that prior fragmentation could make an irregular number of workgroups dispatch unless the system resets the units after idling them

Does this happen on modern hardware in practice? Maybe not any more

It did happen in early CUDA times

What is more likely is that other things might be running on the GPU and thus one won't necessarily fill fully immediately at launch so blocking on counting workgroups that should fill the machine is probably a possible fail point (deadlock)

Example AMD Compiler Bugs (MEGA FAIL)

Showing below an implementation of a GPU spin loop. It's a MEGA FAIL because the massive number of compiler bugs preventing even the attempted workarounds from being acceptable ...

AMD Compiler Correctness Fail With Regards to "Readonly Coherent"

Instant deadlock (livelock)
The compiler should do a GLC=1 (aka "coherent") SMEM read here in the spin loop but doesn't

The irony is that it does put in the GLC=1 so it implements the "coherent" then ignores those very correctness rules

Instead it incorrectly assumes the read will return the same value as before and spins on the SGPR prior value
Note using "volatile" won't fix the problem either!

Source

#version 460

#extension GL_EXT_shader_explicit_arithmetic_types:require

#extension GL_EXT_shader_explicit_arithmetic_types_int32:require

#define I1 uint32_t

layout(set=0,binding=0,std430)readonly coherent buffer ssboRC_ {I1 ssboRC[1024*1024*1024/4];};

layout(set=0,binding=0,std430)readonly buffer ssboR_ {I1 ssboR[1024*1024*1024/4];};

layout(set=0,binding=0,std430)writeonly buffer ssboW_ {I1 ssboW[1024*1024*1024/4];};

void main(){

// Fast check on already signaled

// API bug: no way to say to expect this if to always be false

if(ssboR[0]==0){

// Predicate to just the first lane

if(gl_LocalInvocationID.x==0){

// Not signalled, so must spin until signalled, note the coherent read

while(true){if(ssboRC[0]!=0)break;}}}

// Make the shader do something so the program doesn't get dead-coded

ssboW[256]=1;}

Disassembly

0x000000 s_getpc_b64 s[2:3]

0x000004 s_mov_b32 s0, s1

0x000008 s_mov_b32 s1, s3

0x00000C s_load_dwordx4 s[4:7], s[0:1], null

0x000014 s_waitcnt lgkmcnt(0)

0x000018 s_buffer_load_dword s0, s[4:7], null

0x000020 s_waitcnt lgkmcnt(0)

0x000024 v_or_b32_e32 v0, s0, v0

0x000028 v_cmp_ne_u32_e32 vcc_lo, 0, v0

0x00002C s_and_b32 s0, vcc_lo, exec_lo

0x000030 s_cbranch_scc1 _L0

0x000034 s_buffer_load_dword s0, s[4:7], null glc dlc

0x00003C s_waitcnt lgkmcnt(0)

0x000040 s_cmp_lg_u32 s0, 0

0x000044 s_cselect_b32 s0, -1, 0

0x000048 v_cndmask_b32_e64 v0, 0, 1, s0

0x000050 v_cmp_ne_u32_e64 s0, 1, v0

_L1:

0x000058 s_and_b32 vcc_lo, exec_lo, s0

0x00005C s_cbranch_vccnz _L1

_L0:

0x000060 v_mov_b32_e32 v0, 1

0x000064 buffer_store_dword v0, off, s[4:7], 0 offset:1024

0x00006C s_endpgm

Note you cannot workaround by hiding the memory address either

The compiler will simply load the memory address its given
And then hoist it outside the spinloop! = DEADLOCK

Source

#version 460

#extension GL_EXT_shader_explicit_arithmetic_types:require

#extension GL_EXT_shader_explicit_arithmetic_types_int32:require

#define I1 uint32_t

layout(set=0,binding=0,std430)readonly coherent volatile buffer ssboRC_ {I1 ssboRC[1024*1024*1024/4];};

layout(set=0,binding=0,std430)readonly buffer ssboR_ {I1 ssboR[1024*1024*1024/4];};

layout(set=0,binding=0,std430)writeonly buffer ssboW_ {I1 ssboW[1024*1024*1024/4];};

void main(){

// Fast check on already signaled

// API bug: no way to say to expect this if to always be false

I1 hack=ssboR[1];

if(ssboR[0]==0){

// Predicate to just the first lane

if(gl_LocalInvocationID.x==0){

// Not signalled, so must spin until signalled, note the coherent read

while(true){if(ssboRC[hack]!=0)break;}}}

// Make the shader do something so the program doesn't get dead-coded

ssboW[256]=1;}

Disassembly

0x000000 s_getpc_b64 s[2:3]

0x000004 s_mov_b32 s0, s1

0x000008 s_mov_b32 s1, s3

0x00000C s_load_dwordx4 s[4:7], s[0:1], null

0x000014 s_waitcnt lgkmcnt(0)

0x000018 s_buffer_load_dword s0, s[4:7], null

0x000020 s_waitcnt lgkmcnt(0)

0x000024 v_or_b32_e32 v0, s0, v0

0x000028 v_cmp_ne_u32_e32 vcc_lo, 0, v0

0x00002C s_and_b32 s0, vcc_lo, exec_lo

0x000030 s_cbranch_scc1 _L0

0x000034 s_buffer_load_dword s0, s[4:7], 0x4

0x00003C s_waitcnt lgkmcnt(0)

0x000040 s_lshl_b32 s0, s0, 2

0x000044 s_buffer_load_dword s0, s[4:7], s0 glc dlc

0x00004C s_waitcnt lgkmcnt(0)

0x000050 s_cmp_lg_u32 s0, 0

0x000054 s_cselect_b32 s0, -1, 0

0x000058 v_cndmask_b32_e64 v0, 0, 1, s0

0x000060 v_cmp_ne_u32_e64 s0, 1, v0

_L1:

0x000068 s_and_b32 vcc_lo, exec_lo, s0

0x00006C s_cbranch_vccnz _L1

_L0:

0x000070 v_mov_b32_e32 v0, 1

0x000074 buffer_store_dword v0, off, s[4:7], 0 offset:1024

0x00007C s_endpgm

Dropping the "readonly" doesn't work either even with the hack

And goes and factors out the load again = DEADLOCK
Same correctness bugs!

Source

#version 460

#extension GL_EXT_shader_explicit_arithmetic_types:require

#extension GL_EXT_shader_explicit_arithmetic_types_int32:require

#define I1 uint32_t

layout(set=0,binding=0,std430)coherent volatile buffer ssboC_ {I1 ssboC[1024*1024*1024/4];};

layout(set=0,binding=0,std430)readonly coherent volatile buffer ssboRC_ {I1 ssboRC[1024*1024*1024/4];};

layout(set=0,binding=0,std430)readonly buffer ssboR_ {I1 ssboR[1024*1024*1024/4];};

layout(set=0,binding=0,std430)writeonly buffer ssboW_ {I1 ssboW[1024*1024*1024/4];};

void main(){

// Fast check on already signaled

I1 hack=ssboR[1];

if(ssboR[0]==0){

// Predicate to just the first lane

if(gl_LocalInvocationID.x==0){

// Not signalled, so must spin until signalled, note the coherent read

while(true){if(ssboC[hack]!=0)break;}}}

// Make the shader do something so the program doesn't get dead-coded

ssboW[256]=1;}

Disassembly

0x000000 s_getpc_b64 s[2:3]

0x000004 s_mov_b32 s0, s1

0x000008 s_mov_b32 s1, s3

0x00000C s_load_dwordx4 s[4:7], s[0:1], null

0x000014 s_waitcnt lgkmcnt(0)

0x000018 s_buffer_load_dword s0, s[4:7], null

0x000020 s_waitcnt lgkmcnt(0)

0x000024 v_or_b32_e32 v0, s0, v0

0x000028 v_cmp_ne_u32_e32 vcc_lo, 0, v0

0x00002C s_and_b32 s0, vcc_lo, exec_lo

0x000030 s_cbranch_scc1 _L0

0x000034 s_buffer_load_dword s0, s[4:7], 0x4

0x00003C s_waitcnt lgkmcnt(0)

0x000040 s_lshl_b32 s0, s0, 2

0x000044 s_buffer_load_dword s0, s[4:7], s0 glc dlc

0x00004C s_waitcnt lgkmcnt(0)

0x000050 s_cmp_lg_u32 s0, 0

0x000054 s_cselect_b32 s0, -1, 0

0x000058 v_cndmask_b32_e64 v0, 0, 1, s0

0x000060 v_cmp_ne_u32_e64 s0, 1, v0

_L1:

0x000068 s_and_b32 vcc_lo, exec_lo, s0

0x00006C s_cbranch_vccnz _L1

_L0:

0x000070 v_mov_b32_e32 v0, 1

0x000074 buffer_store_dword v0, off, s[4:7], 0 offset:1024

0x00007C s_endpgm

Partial workaround?

This won't actually work, but it gets close to a workaround

Since the workaround involves moving the spin loops to VMEM
One needs to force a reload of the K$ line used for the fast check (didn't do that yet)

It's loaded with perf bugs (will itemize those)
Needed to move to an explicit atomic operation in the spin loop

And used a 'hack' value to hide a zero so there is no way a compiler can dead-code the atomic

In this instance the compiler decided to choose the worst branch order

The expected path has a taken branch when it should be linear (non-taken) ideally
No interface to show explicit expected branch priorities
Note sometimes I've seen the compiler get it right
So AMD at least can generate a branch that conditionally branches out and leaves the fast path linear (branch free), but they don't guess right a lot of the time

The compiler branches into code predicated into one lane (lane=0)

This is done via if(gl_LocalInvocationID.x==0)

But then the next thing the compiler does it implement it's "perf strategy" for idiot programmers

So now it's duplicating predicating the atomic to the first lane using the worst possible method (via masked bit count)
But it already is down to one lane
So it become an anti-optimization (cost increase)

Then it restores what it thinks is multi-lane execution (but it isn't)

Seriously this is exactly why we need explicit SGPRs and explicit intrinsics, etc

And does a read-first-lane to get the atomic back into an SGPR so it can vector logic on it

So it can then branch by a vector compare
Really no reason to do the read-first-lane, it's already in one-lane execution

The output is unbelievably bad

We need some other approach here!

Source

#version 460

#extension GL_EXT_shader_explicit_arithmetic_types:require

#extension GL_EXT_shader_explicit_arithmetic_types_int32:require

#define I1 uint32_t

layout(set=0,binding=0,std430)coherent volatile buffer ssboC_ {I1 ssboC[1024*1024*1024/4];};

layout(set=0,binding=0,std430)readonly coherent volatile buffer ssboRC_ {I1 ssboRC[1024*1024*1024/4];};

layout(set=0,binding=0,std430)readonly buffer ssboR_ {I1 ssboR[1024*1024*1024/4];};

layout(set=0,binding=0,std430)writeonly buffer ssboW_ {I1 ssboW[1024*1024*1024/4];};

layout(local_size_x=64)in;

void main(){

// Fast check on already signaled

I1 hack=ssboR[1]; // Make sure that loads a zero!

if(ssboR[0]==0){

// Predicate to just the first lane

if(gl_LocalInvocationID.x==0){

// Not signalled, so must spin until signalled, note the coherent read

while(true){if(atomicOr(ssboC[0],hack)!=0)break;}}}

// Make the shader do something so the program doesn't get dead-coded

ssboW[256]=1;}

Disassembly

0x000000 s_getpc_b64 s[2:3]

0x000004 s_mov_b32 s0, s1

0x000008 s_mov_b32 s1, s3

0x00000C s_load_dwordx4 s[0:3], s[0:1], null

0x000014 s_waitcnt lgkmcnt(0)

0x000018 s_buffer_load_dword s4, s[0:3], null

0x000020 s_waitcnt lgkmcnt(0)

0x000024 v_or_b32_e32 v0, s4, v0

0x000028 s_mov_b64 s[4:5], exec

0x00002C v_cmpx_eq_u32_e32 0, v0

0x000030 s_cbranch_execz _L0

0x000034 s_buffer_load_dword s10, s[0:3], 0x4

0x00003C s_mov_b64 s[6:7], 0

0x000040 s_branch _L1

0x000044 s_nop 0

0x000048 s_nop 0

0x00004C s_nop 0

0x000050 s_nop 0

0x000054 s_nop 0

0x000058 s_nop 0

0x00005C s_nop 0

0x000060 s_nop 0

0x000064 s_nop 0

0x000068 s_nop 0

0x00006C s_nop 0

0x000070 s_nop 0

0x000074 s_nop 0

0x000078 s_nop 0

0x00007C s_nop 0

_L2:

0x000080 s_or_b64 exec, exec, s[8:9]

0x000084 s_waitcnt vmcnt(0)

0x000088 v_readfirstlane_b32 s8, v0

0x00008C s_waitcnt lgkmcnt(0)

0x000090 v_cndmask_b32_e64 v0, s10, 0, vcc_lo

0x000098 v_or_b32_e32 v0, s8, v0

0x00009C v_cmp_ne_u32_e32 vcc_lo, 0, v0

0x0000A0 s_or_b64 s[6:7], vcc, s[6:7]

0x0000A4 s_andn2_b64 exec, exec, s[6:7]

0x0000A8 s_cbranch_execz _L0

_L1:

0x0000AC v_mbcnt_lo_u32_b32 v0, exec_lo, 0

0x0000B4 v_mbcnt_hi_u32_b32 v0, exec_hi, v0

0x0000BC v_cmp_eq_u32_e32 vcc_lo, 0, v0

0x0000C0 s_and_saveexec_b64 s[8:9], vcc

0x0000C4 s_cbranch_execz _L2

0x0000C8 s_waitcnt lgkmcnt(0)

0x0000CC v_mov_b32_e32 v0, s10

0x0000D0 buffer_atomic_or v0, off, s[0:3], 0 glc

0x0000D8 s_branch _L2

_L0:

0x0000DC s_or_b64 exec, exec, s[4:5]

0x0000E0 v_mov_b32_e32 v0, 1

0x0000E4 buffer_store_dword v0, off, s[0:3], 0 offset:1024

0x0000EC s_endpgm

Another workaround attempt that failed

Tried to trick the compiler into using a VMEM load instead of SMEM
But it still ignores the "volatile" and "coherent" and hoists the load out of the spin loop
It's strange that it even factored out the compare out of the spin loop too!

Source

#version 460

#extension GL_EXT_shader_explicit_arithmetic_types:require

#extension GL_EXT_shader_explicit_arithmetic_types_int32:require

#define I1 uint32_t

layout(set=0,binding=0,std430)coherent volatile buffer ssboC_ {I1 ssboC[1024*1024*1024/4];};

layout(set=0,binding=0,std430)readonly coherent volatile buffer ssboRC_ {I1 ssboRC[1024*1024*1024/4];};

layout(set=0,binding=0,std430)readonly buffer ssboR_ {I1 ssboR[1024*1024*1024/4];};

layout(set=0,binding=0,std430)writeonly buffer ssboW_ {I1 ssboW[1024*1024*1024/4];};

layout(local_size_x=64)in;

void main(){

// Fast check on already signaled

I1 hack=ssboR[1]; // Make sure that loads a zero!

hack=hack&gl_LocalInvocationID.x; // Trick the compiler into a vector load (to the same address)!

if(ssboR[0]==0){

// Not signalled, so must spin until signalled, note the coherent read

while(true){if(ssboC[hack]!=0)break;}}

// Make the shader do something so the program doesn't get dead-coded

ssboW[256]=1;}

Disassembly

0x000000 s_getpc_b64 s[2:3]

0x000004 s_mov_b32 s0, s1

0x000008 s_mov_b32 s1, s3

0x00000C s_load_dwordx4 s[0:3], s[0:1], null

0x000014 s_waitcnt lgkmcnt(0)

0x000018 s_buffer_load_dword s4, s[0:3], null

0x000020 s_waitcnt lgkmcnt(0)

0x000024 s_cmp_lg_u32 s4, 0

0x000028 s_cbranch_scc1 _L0

0x00002C s_buffer_load_dword s4, s[0:3], 0x4

0x000034 s_waitcnt lgkmcnt(0)

0x000038 v_and_b32_e32 v0, s4, v0

0x00003C s_mov_b64 s[4:5], 0

0x000040 v_lshlrev_b32_e32 v0, 2, v0

0x000044 buffer_load_dword v0, v0, s[0:3], 0 offen glc dlc

0x00004C s_waitcnt vmcnt(0)

0x000050 v_cmp_ne_u32_e32 vcc_lo, 0, v0

_L1:

0x000054 s_and_b64 s[6:7], exec, vcc

0x000058 s_or_b64 s[4:5], s[6:7], s[4:5]

0x00005C s_andn2_b64 exec, exec, s[4:5]

0x000060 s_cbranch_execnz _L1

0x000064 s_or_b64 exec, exec, s[4:5]

_L0:

0x000068 v_mov_b32_e32 v0, 1

0x00006C buffer_store_dword v0, off, s[0:3], 0 offset:1024

0x000074 s_endpgm

Finally a workaround

Will trick the compiler into thinking the address is changing inside the loop

By adding a constant that only the programmer knows is zero in the loop iteration

Then it bypasses the bug
However the compiler still chooses a poor branch path

The fast path isn't linear

And this requires some extra dummy constant loads

Source

#version 460

#extension GL_EXT_shader_explicit_arithmetic_types:require

#extension GL_EXT_shader_explicit_arithmetic_types_int32:require

#define I1 uint32_t

layout(set=0,binding=0,std430)coherent volatile buffer ssboC_ {I1 ssboC[1024*1024*1024/4];};

layout(set=0,binding=0,std430)readonly coherent volatile buffer ssboRC_ {I1 ssboRC[1024*1024*1024/4];};

layout(set=0,binding=0,std430)readonly buffer ssboR_ {I1 ssboR[1024*1024*1024/4];};

layout(set=0,binding=0,std430)writeonly buffer ssboW_ {I1 ssboW[1024*1024*1024/4];};

layout(local_size_x=64)in;

void main(){

// Fast check on already signaled

if(ssboR[0]==0){

// Not signalled, so must spin until signalled, note the coherent read

I1 hack=ssboR[1]; // Make sure that loads a zero!

I1 hack2=ssboR[2]; // Make sure that loads a zero! Trick the compiler into thinking the address changes!

while(true){if(ssboRC[hack]!=0)break;hack+=hack2;}}

// Make the shader do something so the program doesn't get dead-coded

ssboW[256]=1;}

Disassembly

0x000000 s_getpc_b64 s[2:3]

0x000004 s_mov_b32 s0, s1

0x000008 s_mov_b32 s1, s3

0x00000C s_load_dwordx4 s[0:3], s[0:1], null

0x000014 s_waitcnt lgkmcnt(0)

0x000018 s_buffer_load_dword s4, s[0:3], null

0x000020 s_waitcnt lgkmcnt(0)

0x000024 s_cmp_lg_u32 s4, 0

0x000028 s_cbranch_scc1 _L0

0x00002C s_buffer_load_dwordx2 s[4:5], s[0:3], 0x4

0x000034 s_waitcnt lgkmcnt(0)

0x000038 s_lshl_b32 s4, s4, 2

0x00003C s_lshl_b32 s5, s5, 2

_L1:

0x000040 s_buffer_load_dword s6, s[0:3], s4 glc dlc

0x000048 s_add_i32 s4, s4, s5

0x00004C s_waitcnt lgkmcnt(0)

0x000050 s_cmp_eq_u32 s6, 0

0x000054 s_cbranch_scc1 _L1

_L0:

0x000058 v_mov_b32_e32 v0, 1

0x00005C buffer_store_dword v0, off, s[0:3], 0 offset:1024

0x000064 s_endpgm

Knowledge Base / Resources

Compiler ^[f]Testing Online

https://shader-playground.timjones.io

Instruction Level Cache Controls on AMD and NVIDIA

Takeaways

Fully bypass L2/L3 doesn’t exist as instruction level cache control in later HW

This must be done via page table (see DEVICE_UNCACHED_BIT_AMD)
This is in contrast to NV’s hardware which has it at the instruction level, but doesn’t have or doesn’t expose a Vulkan DEVICE_UNCACHED memory type for page table control

TODO

Takes a bit of work to understand AMD’s cache control evolution

Hit Evict - A hit will be used, but regardless after the load the line is evicted

Aka “Last Use” but won’t leave stale lines either

Hit No Allocate - TODO (not well described in AMD’s public docs)
Miss Evict - The cache is “bypassed” in behavior, but “used” in implementation

Any matching lines (that would hit) are forced to miss
And after the operation the lines are evicted
HW burns a line in a temporary way

Stream (Store) - Hit leaves the line, but doesn’t update age to most recent

Documented in RDNA1 docs but this isn't fully clear as to behavior on a Miss

Stream (Load) - TODO (not well described in AMD’s public docs)

SMEM ops lack SLC bit (no L2 cache control), SLC=0 (cached)

SMEM doesn’t have stores (well at least one chipset did but it wasn’t exposed)
SMEM cannot bypass L2 via SLC control

LOAD

GLC=0 Cached L1+L2
GLC=1 Miss L1 Fetch L2 (Cache only in the coherent domain)
SLC=0 Cache in L2
SLC=1 Force miss in L2 (goes away in RDNA3)

STORE

GLC=0 Only if store writes full line, leave line in L1, write through to L2
GLC=1 Don’t leave line in L1, write-through to L2
SLC=0 Cache in L2
SLC=1 TODO (goes away in RDNA3)

RDNA 1&2

The old L1 is now L0, GLC controls L0 behavior (same as before)
There is a new L1 mid-level cache for reads (writes actually physically bypass)
SLC is now mixed with DLC bits
LOAD

SLC=0 DLC=0 - Cached L2+L1
SLC=0 DLC=1 - Cached L2, Miss Evict L1
SLC=1 DLC=0 - Stream L2, Cached L1
SLC=1 DLC=1 - Hit No Allocate L2, Miss Evict L1

STORE

SLC=0 DLC=0 - Cached L2
SLC=0 DLC=1 - Bypass L2 (goes away in RDNA3)
SLC=1 DLC=0 - Stream (Hit leaves line, but doesn’t update age)
SLC=1 DLC=1 - Hit No Allocate

RDNA 3

DLC bits repurposed for MALL (L3) controls, so SLC+DLC bit meanings change

MALL/LLC/L3 is documented as off the "Infinity Cache"

L1 control based on SLC+GLC bits
Some cache control comes from resource descriptor bits now (llc_alloc bits)

This llc_alloc is an LLC (Last Level Cache, aka L3) override?

Overrides behavior specified on the instruction
Probably so drivers can change global behavior for a given resource

S_LOAD, llc_alloc=0 (no descriptor)

LOAD

SLC=0 - Cached L2
SLC=1 - Stream L2
DLC=0 - L3 normal (llc_alloc is set then this gets forced to DLC=1)
DLC=1 - L3 non-temporal hint (no alloc)
SLC=0 GLC=0 - Cache L1
SLC=0 GLC=1 - Miss Evict L1
SLC=1 GLC=0 - Hit Evict L1 (don’t leave line in L1 after)
SLC=1 GLC=1 - Miss Evict L1

STORE

SLC=0 - Cached L2
SLC=1 - Stream L2
DLC=0 - L3 normal
DLC=1 - L3 non-temporal hint (except if llc_alloc overrides to DLC=0)

RDNA 4?

A preview analysis from Chips and Cheese based on LLVM updates before HW release: https://chipsandcheese.com/p/examining-amds-rdna-4-changes-in-llvm

Looks like Chips and Cheese analysis isn't right based on the actual source
It's just the same 3 bits as before just re-purposed

https://github.com/llvm/llvm-project/pull/74062

Full changelist

https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/AMDGPU/SIDefines.h

Look for "GFX12+" in comments "enum CPol" (likely means cache policy)

https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/AMDGPU/AsmParser/AMDGPUAsmParser.cpp

Looks like they put in WorkGroup RoundRobin scheduling!
Scope bits alias the lower 2-bits of TH
{TH_NT_RT,TH_RT_NT,TH_NT_HT} not supported for SMEM

Translates to SMEM has no DLC bit (just {RT,NT,HT,LU})

{RETURN,RT,RT_RETURN,NT,NT_RETURN,CASCADE_RT,CASCADE_NT} are the only valid options for atomics

So CASCADE is fire-and-forget only

https://github.com/llvm/llvm-project/blob/main/llvm/lib/Target/AMDGPU/MCTargetDesc/AMDGPUInstPrinter.cpp

MUBUF/MTBUF moved from U16 to S24
FLAT is signed instead of unsigned offset

210 [MEANS SPECULATION]
..1 GLC / SC0 = 1 [pre-GFX12]
.1. SLC / SC1 / NT = 2 [pre-GFX12]
1.. DLC = 4 [pre-GFX12]
===
.ss Scope [GFX12]
.00 CU
.01 SE
.10 DEV
.11 SYS
===
ttt Temporal Hint [GFX12]
000 TH_RT = regular
001 TH_NT = non-temporal
010 TH_HT = high-temporal
011 TH_RT_WB = regular (CU/SE), high-temporal with write-back (MALL) [STORE]
011 TH_LU = last use [LOAD]
011 TH_BYPASS = only used with scope = 3
100 TH_NT_RT = non-temporal (CU/SE), regular (MALL)
101 TH_RT_NT = regular (CU/SE), non-temporal (MALL)
110 TH_NT_HT = non-temporal (CU/SE), high-temporal (MALL)
111 TH_NT_WB = non-temporal (CU/SE), high-temporal with write-back (MALL) [STORE]
111 TH_RESERVED = unused value for load instructions [LOAD]
===
aaa Atomic Temporal Hint [GFX12]
..r TH_ATOMIC_RETURN = GLC (return or not)
.n. TH_ATOMIC_NT = SLC (non-temporal or not)
c.. TH_ATOMIC_CASCADE = 4 (cascade or regular) [no mixing with ATOMIC_RETURN]
SPECULATION: Possible meaning?

Maybe there is no L2, instead only a MALL?

No L2 language in there any more in the LLVM comments

Regular = Evict Normal
Non-Temporal = Evict First or Evict Unchanged
High-Temporal = Evict Last
There is no GLC=1 on stores, which means likely NT stores are HIT EVICT in L0 (aka CU) and still bypass L1 (aka SE) else there would be no way to implement layout "coherent"
Maybe TH_RT_NT is for ability to get more write-combining in L0 for things that write out data across disconnected stores, but otherwise stream (no reuse in larger level caches)
Maybe the difference between HT and WB is that in normal HT mode the aim is to avoid WB and hope for LU in time to avoid WB, and WB mode, it's reuse but also WB (use next frame too), so can WB like normal

NVIDIA

https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#cache-operators
Using PTX as a guide to the most portable options (note HW might have more than this)
>=SM_20 (aka Fermi)

LOADS

.ca Cache All Levels
.cg Cache Global (L2 but not L1, cache only in the coherent cache domain)
.cs Cache Streaming (evict first policy)
.lu Last Use

Maps to .cs for global addresses, only does .lu for workgroup private memory (register spill/etc)

.cv Cache Volatile (don’t cache, specific example system memory)

This is how NV supports polling CPU-side memory

STORES

.wb Write Back
.cg Cache Global (L2 but not L1, cache only in the coherent cache domain)
.cs Cache Streaming (evict first policy)
.wt Write Through (don’t cache, specific example system memory)

This is how NV supports lower latency CPU communication (don’t need to wait for job to finish and write-back L2 cache)

>=SM_70 (aka Volta)

Hardware adds cache eviction priority hints

.en Evict Normal
.ef Evict First (aka Streaming possible reuse)
.el Evict Last (high priority to persist in cache)
.eu Evict Unchanged (do not change ordering)
.na No Allocate (do not place data in cache, hard streaming, one use)

Latest hardware notes (4xxx series) by inspecting fuzzed disassembly

Has more cache controls (4-bits) so perhaps beyond what is exposed in PTX?
TEX has eviction priority but no cache control
SUST (surface store) has both eviction priority and cache control

NVIDIA Questions

What is the existing mapping of layout qualifiers to cache control?

Layout “coherent” should map to '.cg' (else it wouldn’t work) - good there
TODO: Open question if "volatile" maps to '.cv' for loads and '.wt' for stores

If yes, then it opens up a portable way to do low-latency communication
Use the existing layout qualifiers on NV but on AMD use the memory type to force uncached operation (mix different mechanisms, but get the same result)

TODO: Going to assume one can mix '.cg' with '.el' to avoid keeping Evict Last lines in L1

Mapped Page Cache?

The idea is to remove the need to do file operations

Designed for indie title usage (tiny install, whole game is loaded into VRAM at start)

Example of what the author is trying in Vulkan

CreateFileMapping() and MapViewOfFile() - map a file into app's address space

Background thread walks the pages to insure they are pre-faulted if possible

Using VkImportMemoryHostPointerInfoEXT with VK_EXTERNAL_MEMORY_HANDLE_TYPE_HOST_MAPPED_FOREIGN_MEMORY_BIT_EXT

And pHostPointer set to the address of the CPU mapped file

Using both

VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT
VK_MEMORY_PROPERTY_HOST_COHERENT_BIT

vkAllocateMemory(), vkBindBufferMemory(), vkUpdateDescriptorSets()
During launch there is a DMA copy of this buffer on the GPU to the beginning of a DEVICE_LOCAL buffer (the big GPU-side buffer)
During dev usage, DMA copy from big GPU-side buffer back to this CPU mapping

This is how the game data is 'saved' back into the Cart file

NVIDIA ISA Notes

References

Hopper ISA encoding fuzzed from nvdisasm: https://kuterdinel.com/nv_isa/
Lovelace ISA encoding fuzzed: https://kuterdinel.com/nv_isa_sm89/
“Demystifying the Nvidia Ampere Architecture through Microbenchmarking and Instruction-level Analysis” - https://arxiv.org/abs/2208.11174
“Dissecting the NVidia Turing T4 GPU via Microbenchmarking” - https://arxiv.org/pdf/1903.07486

Fixed 128-bit instructions
6-bit mask for which barriers to wait on (6 HW barriers, tracking long term instruction completion)
3-bit write barrier (selects which barrier bit is set when destination is written) and similar for a write-after-read hazard read barrier
There is an explicit yield to another warp bit (has operand cache effects)

This mechanism is similar to AMD’s clause but more effective as it isn’t limited to one instruction type!

AMD typically has problems of oldest-first camping on VALU causing memory bubbles because other waves cannot make forward progress to get VMEM ops out, NVIDIA’s yield in theory would mitigate that problem

Up to 4 operands each with a reuse bit (put into the operand cache)
Instructions are predicated (negate bit, 3-bit predicate register)

CUDA instruction set reference from NVIDIA: https://docs.nvidia.com/cuda/cuda-binary-utilities/index.html#instruction-set-reference

Hardware Improvement Thoughts

Random ideas on how to possibly improve GPU hardware design, a collection of thoughts across a decade of working in the industry (will update as time allows)

Bit LUT Operation With Data-Driven Truth-Table

Many processors still have traditional design fail with respect to bit logic

Specifically having separate opcodes for each individual logic op {AND,OR,XOR,etc}

Would be better to do this FPGA LUT style where one data-driven operand provides the truth-table

It reduces the number of CPU instructions required (better effective IPC)
It provides a way to have some data driven divergence without code divergence

4 operand fetch architecture would be able to do any 3 operand logic operation

3 operands provide the 3 inputs
1 operand provides the truth-table (uses 8-bit from LSBs)

Note one could make the 8-bit LUT part of the opcode bits, but that then removes the data-driven capability (which might be the right unfortunate compromise for only a 3-operand architecture)
If HW has existing capacity for two 32-bit register stores per instruction to a pair of 32-bit registers per clock

Then one could build this instruction to do 2 logic ops in parallel (use 16-bit LUT)

CU-Shared SGPRs

Providing perhaps a better way to share controlling data across waves, or communicate across waves

Reserve wave ID=0
Reuse it's SGPR space for a CU-shared area of SGPRs
Put in instruction access to toggle between {shared,private} SGPRs

This HW implementation would likely be very low cost

Divergent Textures vs HW Design

Trouble spots for current hardware

Divergent texture descriptors
Divergent mip levels
Addressing logic gets too complex, need to factor it out

Using a minimum alignment to extend the 32-bit addressing capability

32-bit value with 256-byte alignment

Don't store the lower 8-bits
40-bit effective address in 32-bits = 1 TiB VA space

One could build in support for divergence given

Same texture descriptor (so same format and size, etc)
Same mip level (at runtime)

Factor out the base address from the descriptor and send with texture coordinates
One could argue that layered textures already covers this case quite well

But having fixed amount of layers sometimes is a deal breaker

Radical changes

If base address is completely factored out of the descriptor
And the HW uses descriptor indexes (instead of passing descriptors like AMD)
Then the index is basically a "texture compatibility index"

So textures of the same configuration (same index) can dynamically be evaluated together in parallel at runtime (even if their base address is divergent)

And the actual resource divergence is now limited to just the subset of 'compatibility indexes'

Just support a passed in base address (separate from that index)
And then introduce an adder for the final addressing in the texture unit

Float Bool Fixes - Float Mode Without NaNs

Saturation modifier with FMA provides a good tool, some related IEEE rules

saturate(NaN) = 0
saturate(-INF) = 0
saturate(+INF) = 1
mul(0,INF)=NaN … trouble case, but saturate(NaN) converts that back to 0

First a reference of what is used today for optimization

1.0 = True
0.0 = False
And(x,y) = min(x,y)
And(x,y) = saturate(x*y)
And3(x,y,z) = min3(x,y,z)
AndOr(x,y,z) = saturate(x*y+z)
AndNot(x,y) = -x*y+1
Gt(x,y) = Gtz(x-y) … 2 ops
Gtz(x) = saturate(INF*x) … {NaN := 0, x GT 0 := 1, else 0}
Lt(x,y) = Ltz(x-y) … 2 ops
Ltz(x) = saturate(-INF*x) … {NaN := 0, x LT 0 := 1, else 0}
Ne(x,y) = Gtz(abs(x-y)) … 2 ops
Not(x) = 1-x
Or(x,y) = max(x,y)
Or(x,y) = saturate(x+y)
Or3(x,y,z) = max3(x,y,z)
Sel(x,y,z) = z*y+((-z)*x*x) … z==0.0?x:y … 2 ops, preserves precision

Warning

Some compiler engineers don’t understand the importance of mixing INF and saturate
So some platforms have bugs (they might factor out or transform the *INF into something else)

Since INF is a compile time literal

The workaround might be to push INF into a load-time constant instead

Optimization cases

Packed math (HW does not support a packed conditional)
Avoiding SALU to VALU dependent latency of conditionals
etc

The following are more expensive (extra op for the Not() which is a 1-x)

Le(x,y) … Not(Gt(x,y))
Ge(x,y) … Not(Lt(x,y))
Eq(x,y) … Not(Ne(x,y))

Could be faster though if there was a way to run the hardware in a modified no-NaN mode

Actual IEEE rules that provide the problems

INF-INF = NaN
(+/-)INF*0 = NaN

Would rather have no NaNs and instead this logic

0*INF = 0
-INF+INF = 0
-INF-INF = -INF
+INF+INF = +INF

Then the following is possible (one less operation)

Eq(x,y) = saturate(-INF*abs(x-y)+INF)
Ge(x,y) = Gez(x-y)
Gez(x) = saturate(+INF*x+INF)
Le(x,y) = Lez(x-y)
Lez(x) = saturate(-INF*x+INF)

FMA Select Uber Op

This is a way to mix a FMA with a select in one operation

Applies to a 4-operand architecture
Can be implemented with very low cost in HW

It helps improve the common pattern of SIMD logic

x=option 0 ... computed at runtime
y=option 1 ... computed at runtime
c=logic op to control the choice between x and y
x=c?x:y ... aka select (or conditional mask)
So this applies to the last two parts (in bold) done as one operation

e=fmaWithModifiers(a,b,c)!=0.0?e:d

This doesn't require a 5-th operand fetch
The HW selects 'e' by simply predicating off the resulting write
And notice the 'd' would typically be from a prior instruction return cache

Avoiding an extra register file read (so low power)

Note 'fmaWithModifiers' includes saturate()

Which means one can use saturate FMA based {0=false, 1=true} bool logic

The FMA is that float bool logic op in the suggested usage

Which is quite good for packed or vector-in-register to parallelize conditional logic
This would significantly improved packed 16-bit code generation for architectures that otherwise only have one HW flags register between CMP and SELECT

FMA Negate Input Modifier - 3-bit to 2-bit

Applies to hardware using 3-bits for negative input modifier for FMAs

([+/-]a*[+/-]b)+([+/-]c)
The ((-a)*(-b)) has the same result as (a*b)
The ((-a)*b) has the same result as (a*(-b))

Can instead do 2-bit in the following form

([+/-](a*b))+([+/-]c)

Saving one bit of opcode space

Or allowing a secondary usage (given generic 3 operand + modifiers opcode format)

IMAD Leveraging The Float Input Negate Modifier(s)

Starting with why implementations don't reuse neg opcode modifier bits

Because two's complement NEG requires a NOT and ADD 1 (carry bit)
In contrast to float where it's just an input MSB bit flip

There are a few useful forms here

Lessons learned from how FPGA people leverage DSP blocks

At a minimum the integer ADDers support one carry bit
Could support using a 3-bit modifier field as

[~](a*b) + [~]c + [carry]
So two not modifiers and one optional carry
Which is substantially more useful than the base IMAD without modifiers

There might be an even better way to do modifiers without carry abuse

a-b = a + not(b) + 1
a-b = not(not(a) + b)

Xilinx (aka AMD) FPGA DSP48E1 ALUMODE bits for reference

Example of how to do this well
Where Z is the added operand, and (X+Y) is the partial products of the MUL

Also for non-mul cases
X can be {0, P=prior result}
Y can be {0, C=the other operand, ~0 (aka -1)}

[00] Z+(X+Y+CIN) … add
[01] (~Z)+(X+Y+CIN) … -Z+(X+Y+CIN)-1 … if CIN=1 then get -Z+(X+Y) reverse subtract
[10] ~(Z+(X+Y+CIN)) … neg output - 1
[11] ~((~Z)+(X+Y+CIN)) … subtract … Z-(X+Y+CIN)

Integer Signed Maths - {MIN_NEGATIVE to 0} as Fixed-Point {1.0 to 0.0}

This is something for an older integer FGPA ALU design

Might have some use for integer GPU maths in shader programs

Two’s complement has more values on the negative side

Thus working with negative numbers is often better than positives

For fixed-point bool logic use the sign-bit (MSB)

0 (positive) is false
Negative is true
Much easier than testing if equal to zero in HW (or FPGA)

Standardizing on negatives requires a lot more mental work though

For example a parabolic sqrt(x) estimation would normally be 2*x-x*x

But you’d need to transform that to 2*x+x*x (flip the sign of the square)

Max and Min in One Op

Typically max and min are separate operations

Sometimes with 3 operand forms min3(a,b,c) and max3(a,b,c)

If the hardware had ability to write (or accept) 2 results from the ALU per clock

Accept as in goes in a destination cache to avoid register fetch from SRAM in later instructions

Then getting both the min and max result would be quite useful
Because often both the min and max are needed in many algorithms

Today the AMD packed math fast path is to do max(half2(a.x,-a.x),half2(b.x,-b.x))
Which provides {max,-min} respectively
And allows for AoS or SoA changes at the same time (change the .x to a .y in either operand)
This shows up in TAAs, image processing, block compression, etc

Shift Direction - Data Driven

HW tends to have separate ops for shift left and shift right
One could instead allow the shift amount to choose shift direction
Signed for right and unsigned for left
Combine that with ability to shift output to zero too

Shifts to Zero

32-bit shift implementation often takes the 5-bits LSB as the shift amount
This has an unfortunate side effect that it becomes impossible to do a data-driven shift value to zero (either for the left or right shift)

Often if one is using shifts to do SIMD parallel bit packing, shift-to-zero enables denoting a field that is not desired in divergent data SIMD but then becomes expressible without code divergence or extra instructions

The ask would be that one could shift by {0 to 32} instead of {0 to 31} for 32-bit shifts
Probably takes just one extra level of logic to correct this in HW (easy to do)

TBUFFER But With 64-bit Base Pointers Instead of Descriptors

Topic item for future GPU evolution ...
Moving from descriptors to 64-bit base pointer means no need for drivers to manage

No longer need to put buffers into descriptor sets

TBUFFER as in load/store but with Type specified in the opcode

And by Type, ability to load/store the same types as in images
So advanced types like 10:10:10:2 and so on

Obviously would want cache control as part of the opcode
TODO: Along with maybe an addressing mode?

Something that could do standard address translations for locality

XOR Offsetting

This was something more for some radical integer FPGA GPU design

Not sure if it could be useful outside that context!

Idea was to use [XOR(adr,offset)] instead of [adr+offset]

Because the adder gets expensive
And for aligned and power of 2 sized things, it’s the same output

Possibly the XOR provides some tools for re-ordering data

One can use the LSBs of the base address to choose a reordering pattern

OFF 000 001 010 011 100 101 110 111 0 1 2 3 4 5 6 7

BAS --- --- --- --- --- --- --- --- - - - - - - - -

000 | 000 001 010 011 100 101 110 111 0 | 0 1 2 3 4 5 6 7 ---> zero BAS works like ADD

001 | 001 000 011 010 101 100 111 110 1 | 1 0 3 2 5 4 7 6 ---

010 | 010 011 000 001 110 111 100 101 2 | 2 3 0 1 6 7 4 5 ^

011 | 011 010 001 000 111 110 101 100 -> 3 | 3 2 1 0 7 6 5 4 | the rest provide various reordering

100 | 100 101 110 111 000 001 010 011 4 | 4 5 6 7 0 1 2 3 |

101 | 101 100 111 110 001 000 011 010 5 | 5 4 7 6 1 0 3 2 v

110 | 110 111 100 101 010 011 000 001 6 | 6 7 4 5 2 3 0 1 ---

111 | 111 110 101 100 011 010 001 000 7 | 7 6 5 4 3 2 1 0 ---> ~0 BAS inverts the order of OFF

Bonus Section on AMD PC Compiler Bugs

This is only getting updated very slowly when the topics show up again ...

AtomicAdd Fail on AMD PC

Yeah it's that bad, imagine the worst compiler behavior possible, that is what you get ...

Let's try to do something simple like an atomicAdd predicated to the first lane

Round 1 - First Attempt

Compiler does the branch to get to single lane execution

For the program's if(gl_LocalInvocationID.x==0)
Although it didn't need to do a branch it could have just predicated execution instead

Then compiler forgets it is already in single lane execution

And it uses masked bit count to find the first active lane of execution
Then branches a second time
Basically duplicating what it did before but slower this time

Then it computes the number of active lanes before the second branch

And multiplies that active lane count (which compiler should know is one)
By the 1289 atomic add compile time immediate

Then it does the atomic
Then it goes back and does a redundant readfirstlane

And multiplies the prior masked bit count by the 1289 constant
To reconstruct the atomic as if it wasn't done in one lane

And finally it gets to the program's readlane which ignores the prior step
Yeah, WTF?

If one is going to do a "perf strategy" for idiot programmers

Better at least not destroy performance for "competent" programmers

Do no harm
And yet, it produces an absolute nightmare to try to workaround

Source

#version 460

#extension GL_EXT_shader_explicit_arithmetic_types:require

#extension GL_EXT_shader_explicit_arithmetic_types_int32:require

#extension GL_KHR_shader_subgroup_arithmetic:require

#extension GL_KHR_shader_subgroup_ballot:require

#extension GL_KHR_shader_subgroup_quad:require

#extension GL_KHR_shader_subgroup_shuffle:require

#extension GL_KHR_shader_subgroup_vote:require

#extension GL_EXT_shader_subgroup_extended_types_float16:require

#define I1 uint32_t

layout(set=0,binding=0,std430)writeonly buffer ssboW_ {I1 ssboW[1024*1024*1024/4];};

layout(set=0,binding=1,r32ui)coherent uniform uimageBuffer stbC_I1[64];

layout(local_size_x=64)in;

void main(){

I1 v;

if(gl_LocalInvocationID.x==0)v=imageAtomicAdd(stbC_I1[0],0,1289);

v=subgroupBroadcast(v,0);

ssboW[gl_LocalInvocationID.x]=v;}

Disassembly

0x000000 s_mov_b32 s0, s1

0x000004 s_getpc_b64 s[2:3]

0x000008 s_mov_b64 s[4:5], exec

0x00000C v_cmpx_eq_u32_e32 0, v0

0x000010 s_cbranch_execz _L0

0x000014 s_mov_b64 s[8:9], exec

0x000018 s_mov_b64 s[6:7], exec

0x00001C v_mbcnt_lo_u32_b32 v1, s8, 0

0x000024 v_mbcnt_hi_u32_b32 v1, s9, v1

0x00002C v_cmpx_eq_u32_e32 0, v1

0x000030 s_cbranch_execz _L1

0x000034 s_mov_b32 s1, s3

0x000038 v_mov_b32_e32 v3, 0

0x00003C s_load_dwordx4 s[12:15], s[0:1], null

0x000044 s_bcnt1_i32_b64 s1, s[8:9]

0x000048 s_mulk_i32 s1, 0x509

0x00004C v_mov_b32_e32 v2, s1

0x000050 s_waitcnt lgkmcnt(0)

0x000054 buffer_atomic_add v2, v3, s[12:15], 0 idxen glc

_L1:

0x00005C s_or_b64 exec, exec, s[6:7]

0x000060 s_waitcnt vmcnt(0)

0x000064 v_readfirstlane_b32 s1, v2

0x000068 v_mad_u32_u24 v1, 0x509, v1, s1

_L0:

0x000074 s_or_b64 exec, exec, s[4:5]

0x000078 s_mov_b32 s1, s3

0x00007C v_readlane_b32 s4, v1, 0

0x000084 s_load_dwordx4 s[0:3], s[0:1], 0x800

0x00008C v_lshlrev_b32_e32 v0, 2, v0

0x000090 v_mov_b32_e32 v1, s4

0x000094 s_waitcnt lgkmcnt(0)

0x000098 buffer_store_dword v1, v0, s[0:3], 0 offen

0x0000A0 s_endpgm

Round 2 - Use gl_SubgroupInvocationID instead?

Surely the compiler would know it's already predicated to one lane?

Nope same collection of problems

Except it's actually worse

Because gl_SubgroupInvocationID requires 2 VALU ops (masked bit count)
Instead of just using the gl_LocalInvocationID.x which is already in a VGPR

So it ends up doing the masked bit count 2 times

Source

#version 460

#extension GL_EXT_shader_explicit_arithmetic_types:require

#extension GL_EXT_shader_explicit_arithmetic_types_int32:require

#extension GL_KHR_shader_subgroup_arithmetic:require

#extension GL_KHR_shader_subgroup_ballot:require

#extension GL_KHR_shader_subgroup_quad:require

#extension GL_KHR_shader_subgroup_shuffle:require

#extension GL_KHR_shader_subgroup_vote:require

#extension GL_EXT_shader_subgroup_extended_types_float16:require

#define I1 uint32_t

layout(set=0,binding=0,std430)writeonly buffer ssboW_ {I1 ssboW[1024*1024*1024/4];};

layout(set=0,binding=1,r32ui)coherent uniform uimageBuffer stbC_I1[64];

layout(local_size_x=64)in;

void main(){

I1 v;

if(gl_SubgroupInvocationID==0)v=imageAtomicAdd(stbC_I1[0],0,1289);

v=subgroupBroadcast(v,0);

ssboW[gl_LocalInvocationID.x]=v;}

Disassembly

0x000000 v_mbcnt_lo_u32_b32 v1, -1, 0

0x000008 s_mov_b32 s0, s1

0x00000C s_getpc_b64 s[2:3]

0x000010 v_mbcnt_hi_u32_b32 v1, -1, v1

0x000018 v_cmp_eq_u32_e32 vcc_lo, 0, v1

0x00001C s_and_saveexec_b64 s[4:5], vcc

0x000020 s_cbranch_execz _L0

0x000024 s_mov_b64 s[8:9], exec

0x000028 s_mov_b64 s[6:7], exec

0x00002C v_mbcnt_lo_u32_b32 v1, s8, 0

0x000034 v_mbcnt_hi_u32_b32 v1, s9, v1

0x00003C v_cmpx_eq_u32_e32 0, v1

0x000040 s_cbranch_execz _L1

0x000044 s_mov_b32 s1, s3

0x000048 v_mov_b32_e32 v3, 0

0x00004C s_load_dwordx4 s[12:15], s[0:1], null

0x000054 s_bcnt1_i32_b64 s1, s[8:9]

0x000058 s_mulk_i32 s1, 0x509

0x00005C v_mov_b32_e32 v2, s1

0x000060 s_waitcnt lgkmcnt(0)

0x000064 buffer_atomic_add v2, v3, s[12:15], 0 idxen glc

_L1:

0x00006C s_or_b64 exec, exec, s[6:7]

0x000070 s_waitcnt vmcnt(0)

0x000074 v_readfirstlane_b32 s1, v2

0x000078 v_mad_u32_u24 v1, 0x509, v1, s1

_L0:

0x000084 s_or_b64 exec, exec, s[4:5]

0x000088 s_mov_b32 s1, s3

0x00008C v_readlane_b32 s4, v1, 0

0x000094 s_load_dwordx4 s[0:3], s[0:1], 0x800

0x00009C v_lshlrev_b32_e32 v0, 2, v0

0x0000A0 v_mov_b32_e32 v1, s4

0x0000A4 s_waitcnt lgkmcnt(0)

0x0000A8 buffer_store_dword v1, v0, s[0:3], 0 offen

0x0000B0 s_endpgm

Round 3 - How about subgroupElect()?

Certainly the compiler has to know it's only one lane, that is what subgroupElect() is for after all

Nope same collection of problems

But it actually got worse

The implementation of subgroupElect() tries to figure out the first lane
Even though it's the start of the program, and the first lane is obviously lane 0

Source

#version 460

#extension GL_EXT_shader_explicit_arithmetic_types:require

#extension GL_EXT_shader_explicit_arithmetic_types_int32:require

#extension GL_KHR_shader_subgroup_arithmetic:require

#extension GL_KHR_shader_subgroup_ballot:require

#extension GL_KHR_shader_subgroup_quad:require

#extension GL_KHR_shader_subgroup_shuffle:require

#extension GL_KHR_shader_subgroup_vote:require

#extension GL_EXT_shader_subgroup_extended_types_float16:require

#define I1 uint32_t

layout(set=0,binding=0,std430)writeonly buffer ssboW_ {I1 ssboW[1024*1024*1024/4];};

layout(set=0,binding=1,r32ui)coherent uniform uimageBuffer stbC_I1[64];

layout(local_size_x=64)in;

void main(){

I1 v;

if(subgroupElect())v=imageAtomicAdd(stbC_I1[0],0,1289);

v=subgroupBroadcast(v,0);

ssboW[gl_LocalInvocationID.x]=v;}

Disassembly

0x000000 v_mbcnt_lo_u32_b32 v1, exec_lo, 0

0x000008 s_mov_b32 s0, s1

0x00000C s_getpc_b64 s[2:3]

0x000010 v_mbcnt_hi_u32_b32 v1, exec_hi, v1

0x000018 v_cmp_eq_u32_e32 vcc_lo, 0, v1

0x00001C s_and_saveexec_b64 s[4:5], vcc

0x000020 s_cbranch_execz _L0

0x000024 s_mov_b64 s[8:9], exec

0x000028 s_mov_b64 s[6:7], exec

0x00002C v_mbcnt_lo_u32_b32 v1, s8, 0

0x000034 v_mbcnt_hi_u32_b32 v1, s9, v1

0x00003C v_cmpx_eq_u32_e32 0, v1

0x000040 s_cbranch_execz _L1

0x000044 s_mov_b32 s1, s3

0x000048 v_mov_b32_e32 v3, 0

0x00004C s_load_dwordx4 s[12:15], s[0:1], null

0x000054 s_bcnt1_i32_b64 s1, s[8:9]

0x000058 s_mulk_i32 s1, 0x509

0x00005C v_mov_b32_e32 v2, s1

0x000060 s_waitcnt lgkmcnt(0)

0x000064 buffer_atomic_add v2, v3, s[12:15], 0 idxen glc

_L1:

0x00006C s_or_b64 exec, exec, s[6:7]

0x000070 s_waitcnt vmcnt(0)

0x000074 v_readfirstlane_b32 s1, v2

0x000078 v_mad_u32_u24 v1, 0x509, v1, s1

_L0:

0x000084 s_or_b64 exec, exec, s[4:5]

0x000088 s_mov_b32 s1, s3

0x00008C v_readlane_b32 s4, v1, 0

0x000094 s_load_dwordx4 s[0:3], s[0:1], 0x800

0x00009C v_lshlrev_b32_e32 v0, 2, v0

0x0000A0 v_mov_b32_e32 v1, s4

0x0000A4 s_waitcnt lgkmcnt(0)

0x0000A8 buffer_store_dword v1, v0, s[0:3], 0 offen

0x0000B0 s_endpgm

Round 4 - Trying Another Branch Strategy

Thought for certain this would work, but it is also horrible
Made a lane dynamic value that is 1289 on lane 0 but zero on all other lanes

Then predicated the atomic by if the value to add was zero

But the output is bad, it's still doing the multiply garbage

So either that is a bug, or it somehow knows the value is wave coherent inside the branch

Quite amazing, will have to try harder

Source

#version 460

#extension GL_EXT_shader_explicit_arithmetic_types:require

#extension GL_EXT_shader_explicit_arithmetic_types_int32:require

#extension GL_KHR_shader_subgroup_arithmetic:require

#extension GL_KHR_shader_subgroup_ballot:require

#extension GL_KHR_shader_subgroup_quad:require

#extension GL_KHR_shader_subgroup_shuffle:require

#extension GL_KHR_shader_subgroup_vote:require

#extension GL_EXT_shader_subgroup_extended_types_float16:require

#define I1 uint32_t

layout(set=0,binding=0,std430)writeonly buffer ssboW_ {I1 ssboW[1024*1024*1024/4];};

layout(set=0,binding=1,r32ui)coherent uniform uimageBuffer stbC_I1[64];

layout(local_size_x=64)in;

void main(){

I1 v=mix(0,1289,gl_LocalInvocationID.x==0);

if(v!=0)v=imageAtomicAdd(stbC_I1[0],0,v);

v=subgroupBroadcast(v,0);

ssboW[gl_LocalInvocationID.x]=v;}

Disassembly

0x000000 v_mov_b32_e32 v1, 0

0x000004 s_mov_b32 s0, s1

0x000008 s_getpc_b64 s[2:3]

0x00000C s_mov_b64 s[4:5], exec

0x000010 v_cmpx_eq_u32_e32 0, v0

0x000014 s_cbranch_execz _L0

0x000018 s_mov_b64 s[8:9], exec

0x00001C s_mov_b64 s[6:7], exec

0x000020 v_mbcnt_lo_u32_b32 v1, s8, 0

0x000028 v_mbcnt_hi_u32_b32 v1, s9, v1

0x000030 v_cmpx_eq_u32_e32 0, v1

0x000034 s_cbranch_execz _L1

0x000038 s_mov_b32 s1, s3

0x00003C v_mov_b32_e32 v3, 0

0x000040 s_load_dwordx4 s[12:15], s[0:1], null

0x000048 s_bcnt1_i32_b64 s1, s[8:9]

0x00004C s_mulk_i32 s1, 0x509

0x000050 v_mov_b32_e32 v2, s1

0x000054 s_waitcnt lgkmcnt(0)

0x000058 buffer_atomic_add v2, v3, s[12:15], 0 idxen glc

_L1:

0x000060 s_or_b64 exec, exec, s[6:7]

0x000064 s_waitcnt vmcnt(0)

0x000068 v_readfirstlane_b32 s1, v2

0x00006C v_mad_u32_u24 v1, 0x509, v1, s1

_L0:

0x000078 s_or_b64 exec, exec, s[4:5]

0x00007C s_mov_b32 s1, s3

0x000080 v_readlane_b32 s4, v1, 0

0x000088 s_load_dwordx4 s[0:3], s[0:1], 0x800

0x000090 v_lshlrev_b32_e32 v0, 2, v0

0x000094 v_mov_b32_e32 v1, s4

0x000098 s_waitcnt lgkmcnt(0)

0x00009C buffer_store_dword v1, v0, s[0:3], 0 offen

0x0000A4 s_endpgm

Round 5 - Trying Not to Branch Strategy

Ok this finally worked, but it requires some background to understand how it works
For image operations there are 2 ways to disable the store or atomic

First set EXEC to disable the associated lane

This is what we are told to do in "school"
This is what the compiler fails quite hard at

The second is to just set the store address to something out of bounds

Note this likely won't work for SSBOs!

And definitely won't work for 64-bit pointers!
One needs to use STORAGE_TEXEL_BUFFER for this to work!

Out of bounds writes are disabled after address generation
For stores the hardware pre-merges all the same address writes

So regardless when the disable happens it should be fast

For atomics this is depending on it skipping the disabled lanes early

TODO: Will need to check this in a benchmark to double verify

So this one always does the atomic on all lanes

Depending on hardware fast path of disabling atomics for out of bound addresses
By just pushing the address to an out of bound value for all lanes except the first

This disables all the dead stupid compiler behavior

Probably because the address is now dynamic

Source

#version 460

#extension GL_EXT_shader_explicit_arithmetic_types:require

#extension GL_EXT_shader_explicit_arithmetic_types_int32:require

#extension GL_KHR_shader_subgroup_arithmetic:require

#extension GL_KHR_shader_subgroup_ballot:require

#extension GL_KHR_shader_subgroup_quad:require

#extension GL_KHR_shader_subgroup_shuffle:require

#extension GL_KHR_shader_subgroup_vote:require

#extension GL_EXT_shader_subgroup_extended_types_float16:require

#define I1 uint32_t

layout(set=0,binding=0,std430)writeonly buffer ssboW_ {I1 ssboW[1024*1024*1024/4];};

layout(set=0,binding=1,r32ui)coherent uniform uimageBuffer stbC_I1[64];

layout(local_size_x=64)in;

void main(){

I1 v=mix(I1(-4),0,gl_LocalInvocationID.x==0);

v=imageAtomicAdd(stbC_I1[0],int(v),1289);

v=subgroupBroadcast(v,0);

ssboW[gl_LocalInvocationID.x]=v;}

Disassembly

0x000000 s_mov_b32 s4, s1

0x000004 s_getpc_b64 s[0:1]

0x000008 v_cmp_eq_u32_e32 vcc_lo, 0, v0

0x00000C s_mov_b32 s5, s1

0x000010 v_mov_b32_e32 v2, 0x509

0x000018 s_clause 0x1

0x00001C s_load_dwordx4 s[0:3], s[4:5], null

0x000024 s_load_dwordx4 s[4:7], s[4:5], 0x800

0x00002C v_cndmask_b32_e64 v1, -4, 0, vcc_lo

0x000034 v_lshlrev_b32_e32 v0, 2, v0

0x000038 s_waitcnt lgkmcnt(0)

0x00003C buffer_atomic_add v2, v1, s[0:3], 0 idxen glc

0x000044 s_waitcnt vmcnt(0)

0x000048 v_readlane_b32 s0, v2, 0

0x000050 v_mov_b32_e32 v1, s0

0x000054 buffer_store_dword v1, v0, s[4:7], 0 offen

0x00005C s_endpgm

GL_EXT_buffer_reference

Re Gustav Sterbrant's comment: 'They support pointer arithmetic and everything'

Lets see if everything works ...

TLDR

Serious correctness bug with lacking 'coherent' GLC=1 on stores

Makes this completely unusable on AMD

Second problem, only compiles for RNDA3 optimize correctly with the simple case

RDNA2 and before are unusably slow (they emulate the hardware using 64-bit VALU ops)
Apparently AMD changed compilers on RDNA3 HW

Serious problems with basic code generation on even RNDA3 in the less simple path (packed 16-bit)

See the later example

Which is otherwise too bad because this extension is really exactly what the author was looking for

Review of AMD HW (example from RDNA2 ISA guide)

S_LOAD_*

Load from 1-16 dwords
address = base + offset + imm21

base : 64-bit SGPR pair
offset : 32-bit SGPR providing unsigned byte offset
imm21 : signed byte offset (but must be positive)

GLOBAL_*

Load or store or atomic via the FLAT instruction form
address = base + offset + imm12

base : 64-bit SGPR pair
offset : 32-bit VGPR providing unsigned byte offset
imm12 : signed byte offset

address = base + imm12

base : 32-bit | 64-bit VGPR

Does support {SLC,DLC,GLC} cache control bits

Using the Radeon GPU Analyzer from 2024/09/26

Apparently AMD deprecated pre-RDNA HW? Already?
So cannot check the output for my Vega based APU using this tool!

Using this program below that tests

SMEM getting 8 DWORD loads (can it do large accesses)
SMEM using 'base + offset + imm21' (can it use all components)
VMEM using 4 DWORD stores
VMEM using 'base + offset + imm12' (can it use all components)
VMEM using 'coherent' GLC=1 cache control bits

#version 460

#extension GL_EXT_shader_explicit_arithmetic_types:require

#extension GL_EXT_shader_explicit_arithmetic_types_int32:require

#extension GL_EXT_shader_explicit_arithmetic_types_int64:require

#extension GL_EXT_buffer_reference : require

#define I1 uint32_t

#define I4 u32vec4

#define L1 uint64_t

#define L4 u64vec4

layout(buffer_reference,std430,buffer_reference_align=32)readonly buffer BufR_L4{L4 v;};

layout(buffer_reference,std430,buffer_reference_align=32)writeonly coherent buffer BufW_L4{L4 v;};

layout(push_constant)uniform pc0_ {I4 pc;};

layout(local_size_x=64)in;

void main(){

I1 i=gl_LocalInvocationID.x<<5;

L1 base=pack64(pc.yz);

I1 off=pc.w;

I1 off2=pc.x;

BufR_L4 rL4=BufR_L4(base+off+0xaa0);

BufW_L4 wL4=BufW_L4(base+i+0xbb0);

wL4.v=rL4.v;}

RDNA1 (gfx1010) Disassembly

Bugs

Serious correctness bug, no GLC=1 (for coherent layout)
Serious perf bug, it's using [vgpr64] addressing and emulating the hardware using 64-bit VALU ops

0x000000 s_load_dwordx8 s[4:11], s[2:3], s4 offset:0xaa0 1,8 F40C0101 08000AA0

0x000008 v_lshlrev_b32_e32 v0, 5, v0 1,8 34000085

0x00000C v_add_co_u32 v0, s0, s2, v0 1,8 D70F0000 00020002

0x000014 v_add_co_ci_u32_e64 v1, s0, s3, 0, s0 2,8 D5280001 00010003

0x00001C v_add_co_u32 v8, vcc_lo, 0x800, v0 3,8 D70F6A08 000200FF 00000800

0x000028 v_add_co_ci_u32_e32 v9, vcc_lo, 0, v1, vcc_lo 3,8 50120280

0x00002C s_waitcnt lgkmcnt(0) 2,8 BF8CC07F

0x000030 v_mov_b32_e32 v6, s6 3,8 7E0C0206

0x000034 v_mov_b32_e32 v7, s7 4,8 7E0E0207

0x000038 v_mov_b32_e32 v4, s4 5,8 7E080204

0x00003C v_mov_b32_e32 v5, s5 6,8 7E0A0205

0x000040 v_mov_b32_e32 v2, s10 7,8 7E04020A

0x000044 v_mov_b32_e32 v3, s11 8,8 7E06020B

0x000048 v_mov_b32_e32 v0, s8 9,8 7E000208

0x00004C v_mov_b32_e32 v1, s9 10,8 7E020209

0x000050 global_store_dwordx4 v[8:9], v[4:7], off offset:944 10,8 DC7883B0 007D0408

0x000058 global_store_dwordx4 v[8:9], v[0:3], off offset:960 6,8 DC7883C0 007D0008

0x000060 s_endpgm 0,8 BF810000

RDNA3 (gfx1100) Disassembly

Getting right

SMEM getting 8 DWORD loads (can it do large accesses)
SMEM using 'base + offset + imm21' (can it use all components)
VMEM using 4 DWORD stores
VMEM using 'base + offset + imm12' (can it use all components)

Bugs

Serious correctness bug, no GLC=1 (for coherent layout)

0x000000 s_load_b256 s[4:11], s[2:3], s4 offset:0xaa0 1,16 F40C0101 08000AA0

0x000008 v_lshlrev_b32_e32 v0, 5, v0 1,16 30000085

0x00000C s_delay_alu instid0(VALU_DEP_1) 1,16 BF870001

0x000010 v_and_b32_e32 v8, 0x7fe0, v0 2,16 361000FF 00007FE0

0x000018 s_waitcnt lgkmcnt(0) 1,16 BF89FC07

0x00001C v_mov_b32_e32 v6, s6 2,16 7E0C0206

0x000020 v_mov_b32_e32 v7, s7 3,16 7E0E0207

0x000024 v_mov_b32_e32 v4, s4 4,16 7E080204

0x000028 v_mov_b32_e32 v5, s5 5,16 7E0A0205

0x00002C v_mov_b32_e32 v2, s10 6,16 7E04020A

0x000030 v_mov_b32_e32 v3, s11 7,16 7E06020B

0x000034 v_mov_b32_e32 v0, s8 8,16 7E000208

0x000038 v_mov_b32_e32 v1, s9 9,16 7E020209

0x00003C s_clause 0x1 9,16 BF850001

0x000040 global_store_b128 v8, v[4:7], s[2:3] offset:2992 9,16 DC760BB0 00020408

0x000048 global_store_b128 v8, v[0:3], s[2:3] offset:3008 5,16 DC760BC0 00020008

0x000050 s_nop 0 0,16 BF800000

0x000054 s_sendmsg sendmsg(MSG_DEALLOC_VGPRS) 0,16 BFB60003

0x000058 s_endpgm 0,16 BFB00000

But With Closer Inspection Even the RDNA3 Code Gen is Quite Bad

Another simple case, but with some packed 16-bit maths

Compiler doesn't seem to be able to do basic register allocation right

Notice the extra V_LSHRREV_B32_E32

Compiler then does 4 global stores instead of 1 global store because it put the packed stuff in non-aligned registers

#version 460

#extension GL_EXT_shader_explicit_arithmetic_types:require

#extension GL_EXT_shader_explicit_arithmetic_types_int16:require

#extension GL_EXT_shader_explicit_arithmetic_types_int32:require

#extension GL_EXT_shader_explicit_arithmetic_types_int64:require

#extension GL_EXT_buffer_reference : require

#define W2 u16vec2

#define W4 u16vec4

#define I1 uint32_t

#define I4 u32vec4

#define L1 uint64_t

#define L4 u64vec4

layout(buffer_reference,std430,buffer_reference_align=8)readonly buffer BufR_W4{W4 v;};

layout(buffer_reference,std430,buffer_reference_align=8)writeonly coherent buffer BufW_W4{W4 v;};

layout(push_constant)uniform pc0_ {I4 pc;};

layout(local_size_x=64)in;

void main(){

I1 i=gl_LocalInvocationID.x<<5;

L1 base=pack64(pc.yz);

I1 off=pc.w;

I1 off2=pc.x;

BufR_W4 rW4=BufR_W4(base+off+0xaa0);

BufW_W4 wW4=BufW_W4(base+i+0xbb0);

W4 ww=rW4.v;

ww.xy=ww.xy*ww.zw+W2(12,24);

wW4.v=ww;}

0x000000 s_load_b64 s[0:1], s[2:3], s4 offset:0xaa0

0x000008 v_mov_b32_e32 v1, 0x18000c

0x000010 v_lshlrev_b32_e32 v0, 5, v0

0x000014 s_delay_alu instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_3)

0x000018 v_and_b32_e32 v0, 0x7fe0, v0

0x000020 s_waitcnt lgkmcnt(0)

0x000024 v_pk_mad_u16 v1, s0, s1, v1

0x00002C s_lshr_b32 s0, s1, 16

0x000030 v_mov_b32_e32 v3, s1

0x000034 v_mov_b32_e32 v4, s0

0x000038 s_delay_alu instid0(VALU_DEP_3)

0x00003C v_lshrrev_b32_e32 v2, 16, v1

0x000040 s_clause 0x3

0x000044 global_store_b16 v0, v1, s[2:3] offset:2992

0x00004C global_store_b16 v0, v2, s[2:3] offset:2994

0x000054 global_store_b16 v0, v3, s[2:3] offset:2996

0x00005C global_store_b16 v0, v4, s[2:3] offset:2998

0x000064 s_nop 0

0x000068 s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)

0x00006C s_endpgm

Looks like in this case it is possible to workaround by re-packing to uvec2 before the store

Which implies the compiler just isn't putting the right constraints on things

#version 460

#extension GL_EXT_shader_explicit_arithmetic_types:require

#extension GL_EXT_shader_explicit_arithmetic_types_int16:require

#extension GL_EXT_shader_explicit_arithmetic_types_int32:require

#extension GL_EXT_shader_explicit_arithmetic_types_int64:require

#extension GL_EXT_buffer_reference : require

#define W2 u16vec2

#define W4 u16vec4

#define I1 uint32_t

#define I2 u32vec2

#define I4 u32vec4

#define L1 uint64_t

#define L4 u64vec4

#define I1_W2(a) packUint2x16(a)

I2 I2_W4(W4 a){I2 r;r.x=I1_W2(a.xy);r.y=I1_W2(a.zw);return r;}

layout(buffer_reference,std430,buffer_reference_align=8)readonly buffer BufR_W4{W4 v;};

layout(buffer_reference,std430,buffer_reference_align=8)writeonly coherent buffer BufW_W4{W4 v;};

layout(buffer_reference,std430,buffer_reference_align=8)writeonly coherent buffer BufW_I2{I2 v;};

layout(push_constant)uniform pc0_ {I4 pc;};

layout(local_size_x=64)in;

void main(){

I1 i=gl_LocalInvocationID.x<<5;

L1 base=pack64(pc.yz);

I1 off=pc.w;

I1 off2=pc.x;

BufR_W4 rW4=BufR_W4(base+off+0xaa0);

BufW_I2 wI2=BufW_I2(base+i+0xbb0);

W4 ww=rW4.v;

ww.xy=ww.xy*ww.zw+W2(12,24);

wI2.v=I2_W4(ww);}

0x000000 s_load_b64 s[0:1], s[2:3], s4 offset:0xaa0

0x000008 v_lshlrev_b32_e32 v0, 5, v0

0x00000C v_mov_b32_e32 v1, 0x18000c

0x000014 s_delay_alu instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_2)

0x000018 v_and_b32_e32 v2, 0x7fe0, v0

0x000020 s_waitcnt lgkmcnt(0)

0x000024 v_pk_mad_u16 v0, s0, s1, v1

0x00002C v_mov_b32_e32 v1, s1

0x000030 global_store_b64 v2, v[0:1], s[2:3] offset:2992

0x000038 s_nop 0

0x00003C s_sendmsg sendmsg(MSG_DEALLOC_VGPRS)

0x000040 s_endpgm

Loading or Unpacking Descriptors Inside a Branch

Example below
If you predicate a store or atomic to one lane
AMD's driver will load/build the descriptor right before the operation
When it's on the latency critical path
Instead of doing the loading/building prior not on the latency critical path

s_cbranch_execz label_0012 // 000000000020: BF880009

s_and_b32 s2, s6, 0x0000ffff // 000000000024: 8602FF06 0000FFFF

s_mov_b32 s4, s5 // 00000000002C: BE840005

s_mov_b32 s5, s2 // 000000000030: BE850002

s_movk_i32 s6, 0xffff // 000000000034: B006FFFF

s_mov_b32 s7, 0x00024fac // 000000000038: BE8700FF 00024FAC

buffer_store_dword v1, v0, s[4:7], 0 offen offset:1024 glc // 000000000040: E0705400 80010100

label_0012:

Predicating to One Lane is Slow

Desired path

Explicit setting of EXEC to just lane 0
Using 'if(subgroupInverseBallot(uvec4(1,0,0,0)))'
Code produced below (highly ugly)

v_mbcnt_lo_u32_b32 v1, -1, 0 // 000000000000: D28C0001 000100C1

v_mbcnt_hi_u32_b32 v1, -1, v1 // 000000000008: D28D0001 000202C1

v_lshlrev_b64 v[1:2], v1, 1 // 000000000010: D28F0001 00010301

v_and_b32 v1, 1, v1 // 000000000018: 26020281

s_mov_b64 s[0:1], exec // 00000000001C: BE80017E

v_cmpx_ne_u32 s[2:3], v1, 0 // 000000000020: D0DD0002 00010101

...

s_cbranch_execz label_0019 // 00000000003C: BF880009

Workaround?

The workaround path is 'if(gl_LocalInvocationID.x==0)'
This is perhaps maybe not desired as a workaround because it would be far better to just explicitly set EXEC without doing the V_CMPX op and checking invocation ID in a VGPR
Code produced below
Note it forces a compare, and will do a full branch instead of just masking the EXEC
In theory the compiler knows this is a one wave workgroup 'layout(local_size_x=64)'

It could in theory pattern match that to EXEC manipulation

s_mov_b64 s[0:1], exec // 000000000000: BE80017E

v_cmpx_eq_i32 s[2:3], v0, 0 // 000000000004: D0D20002 00010100

...

s_cbranch_execz label_0012 // 000000000020: BF880009

Also bad

Using 'if(subgroupElect())'
The driver knows this is the beginning of a wave sized workgroup
It could have in theory just SAVEEXECed this to lane 1
Instead it looks for the first active lane and does the masked bit count mess too

s_ff1_i32_b64 s0, exec // 000000000000: BE80117E

v_mbcnt_lo_u32_b32 v1, -1, 0 // 000000000004: D28C0001 000100C1

v_mbcnt_hi_u32_b32 v1, -1, v1 // 00000000000C: D28D0001 000202C1

s_mov_b64 s[2:3], exec // 000000000014: BE82017E

v_cmpx_eq_i32 s[0:1], s0, v1 // 000000000018: 7DA402F9 06868000

...

s_cbranch_execz label_0017 // 000000000034: BF880009

SSBO Won't Use the Free Addressing for SGPR Offset

Notes

So the optimizer does pickup small offsets
Large offsets of elements in SSBO get VALU added instead of using the free SGPR offset
And each instance of that gets wasted VGPR space and VALU work

Example shader

#version 460

#extension GL_EXT_shader_explicit_arithmetic_types:require

#extension GL_EXT_shader_explicit_arithmetic_types_int32:require

#define I1 uint32_t

#define I4 u32vec4

struct HugeSSBO{

I4 takeUpSpace[1024*1024*64];

I1 pastImm[1024];

I1 pastImm2[1024];};

layout(set=0,binding=0,std430)buffer ssbo {HugeSSBO huge;};

layout(push_constant)uniform pc0_ {I4 pc;};

layout(local_size_x=64)in;

void main(){

I1 i=gl_LocalInvocationID.x<<5;

I1 d=huge.pastImm[i];

d+=0xdead;

huge.pastImm2[i+0x1234]=d;}

And the disassembly on RNDA2

0x000000 s_getpc_b64 s[2:3]

0x000004 s_mov_b32 s0, s1

0x000008 s_mov_b32 s1, s3

0x00000C v_lshlrev_b32_e32 v0, 7, v0

0x000010 s_load_dwordx4 s[0:3], s[0:1], null

0x000018 v_add_nc_u32_e32 v1, 2.0, v0

0x00001C v_add_nc_u32_e32 v0, 0x40005000, v0

0x000024 s_waitcnt lgkmcnt(0)

0x000028 buffer_load_dword v1, v1, s[0:3], 0 offen

0x000030 s_waitcnt vmcnt(0)

0x000034 v_add_nc_u32_e32 v1, 0xdead, v1

0x00003C buffer_store_dword v1, v0, s[0:3], 0 offen offset:2256

0x000044 s_endpgm

Texel Buffer Won't Use the Free Addressing

Notes

AMD's implementation of Texel Buffer uses BUFFER instructions
And while the compiler knows the index stride in bytes due to the layout
It refuses to optimize cases where it could factor out immediates into either

SGPR offset
or IMM offset

Instead it uses extra expensive VALU ops

Example shader

#version 460

#extension GL_EXT_shader_explicit_arithmetic_types:require

#extension GL_EXT_shader_explicit_arithmetic_types_int32:require

#define I1 uint32_t

#define I4 u32vec4

layout(set=0,binding=1,r32ui)uniform uimageBuffer img[2];

layout(push_constant)uniform pc0_ {I4 pc;};

layout(local_size_x=64)in;

void main(){

I1 i=gl_LocalInvocationID.x<<5;

I1 d=imageLoad(img[0],int(i)).x;

d+=0xdead;

imageStore(img[0],int(i+0xdead00),uvec4(d));

imageStore(img[0],int(i+0x78),uvec4(d));}

And the disassembly on RNDA2

0x000000 s_getpc_b64 s[2:3]

0x000004 s_mov_b32 s0, s1

0x000008 s_mov_b32 s1, s3

0x00000C v_lshlrev_b32_e32 v0, 5, v0

0x000010 s_load_dwordx4 s[0:3], s[0:1], null

0x000018 v_add_nc_u32_e32 v2, 0xdead00, v0

0x000020 s_waitcnt lgkmcnt(0)

0x000024 buffer_load_format_x v1, v0, s[0:3], 0 idxen

0x00002C v_add_nc_u32_e32 v0, 0x78, v0

0x000034 s_waitcnt vmcnt(0)

0x000038 v_add_nc_u32_e32 v1, 0xdead, v1

0x000040 buffer_store_format_x v1, v2, s[0:3], 0 idxen

0x000048 buffer_store_format_x v1, v0, s[0:3], 0 idxen

0x000050 s_endpgm

Work-in-Progress

Consider everything below junk until it gets moved above ...

Stable ABI For Compute

TODO

General Statement About Hints

Many of the suggestions below manifest as hints that can be placed in the high-level languages like GLSL and the IRs like SPIR-V, with relatively low effort, even if the IVH backends don’t yet support. And IHVs could roll in support as works on their timelines. This would allow shaders to be authored well so improvements could be made later. Some vendors might not have HW support, and they can simply ignore many of these hints safely.

Audio From GPU

TODO, this is just a placeholder

Need to expand on a suggestion of how to implement
Including how one solves the problem of lower latency (GPU work scheduling,etc)

GPU is one of the best platforms for custom audio generation algorithms^[g]
HDMI includes audio output
However there is no route on the GPU to write into an audio ring buffer
One has to write across the bus into CPU-side memory, then do a copy on the CPU into the audio ring buffer, then have the system copy that back to the GPU for audio out
Latency issues

GPU-to-CPU writes have no great portable forced uncached write (AMD is an exception)

Instead the APIs rely on last-level-cache forced write-back operations (expensive and latent)
So this is why today it’s hard to build low-latency GPU audio (because of the GPU->CPU->GPU expensive path)

For on-GPU the problem would be getting a high-priority wave or workgroup to execute when audio needs to be generated (periodically)

If it’s Graphics running, and no mid-triangle preemption, there would be a problem of a huge graphics job camping all the waves and blocking low-latency
If it’s Compute running, there is kernel preemption, so it is possible to get work on the GPUs (not that this is the best option)
If the game is using persistent waves and managing it’s own task cut-up, in theory it could respond to running audio generation tasks when needed

CPU audio generation has effectively the same kinds of actual latency issues in task execution, like fetching from audio samples is likely to be from a cold cache at the start of the task (assuming a fully loaded machine)

Side issue, a lot of modern audio is decompressed streams

GPUs do have scalar integer ALUs that could do decompression as well
There are options for vector decompression, but who needs 64 parallel decode streams

So one would need to parallel in samples not streams

Cache Control - Instruction Level / Etc

Getting Coherence In LLC (Last Level Cache)

TODO: Big topic

Standard Write

Suggested implementation

Hit Evict on any incoherent cache level
Evict Normal on any coherent cache level
AMD PC driver does this already with GLC=1 by default on stores (even without coherent layout)

Easy to Force: just mix 'coherent writeonly' layout
Don't want to leave the possibility of getting stale lines in incoherent caches

If the hardware doesn't do eviction after the write, then a full cache flush of the incoherent level is required later for safety

Last Use / Hit Evict

Suggestion: TODO, this requires more thought ...
Designed for private workgroup memory (register spill, stacks, etc)

NVIDIA according to PTX only allows this for private workgroup memory
Even if NVIDIA didn't support for non-private usage, one could still provide a hint for other platforms

Designed for temporary data
Designed to avoid unnecessary write-back of lines

Lines might have been written only to say NV's L1$ and not L2$
Last Use avoids writing any data to L2$ (line state becomes undefined)
The big usage case is for using large L3$ to hold non-private temporary data

NVIDIA doesn’t document this explicitly, but assuming it is "Evict" after access
Likely "Hit Evict" in AMD terms
Very hard to leverage "Last Use" in many non-private memory multi-reader parallel problems

Unknown at read time state of other parallel tasks which might read a line
Probably better to do a ranged invalidation of the cache level after the whole task is done in that case

The problem of course, that almost always involves getting the driver involved, which is definitely something one wants to avoid

Secondary option if the memory is DCCed (compressed) is to clear it while there is a possibility that the lines are still in the cache

By storing zeros to a full DCC block at the same time
The store would invoke a meta-data only operation (lower overhead than a full write-back of non-zero data)

By RDNA3.5 AMD only supports Hit Evict on L1 (mid-level read-only cache)

SLC=1 (STREAM) GLC=0 (cache in L0)
This is effectively implicitly selected by one combination

Scanout Query

Either ability to query when a frame starts on the GPU realtime shader clock domain
Or alternatively ability to query the scanout h-sync counter
This is another component to providing AWS (see Realtime Shader Clock Frequency Query)

VA Duplication for Shared Physical Backing for Aliasing?

TODO, this is just a placeholder

Might scrap this section, because it might not have advantages unless lower level caches are virtually tagged (which isn’t going to be good for GPU design anyway)

If aliased allocations have guaranteed different virtual addresses but share the same physical pages, it is possible to reduce the cache invalidation overhead to just the physically tagged caches, and all the virtual tagged caches need not be invalidated

But if the virtual tagged caches are all coherent, then one wouldn’t need to invalidate them
And this is likely the common configuration

VA Explicit Address for Buffer

TODO, this is just a placeholder
AMD kernel driver (Linux) supports fixed repeatable manual Virtual Address layout
It would be nice to have ability to layout a part of the GPU Virtual Address space explicitly in a portable way across multiple chipsets and multiple vendors
This is a great tool to pre-linking pointers and doing other related optimizations
It would be nice to have ability to layout a part of the GPU Virtual Address space explicitly
A secondary part of this is to be able to mix and match different buffer allocations for different usages into a common consistent VA space

For example, having a specific region supporting CPU-read-back, or CPU-write-through, but accessing that via a common pointer on the GPU
Common pointer might be a questionable ask if some vendors place some properties into a ‘descriptor’ instead of say a page table entry

Third example might be the classic VM enabled ring buffer

Repeat the same physical mapping at different VA ranges to have the VA address translation implement a ring buffer

Bus Crossing Topics

CPU-Write-GPU-Read

DEVICE_LOCAL+HOST_VISIBLE case

CPU is using write-combining stores that cross the bus and write to GPU DRAM
Assume the CPU writes don’t invalidate GPU cache entries

non-DEVICE_LOCAL

CPU is writing to its own cache hierarchy, and the GPU is reading across the bus
The bus crossing read snoops the CPU caches (PC)

HOST_VISIBLE+HOST_COHERENT

In theory could be faster for cases the CPU never reads, and writes full cachelines^[h] (TODO validation of this statement)

HOST_VISIBLE+HOST_COHERENT+HOST_CACHED

In cases where the CPU might read/modify/write a line, or doesn’t write full lines, don’t want the reads to go uncached on the CPU so use this

Once a GPU read happens later reads can hit in the GPU cache and bypass re-reading across the bus

Always write full cachelines from the CPU (important for write-combining)
Have the GPU only read a CPU-written cacheline once per frame

Do not try to GPU write to any line that could be CPU-written
Make a copy of the data if multi-GPU-read is needed
This avoids the possibility of getting different data versions if the line is lost in the GPU cache

Note there is the possibility of getting a partial CPU-written cacheline if CPU wrote after submit

There is no guarantee of cacheline granularity atomic stores
One mitigation is to include a hash of the data in the cacheline
If the hash doesn’t agree with the contents, then toss out the data packet

CPU-Write-GPU-Read - GPU-Polling?

AMD provides DEVICE_UNCACHED_BIT_AMD for memory type which allows reads to bypass the GPU cache domain (likely in HW this is a page table bit)

This support extends back to at least Vega (so anything with packed 16-bit double rate supports it)
NVIDIA does not provide such support in a memory type in Vulkan (2024)

This now allows the GPU to poll the same line multiple times per frame to see if the CPU is finished with something
Without this, you’d need to poll different lines each poll read
TODO: Would be nice if there was a way to make GPU polling portable

Challenges of portable HW support given all the HW design possibilities (page table, vs in-descriptor, vs cache hints on opcode) … maybe being over-complete would solve that?

GPU-Write-CPU-Read

Starting with a basic rule, don’t share cacheline usage between read and write

GPU writes the full cacheline (no read) and CPU only reads a full cacheline (no write)

TODO: Note, I’m mostly talking about dGPU cases, clearly iGPU can be different

This is focusing on worst case …

DEVICE_LOCAL?

This typically implies uncached reads from the CPU, so don’t use for CPU read

There are typically two CPU-side memory types

HOST_VISIBLE+HOST_COHERENT

This typically implies uncached reads from the CPU, so don’t use for CPU read cases

HOST_VISIBLE+HOST_COHERENT+HOST_CACHED

CPU reads are cached (fast if they hit)

Without some form of cache control there is no guarantee on the timing a GPU write becomes CPU visible mid-frame

GPU cache can soak up writes, and wait until some write-back before kicking the lines across the bus
Typically one would need a pipeline barrier with VK_PIPELINE_STAGE_HOST_BIT to force a write-back and that could be quite costly
One mitigation plan for vendors without cache control might be to just run a GPU workload that blows through the cache, but if it’s a big L3 that could be more challenging

TODO: How to try to force a bus-write-through without brutal cost?
TODO: Is there a way to pipeline HOST_BIT write-back on both vendors?

The typical concern, if a bus crossing write-back is serialized cost is bus speed limited

TODO: Can one use ‘volatile’ qualifiers on any vendor to force write-through the bus?

GPU-Write-CPU-Read - Uncached AMD

AMD provides DEVICE_UNCACHED_BIT_AMD for memory type
This provides a way to force a write to cross the bus and bypass GPU caches
TODO: Starting topics which need to be discussed

Ability to saturate and stall on bus-crossing writes?

TODO SECTION^[i]

Collecting notes from others to comment on when time allows …

John Brooks List and Related Comments

Reducing over time to what isn't commented on ...

“1) Direct ptrs to LDS sram

2) Lock cache ways for manual caching

3) > 64KB sram (PS3 SPU had 256KB in 2005)

3) CPU-style stack

4) Function ptrs

5) Linking & libraries

6) Ability to write native assembly

7) Ability to control wave priority & sleep/wake

GPU compilers need to evolve into ptr/function/library paradigm instead of monolithic all-in-one codegen

I also bring data into LDS sram and do loads/stores into LDS so I can NOP lanes/insns by using LDS ptrs >64KB (ie null store) to avoid branches around code blocks that contain stores.”

Pointers

Repurposing memory (buffers + pointers + different types)

More Work in Progress

RDNA4 Notes

ISA Doc: https://www.amd.com/content/dam/amd/en/documents/radeon-tech-docs/instruction-set-architectures/rdna4-instruction-set-architecture.pdf

Workgroup Barriers

RDNA4 Reference

Workgroup has 64 barriers

NVIDIA PTX Reference

https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#parallel-synchronization-and-communication-instructions
Workgroup has 16 barriers

WMMA Sub-Topic

If we look at RNDA 3.5, the hardware is well documented in AMD's ISA Guide and dead easy to understand, just 16x16 element matrices

https://www.amd.com/content/dam/amd/en/documents/radeon-tech-docs/instruction-set-architectures/rdna35_instruction_set_architecture.pdf

But it has some serious pain points for traditional shader programming

Each input {A,B} matrix just sits in lanes {0 to 15}
But the other lanes {16-63} need to duplicate the data (wave64 shader)
So at one point the algorithm might run wave64
but WMMAs require reducing to wave16 logic with duplication in the rest
And then results and accumulation matrix gets mapped across wave64 with a mix of rows in a given VGPR

There is a limit to the practical size of the matrix based on how many VGPRs one wants to use for a column, 16-tall is 8 VGPRs of packed 16-bit data for example

If one wanted to get serious about using WMMAs

It would be massively better to go back to SIMD16 physical hardware!
And have explicit sub-vector(wave16)/full-vector(wave64) execution
This way you can efficiently work in wave16 traditional shader logic

And also get the extreme register file size/lane that you'd want

For instance a mixed WMMA/shader algorithm could just run in wave16!

Bandwidth Pain Point

One way off bandwidth pain point: keep matrix A streaming through data while matrix B is held constant

But note you compromise your working space by holding matrices in VGPRs
In some respects: it's moving part of a program into VGPRs and using register file as I$

Hopefully it's obvious now that persistent waves was the answer all along

The same group that didn't or still doesn't support explicit persistent waves

Is now relying on them wholesale for efficient computation in the ML gold rush (irony)

Do the rest of us a favor and make the general persistent wave toolset available

Query how many workgroups fit on the GPU given a PSO
Bonus: all the cool hardware changes that could be done (separate topic)

Sparcicity

Better to think about things this way
Sparse matrix encoding is a form of compression for the matrix that is the baked "program"

That reduces the number of FMAs ultimately
The sparse matrix encoding is effectively embedded opcode bits

It's just MUXing operands
Same as the routing network in a systolic array

In the program held in the VGPRs (the matrix)

Not really something one could leverage for the dynamic data
TODO: Opens up the side question

Would there ever be utility in general compression of VGPR data beyond simple stuff like packing smaller types?
And the other one, what about embedding more "control" data in VGPR data

For more efficient execution
Aka VGPR data drives an operand MUX (just like the sparse matrix stuff does)

Part that a systolic array gets right and a designed-for-sparse-WMMA gets wrong

You can make network routing (aka the MUX) conditional on the resulting sign
It's too complex to describe what that enables here ...

Flattening Networks

The --ONLY-- "way it was meant to be played"
Example

Network is say NxN sized but needs M*NxM*N context flattened
One way to look at this is the filter kernel spatial window
Thus one ends up doing M*M duplication of work to avoid the dreaded DRAM round trip
This is why TOPs values are "inflated" and NN "efficiency" is complex

The bias towards using temporal networks becomes logical for realtime

Moving spatial to temporal keeps the M factor down!

The dreaded DRAM round trip

When networks reduce spatial dims they tend to expand in vector size
So round tripping that even after a "spatial reduction" step is painful

Ultimately bandwidth scaling is a dead-end, one cannot afford mass non-local speculative computation!

Simplified View of WMMA Network Logic

It's an extreme form of running both sides of a branch

Where the matrix getting reuse is part of the "program"

Aka a sparse matrix is simply MUX control for operands

One is running all filter kernel answers simultaneously

As well as running filter kernel logic to test the result match
The non-WMMA non-linear logics act as the selection in a way
This is one of the big components of TOP "inflation"
And can only do this kind of huge speculative computation if the data is local

Hierarchical reduction part of the network is simply factoring of common terms in this process

Building a hierarchy of terms or vocabulary to describe a domain

Another way of thinking about this

Have a collection of vectors of input in the columns of A
And a collection of patterns in the rows of B
Resulting dot product of those vectors is the weight of the match
Byproduct of doing a MMA

Testing all groups of inputs against all patterns

Concrete simple example but with tiny 4x4 matrices

Could have a 2x2 texels of luma unwrapped into a 4x1 column
Then 2x2 patterns to test against unwrapped into a 4x1 row

Like say {-1,-1,1,1} {-1,1,-1,1} {-1,1,1,-1} for {horz,vert,diagonal}

Looking at this at 16x16 sizing

A lane gets a column
Thus a lane likely has a 1:N mapping

If say a pixel has 4 attributes, you'd have N=4 to get 16 values

This is in sharp contrast to say pixel shading which is 1 lane to 1 pixel

Choosing a network design is forcing a fixed filter flow (the program structure), and training is figuring out a good set of patterns (the language of the network, the program matrix)

Problems That Don't Map Well To Networks

Simple Neural Networks are effectively a form of data compression

It's a lookup like a texture fetch, but computed by a network filter
The input vector is the coordinates of the lookup
The output vector is the interpolated result
The interpolation is multi-stage hierarchical and highly complex
When the network becomes stateful (recurrent) it functions like a temporal filter

Some problems can never map efficiently to even complex neural networks

Example, motion reprojection sampling

The input to the network would be the {entire image, and the fetch coordinate}
Obviously never going to fly, one would just sample that data
So anything that maps well to a lookup in a large data structure that cannot be compressed

How about reprojection filtering after sampling?

Input to the network would be {the sampling data for the filter window, the sub-pixel offset}
One of the better analytical solutions involves

Computing filter coefficients from the sub-pixel offset
Taking the {min,max} of the inner 2x2 pixels of the window
Doing the filter kernel (weighted average, maps well to dot products)
But then a highly non-linear clamp of the result by that {min,max}

This is something a neural network is CRAP at
Good filtering options involve highly non-differentiable functions

Problems that map to extremely sparse networks are obviously not efficient

TODO ...

[a]Conditional branches consume values and we can already label them

SPIR-V has the Subgroup and especially SubgroupId annotations that can be applied to any _value_, and even allow being more specific about the scope of uniformity (subgroup, workgroup, etc)

AFAIK these are not emitted by any compiler for you, but you can use intrinsics to label values, in fact it's a requirement for non-uniform descriptor indexing.

Better HLL compilers can and maybe one day will emit it for everything (ours has the ability to track that information using local type analysis but doesn't emit it for now)

Unsure if any device compilers actually leverage these, but they totally could do so today

[b]It would be cute in a similar way to be able to hint about the coherence of memory accesses (expected cache misses). This could be something that is added after PGO.

1 total reaction

Sebastian Aaltonen reacted with 👍 at 2024-12-17 08:31 AM

[c]This is... an IL... like today. Realistically, it seems that the GPUs ISAs are still radically evolving every generation or two, seems unwise to cap that innovation only to allow people to play with low-level ASM. Maybe one day when GPUs are totally "boring" that would be the road - but I think we'll observe it in practice (less changes in ISAs) - not impose it as a standard for reasons. Standards are the death of innovation.

2 total reactions

N “Nielsbishere” B reacted with 👍 at 2024-12-04 13:23 PM

Adam Sawicki reacted with 👍 at 2024-12-25 23:33 PM

[d]And even with cpus nowadays it's still a time sink, you'd have to write x64 and arm64 versions and if you care about 32-bit x86 and arm versions of the code (maybe RISC-V in the future). Rather than just writing the code once. On GPUs on desktop this would mean you have to do it for AMD, Intel, NV, QCOM and then foreach architecture... Very impractical. On mobile likely no one would get it right either :)

EDIT: Ah, maybe like embedding SPIR-V, but even that'd not be so useful since it's only intermediate. And you'd still need a separate one for DXIL and/or MSIL.

[e]Why HOST_CACHED? HOST_CACHED memory is the equivalent of DX12 READBACK heap, while HOST_VISIBLE without HOST_CACHED is the UPLOAD heap - still system RAM but uncachched and write-combined, good for CPU writes and GPU reads.

[f]For viewing AMD ISA, godbolt.org recently added support for HLSL+RGA

[g]Isn't it the case that audio requires very stringent latency requirements, whilst GPUs are great at throughput but are terrible at latency?

1 total reaction

Adam Sawicki reacted with ☝️ at 2024-12-26 21:23 PM

[h]This article may be useful: https://gpuopen.com/learn/get-the-most-out-of-smart-access-memory/ We did lots of experiments, we had tons of data. Unfortunately we had to water down the statements for this article to be so generic - no absolute statements, no numbers. You know how it is...

Anyway, the difference between this and DEVICE_LOCAL is really the question when do you want to cross the PCIe bus - when writing from the CPU (non-DEVICE_LOCAL) or when reading on the GPU (DEVICE_LOCAL).

About the perf of CPU writes, when using PCIe 4.0, writes to the VRAM are not as efficient as to the system RAM but same order of magnitude. May be few times slower. Definitely not like 100x slower.

[i]Somethings that I haven't see in the list:

- Allow raster to not generate quads / provide differentials

- Allow registers to be "dropped" at a given point in the shader execution (i.e. in the same shader evaluate some preamble, decide what path in the shader is really needed, drop the registers that the given path does not need to use)

Change Log

Suggestions - Small Scope / Easy

Branch Related Hinting

Buffer Data in Shaders

Cache Control - Streaming

Device-Uncached Portability Path

Global Structure Unions / Register Aliasing Hints

GPU-Filling Dispatch Query for PSO

Labels and Goto Support In Shaders

Preoptimized Pragma

Realtime Shader Clock Frequency Query

TBUFFER Support - Portable Form

TBUFFER Emulation Round 2?

Transition-Free Swapchain

Wave-Uniform Qualifier

Suggestions - Medium Scope

Binary Compute PSOs

Binary Translation Sub-Topic

Instruction Intrinsics

Suggestions - Large Scope / Hard

Inline Assembly

Discussion Topics

16-Bit Support

Accessing Memory Super Topic

Review on AMD Memory Interface Throughput

Buffer Compression Problem

Constant Problem

Buffer Size Limit Problem

Variable Type Access Problem?

Summary of all the Buffer Access Points

Conclusions (Where Today = 20241216)

Adaptive Workload Scaling (AWS)

Bind Everything

CPU Push Data to GPU

CPU to GPU Workload Migration

If the CPU workload is less power efficient on the CPU compared to the GPU the workload should be migrated over (and include the bus transfer power in that too)

Dropping Registers Mid-Shader Execution?

Efficient Program Counter (PC) Logic

Index vs Offset - Analysis of Options

AMD PC RDNA2 Disassembly Examples

Write-Once-Read-Many (WORM)

One Implicit Per-Frame Cache Flush

Rules for GPU-Side Memory Access

HW Requirements

GPU Memory Aliasing

GPU Spin Loops or Lock-Free Retry Loops

Example AMD Compiler Bugs (MEGA FAIL)

Knowledge Base / Resources

Compiler [f]Testing Online

Instruction Level Cache Controls on AMD and NVIDIA

Mapped Page Cache?

NVIDIA ISA Notes

Hardware Improvement Thoughts

Bit LUT Operation With Data-Driven Truth-Table

CU-Shared SGPRs

Divergent Textures vs HW Design

Float Bool Fixes - Float Mode Without NaNs

FMA Select Uber Op

FMA Negate Input Modifier - 3-bit to 2-bit

IMAD Leveraging The Float Input Negate Modifier(s)

Integer Signed Maths - {MIN_NEGATIVE to 0} as Fixed-Point {1.0 to 0.0}

Max and Min in One Op

Shift Direction - Data Driven

Shifts to Zero

TBUFFER But With 64-bit Base Pointers Instead of Descriptors

XOR Offsetting

Bonus Section on AMD PC Compiler Bugs

AtomicAdd Fail on AMD PC

GL_EXT_buffer_reference

RDNA1 (gfx1010) Disassembly

RDNA3 (gfx1100) Disassembly

But With Closer Inspection Even the RDNA3 Code Gen is Quite Bad

Loading or Unpacking Descriptors Inside a Branch

Predicating to One Lane is Slow

SSBO Won't Use the Free Addressing for SGPR Offset

Texel Buffer Won't Use the Free Addressing

Work-in-Progress

Stable ABI For Compute

General Statement About Hints

Audio From GPU

Compiler ^[f]Testing Online

TODO SECTION^[i]