Fixing The GPU

This represents opinions of the author and not necessarily aligned or not-aligned with any {prior, current, or future} employers! Also the author only cares about compute generated graphics (no fixed function and pixel shaders, so those asks won’t be covered here)!

The purpose is simple: record problems and suggested solutions for issues with GPU programming. Circa 2024, one of the larger problems as game production costs skyrocket, is that top AAA titles need to find their way across platforms (either at launch, or later with a timed exclusive) to hit a large enough market to amortize cost. Thus it would be nice to find a way to make forward process fixing these fundamental issues across as much of the industry as possible so that GPU shader devs can ultimately write better algorithms.

Note this is a work in progress, not even close to being finished, so best to just reference instead of grabbing a copy right now.

Change Log

20241121 - Starting to collect ideas

20241122 - Expanding WORM/etc

20241123 - Reordering to make this more useful (TLDR problems)

20241124 - Adding a section on hardware design improvements too

20241125 - Added PSO query to get workgroup count that fills empty GPU, etc

20241126 - Added CPU Push to GPU section, started inserting some AMD compiler bugs

20241127 - Started adding extra resource and doc links, like online shader compiler

20241127 - Added a section on efficient PC logic (fixed 32-bit MSB)

20241127 - Added section for portable TBUFFER support

20241128 - Added section on Stream layout qualifier (well thought through now)

20241129 - Added the section on SSBO array[index]'ing problems with disassembly

20241130 - Added section for AMD bugs on GL_EXT_buffer_reference

20241202 - Added buffer data in shaders section and mapped page cache section

20241202 - Added labels and goto support in shaders section

20241203 - Clean up and un-todo'ing intrinsics and inline asm section

20241204 - Added section on divergent texture vs hardware design

20241215 - Texel offset notes, additions and clean-up of buffers vs pointer stuff

20241216 - Accessing Memory Super Topic up

20241217 - Added CU Shared SGPR area

20241218 - Added the section on GPU Spin Loops or Lock-Free Retry Loops

20241220 - Added AtomicAdd bugs and workaround in the AMD Compiler Bugs section


Suggestions - Small Scope / Easy

These are mostly flushed out suggestions which would be easy to implement the front-ends for, even if IHV backend support might lag due to higher implementation time.

Branch Related Hinting

  • Ability to label a branch[a] {if,while,etc} with explicit coherence or divergence[b], along with providing a probability of the conditional evaluation, and explicit whole-wave mode
  • This would enable a well written compiler to produce some optimal code paths (examples below) that compilers today cannot do
  • GLSL syntax suggestion(s)
  • [true=N] where N is a {0 to 1} value of the probability of true
  • [divergent] meaning the conditional will evaluate differently per lane
  • [coherent] meaning the conditional will evaluate the same for all lanes
  • [whole] meaning all lanes are active
  • Compilers can do some optimizations when they know all lanes are active before a branch, for example subgroupElect() could just use lane 0 explicitly instead of searching for the lowest active lane
  • [partial] meaning some lanes could be inactive
  • [whole] [true=0.5] [divergent] if(...
  • Means lanes of the wave will always take both paths
  • On AMD for example, the compiler could remove the branching and simply mask EXEC for better performance
  • [whole] [true=0.99] [coherent] if(...
  • Means the else path would almost never be taken
  • Compilers could make the if(true) path completely linear in the binary
  • Limiting branching away to the uncommon path and branch back to the join point overhead that only happens on the uncommon path
  • Author's usage
  • Using defines for these which map to comments since there is no implementation
  • But at least the code is setup for them if they ever arrive
  • #define JMP_LANE /* divergent */
  • #define JMP_WAVE /* coherent */
  • #define JMP_FULL /* full wave */
  • #define JMP_PART /* partial wave */
  • #define JMP_BOTH /* expect both paths taken */
  • #define JMP_HERE /* expect this path nearly always taken */
  • #define JMP_RARE /* expect this path almost never taken */

Buffer Data in Shaders

  • Ability to create global buffer data inside a shader
  • With ability to control alignment
  • With an include for binary data
  • Etc
  • Ability to create a 64-bit pointer to the data and use GL_EXT_buffer_reference
  • TODO: Would also want to be able to do buffer access
  • Probably could query something on the PSO
  • To get the GPU address to pass into some descriptor creation?
  • This enables GPUs shaders to have similar advantages of CPU code
  • An initialized data segment for constant and/or mutable data
  • Bonus points
  • If the API guarantees the 32-bit MSBs of the 64-bit pointer to be constant
  • Such that only a 32-bit LSB needs to be passed around

Cache Control - Streaming

  • Suggestion to make a new layout qualifier stream
  • Streaming on Read
  • Suggested implementation: Evict First on all Levels, Miss Evict as backup
  • Introduce a 'stream' memory qualifier for the layout
  • Can mix this with 'readonly'
  • Obvious case, only one wave on the GPU ever reads a given line inside one load
  • Can use streaming cache hint on all levels of the cache
  • easily portable
  • Note on AMD K$ lines are likely 64-bytes (not 128-bytes like L0)
  • SMEM supports reading 16 DWORDs in one op (full 64-byte line in one shot)
  • More complex, only one wave accesses the line, but across multiple separate loads
  • If loads are separated (not claused on hardware that could join the collection), then Miss Evict would provide poor performance
  • Forced Miss, then Evict (aka Miss Evict) vs Evict First
  • Likely HW wouldn't have a true bypass on loads
  • Because it would need to duplicate paths (expensive)
  • More likely HW forces a miss then evicts the content after load
  • If the line was read-only it shouldn't get write-back (empty byte write mask)
  • So not concerned if the line stays
  • Evict First in theory would be better than Miss Evict in this case
  • They both will provide the first line for the next miss in the set
  • But Evict First enables the line to get reuse possibly if there is a secondary load
  • Streaming on Write
  • Suggested implementation: Evict First on all Levels, Hit Evict as backup
  • Introduce a 'stream' memory qualifier for the layout
  • Can mix this with 'writeonly'
  • Note don't need to force Eviction on the incoherent cache levels
  • Because this is streaming, no reuse case for the given frame
  • No chance to read a stale line (it will never get read)
  • Obvious case, only one wave on the GPU ever writes a given line inside one store
  • No write-combining at other levels in the cache
  • No reuse until next frame or later (no practical way to capture multi-frame coherence, given limited cache sizes)
  • Less obvious case, separate stores need write-combining
  • Maybe Evict First helps here
  • Likely HW wouldn't have a true bypass on stores either
  • One point of caches is to collect traffic to coalesce operations to the open DRAM page (hopefully that term is right) to increase efficiency
  • Hit Evict vs Evict First
  • Hit Evict could take advantage of a partial line write after a full line load
  • Evict First thought would have a larger window for any possible write-combining for no extra down-sides

Device-Uncached Portability Path

  • To make this work and be portable one has to build an API that is the LCD of the vendors
  • Suggesting requiring at least two things together
  • New layout qualifier device_uncached
  • This would be a NOP on AMD
  • But would trigger the .cv and .wt instruction bits on NVIDIA
  • Note if NVIDIA could implement uncached behavior via page table bits, then this new layout qualifier would not be required
  • New portable DEVICE_UNCACHED_BIT_KHR memory type allocation
  • Which if exists would need to be used to create any allocation that was accessed using the new device_uncached layout qualifier
  • This is a portable version of the existing DEVICE_UNCACHED_BIT_AMD
  • Net result is that at least uncached load/store to/from system RAM would work
  • The usage for this is to provide an efficient way for CPU/GPU communication which does not rely on high overhead and high latency L2 writeback, or address range tracking cache flushes, or more complex in shader cache control instructions

Global Structure Unions / Register Aliasing Hints

  • The first part is simple define a structure of variables in global scope
  • With explicit layout rules so it’s obvious the mapping to a set of 32-bit registers
  • For example, 64-bit values would get pairs of 32-bit registers, with .x in bank&1=0 and .y in bank&1=1, and so on
  • The second part is to add union support to the shader language so one could alias a collection of the above described structures in global scope
  • Note the initial ask here is simply union support in the shading language
  • Secondary ask that something like spirvopt would have an option to not flatten and remove the unions/structures
  • Note that union’s have important usage even outside the ideas proposed here
  • This collectively enables an easy high-level framework to express low-level register aliasing/allocation hints that could be passed from high level shading language all the way down through an IR to the IHV compiler
  • The IHV compiler could still take the construct as a hint and override the layout/allocation
  • The other option is that the IHV compiler uses the described aliasing as a starting point for register allocation, which could get compilers out of trouble cases, and also could make some big improvements on compile time
  • Note it would also be possible to use these global structure unions as ABI to provide a way to link common modules together without having all the problems of call stacks, but then one would need a more explicit layout instead of strictly allocation/aliasing hints (so perhaps out of the scope of this initial ask, BUT it could be a very good solution to the shader combinatorial explosion problem!)
  • One of the largest problems compilers have today is register allocation (due to the explosion of state in SSA form), and compile times, both things this construct could greatly help improve

GPU-Filling Dispatch Query for PSO

  • Answers the question, how many workgroups does it take to fill the machine given a PSO
  • Suggested implementation: uint32_t vkPipelineFillSize(VkDevice d, VkPipeline* p)
  • Basic functionality for persistent workgroups

Labels and Goto Support In Shaders

  • With ability to get pointers to labels and build jump tables
  • Bonus points if one can guarantee high 32-bit MSB bits are constant
  • So one only has to store the lower 32-bit LSBs in the jump tables instead of 64-bits
  • While pointers to labels is something SPIR-V likely doesn't have
  • SPIR-V does have labels so the initial ask of {labels and gotos} without pointers should be easily possible without IHV support even perhaps
  • Probably good to split this ASK into two rollouts, with and without pointers

Preoptimized Pragma

  • Suggested shader language syntax: #pragma preoptimized
  • Or alternatively not a pragma via [preoptimized]
  • Turn off compiler helper paths for poorly written shaders that function as de-optimizaton paths for well written shaders
  • Examples of these helper paths
  • Checking if atomic operations are wave uniform or not and then transforming the atomic from multi-lane to single lane

Realtime Shader Clock Frequency Query

  • Implementation suggestion for Vulkan
  • Add a new structure that works in conjunction with vkGetPhysicalDeviceFeatures2()
  • struct {...; uint64_t frequency; } VkPhysicalDeviceShaderRealtimeClockFeatures;
  • GL_EXT_shader_realtime_clock provides ability to get a coherent wall-clock counter from the shader, without a defined clock frequency
  • This would provide a way to query that clock frequency
  • A workaround is to attempt to derive it at runtime using questionable methods (like measuring and rounding, or using a device lookup table, etc)
  • One usage case for this is a component of Adaptive Workload Scaling (AWS)

TBUFFER Support - Portable Form

  • Generic loads from a buffer descriptor where the type is provided in the opcode
  • It is a foundational component of good compute code
  • NOTE definitely want BYTE-OFFSETs instead of INDEXES!
  • See "Index vs Offset - Analysis of Options"
  • Is there any way this could get portability?
  • The simple types YES: <int,uint,half,short,long,double,etc><1,2,3,4>
  • At a minimum the ask is for those
  • See a horrible example of how to emulate today in GLSL below
  • TODO: The complex types (image pixel types)?
  • In theory it is possible on architectures without TBUFFER ops
  • Have a collection of typed buffer descriptors that aliases the same memory
  • Just with completely different types
  • See lower section, that will also work

The author has been "emulating this" for simple types in Vukan

  • This example is in here to point out that this is an area that GLSL does a very poor job on
  • It is done by using layout() aliasing
  • Using different structures for the different types
  • Also aliasing for access control {readonly, writeonly coherent, volatile, etc}
  • The H_() macro used to setup a {structure and layout alias} for the different types
  • Note the types are also defined to be simple {F1,F2,F4,I1,I2,I4,etc}

-

 // Aliasing as multiple types, all of these arrays should be 128-byte aligned

 #define H_(t,s) struct Ram##t{\

  t zro[64/s]; /* Always zero */\

  t gfd[64/s]; /* Dispatch indirect and management */\

  t msc[64/s]; /* Misc constants */\

  t hxc[64/s]; /* Hex dump characters */\

  t hxd[(256*256)/s]; /* This is for debug, but always having it in here */\

  t lseGrp[(LSE_GRP_MAX*4)/s]; /* todo */\

  t lseRot[(65536*8)/s]; /* todo */\

  t lseRul[(65536*4)/s]; /* todo */\

  ...

 }

 // To

 H_(F1,1);H_(F2,2);H_(F4,4);H_(I1,1);H_(I2,2);H_(I4,4);

 H_(L1,2);H_(L2,4);H_(L4,8);H_(H2,1);H_(H4,2);

 #undef H_

 ...

 #define H_(t) layout(set=0,binding=2,std430)\

  readonly buffer ssboRamR##t {Ram##t ramR##t;}

 H_(F1);H_(F2);H_(F4);H_(I1);H_(I2);H_(I4);H_(L1);H_(L2);H_(L4);H_(H2);H_(H4);

 #undef H_

//------------------------------------------------------------------------------

 #define H_(t) layout(set=0,binding=2,std430)\

  writeonly coherent buffer ssboRamW##t {Ram##t ramW##t;}

 H_(F1);H_(F2);H_(F4);H_(I1);H_(I2);H_(I4);H_(L1);H_(L2);H_(L4);H_(H2);H_(H4);

 #undef H_

//------------------------------------------------------------------------------

 #define H_(t) layout(set=0,binding=2,std430)\

  volatile /* readonly */ buffer ssboRamV##t {Ram##t ramV##t;}

 H_(F1);H_(F2);H_(F4);H_(I1);H_(I2);H_(I4);H_(L1);H_(L2);H_(L4);H_(H2);H_(H4);

 #undef H_

//------------------------------------------------------------------------------

 #define H_(t) layout(set=0,binding=2,std430)\

  buffer ssboRamA##t {Ram##t ramA##t;}

 H_(F1);H_(F2);H_(F4);H_(I1);H_(I2);H_(I4);H_(L1);H_(L2);H_(L4);H_(H2);H_(H4);

 #undef H_

-

TBUFFER Emulation Round 2?

Storage Texel Buffer

  • Alias SSBO and Storage Texel Buffers [STBs]
  • Use the SSBOs for simple types
  • Use the STBs for the complex pixel types (like 10:10:10:2)
  • Have to special case these and compute SSBO offsets manually
  • 1D texels, provides image load/store/atomic access
  • Associated with a buffer resource via a buffer view
  • Format support?
  • VK_FORMAT_FEATURE_STORAGE_TEXEL_BUFFER_BIT
  • VK_FORMAT_FEATURE_STORAGE_TEXEL_BUFFER_ATOMIC_BIT
  • Don't need atomics on these complex types
  • STB format support - important complex formats
  • For AMD Vega and up, for NVIDIA Turing and up (Authors min-spec)
  • 9-bit shared 5-bit E - No AMD, Read-only NVIDIA
  • 10:11:11 - Yes AMD/NV
  • 10:10:10:2 - Yes AMD/NV
  • 8-bit/channel unorm/snorm - Yes AMD/NVIDIA
  • 16-bit/channel unorm/snorm - Yes AMD/NVIDIA
  • sRGB - No AMD, Yes NVIDIA
  • Size limits?
  • VkPhysicalDeviceLimits.maxTexelBufferElements
  • AMD (at least Vega and up) = 4 GiElements
  • NVIDIA (at least Maxwell and up) = 128 MiElements
  • NVIDIA is the limiter
  • 32-bit type = 512 MiB
  • 64-bit type = 1 GiB
  • 128-bit type = 2 GiB
  • Can alias buffer views that are smaller for the smaller types
  • Apply limits to data layout based on size but only for complex types
  • An acceptable trade off

Transition-Free Swapchain

  • Way to explicitly ask for a swapchain image allocation that requires no transition
  • Vulkan suggestion: a new flag for VkSwapchainCreateInfoKHR.flags
  • In VkSwapchainCreateFlagBitsKHR
  • VK_SWAPCHAIN_CREATE_TRANSITION_FREE
  • Which would avoid the requirement to do the image transition to VK_IMAGE_LAYOUT_PRESENT_SRC_KHR
  • Transition as in for say some meta-data, or perhaps a tiling mode change, etc
  • So this would force the driver to choose a compatible tiling mode for both compute imageStores and scanout
  • This way it is possible to avoid transition overhead and complexity, and also have a safe mechanism to do swap-tear if slightly over budget (write into the swap late after the image was already scheduled to be flipped), or leverage for front-buffer rendering
  • The current possibly unsafe workaround (that the author is currently using on AMD/NV)
  • Use VK_IMAGE_USAGE_STORAGE_BIT to try to get a more compatible tiling mode
  • Skip the layout transition in the Vulkan (not legal, but still runs)
  • Hope that it just works

Wave-Uniform Qualifier

  • GLSL/etc variable storage qualifier to denote a value is uniform across a wave (aka subgroup)
  • GLSL syntax suggestion: subgroup_uniform int x=23;
  • For AMD this would hint a variable going into an SGPR and for NVIDIA since Turing it would be a URX for a variable, or UPX for a predicate
  • Compilers have performance bugs related to not optimizing using wave-uniform registers and logic
  • This change would greatly reduce the amount of bugs in practice while also possibly increasing compile speed because less has to be inferred by the compiler


Suggestions - Medium Scope

Binary Compute PSOs

  • Provides ability for an ISV to choose to run either existing portable SPIR-V based shader or alternatively provide platform chipset specific binary replacements for the subset of platforms they care about performance for a given product lifetime
  • Definite some simplified stable and well documented ABI for compute PSOs
  • This should be simple for compute shaders
  • One could standardize on 64-bit buffer pointers instead of descriptors
  • And even reduce this to one descriptor set if required
  • This need not take on the complexity of all the Vulkan binding crazyness
  • Version the ABI
  • Query chipset compatibility list and ABI version
  • Note a vendor could choose to do a binary translation[c][d] if they wanted to for future portability
  • Or just not support binaries for older chipsets, either is fine
  • Have a VK entry point to load a binary shader
  • Don’t necessarily need a IHV assembler, just need docs on required ELF parts
  • Up to the dev to choose platform lock-in or to provide a SPIR-V backup shader
  • Some markets like arcade machines or rides, etc, don’t need forward compatibility
  • Would be nice to be able to do R&D at the machine level
  • Would be nice to be able to provide both a ASM shader and the SPIR-V and file perf bugs

Binary Translation Sub-Topic

  • To Angelo’s point: binary shader is an IL (intermediate language) that one could compile from
  • High-end emulators for more modern systems are doing exactly that
  • Binary translation is a fast way to bring up modern hardware
  • Take existing binary shaders of the prior architecture and run them through binary translation for bring up design
  • GPU ISAs are not radically changing in core functionality
  • AMD since GCN (2012) is quite stable despite the opcode encoding changing
  • The wave64 64 VGPR formula still works quite well
  • Still an SGPR + SGPR descriptor based architecture
  • The base 32-bit opcodes are very similar across time
  • The big changes (SIMD16 4-clock to SIMD32 1-clock) was effectively transparent
  • Although effective ALU latency changes
  • There are add-ons over time which could be optional extensions to a base design
  • Like 16-bit and packed 16-bit support
  • Latest RDNA rolls compile scheduling hints into the opcodes
  • But those are things that could be done in a binary translation layer
  • WMMA stuff can be looked as an opcode extension/evolution but doesn’t change the basic core design
  • Similar to how x86 has evolved over time but still runs the base arch
  • The major changes on GPU-side are mostly opcode encoding which is easy binary translation
  • Similar story for NV at Kepler just a light evolution
  • PTX (a virtualized register allocation assembly) easily maps across chipsets
  • One could build out some nearly-same-as-HW assembly spec, for either AMD and NV, and treat the major HW additions (like ML ops) as optional extensions, then build out different wave32 and wave64 versions of shaders and then retarget to basically all modern {AMD,NV} hardware
  • This would be at worst better than starting from SPIR-V

Instruction Intrinsics

  • Vendors rely on implicit pattern matching and that pattern matching is a constant source of a large number of performance bugs, and this has been the state of the industry since the beginning
  • If it was practically possible to actually fix those kinds of bugs, we wouldn’t still be seeing them
  • Even simple examples like getting native normalized {cos,sin} on AMD is not pattern matched and if you try to optimize it you get 2 extra MULs instead of none
  • Some instructions have no easy way to express in ways a compiler could pattern match
  • AMD for instance has saturating integer math support that is impossible to use or express
  • Author has a decade+ long ask that all vendors expose their native instructions through a set of instruction intrinsics
  • Current practice requires a long political battle for each and every intrinsic, with lots of per-extension overheads
  • Instead simply do one extension for all intrinsics on the HW family

Suggestions - Large Scope / Hard

Inline Assembly

  • Like GCC’s inline assembly (both raw assembly, and assembly that can co-exist with higher-level language)
  • This is in many respects a better option than only having instruction intrinsics, but a much harder ask, due to lack of way of expressing this through current IRs
  • With inline assembly one can build the instruction intrinsics on the dev side
  • Example from CPU x86-land below (on of the author’s engines)
  • inline uint64_t RolL1(uint64_t v uint32_t r){asm("rolq %1,%0":"+g"(v):"cJ"(r));return v;}
  • One example of using GPU inline assembly in an open source project that got used in a huge number of titles of the time: FXAA 3.11 has a perf fix on Xbox360 leveraging the platforms inline assembly support
  • When one mixes inline assembly with pre-processor defines, it is possible to target a large amount of hardware in an optimized way and still have platform portable fallbacks all in the same source base

Discussion Topics

16-Bit Support

  • AMD - GCN5 (Vega) and up
  • Actually before Vega, Polaris had 16-bit support, just not double rate
  • But at least one could get the VGPR savings (which is huge)
  • Unfortunately they effectively disabled it in the drivers
  • NVIDIA - Turing (aka 20 series) and up
  • 2023 it was at least 30% of hardware on Steam HW Survey that would support 16-bit
  • Which is good enough for the author to only write 16-bit supported shaders moving forward

Accessing Memory Super Topic

Writing this from a focus on AMD + Vulkan + Compute of course since it's the most documented or accessible platform to do analysis on ...

Review on AMD Memory Interface Throughput

  • Highlights
  • Needed buffers for 32-bit load/store throughput
  • Either buffer or image for 128-bit load/store (same throughput)
  • RDNA
  • TODO: Missing a lot of information here ...
  • Buffers on cache hits return 32-bit/lane/clock (best case)
  • So 128-bit load hit takes 4 clocks to return (best case)
  • RDNA introduced a bypass on imageLoad to bypass the filtering pipeline
  • TODO: Compare latency of imageLoad vs nearest sampling vs buffer load

Buffer Compression Problem

  • Today's hardware has compression for images but not for buffers
  • If an algorithm is bandwidth limited, and requires a clear each frame, compression gets important
  • If it's just {clear,write,read} that's 3 round trips to DRAM
  • If it's {fast clear,write,read} with compression it can be a >30% savings
  • A subset of problems normally solved using buffers likely needs images for peak perf
  • And the only way to manage that is to special case it

Constant Problem

  • No image/texture access for constants (K$)
  • Can alias a 1D image with a buffer, and access using a buffer
  • Note such aliasing implies no compression in the image
  • Alternative like loading via VMEM and V_READLANE_B32'ing back to SGPRs is likely not performant, but is at least possible
  • Pre-copying from compressed image memory to uncompressed buffer memory likely negates the compression gain
  • Buffers are effectively required for per-wave granularity items

Buffer Size Limit Problem

  • Both NV and AMD have maxStorageBufferRange = 4 GiB
  • So one needs to split into multiple buffers to get beyond 4 GiB of data
  • Buffers though have the descriptor divergence problem
  • If one wants to gain unified access to >4 GiB of data the only real option is Pointers

Variable Type Access Problem?

  • Is this a real problem in practice?
  • Problem
  • If one has an index to an item
  • Then wants to access different sized types for load
  • One has to shift the index to the new associated granularity
  • So for some things 'indexing' is "worse" than simply using a 'byte offset' or 'pointer'

Summary of all the Buffer Access Points

SSBO Family

  • VK API: [V#Base + immSsboElementOffset + (index<<shift)]
  • Hardware
  • SMEM: [V#Base + sgprOffset32 + imm20]
  • VMEM: [V#Base + sgprOffset32 + imm12 + {vgprOffset} + {vgprIndex<<shift}]
  • Bugs
  • The imm12 works
  • It won't leverage sgprOffset32 for when immSsboElementOffset gets too large
  • Instead it burns VALU ops and VGPRs to add to vgprOffset
  • The '<<shift' value comes from the descriptor which the compiler doesn't know
  • Even though the driver could perhaps just pick 4-byte element size
  • So the optional {vgprIndex<<shift} is never used
  • The mismatch between API being indexed and the HW being offsetted
  • Means that the compiler always introduces extra VGPRs and VALU
  • For the '+ (index<<shift)' part
  • Note this trick won't work 'uintArray[byteOffset>>2]'
  • Instead of NOPing the '<<2' the compiler will double the number of unnecessary operations

TEXEL BUFFER Family

  • VK API: [V#Base + (index<<shift)]
  • Hardware
  • Same implementation as SSBOs (BUFFER ops)
  • Bugs
  • The (index<<shift) works
  • However compiler not able to factor out compile time immediates
  • Large immediates won't get placed in the free SGPR offset
  • Small immediates won't leverage the opcode immediate

Texture Family (Using Texture Instead of Buffer)

  • VK API matches HW API
  • imageLoad(resource, coords, lod) ... etc
  • texture(resources, coords, lod, offsets, etc)
  • gather4(resources, coords, lod, offsets, etc)
  • Hardware issues
  • AMD VEGA+ supports NSA (non-sequential addressing)
  • So coords need not come from a series of linear registers, it can instead gather them
  • Sampling offsets are limited
  • AMD {-64 to 63}
  • NV {-8 to 7}
  • Gather4 offsets are better AMD/NV {-32 to 31}
  • However using offsets means no low-latency imageLoad() bypass path
  • TODO: Generally shouldn't expect anything better than 4 lanes/clock?
  • Simply due to only building in a 4 lane/clock texture addressing unit
  • Maybe at best ivec4 imageLoad throughput might be 16 lanes/clock?
  • Compared to 32-bit per 32-lanes per clock on buffers (verified)
  • Small formats = higher throughput on cache hits via buffers
  • And even large formats might be lower throughput on hits via imageLoad
  • Anyway this is all speculation (need actual reference)

Pointer Family

  • VK API: [pointer64]
  • Hardware
  • SMEM: [sgprBase64 + sgprOffset32 + imm20]
  • VMEM: [vgprPointer64 + imm12]
  • VMEM: [sgprBase64 + vgprOffset32 + imm12]
  • Bugs
  • Serious correctness bug, layout "coherent" is completely broken
  • This is effectively a hard stop (this interface is useless)
  • RDNA2- Compiler
  • Everything goes to the VMEM: [vgprPointer64] path without the imm12
  • Everything is 64-bit integer VALU ops (pairs of v_add_co*) for ADDs
  • Too expensive to use
  • RDNA3+ Compiler
  • On the right track
  • Compiler will use VMEM: [sgprBase64 + vgprOffset32 + imm12]
  • But there are rough edges
  • For example must manually alias packed 16-bit into 32-bit, etc

Conclusions (Where Today = 20241216)

  • Pointers are off the table completely
  • Due to the "coherent" layout correctness bug
  • And will remain that way until AMD fixes the RDA2- compiler perf problem
  • Pointers are the worst way on that compiler to do anything (64-bit integer math everywhere)
  • When pointers become usable it will be a massive rewrite to leverage
  • The VMEM immediate size is tiny 12-bit (signed)
  • For accessing multiple things from different bases, will need to construct different 64-bit SGPR pairs for those, cannot rely on the compiler doing the right thing with the mix of 2 compile time immediates {SGPR offset, and local immedate}
  • This might require hiding immediates from the compiler via actual loaded constants to fix
  • Pointers are not necessarily faster either for VMEM stuff
  • They are missing a lot of the "free" addressing '(V#Base + sgprOffset32)'
  • Probably actually better to stick with 'one-big-buffer' approach
  • If doing 'one-big-buffer' put all immediate accessed constants in the lower 2 MiB
  • These get [V#Base + imm20] addressing which is speed-of-light using SSBOs
  • SSBOs is the only speed-of-light option today (pointers could be if AMD fixed the drivers)
  • One has to alias a STORAGE_TEXEL_BUFFER over an SSBO to emulate TBUFFER of certain types like {<S,U>NORM*, 10:10:10:2, 10:11:11, ...}
  • In some respects STORAGE_TEXEL_BUFFER could be the fast path on AMD hardware for an 'indexed' based API
  • But it certainly isn't as fast as it could be today
  • Looking at one extra VALU instruction and possibly VGPR per access
  • But this is effectively the same driver bug overhead as today's SSBOs!
  • If maxStorageBufferRange is fixed it's easy to get to 16 GiB with 32-bit/item accesses
  • Has capacity to solve the desire to keep on fast 32-bit index based 'pointers'
  • But one couldn't factor that back into a 32-bit SGPR offset
  • Would have to use the 32-bit VGPR index as the pointer
  • The compiler knows the element stride from the layout, so in theory it could optimize immediate offsets into either a 32-bit SGPR offset for big values, or the smaller 12-bit immediate
  • One possible issue is that there is no way to easily express a dynamic but wave-coherent 32-bit base offset (acting like a coherent pointer) because the interface is 100% index based
  • This is why ultimately you'd need intrinsics here to get full perf
  • The indexing is already free in current driver implementations
  • HW TBUFFER ops might not even be better overall
  • They will be better in needing no new descriptors
  • If HW takes index stride from the descriptor and not the type in the TBUFFER
  • Then one is back to needing 3 extra descriptors
  • TODO: What does the HW do?
  • There are really only 2 possible downsides
  • You'd need {32-bit,64-bit,128-bit} descriptors to emulate SSBOs
  • So burning up to 12 SGPRs for the extra three buffer descriptors
  • Note it is 3, because the regular buffer descriptor is needed for K$ access
  • If one wanted different types from the same VGPR index (acting as a divergent pointer) one would need to build a shifted version
  • SSBOs are likely always a pain point for the compiler
  • Because of the HW use of byte offsets for when descriptor type stride is not known
  • This mismatch between index API and offset HW usage is bad
  • Anything outside the immediate offset range in the opcode takes
  • V_LSHLREV_B32_E32 to convert the index into a byte offset
  • S_MOV_B32 to possibly build the sgprOffset32 when the immediate gets too big
  • For this case it's same overhead as just using the broken TEXEL BUFFER today
  • Likewise since it's all byte offsets, it can never function outside 4GiB buffers
  • Today's fast path is
  • Alias STORAGE_TEXEL_BUFFERs for VMEM over SSBO for SMEM
  • Don't think there is value in special casing the [vgprIndex+imm12] case to SSBOs
  • It's possible to save one SALU op there today
  • But that would go away with proper driver optimizations
  • SMEM
  • Use SSBO with layout aliasing with single arrays of a given type
  • SRead_<Type>(immOffset)
  • Via '[immOffset>>N]' in macro, where N is compile-time-immediate
  • N is based on type
  • It's all compile time so interface can just directly match HW
  • Driver maps to either
  • [V#Base + sgprOffset32] and an extra S_MOV_B32 for big imms
  • Extra S_MOV_B32s are not worth worrying about
  • Given the larger goals, due to dual issue
  • [V#Base + imm20] for small imms
  • SRead_<Type>(sgprOffset, immOffset)
  • Via '[(sgprOffset>>N)+(immOffset>>N)]' in macro
  • This another direct map to HW but with some driver overhead today
  • Drive maps that today to
  • S_AND_B32 to clear the LSBs
  • This is 100% safe to remove for 32-bit because the HW does it for free, meaning a compiler engineer could NOP it
  • For 64-bit and 128-bit and beyond, IMO it's an acceptable perf strategy to NOP, someone could put it on an extension if required, but it's an easy fix regardless
  • S_ADDK_I32 to add in the '(immOffset>>N)'
  • This is a optimization bug for small immediates
  • Driver should be using the HW immediates but doesn't
  • Note '[sgprIndex+(immOffset>>N)]' gets mapped to
  • S_LSHL_B32 to convert sgprIndex into a byte offset
  • S_ADDK_I32 to add in the '(immOffset>>N)'
  • So it's the same overhead today for something less like the HW
  • And ultimately less performance fixable
  • VMEM
  • Don't need VRead_<Type>(imm) or VRead_<Type>(sgprBase + imm)
  • Because address is scalar it should go into K$
  • Unless one wants TBUFFER types that need format conversion
  • Or one is working around AMD compiler bugs with 'readonly' and SMEM polling
  • VRead_<Type>(vgprIndex) is already on the HW fast path
  • VRead_<Type>(vgprIndex,immOffset12)
  • Via '[vgprIndex+(immOffset12>>N)]' in macro
  • Driver maps that today to
  • V_ADD_NC_U32_E32 to add the index to the immediate offset
  • This is a perf bug
  • Driver knows the buffer index byte span via the layout
  • It can compute the shift and just use the HW immediate
  • VRead_<Type>(vgprIndex,sgprOffset32)
  • Via '[vgprIndex+(sgprOffset32>>N)]' in macro
  • Driver maps that today to
  • S_LSHR_B32 to do the sgprOffset32>>N
  • This might be one instruction worse than SSBOs
  • But it's SALU so not concerned
  • Although the latency to the next op is less than ideal
  • V_ADD_NC_U32_E32 to add that to the index
  • These are both perf bugs IMO
  • Driver could NOP all this and just use the HW sgprOffset32
  • It's 100% safe for 32-bit types (HW zeros 2 LSBs anyway)
  • It's not strictly 'safe' for 64/128-bit types
  • But they could but it on an extension (allSafeAlignment)
  • VRead_<Type>(vgprIndex,sgprOffset32,immOffset12)
  • Via '[vgprIndex+(sgprOffset32>>N)+(immOffset12>>N)]' in macro
  • Driver maps that today to
  • S_LSHR_B32 to do the sgprOffset32>>N
  • Same as prior
  • V_ADD3_U32 to add all the stuff together
  • Again perf bugs IMO, driver could NOP all this and use the HW

Adaptive Workload Scaling (AWS)

  • Things like Dynamic Resolution Scaling (DRS) are re-active and thus first miss
  • Thus they are not good long term solutions to the problems of Quality-of-Service (QoS)
  • This is an alternative component in which passes switch to lower cost paths when the frame starts to go over budget in hope to avoiding the miss completely
  • This could be mixed with DRS, using AWS in-frame, then DRS
  • Components required to make this possible
  • Or alternatively QueryDisplayConfig() for FPS in n/d form
  • MISSING: Ability to get the shader realtime clock domain time of v-sync or alternatively query the h-sync position active in scanout

Bind Everything

  • Intel's iGPUs (pre-Arc HW) has too small HW binding limits
  • Author is just ignoring Intel (including Arc)
  • NV and AMD are both quite good with binding everything once per frame

CPU Push Data to GPU

Description of an important GPU fast path on today’s HW

  • Usage case is sending and updating small data to the GPU
  • {keyboard state, gamepad state, CPU time, resolution, framerate, etc}
  • Aka 'late-latch'
  • CPU updates async to the GPU, GPU gets state as close to when it is needed
  • Usage contract
  • CPU writes complete cachelines (GPU-sized cacheline aka 128 bytes)
  • Cachelines are used only for this purpose
  • These cachelines are read only once per frame on the GPU and the contents are copied
  • This makes a stable snapshot of the contents
  • Working around the non-atomic nature of the PCIe bus
  • TODO: Find the reference for only bytes are atomic when crossing the PCIe bus
  • Each cacheline uses 32-bit for a hash of the cacheline contents (not including the hash)
  • The GPU reads the cacheline completely, and can compute the hash
  • If the included hash and the GPU-computed hash are different, the packet is tossed out
  • Tossed out because it's better latency wise
  • No re-read because this is cached GPU memory
  • Want cached GPU memory, because GPU isn't necessarily going to read the full line in one transaction (if it's one lane doing the reading, etc)
  • The data is updated in a ring buffer of size 2 {1 older, 1 latest}
  • The GPU reads both entries (one or both should be good)
  • If both pass the hash, take the one with the newer timestamp
  • Memory
  • Dedicated always mapped allocation
  • Two options
  • Lower Latency : DEVICE_LOCAL | HOST_VISIBLE
  • GPU write-combine stores to the GPU memory (less latency on GPU read)
  • Middle : HOST_VISIBLE
  • GPU reads through the bus, CPU writes write-combine to CPU DRAM
  • Slower : HOST_VISIBLE | HOST_CACHED[e]
  • GPU reads through the bus, snooping the cache
  • TODO: Per Adam's point, should perhaps re-profile the differences between with and with-out HOST_CACHED, could be some perf advantage without HOST_CACHED for CPU upload
  • Both don't require DEVICE_UNCACHED since data is read once per frame only!

CPU to GPU Workload Migration

  • In modern times SoCs have power sloshing
  • The power slosh ratio GPU:CPU ratio ranges 1:5 to 5:1
  • Example: Ps5 Pro
  • If the CPU workload is less power efficient on the CPU compared to the GPU the workload should be migrated over (and include the bus transfer power in that too)

Dropping Registers Mid-Shader Execution?

  • Usage case, at launch uber shader is allocated a maximum register count, then after initial control path decision, release registers early
  • Usage case, as a non-ubershader winds down and gets out of peak register count, it could start releasing registers back for other shaders
  • AMD's "Dealloc VGPRs" (in some version of RDNA)
  • This is a related incremental improvement
  • See the ISA guide on S_SENDMSG for details
  • Waves wait on memory-write-acknowledge (ACK) before actually exiting
  • This allows the VGPRs to be released before the ACK so another wave can launch faster
  • Register Allocation
  • Registers are allocated in some grouping granularity (N registers)
  • Any runtime re-naming of registers requires logic on a critical path (operand fetch)
  • If registers are allocated in linear segments that wrap around
  • AMD's documented this in the ISA guide: VGPR_BASE and VGPR_SIZE
  • It is just a small adder
  • However fragmentation would probably limit what could take advantage of released registers (esp if uber shader launch needs a very large chunk)
  • If registers are renamed per group of N, one needs to put a small memory on the critical path of operand fetch, example an 256regs/N=8=32 entries * maxWaveCount 16 = 512 entry x 5-bit memory … and with typically 4 operands you’d need either duplication or a few read ports … yeah gets really expensive so vendors are not likely to go for this kind of approach (it’s a big latency add too)
  • So it’s not easy to support this efficiently in general
  • SPECULATION: Apple’s more recent HW designs are keeping smaller on-deck physical register files for active waves, and relying more on a hidden load/store to it’s new shared regbacking/L1/LDS memory, so it likely isn’t really a dynamic register count, but rather HW managed spilling?
  • Apple’s ISA has register last-use (“discard”) and register reuse hints (“cache”)
  • These could be used for HW managed spilling
  • Could be that the register cache is made bigger
  • Could profile to try to figure this out by trying ALU bound algorithms and graph performance vs the amount of possible register reuse possible given register reuse cache sizing
  • If perf falls off a cliff on FMAs when the operands cannot get high register reuse, then it’s likely they got port limited on hardware managed offload of registers to the shared SRAM
  • It might soften worst case behavior, but might decrease perf of high state load (and register) algorithms that otherwise run well on PC without any hidden spilling (even to LDS)
  • Curious if texture fetch actually writes results into the shared regspill/L1/LDS memory (although it would be silly to have a possibility of spilling texture fetch data)
  • AMD RDNA’s Sub-Vector Execution
  • This is an alternative which provides a variable amount of registers during execution from a fixed allocation
  • SIMD32 so Wave64 runs normally as 2 back-to-back {lo,hi} sub-vectors (32-wide)
  • In Sub-Vector Execution mode, the hardware can run a long sequence of {lo} instructions, and then a long sequence of {hi} instructions
  • During this sub-vector execution a subsection of registers gets two times as large
  • See VGPR_SHARED_SIZE (RDNA2 ISA Guide)
  • So this is highly vendor specific (not generally usable, and I don’t think compilers automatically generate code for this)
  • Generic Loop Unrolling?
  • AMD’s Sub-Vector Execution is in some respects a very non-portable form of this
  • This is how CPUs manage variable workload to fixed register count problems
  • If a wave is running 1 thing, it has N registers
  • If a wave is doing 2 things, it has N/2 registers
  • And so on
  • With compute the pixel-shader like 1 pixel to one lane mapping is no longer required
  • Is a fixed maximum register count actually a performance problem on modern GPUs from AMD/NV?
  • Answer is yes for shaders that explode in register count (cannot hide their own latency)
  • But if one avoids that problem completely (which is possible with good compiler), then what?
  • Too many waves can overload L0 caches
  • Only want just enough waves to hide latency, not more as it decreases cache utilization (less cache per wave)
  • One of the points of semi-persistent waves and persistent waves is that the bubbles introduced by wave exit and restart actually decrease perf, so relying less on the HW scheduling has performance improvements
  • Author sticks with designing around 64 registers/wave, and uses loop-unrolling style solutions, and that does seem to work quite well in practice

Efficient Program Counter (PC) Logic

  • Review of AMD HW
  • AMD's HW is fully general purpose
  • Think of the wave as a CPU thread
  • All the opcodes are available that one would need
  • PC is 64-bit
  • Branches jump to (PC of instruction after branch + offset)
  • SOPP opcodes
  • These support a signed 16-bit immediate
  • Which is scaled by 4 for branch offset
  • So +/- 128 KiB (which is way larger than the I$ size)
  • S_BRANCH ... unconditional branch
  • S_CBRANCH_* ... conditional branch
  • Based on {VCC,EXEC,SCC}{Z,NZ} bit values
  • S_CALL_B64 ... save return PC in SGPR pair, then branch
  • There are 3 other instructions that can directly interact with the PC
  • S_GET_PC ... store the PC to an aligned SGPR pair
  • Code can use this to compute a physical address of a relocated (or compile-time unknown) PC if the offset to the current instruction is known
  • S_SET_PC ... load the PC from an aligned SGPR pair
  • S_SWAP_PC ... store then load the PC
  • Branch tables would use this
  • Requirement for Optimizations
  • These require a guarantee that all shaders exist in a consistent 4 GiB window
  • Something a driver should be able to do
  • AMD's PC driver already does this for descriptor sets/etc
  • Meaning the 32-bit MSB of the PC is always constant at least during execution of a given program
  • And only the lower 32-bit LSB of the PC ever needs to be changed
  • This enables pointers to only ever need to load/store or do logic on the lower 32-bit LSB of the PC
  • This enables efficient jump tables
  • Can pack just the lower 32-bits of all the jump targets
  • For relocation can just use ELF stuff (patch the lower 32-bit LSB)
  • This enables space-efficient return stacks in a single VGPR
  • Can get at least a fixed 32-entry (wave32) return stack
  • AMD HW reference
  • V_READLANE_B32 ... copy a VGPR to a SGPR
  • V_WRITELANE_B32 ... copy a SGPR to a VGPR
  • Note S1 argument = lane select which is an SGPR
  • Note latency applies
  • Latency of branching
  • Latency of SALU/VALU dependencies

Index vs Offset - Analysis of Options

TLDR

  • Drivers don't do a good job with SSBOs and C-style array indexing!
  • There is no good way to emulate the HW byte-offsetting without getting a perf tax
  • Seems like GL_EXT_buffer_reference is the fix, except for compiler bugs on AMD
  • See the related section in the AMD Compiler Bugs area
  • This is a prime example of why actual instruction intrinsics would be a good idea
  • Because the hardware works with byte-offsets internally

Goals

  • Would like to get the driver out of the way for 'buffer' access
  • One 'thing' covering a huge amount of memory
  • Instead of one driver managed resource per huge number of things
  • That can be accessed via {32,64,128}-bit load/store and atomics
  • That can also do some amount of format conversion (aka TBUFFER)
  • Would like to take advantage of the HW's 'free' addressing logic
  • Would like to be able to DMA a huge chunk of this one 'thing' to/from a memory mapped file
  • So save/load state is effectively free of any file IO

Using Images Instead of Buffer?

  • With DCC format/type being either in descriptors or page tables
  • Wouldn't gain anything from compression due to format aliasing
  • Would have to go linear tiling mode to get different bit-pixel aliasing to make sense
  • So no sense in even doing 2D
  • Except it's 2D providing the 'free' address logic
  • Cannot get away from needing different bit widths
  • 32-bit atomics
  • 64-bit atomics
  • 128-bit load/store (for efficient access)
  • Would need a lot of image descriptors: 32-bit, 64-bit, 128-bit access, etc
  • So this is a definite NO on using images

Using AMD's HW (RDNA2) as an example because of the clear ISA documentation

  • Buffer ops are better than emulating Buffer ops via Image ops
  • Buffer has a lot of "free" addressing logic
  • TODO: Verify and find reference for this
  • Buffers have lower latency?
  • Buffers have better peak throughput?

AMD SMEM

  • 64-bit base pointer or 128-bit descriptor
  • 32-bit SGPR providing an unsigned byte offset
  • 21-bit signed byte offset via immediate
  • This must be positive for S_BUFFER ops though
  • So one cannot use the trick of using negatives to double range from the base

AMD VMEM BUFFER

  • 128-bit descriptor
  • 32-bit SGPR providing an unsigned byte offset
  • TODO: Open question if the driver will even use this for optimization
  • Seems like it doesn't (but might not have tested that case correctly yet)
  • 12-bit unsigned byte offset via immediate
  • 32-bit VGPR for byte offset (optional)
  • Driver only uses this
  • 32-bit VGPR for index (optional)
  • Driver won't use this for optimizations even for some fixed stride

NVIDIA LDG (SM89)

  • TODO: Author doesn't know NV hardware as well any more ...
  • Global load, with a huge number of addressing modes
  • [imm24]
  • [vgpr32]
  • [vgpr32+imm24]
  • [vgpr64+imm19]{+imm10} ... imm10 is likely offset for a second load
  • [sgpr64+vgpr32+imm24]
  • [sgpr64+vgpr32+imm18]{+imm10}
  • desc[sgpr64][vgpr64]
  • desc[sgpr64][vgpr64+imm24]
  • Supports a prefetch size included in the operation {64,128,256}-byte!
  • Addresses must be naturally aligned, PTX says two options
  • HW might fault
  • HW might silently clear the associated LSBs (better, too bad one cannot count on this)
  • All addresses are byte based (there is no support for C-style indexing)
  • Indexing logic (shift add) is taken up by the ALU!

Problems with the C-like shading languages, specifically SSBOs

  • See disassembly example below
  • For context, can layout alias the bind point as arrays of all types
  • But these C-like APIs require array indexing
  • No support for byte offsets
  • And the buffer logic in HW is based on byte offsets mostly
  • Descriptors do have an index size and an optional index
  • Author thinks (TODO confirm) that TBUFFER ops use descriptor index size
  • Not the size of the type in the opcode
  • Driver builds the SSBO descriptors
  • But looks like it sets index stride = 1-byte
  • So no way to leverage indexing
  • buf[immediateIndex]
  • Anything reduced to simply immediate index (known at compile time)
  • A compiler can transparently convert that back to a HW byte offset
  • Confirmed below for pure compile time immediates that fit in the immediate offset field
  • But note mixing with complex dynamic offset can undo this
  • buf[unsignedByteOffset>>immediateLog2Size]
  • Where the 'unsignedByteOffset' is not an immediate
  • The compiler won't optimize this
  • It will either do the described shift, and then the hidden unshift
  • Or just clear the associated LSBs (as an 'optimization')
  • The ask of course is for no SHIFTs or AND
  • buf[immediate+index]
  • Compiler adds an extra shift for the index
  • But can often factor the immediate into the the opcode immediate

AMD PC RDNA2 Disassembly Examples

So the AMD compiler is quite bad at optimizing this next case

  • Compiler gets screwed over by shader languages using indexing instead of byte offsetting

ramWI1.hxd[13]=0xdead+ramRI1.hxd[(27>>2)+(g1>>2)];

  • Yes this is 'bad' code (for a few reasons, but shows issues in the compiler)
  • Where
  • g1 ... global invocation index (1D dispatch)
  • ram<W,R>I1 ... SSBO aliased to the same bind point
  • .hxd ... uint32_t array in the SSBO
  • Load Behavior .hxd[(27>>2)+(g1>>2)]
  • Note the immediate offset is hardcoded to .hxd
  • The (27>>2) immediate offset doesn't get factored in
  • This should be an index of 6, and a byte offset of 24
  • Instead the driver incorporates it into the extra LSHL needed for g1>>2
  • Via V_LSHL_ADD_U32
  • The (g1>>2) part shows the compiler will put in the shift then the hidden unshift
  • Instead of just factoring both out
  • Because of the possibility of having lower bits set
  • And note this example does have lower bits set (bad behavior)
  • NOTE at least for S_LOAD_BUFFER/etc it's documented (AMD's RNDA2 ISA Guide) that the lower 2-bits are ignored anyway, so in theory NOPing the shift could perhaps become a legal optimization for 32-bit/element loads
  • There is SH_MEM_CONFIG.alignment_mode to force that
  • Not clear if 'STRICT' forces alignment for all sizes, or just causes a fault
  • Store Behavior .hxd[13]
  • The driver does factor this into the immediate offset field
  • So at least some amount of minimal pure immediate optimization happens
  • Best to keep common static offset stuff in the lower immediate addressing range!
  • SSBO layout aliasing does only use one descriptor (and not a 64-bit pointer)
  • So at least that optimization is working correctly
  • The descriptor is built on the fly in the shader
  • Created from a USER_DATA_SGPR pair (64-bits total)
  • This is done to save on USER_DATA_SGPR setup space
  • The index stride is set to 1-byte
  • The driver could at least convert this to N-byte
  • And then leverage that information for optimization for N-byte load/store
  • But it doesn't employ any of those kind of optimizations
  • Notice the AMD compiler cannot do good SGPR allocation here
  • There is an extra unnecessary S_MOV_B32
  • Driver doesn't trust the DYNAMIC descriptor created by the driver?
  • It explicitly clears out the {stride, cache swizzle, and AOS swizzle}
  • But un-factored that into GPU runtime instead of CPU create time

-

DECODING THE BUFFER DESCRIPTOR

    11111111111111110000000000000000

    fedcba9876543210fedcba9876543210

[0] _______________s5_______________  lower 32-bits of 48-bit base address

[1] 0000000000000000_______s6_______

    ................_______s6_______  upper 16-bits of 48-bit base address  

    ..00000000000000................  stride = 0

    .0..............................  cache swizzle (disabled)

    0...............................  swizzle AOS (disable)

[2] 11111111111111111111111111111111  num_records (maximum, effectively disable bounds check)

[3] 00000000000000100100111110101100

    .............................100  dst_sel_x (R)

    ..........................101...  dst_sel_y (G)

    .......................110......  dst_sel_z (B)

    ....................111.........  dst_sel_w (A)

    .............0100100............  format

    ...........rr...................  reserved

    .........00.....................  index stride (1-byte)

    ........0.......................  add tid enable (disabled)

    .......0........................  resource level (set to 0, even though ISA docs say set to 1)

    ....rrr.........................  reserved

    ..00............................  OOB_SELECT (out of bounds select, disabled?)

    00..............................  type 0=buffer

-

  v_lshl_add_u32  v0, s8, 6, v0                         // 000000000000: D1FD0000 04010C08

  v_ashrrev_i32  v0, 2, v0                              // 000000000008: 22000082

  v_lshl_add_u32  v0, v0, 2, 24                         // 00000000000C: D1FD0000 02610500

  s_and_b32     s0, s6, 0x0000ffff                      // 000000000014: 8600FF06 0000FFFF

  s_mov_b32     s1, s0                                  // 00000000001C: BE810000

  s_movk_i32    s2, 0xffff                              // 000000000020: B002FFFF

  s_mov_b32     s3, 0x00024fac                          // 000000000024: BE8300FF 00024FAC

  s_mov_b32     s0, s5                                  // 00000000002C: BE800005

  buffer_load_dword  v0, v0, s[0:3], 0 offen offset:1024 // 000000000030: E0501400 80000000

  s_waitcnt     vmcnt(0)                                // 000000000038: BF8C0F70

  v_add_u32     v0, 0x0000dead, v0                      // 00000000003C: 680000FF 0000DEAD

  buffer_store_dword  v0, v0, s[0:3], 0 offset:1076 glc // 000000000044: E0704434 80000000

  s_endpgm                                              // 00000000004C: BF810000

-

ramWI1.hxd[13]=0xdead+ramRI1.hxd[120+g1];

  • Similar to the above case, but use pure indexing
  • Notice the compiler does now factor the (120<<2) into the compile time immediate
  • But needs to add an extra V_LSHLREV_B32 for the indexing of g1

-

  v_lshl_add_u32  v0, s8, 6, v0                         // 000000000000: D1FD0000 04010C08

  v_lshlrev_b32  v0, 2, v0                              // 000000000008: 24000082

  s_and_b32     s0, s6, 0x0000ffff                      // 00000000000C: 8600FF06 0000FFFF

  s_mov_b32     s1, s0                                  // 000000000014: BE810000

  s_movk_i32    s2, 0xffff                              // 000000000018: B002FFFF

  s_mov_b32     s3, 0x00024fac                          // 00000000001C: BE8300FF 00024FAC

  s_mov_b32     s0, s5                                  // 000000000024: BE800005

  buffer_load_dword  v0, v0, s[0:3], 0 offen offset:1504 // 000000000028: E05015E0 80000000

  s_waitcnt     vmcnt(0)                                // 000000000030: BF8C0F70

  v_add_u32     v0, 0x0000dead, v0                      // 000000000034: 680000FF 0000DEAD

  buffer_store_dword  v0, v0, s[0:3], 0 offset:1076 glc // 00000000003C: E0704434 80000000

  s_endpgm                                              // 000000000044: BF810000

-

Write-Once-Read-Many (WORM)

Description of an important GPU fast path on today’s HW

One Implicit Per-Frame Cache Flush

  • Submit would typically chain command buffer for performance reasons
  • At submit boundaries there would typically be a preamble
  • The preamble does state setup and typically a cache flush
  • This method is a way to avoid doing any extra cache flushes mid-frame
  • So in Vulkan having an app that has NO vkCmdPipelineBarrier() calls!
  • And only using VkEvents for pipelining
  • TODO: It might be better to have a command buffer creation flag or submit flag which explicitly asks for the cache flush (in case the implicit cache flush gets factored out at some point)

Rules for GPU-Side Memory Access

  • Stores are all done using resources with ‘layout(...) coherent’ or atomic ops
  • Note you can alias different memory qualifiers on the same binding
  • To get cached reads mixed with write-through stores
  • This guarantees no stale lines can be left in the non-coherent caches
  • On AMD this invokes L0 GLC=1 to have stores write-through to the coherent L2 cache domain (note L1 is bypassed on store)
  • Note AMD’s driver already uses GL1=1 stores even without that coherent qualifier
  • Write-through on stores and not leaving the line in the lower level caches is a good default performance optimization because it avoids cache pollution!
  • All writing jobs don’t alias cachelines unless they use atomic operations
  • So non-atomic lines are effectively write once (so no aliasing)
  • After writes are done, any number of invocations are free to do cached reads as many times as they want from any caches (K$ for constants, or vector memory cache, etc)
  • No cache flush is needed between the write to read transition!

HW Requirements

  • These are true of much modern hardware, but they didn’t necessarily hold in older hardware
  • TODO: Vendor support chipset list
  • This is supported on all desktop GPUs with packed 16-bit support at a minimum
  • Which covers the author’s usage cases, but support should extend back more
  • HW must read indirect dispatch from the coherent cache domain
  • HW must read scanout from the coherent cache domain

GPU Memory Aliasing

  • WORM doesn’t support mid-frame memory aliasing
  • Note aliasing say for resizable images is safe if they are used on separate frames
  • And in modern times the majority of mid-frame stuff needs temporal feedback
  • So those couldn’t be aliased anyway
  • In order to support mid-frame memory aliasing, one would need something that invalidated the incoherent cache domain
  • To avoid the possibility of stale lines
  • Prior reads for the same lines might be in the non-coherent caches
  • Using work to effectively flush the non-coherent caches is an ‘unsafe’ but valid approach too
  • A safer approach would be to use a vkCmdPipelineBarrier()
  • TODO: Include how to invoke and only flush the incoherent caches (as reference)

GPU Spin Loops or Lock-Free Retry Loops

PC Problems

  • There is no forward progress guarantee in the APIs (fail)
  • Preemption can happen, and there is no guarantee post-preemption will restore all workgroups back on the machine (compared to what workgroups had been active prior)
  • So if a workgroup is spinning on memory waiting for another non-loaded workgroup to change, a livelock will happen (aka deadlock, likely a TDR, or at worst another preemption and just no forward progress)
  • A suggested minimum ASK for IHVs is to have a guarantee that during the preemption restore, that the oldest workgroups get restored first (aka what has a lower global kernel workgroup ID)
  • WIth this it becomes safe to block on workgroups with lower workgroup IDs
  • Which provides a way to implement many algorithms
  • There is the possibility if someone is launching workgroups that fill full CU (in AMD speak) and it's associated resources, that prior fragmentation could make an irregular number of workgroups dispatch unless the system resets the units after idling them
  • Does this happen on modern hardware in practice? Maybe not any more
  • It did happen in early CUDA times
  • What is more likely is that other things might be running on the GPU and thus one won't necessarily fill fully immediately at launch so blocking on counting workgroups that should fill the machine is probably a possible fail point (deadlock)

Example AMD Compiler Bugs (MEGA FAIL)

Showing below an implementation of a GPU spin loop. It's a MEGA FAIL because the massive number of compiler bugs preventing even the attempted workarounds from being acceptable ...

AMD Compiler Correctness Fail With Regards to "Readonly Coherent"

  • Instant deadlock (livelock)
  • The compiler should do a GLC=1 (aka "coherent") SMEM read here in the spin loop but doesn't
  • The irony is that it does put in the GLC=1 so it implements the "coherent" then ignores those very correctness rules
  • Instead it incorrectly assumes the read will return the same value as before and spins on the SGPR prior value
  • Note using "volatile" won't fix the problem either!

Source

--

#version 460

#extension GL_EXT_shader_explicit_arithmetic_types:require

#extension GL_EXT_shader_explicit_arithmetic_types_int32:require

#define I1 uint32_t

layout(set=0,binding=0,std430)readonly coherent buffer ssboRC_ {I1 ssboRC[1024*1024*1024/4];};

layout(set=0,binding=0,std430)readonly buffer ssboR_ {I1 ssboR[1024*1024*1024/4];};

layout(set=0,binding=0,std430)writeonly buffer ssboW_ {I1 ssboW[1024*1024*1024/4];};

void main(){

// Fast check on already signaled

// API bug: no way to say to expect this if to always be false

if(ssboR[0]==0){

// Predicate to just the first lane

if(gl_LocalInvocationID.x==0){

// Not signalled, so must spin until signalled, note the coherent read

while(true){if(ssboRC[0]!=0)break;}}}

// Make the shader do something so the program doesn't get dead-coded

ssboW[256]=1;}

--

Disassembly

--

0x000000    s_getpc_b64            s[2:3]

0x000004    s_mov_b32              s0, s1

0x000008    s_mov_b32              s1, s3

0x00000C    s_load_dwordx4         s[4:7], s[0:1], null

0x000014    s_waitcnt              lgkmcnt(0)

0x000018    s_buffer_load_dword    s0, s[4:7], null

0x000020    s_waitcnt              lgkmcnt(0)

0x000024    v_or_b32_e32           v0, s0, v0

0x000028    v_cmp_ne_u32_e32       vcc_lo, 0, v0

0x00002C    s_and_b32              s0, vcc_lo, exec_lo

0x000030    s_cbranch_scc1         _L0

0x000034    s_buffer_load_dword    s0, s[4:7], null glc dlc

0x00003C    s_waitcnt              lgkmcnt(0)

0x000040    s_cmp_lg_u32           s0, 0

0x000044    s_cselect_b32          s0, -1, 0

0x000048    v_cndmask_b32_e64      v0, 0, 1, s0

0x000050    v_cmp_ne_u32_e64       s0, 1, v0

_L1:

0x000058    s_and_b32              vcc_lo, exec_lo, s0

0x00005C    s_cbranch_vccnz        _L1

_L0:

0x000060    v_mov_b32_e32          v0, 1

0x000064    buffer_store_dword     v0, off, s[4:7], 0 offset:1024

0x00006C    s_endpgm              

--

Note you cannot workaround by hiding the memory address either

  • The compiler will simply load the memory address its given
  • And then hoist it outside the spinloop! = DEADLOCK

Source

--

#version 460

#extension GL_EXT_shader_explicit_arithmetic_types:require

#extension GL_EXT_shader_explicit_arithmetic_types_int32:require

#define I1 uint32_t

layout(set=0,binding=0,std430)readonly coherent volatile buffer ssboRC_ {I1 ssboRC[1024*1024*1024/4];};

layout(set=0,binding=0,std430)readonly buffer ssboR_ {I1 ssboR[1024*1024*1024/4];};

layout(set=0,binding=0,std430)writeonly buffer ssboW_ {I1 ssboW[1024*1024*1024/4];};

void main(){

// Fast check on already signaled

// API bug: no way to say to expect this if to always be false

I1 hack=ssboR[1];

if(ssboR[0]==0){

// Predicate to just the first lane

if(gl_LocalInvocationID.x==0){

// Not signalled, so must spin until signalled, note the coherent read

while(true){if(ssboRC[hack]!=0)break;}}}

// Make the shader do something so the program doesn't get dead-coded

ssboW[256]=1;}

--

Disassembly

--

0x000000    s_getpc_b64            s[2:3]

0x000004    s_mov_b32              s0, s1

0x000008    s_mov_b32              s1, s3

0x00000C    s_load_dwordx4         s[4:7], s[0:1], null

0x000014    s_waitcnt              lgkmcnt(0)

0x000018    s_buffer_load_dword    s0, s[4:7], null

0x000020    s_waitcnt              lgkmcnt(0)

0x000024    v_or_b32_e32           v0, s0, v0

0x000028    v_cmp_ne_u32_e32       vcc_lo, 0, v0

0x00002C    s_and_b32              s0, vcc_lo, exec_lo

0x000030    s_cbranch_scc1         _L0

0x000034    s_buffer_load_dword    s0, s[4:7], 0x4

0x00003C    s_waitcnt              lgkmcnt(0)

0x000040    s_lshl_b32             s0, s0, 2

0x000044    s_buffer_load_dword    s0, s[4:7], s0 glc dlc

0x00004C    s_waitcnt              lgkmcnt(0)

0x000050    s_cmp_lg_u32           s0, 0

0x000054    s_cselect_b32          s0, -1, 0

0x000058    v_cndmask_b32_e64      v0, 0, 1, s0

0x000060    v_cmp_ne_u32_e64       s0, 1, v0

_L1:

0x000068    s_and_b32              vcc_lo, exec_lo, s0

0x00006C    s_cbranch_vccnz        _L1

_L0:

0x000070    v_mov_b32_e32          v0, 1

0x000074    buffer_store_dword     v0, off, s[4:7], 0 offset:1024

0x00007C    s_endpgm              

--

Dropping the "readonly" doesn't work either even with the hack

  • And goes and factors out the load again = DEADLOCK
  • Same correctness bugs!

Source

--

#version 460

#extension GL_EXT_shader_explicit_arithmetic_types:require

#extension GL_EXT_shader_explicit_arithmetic_types_int32:require

#define I1 uint32_t

layout(set=0,binding=0,std430)coherent volatile buffer ssboC_ {I1 ssboC[1024*1024*1024/4];};

layout(set=0,binding=0,std430)readonly coherent volatile buffer ssboRC_ {I1 ssboRC[1024*1024*1024/4];};

layout(set=0,binding=0,std430)readonly buffer ssboR_ {I1 ssboR[1024*1024*1024/4];};

layout(set=0,binding=0,std430)writeonly buffer ssboW_ {I1 ssboW[1024*1024*1024/4];};

void main(){

// Fast check on already signaled

I1 hack=ssboR[1];

if(ssboR[0]==0){

// Predicate to just the first lane

if(gl_LocalInvocationID.x==0){

// Not signalled, so must spin until signalled, note the coherent read

while(true){if(ssboC[hack]!=0)break;}}}

// Make the shader do something so the program doesn't get dead-coded

ssboW[256]=1;}

--

Disassembly

--

0x000000    s_getpc_b64            s[2:3]

0x000004    s_mov_b32              s0, s1

0x000008    s_mov_b32              s1, s3

0x00000C    s_load_dwordx4         s[4:7], s[0:1], null

0x000014    s_waitcnt              lgkmcnt(0)

0x000018    s_buffer_load_dword    s0, s[4:7], null

0x000020    s_waitcnt              lgkmcnt(0)

0x000024    v_or_b32_e32           v0, s0, v0

0x000028    v_cmp_ne_u32_e32       vcc_lo, 0, v0

0x00002C    s_and_b32              s0, vcc_lo, exec_lo

0x000030    s_cbranch_scc1         _L0

0x000034    s_buffer_load_dword    s0, s[4:7], 0x4

0x00003C    s_waitcnt              lgkmcnt(0)

0x000040    s_lshl_b32             s0, s0, 2

0x000044    s_buffer_load_dword    s0, s[4:7], s0 glc dlc

0x00004C    s_waitcnt              lgkmcnt(0)

0x000050    s_cmp_lg_u32           s0, 0

0x000054    s_cselect_b32          s0, -1, 0

0x000058    v_cndmask_b32_e64      v0, 0, 1, s0

0x000060    v_cmp_ne_u32_e64       s0, 1, v0

_L1:

0x000068    s_and_b32              vcc_lo, exec_lo, s0

0x00006C    s_cbranch_vccnz        _L1

_L0:

0x000070    v_mov_b32_e32          v0, 1

0x000074    buffer_store_dword     v0, off, s[4:7], 0 offset:1024

0x00007C    s_endpgm              

--

Partial workaround?

  • This won't actually work, but it gets close to a workaround
  • Since the workaround involves moving the spin loops to VMEM
  • One needs to force a reload of the K$ line used for the fast check (didn't do that yet)
  • It's loaded with perf bugs (will itemize those)
  • Needed to move to an explicit atomic operation in the spin loop
  • And used a 'hack' value to hide a zero so there is no way a compiler can dead-code the atomic
  • In this instance the compiler decided to choose the worst branch order
  • The expected path has a taken branch when it should be linear (non-taken) ideally
  • No interface to show explicit expected branch priorities
  • Note sometimes I've seen the compiler get it right
  • So AMD at least can generate a branch that conditionally branches out and leaves the fast path linear (branch free), but they don't guess right a lot of the time
  • The compiler branches into code predicated into one lane (lane=0)
  • This is done via if(gl_LocalInvocationID.x==0)
  • But then the next thing the compiler does it implement it's "perf strategy" for idiot programmers
  • So now it's duplicating predicating the atomic to the first lane using the worst possible method (via masked bit count)
  • But it already is down to one lane
  • So it become an anti-optimization (cost increase)
  • Then it restores what it thinks is multi-lane execution (but it isn't)
  • Seriously this is exactly why we need explicit SGPRs and explicit intrinsics, etc
  • And does a read-first-lane to get the atomic back into an SGPR so it can vector logic on it
  • So it can then branch by a vector compare
  • Really no reason to do the read-first-lane, it's already in one-lane execution
  • The output is unbelievably bad
  • We need some other approach here!

Source

--

#version 460

#extension GL_EXT_shader_explicit_arithmetic_types:require

#extension GL_EXT_shader_explicit_arithmetic_types_int32:require

#define I1 uint32_t

layout(set=0,binding=0,std430)coherent volatile buffer ssboC_ {I1 ssboC[1024*1024*1024/4];};

layout(set=0,binding=0,std430)readonly coherent volatile buffer ssboRC_ {I1 ssboRC[1024*1024*1024/4];};

layout(set=0,binding=0,std430)readonly buffer ssboR_ {I1 ssboR[1024*1024*1024/4];};

layout(set=0,binding=0,std430)writeonly buffer ssboW_ {I1 ssboW[1024*1024*1024/4];};

layout(local_size_x=64)in;

void main(){

// Fast check on already signaled

I1 hack=ssboR[1]; // Make sure that loads a zero!

if(ssboR[0]==0){

// Predicate to just the first lane

if(gl_LocalInvocationID.x==0){

// Not signalled, so must spin until signalled, note the coherent read

while(true){if(atomicOr(ssboC[0],hack)!=0)break;}}}

// Make the shader do something so the program doesn't get dead-coded

ssboW[256]=1;}

--

Disassembly

--

0x000000    s_getpc_b64            s[2:3]

0x000004    s_mov_b32              s0, s1

0x000008    s_mov_b32              s1, s3

0x00000C    s_load_dwordx4         s[0:3], s[0:1], null

0x000014    s_waitcnt              lgkmcnt(0)

0x000018    s_buffer_load_dword    s4, s[0:3], null

0x000020    s_waitcnt              lgkmcnt(0)

0x000024    v_or_b32_e32           v0, s4, v0

0x000028    s_mov_b64              s[4:5], exec

0x00002C    v_cmpx_eq_u32_e32      0, v0

0x000030    s_cbranch_execz        _L0

0x000034    s_buffer_load_dword    s10, s[0:3], 0x4

0x00003C    s_mov_b64              s[6:7], 0

0x000040    s_branch               _L1

0x000044    s_nop                  0

0x000048    s_nop                  0

0x00004C    s_nop                  0

0x000050    s_nop                  0

0x000054    s_nop                  0

0x000058    s_nop                  0

0x00005C    s_nop                  0

0x000060    s_nop                  0

0x000064    s_nop                  0

0x000068    s_nop                  0

0x00006C    s_nop                  0

0x000070    s_nop                  0

0x000074    s_nop                  0

0x000078    s_nop                  0

0x00007C    s_nop                  0

_L2:

0x000080    s_or_b64               exec, exec, s[8:9]

0x000084    s_waitcnt              vmcnt(0)

0x000088    v_readfirstlane_b32    s8, v0

0x00008C    s_waitcnt              lgkmcnt(0)

0x000090    v_cndmask_b32_e64      v0, s10, 0, vcc_lo

0x000098    v_or_b32_e32           v0, s8, v0

0x00009C    v_cmp_ne_u32_e32       vcc_lo, 0, v0

0x0000A0    s_or_b64               s[6:7], vcc, s[6:7]

0x0000A4    s_andn2_b64            exec, exec, s[6:7]

0x0000A8    s_cbranch_execz        _L0

_L1:

0x0000AC    v_mbcnt_lo_u32_b32     v0, exec_lo, 0

0x0000B4    v_mbcnt_hi_u32_b32     v0, exec_hi, v0

0x0000BC    v_cmp_eq_u32_e32       vcc_lo, 0, v0

0x0000C0    s_and_saveexec_b64     s[8:9], vcc

0x0000C4    s_cbranch_execz        _L2

0x0000C8    s_waitcnt              lgkmcnt(0)

0x0000CC    v_mov_b32_e32          v0, s10

0x0000D0    buffer_atomic_or       v0, off, s[0:3], 0 glc

0x0000D8    s_branch               _L2

_L0:

0x0000DC    s_or_b64               exec, exec, s[4:5]

0x0000E0    v_mov_b32_e32          v0, 1

0x0000E4    buffer_store_dword     v0, off, s[0:3], 0 offset:1024

0x0000EC    s_endpgm    

--

Another workaround attempt that failed

  • Tried to trick the compiler into using a VMEM load instead of SMEM
  • But it still ignores the "volatile" and "coherent" and hoists the load out of the spin loop
  • It's strange that it even factored out the compare out of the spin loop too!

Source

--

#version 460

#extension GL_EXT_shader_explicit_arithmetic_types:require

#extension GL_EXT_shader_explicit_arithmetic_types_int32:require

#define I1 uint32_t

layout(set=0,binding=0,std430)coherent volatile buffer ssboC_ {I1 ssboC[1024*1024*1024/4];};

layout(set=0,binding=0,std430)readonly coherent volatile buffer ssboRC_ {I1 ssboRC[1024*1024*1024/4];};

layout(set=0,binding=0,std430)readonly buffer ssboR_ {I1 ssboR[1024*1024*1024/4];};

layout(set=0,binding=0,std430)writeonly buffer ssboW_ {I1 ssboW[1024*1024*1024/4];};

layout(local_size_x=64)in;

void main(){

// Fast check on already signaled

I1 hack=ssboR[1]; // Make sure that loads a zero!

hack=hack&gl_LocalInvocationID.x; // Trick the compiler into a vector load (to the same address)!

if(ssboR[0]==0){

// Not signalled, so must spin until signalled, note the coherent read

while(true){if(ssboC[hack]!=0)break;}}

// Make the shader do something so the program doesn't get dead-coded

ssboW[256]=1;}

--

Disassembly

--

0x000000    s_getpc_b64            s[2:3]

0x000004    s_mov_b32              s0, s1

0x000008    s_mov_b32              s1, s3

0x00000C    s_load_dwordx4         s[0:3], s[0:1], null

0x000014    s_waitcnt              lgkmcnt(0)

0x000018    s_buffer_load_dword    s4, s[0:3], null

0x000020    s_waitcnt              lgkmcnt(0)

0x000024    s_cmp_lg_u32           s4, 0

0x000028    s_cbranch_scc1         _L0

0x00002C    s_buffer_load_dword    s4, s[0:3], 0x4

0x000034    s_waitcnt              lgkmcnt(0)

0x000038    v_and_b32_e32          v0, s4, v0

0x00003C    s_mov_b64              s[4:5], 0

0x000040    v_lshlrev_b32_e32      v0, 2, v0

0x000044    buffer_load_dword      v0, v0, s[0:3], 0 offen glc dlc

0x00004C    s_waitcnt              vmcnt(0)

0x000050    v_cmp_ne_u32_e32       vcc_lo, 0, v0

_L1:

0x000054    s_and_b64              s[6:7], exec, vcc

0x000058    s_or_b64               s[4:5], s[6:7], s[4:5]

0x00005C    s_andn2_b64            exec, exec, s[4:5]

0x000060    s_cbranch_execnz       _L1

0x000064    s_or_b64               exec, exec, s[4:5]

_L0:

0x000068    v_mov_b32_e32          v0, 1

0x00006C    buffer_store_dword     v0, off, s[0:3], 0 offset:1024

0x000074    s_endpgm              

--

Finally a workaround

  • Will trick the compiler into thinking the address is changing inside the loop
  • By adding a constant that only the programmer knows is zero in the loop iteration
  • Then it bypasses the bug
  • However the compiler still chooses a poor branch path
  • The fast path isn't linear
  • And this requires some extra dummy constant loads

Source

--

#version 460

#extension GL_EXT_shader_explicit_arithmetic_types:require

#extension GL_EXT_shader_explicit_arithmetic_types_int32:require

#define I1 uint32_t

layout(set=0,binding=0,std430)coherent volatile buffer ssboC_ {I1 ssboC[1024*1024*1024/4];};

layout(set=0,binding=0,std430)readonly coherent volatile buffer ssboRC_ {I1 ssboRC[1024*1024*1024/4];};

layout(set=0,binding=0,std430)readonly buffer ssboR_ {I1 ssboR[1024*1024*1024/4];};

layout(set=0,binding=0,std430)writeonly buffer ssboW_ {I1 ssboW[1024*1024*1024/4];};

layout(local_size_x=64)in;

void main(){

// Fast check on already signaled

if(ssboR[0]==0){

// Not signalled, so must spin until signalled, note the coherent read

I1 hack=ssboR[1]; // Make sure that loads a zero!

I1 hack2=ssboR[2]; // Make sure that loads a zero! Trick the compiler into thinking the address changes!

while(true){if(ssboRC[hack]!=0)break;hack+=hack2;}}

// Make the shader do something so the program doesn't get dead-coded

ssboW[256]=1;}

--

Disassembly

--

0x000000    s_getpc_b64              s[2:3]

0x000004    s_mov_b32                s0, s1

0x000008    s_mov_b32                s1, s3

0x00000C    s_load_dwordx4           s[0:3], s[0:1], null

0x000014    s_waitcnt                lgkmcnt(0)

0x000018    s_buffer_load_dword      s4, s[0:3], null

0x000020    s_waitcnt                lgkmcnt(0)

0x000024    s_cmp_lg_u32             s4, 0

0x000028    s_cbranch_scc1           _L0

0x00002C    s_buffer_load_dwordx2    s[4:5], s[0:3], 0x4

0x000034    s_waitcnt                lgkmcnt(0)

0x000038    s_lshl_b32               s4, s4, 2

0x00003C    s_lshl_b32               s5, s5, 2

_L1:

0x000040    s_buffer_load_dword      s6, s[0:3], s4 glc dlc

0x000048    s_add_i32                s4, s4, s5

0x00004C    s_waitcnt                lgkmcnt(0)

0x000050    s_cmp_eq_u32             s6, 0

0x000054    s_cbranch_scc1           _L1

_L0:

0x000058    v_mov_b32_e32            v0, 1

0x00005C    buffer_store_dword       v0, off, s[0:3], 0 offset:1024

0x000064    s_endpgm                

--


Knowledge Base / Resources

Compiler [f]Testing Online

Instruction Level Cache Controls on AMD and NVIDIA

  • AMD
  • Takeaways
  • Fully bypass L2/L3 doesn’t exist as instruction level cache control in later HW
  • This must be done via page table (see DEVICE_UNCACHED_BIT_AMD)
  • This is in contrast to NV’s hardware which has it at the instruction level, but doesn’t have or doesn’t expose a Vulkan DEVICE_UNCACHED memory type for page table control
  • TODO
  • Takes a bit of work to understand AMD’s cache control evolution
  • Hit Evict - A hit will be used, but regardless after the load the line is evicted
  • Aka “Last Use” but won’t leave stale lines either
  • Hit No Allocate - TODO (not well described in AMD’s public docs)
  • Miss Evict - The cache is “bypassed” in behavior, but “used” in implementation
  • Any matching lines (that would hit) are forced to miss
  • And after the operation the lines are evicted
  • HW burns a line in a temporary way
  • Stream (Store) - Hit leaves the line, but doesn’t update age to most recent
  • Documented in RDNA1 docs but this isn't fully clear as to behavior on a Miss
  • Stream (Load) - TODO (not well described in AMD’s public docs)
  • SMEM ops lack SLC bit (no L2 cache control), SLC=0 (cached)
  • SMEM doesn’t have stores (well at least one chipset did but it wasn’t exposed)
  • SMEM cannot bypass L2 via SLC control
  • GCN
  • LOAD
  • GLC=0 Cached L1+L2
  • GLC=1 Miss L1 Fetch L2 (Cache only in the coherent domain)
  • SLC=0 Cache in L2
  • SLC=1 Force miss in L2 (goes away in RDNA3)
  • STORE
  • GLC=0 Only if store writes full line, leave line in L1, write through to L2
  • GLC=1 Don’t leave line in L1, write-through to L2
  • SLC=0 Cache in L2
  • SLC=1 TODO (goes away in RDNA3)
  • RDNA 1&2
  • The old L1 is now L0, GLC controls L0 behavior (same as before)
  • There is a new L1 mid-level cache for reads (writes actually physically bypass)
  • SLC is now mixed with DLC bits
  • LOAD
  • SLC=0 DLC=0 - Cached L2+L1
  • SLC=0 DLC=1 - Cached L2, Miss Evict L1
  • SLC=1 DLC=0 - Stream L2, Cached L1
  • SLC=1 DLC=1 - Hit No Allocate L2, Miss Evict L1
  • STORE
  • SLC=0 DLC=0 - Cached L2
  • SLC=0 DLC=1 - Bypass L2 (goes away in RDNA3)
  • SLC=1 DLC=0 - Stream (Hit leaves line, but doesn’t update age)
  • SLC=1 DLC=1 - Hit No Allocate
  • RDNA 3
  • DLC bits repurposed for MALL (L3) controls, so SLC+DLC bit meanings change
  • MALL/LLC/L3 is documented as off the "Infinity Cache"
  • L1 control based on SLC+GLC bits
  • Some cache control comes from resource descriptor bits now (llc_alloc bits)
  • This llc_alloc is an LLC (Last Level Cache, aka L3) override?
  • Overrides behavior specified on the instruction
  • Probably so drivers can change global behavior for a given resource
  • S_LOAD, llc_alloc=0 (no descriptor)
  • LOAD
  • SLC=0 - Cached L2
  • SLC=1 - Stream L2
  • DLC=0 - L3 normal (llc_alloc is set then this gets forced to DLC=1)
  • DLC=1 - L3 non-temporal hint (no alloc)
  • SLC=0 GLC=0 - Cache L1
  • SLC=0 GLC=1 - Miss Evict L1
  • SLC=1 GLC=0 - Hit Evict L1 (don’t leave line in L1 after)
  • SLC=1 GLC=1 - Miss Evict L1
  • STORE
  • SLC=0 - Cached L2
  • SLC=1 - Stream L2
  • DLC=0 - L3 normal
  • DLC=1 - L3 non-temporal hint (except if llc_alloc overrides to DLC=0)
  • RDNA 4?
  • Looks like Chips and Cheese analysis isn't right based on the actual source
  • It's just the same 3 bits as before just re-purposed
  • Full changelist
  • Look for "GFX12+" in comments "enum CPol" (likely means cache policy)
  • Looks like they put in WorkGroup RoundRobin scheduling!
  • Scope bits alias the lower 2-bits of TH
  • {TH_NT_RT,TH_RT_NT,TH_NT_HT} not supported for SMEM
  • Translates to SMEM has no DLC bit (just {RT,NT,HT,LU})
  • {RETURN,RT,RT_RETURN,NT,NT_RETURN,CASCADE_RT,CASCADE_NT} are the only valid options for atomics
  • So CASCADE is fire-and-forget only
  • MUBUF/MTBUF moved from U16 to S24
  • FLAT is signed instead of unsigned offset
  • 210  [MEANS SPECULATION]
    ..1  GLC / SC0 = 1 [pre-GFX12]
    .1.  SLC / SC1 / NT = 2 [pre-GFX12]
    1..  DLC = 4 [pre-GFX12]
    ===
    .ss  Scope [GFX12]
    .00  CU
    .01  SE
    .10  DEV
    .11  SYS
    ===
    ttt  Temporal Hint [GFX12]
    000  TH_RT = regular
    001  TH_NT = non-temporal
    010  TH_HT = high-temporal
    011  TH_RT_WB = regular (CU/SE), high-temporal with write-back (MALL) [STORE]
    011  TH_LU = last use [LOAD]
    011  TH_BYPASS = only used with scope = 3
    100  TH_NT_RT = non-temporal (CU/SE), regular (MALL)
    101  TH_RT_NT = regular (CU/SE), non-temporal (MALL)
    110  TH_NT_HT = non-temporal (CU/SE), high-temporal (MALL)
    111  TH_NT_WB = non-temporal (CU/SE), high-temporal with write-back (MALL) [STORE]
    111  TH_RESERVED = unused value for load instructions [LOAD]
    ===
    aaa  Atomic Temporal Hint [GFX12]
    ..r  TH_ATOMIC_RETURN = GLC (return or not)
    .n.  TH_ATOMIC_NT = SLC (non-temporal or not)
    c..  TH_ATOMIC_CASCADE = 4 (cascade or regular) [no mixing with ATOMIC_RETURN]
  • SPECULATION: Possible meaning?
  • Maybe there is no L2, instead only a MALL?
  • No L2 language in there any more in the LLVM comments
  • Regular = Evict Normal
  • Non-Temporal = Evict First or Evict Unchanged
  • High-Temporal = Evict Last
  • There is no GLC=1 on stores, which means likely NT stores are HIT EVICT in L0 (aka CU) and still bypass L1 (aka SE) else there would be no way to implement layout "coherent"
  • Maybe TH_RT_NT is for ability to get more write-combining in L0 for things that write out data across disconnected stores, but otherwise stream (no reuse in larger level caches)
  • Maybe the difference between HT and WB is that in normal HT mode the aim is to avoid WB and hope for LU in time to avoid WB, and WB mode, it's reuse but also WB (use next frame too), so can WB like normal

 

  • NVIDIA
  • LOADS
  • .ca Cache All Levels
  • .cg Cache Global (L2 but not L1, cache only in the coherent cache domain)
  • .cs Cache Streaming (evict first policy)
  • .lu Last Use
  • Maps to .cs for global addresses, only does .lu for workgroup private memory (register spill/etc)
  • .cv Cache Volatile (don’t cache, specific example system memory)
  • This is how NV supports polling CPU-side memory
  • STORES
  • .wb Write Back
  • .cg Cache Global (L2 but not L1, cache only in the coherent cache domain)
  • .cs Cache Streaming (evict first policy)
  • .wt Write Through (don’t cache, specific example system memory)
  • This is how NV supports lower latency CPU communication (don’t need to wait for job to finish and write-back L2 cache)
  • >=SM_70 (aka Volta)
  • Hardware adds cache eviction priority hints
  • .en Evict Normal
  • .ef Evict First (aka Streaming possible reuse)
  • .el Evict Last (high priority to persist in cache)
  • .eu Evict Unchanged (do not change ordering)
  • .na No Allocate (do not place data in cache, hard streaming, one use)
  • Latest hardware notes (4xxx series) by inspecting fuzzed disassembly
  • Has more cache controls (4-bits) so perhaps beyond what is exposed in PTX?
  • TEX has eviction priority but no cache control
  • SUST (surface store) has both eviction priority and cache control
  • NVIDIA Questions
  • What is the existing mapping of layout qualifiers to cache control?
  • Layout “coherent” should map to '.cg' (else it wouldn’t work) - good there
  • TODO: Open question if "volatile" maps to '.cv' for loads and '.wt' for stores
  • If yes, then it opens up a portable way to do low-latency communication
  • Use the existing layout qualifiers on NV but on AMD use the memory type to force uncached operation (mix different mechanisms, but get the same result)
  • TODO: Going to assume one can mix '.cg' with '.el' to avoid keeping Evict Last lines in L1

Mapped Page Cache?

The idea is to remove the need to do file operations

  • Designed for indie title usage (tiny install, whole game is loaded into VRAM at start)

Example of what the author is trying in Vulkan

  • CreateFileMapping() and MapViewOfFile() - map a file into app's address space
  • Background thread walks the pages to insure they are pre-faulted if possible
  • Using VkImportMemoryHostPointerInfoEXT with VK_EXTERNAL_MEMORY_HANDLE_TYPE_HOST_MAPPED_FOREIGN_MEMORY_BIT_EXT
  • And pHostPointer set to the address of the CPU mapped file
  • Using both
  • VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT
  • VK_MEMORY_PROPERTY_HOST_COHERENT_BIT
  • vkAllocateMemory(), vkBindBufferMemory(), vkUpdateDescriptorSets()
  • During launch there is a DMA copy of this buffer on the GPU to the beginning of a DEVICE_LOCAL buffer (the big GPU-side buffer)
  • During dev usage, DMA copy from big GPU-side buffer back to this CPU mapping
  • This is how the game data is 'saved' back into the Cart file

NVIDIA ISA Notes

References

  • Fixed 128-bit instructions
  • 6-bit mask for which barriers to wait on (6 HW barriers, tracking long term instruction completion)
  • 3-bit write barrier (selects which barrier bit is set when destination is written) and similar for a write-after-read hazard read barrier
  • There is an explicit yield to another warp bit (has operand cache effects)
  • This mechanism is similar to AMD’s clause but more effective as it isn’t limited to one instruction type!
  • AMD typically has problems of oldest-first camping on VALU causing memory bubbles because other waves cannot make forward progress to get VMEM ops out, NVIDIA’s yield in theory would mitigate that problem
  • Up to 4 operands each with a reuse bit (put into the operand cache)
  • Instructions are predicated (negate bit, 3-bit predicate register)

Hardware Improvement Thoughts

Random ideas on how to possibly improve GPU hardware design, a collection of thoughts across a decade of working in the industry (will update as time allows)

Bit LUT Operation With Data-Driven Truth-Table

  • Many processors still have traditional design fail with respect to bit logic
  • Specifically having separate opcodes for each individual logic op {AND,OR,XOR,etc}
  • Would be better to do this FPGA LUT style where one data-driven operand provides the truth-table
  • It reduces the number of CPU instructions required (better effective IPC)
  • It provides a way to have some data driven divergence without code divergence
  • 4 operand fetch architecture would be able to do any 3 operand logic operation
  • 3 operands provide the 3 inputs
  • 1 operand provides the truth-table (uses 8-bit from LSBs)
  • Note one could make the 8-bit LUT part of the opcode bits, but that then removes the data-driven capability (which might be the right unfortunate compromise for only a 3-operand architecture)
  • If HW has existing capacity for two 32-bit register stores per instruction to a pair of 32-bit registers per clock
  • Then one could build this instruction to do 2 logic ops in parallel (use 16-bit LUT)

CU-Shared SGPRs

Providing perhaps a better way to share controlling data across waves, or communicate across waves

  • Reserve wave ID=0
  • Reuse it's SGPR space for a CU-shared area of SGPRs
  • Put in instruction access to toggle between {shared,private} SGPRs
  • This HW implementation would likely be very low cost

Divergent Textures vs HW Design

  • Trouble spots for current hardware
  • Divergent texture descriptors
  • Divergent mip levels
  • Addressing logic gets too complex, need to factor it out
  • Using a minimum alignment to extend the 32-bit addressing capability
  • 32-bit value with 256-byte alignment
  • Don't store the lower 8-bits
  • 40-bit effective address in 32-bits = 1 TiB VA space
  • One could build in support for divergence given
  • Same texture descriptor (so same format and size, etc)
  • Same mip level (at runtime)
  • Factor out the base address from the descriptor and send with texture coordinates
  • One could argue that layered textures already covers this case quite well
  • But having fixed amount of layers sometimes is a deal breaker
  • Radical changes
  • If base address is completely factored out of the descriptor
  • And the HW uses descriptor indexes (instead of passing descriptors like AMD)
  • Then the index is basically a "texture compatibility index"
  • So textures of the same configuration (same index) can dynamically be evaluated together in parallel at runtime (even if their base address is divergent)
  • And the actual resource divergence is now limited to just the subset of 'compatibility indexes'
  • Just support a passed in base address (separate from that index)
  • And then introduce an adder for the final addressing in the texture unit

Float Bool Fixes - Float Mode Without NaNs

Saturation modifier with FMA provides a good tool, some related IEEE rules

  • saturate(NaN) = 0
  • saturate(-INF) = 0
  • saturate(+INF) = 1
  • mul(0,INF)=NaN … trouble case, but saturate(NaN) converts that back to 0

First a reference of what is used today for optimization

  • 1.0 = True
  • 0.0 = False
  • And(x,y) = min(x,y)
  • And(x,y) = saturate(x*y)
  • And3(x,y,z) = min3(x,y,z)
  • AndOr(x,y,z) = saturate(x*y+z)
  • AndNot(x,y) = -x*y+1
  • Gt(x,y) = Gtz(x-y) … 2 ops
  • Gtz(x) = saturate(INF*x) … {NaN := 0, x GT 0 := 1, else 0}
  • Lt(x,y) = Ltz(x-y) … 2 ops
  • Ltz(x) = saturate(-INF*x) … {NaN := 0, x LT 0 := 1, else 0}
  • Ne(x,y) = Gtz(abs(x-y)) … 2 ops
  • Not(x) = 1-x
  • Or(x,y) = max(x,y)
  • Or(x,y) = saturate(x+y)
  • Or3(x,y,z) = max3(x,y,z)
  • Sel(x,y,z) = z*y+((-z)*x*x) … z==0.0?x:y … 2 ops, preserves precision

Warning

  • Some compiler engineers don’t understand the importance of mixing INF and saturate
  • So some platforms have bugs (they might factor out or transform the *INF into something else)
  • Since INF is a compile time literal
  • The workaround might be to push INF into a load-time constant instead

Optimization cases

  • Packed math (HW does not support a packed conditional)
  • Avoiding SALU to VALU dependent latency of conditionals
  • etc

The following are more expensive (extra op for the Not() which is a 1-x)

  • Le(x,y) … Not(Gt(x,y))
  • Ge(x,y) … Not(Lt(x,y))
  • Eq(x,y) … Not(Ne(x,y))

Could be faster though if there was a way to run the hardware in a modified no-NaN mode

  • Actual IEEE rules that provide the problems
  • INF-INF = NaN
  • (+/-)INF*0 = NaN
  • Would rather have no NaNs and instead this logic
  • 0*INF = 0
  • -INF+INF = 0
  • -INF-INF = -INF
  • +INF+INF = +INF
  • Then the following is possible (one less operation)
  • Eq(x,y) = saturate(-INF*abs(x-y)+INF)
  • Ge(x,y) = Gez(x-y)
  • Gez(x) = saturate(+INF*x+INF)
  • Le(x,y) = Lez(x-y)
  • Lez(x) = saturate(-INF*x+INF)

FMA Select Uber Op

  • This is a way to mix a FMA with a select in one operation
  • Applies to a 4-operand architecture
  • Can be implemented with very low cost in HW
  • It helps improve the common pattern of SIMD logic
  • x=option 0 ... computed at runtime
  • y=option 1 ... computed at runtime
  • c=logic op to control the choice between x and y
  • x=c?x:y ... aka select (or conditional mask)
  • So this applies to the last two parts (in bold) done as one operation
  • e=fmaWithModifiers(a,b,c)!=0.0?e:d
  • This doesn't require a 5-th operand fetch
  • The HW selects 'e' by simply predicating off the resulting write
  • And notice the 'd' would typically be from a prior instruction return cache
  • Avoiding an extra register file read (so low power)
  • Note 'fmaWithModifiers' includes saturate()
  • Which means one can use saturate FMA based {0=false, 1=true} bool logic
  • The FMA is that float bool logic op in the suggested usage
  • Which is quite good for packed or vector-in-register to parallelize conditional logic
  • This would significantly improved packed 16-bit code generation for architectures that otherwise only have one HW flags register between CMP and SELECT

FMA Negate Input Modifier - 3-bit to 2-bit

  • Applies to hardware using 3-bits for negative input modifier for FMAs
  • ([+/-]a*[+/-]b)+([+/-]c)
  • The ((-a)*(-b)) has the same result as (a*b)
  • The ((-a)*b) has the same result as (a*(-b))
  • Can instead do 2-bit in the following form
  • ([+/-](a*b))+([+/-]c)
  • Saving one bit of opcode space
  • Or allowing a secondary usage (given generic 3 operand + modifiers opcode format)

IMAD Leveraging The Float Input Negate Modifier(s)

  • Starting with why implementations don't reuse neg opcode modifier bits
  • Because two's complement NEG requires a NOT and ADD 1 (carry bit)
  • In contrast to float where it's just an input MSB bit flip
  • There are a few useful forms here
  • Lessons learned from how FPGA people leverage DSP blocks
  • At a minimum the integer ADDers support one carry bit
  • Could support using a 3-bit modifier field as
  • [~](a*b) + [~]c + [carry]
  • So two not modifiers and one optional carry
  • Which is substantially more useful than the base IMAD without modifiers
  • There might be an even better way to do modifiers without carry abuse
  • a-b = a + not(b) + 1
  • a-b = not(not(a) + b)
  • Xilinx (aka AMD) FPGA DSP48E1 ALUMODE bits for reference
  • Example of how to do this well
  • Where Z is the added operand, and (X+Y) is the partial products of the MUL
  • Also for non-mul cases
  • X can be {0, P=prior result}
  • Y can be {0, C=the other operand, ~0 (aka -1)}
  • [00] Z+(X+Y+CIN) … add
  • [01] (~Z)+(X+Y+CIN) … -Z+(X+Y+CIN)-1 … if CIN=1 then get -Z+(X+Y) reverse subtract
  • [10] ~(Z+(X+Y+CIN)) … neg output - 1
  • [11] ~((~Z)+(X+Y+CIN))  … subtract … Z-(X+Y+CIN)

Integer Signed Maths - {MIN_NEGATIVE to 0} as Fixed-Point {1.0 to 0.0}

  • This is something for an older integer FGPA ALU design
  • Might have some use for integer GPU maths in shader programs
  • Two’s complement has more values on the negative side
  • Thus working with negative numbers is often better than positives
  • For fixed-point bool logic use the sign-bit (MSB)
  • 0 (positive) is false
  • Negative is true
  • Much easier than testing if equal to zero in HW (or FPGA)
  • Standardizing on negatives requires a lot more mental work though
  • For example a parabolic sqrt(x) estimation would normally be 2*x-x*x
  • But you’d need to transform that to 2*x+x*x (flip the sign of the square)

Max and Min in One Op

  • Typically max and min are separate operations
  • Sometimes with 3 operand forms min3(a,b,c) and max3(a,b,c)
  • If the hardware had ability to write (or accept) 2 results from the ALU per clock
  • Accept as in goes in a destination cache to avoid register fetch from SRAM in later instructions
  • Then getting both the min and max result would be quite useful
  • Because often both the min and max are needed in many algorithms
  • Today the AMD packed math fast path is to do max(half2(a.x,-a.x),half2(b.x,-b.x))
  • Which provides {max,-min} respectively
  • And allows for AoS or SoA changes at the same time (change the .x to a .y in either operand)
  • This shows up in TAAs, image processing, block compression, etc

Shift Direction - Data Driven

  • HW tends to have separate ops for shift left and shift right
  • One could instead allow the shift amount to choose shift direction
  • Signed for right and unsigned for left
  • Combine that with ability to shift output to zero too

Shifts to Zero

  • 32-bit shift implementation often takes the 5-bits LSB as the shift amount
  • This has an unfortunate side effect that it becomes impossible to do a data-driven shift value to zero (either for the left or right shift)
  • Often if one is using shifts to do SIMD parallel bit packing, shift-to-zero enables denoting a field that is not desired in divergent data SIMD but then becomes expressible without code divergence or extra instructions
  • The ask would be that one could shift by {0 to 32} instead of {0 to 31} for 32-bit shifts
  • Probably takes just one extra level of logic to correct this in HW (easy to do)

TBUFFER But With 64-bit Base Pointers Instead of Descriptors

  • Topic item for future GPU evolution ...
  • Moving from descriptors to 64-bit base pointer means no need for drivers to manage
  • No longer need to put buffers into descriptor sets
  • TBUFFER as in load/store but with Type specified in the opcode
  • And by Type, ability to load/store the same types as in images
  • So advanced types like 10:10:10:2 and so on
  • Obviously would want cache control as part of the opcode
  • TODO: Along with maybe an addressing mode?
  • Something that could do standard address translations for locality

XOR Offsetting

  • This was something more for some radical integer FPGA GPU design
  • Not sure if it could be useful outside that context!
  • Idea was to use [XOR(adr,offset)] instead of [adr+offset]
  • Because the adder gets expensive
  • And for aligned and power of 2 sized things, it’s the same output
  • Possibly the XOR provides some tools for re-ordering data
  • One can use the LSBs of the base address to choose a reordering pattern

OFF   000 001 010 011 100 101 110 111        0 1 2 3 4 5 6 7

BAS   --- --- --- --- --- --- --- ---        - - - - - - - -

000 | 000 001 010 011 100 101 110 111    0 | 0 1 2 3 4 5 6 7  ---> zero BAS works like ADD

001 | 001 000 011 010 101 100 111 110    1 | 1 0 3 2 5 4 7 6  ---

010 | 010 011 000 001 110 111 100 101    2 | 2 3 0 1 6 7 4 5   ^

011 | 011 010 001 000 111 110 101 100 -> 3 | 3 2 1 0 7 6 5 4   |   the rest provide various reordering

100 | 100 101 110 111 000 001 010 011    4 | 4 5 6 7 0 1 2 3   |

101 | 101 100 111 110 001 000 011 010    5 | 5 4 7 6 1 0 3 2   v

110 | 110 111 100 101 010 011 000 001    6 | 6 7 4 5 2 3 0 1  ---

111 | 111 110 101 100 011 010 001 000    7 | 7 6 5 4 3 2 1 0  ---> ~0 BAS inverts the order of OFF


Bonus Section on AMD PC Compiler Bugs

This is only getting updated very slowly when the topics show up again ...

AtomicAdd Fail on AMD PC

Yeah it's that bad, imagine the worst compiler behavior possible, that is what you get ...

Let's try to do something simple like an atomicAdd predicated to the first lane

Round 1 - First Attempt

  • Compiler does the branch to get to single lane execution
  • For the program's if(gl_LocalInvocationID.x==0)
  • Although it didn't need to do a branch it could have just predicated execution instead
  • Then compiler forgets it is already in single lane execution
  • And it uses masked bit count to find the first active lane of execution
  • Then branches a second time
  • Basically duplicating what it did before but slower this time
  • Then it computes the number of active lanes before the second branch
  • And multiplies that active lane count (which compiler should know is one)
  • By the 1289 atomic add compile time immediate
  • Then it does the atomic
  • Then it goes back and does a redundant readfirstlane
  • And multiplies the prior masked bit count by the 1289 constant
  • To reconstruct the atomic as if it wasn't done in one lane
  • And finally it gets to the program's readlane which ignores the prior step
  • Yeah, WTF?
  • If one is going to do a "perf strategy" for idiot programmers
  • Better at least not destroy performance for "competent" programmers
  • Do no harm
  • And yet, it produces an absolute nightmare to try to workaround

Source

--

#version 460

#extension GL_EXT_shader_explicit_arithmetic_types:require

#extension GL_EXT_shader_explicit_arithmetic_types_int32:require

#extension GL_KHR_shader_subgroup_arithmetic:require

#extension GL_KHR_shader_subgroup_ballot:require

#extension GL_KHR_shader_subgroup_quad:require

#extension GL_KHR_shader_subgroup_shuffle:require

#extension GL_KHR_shader_subgroup_vote:require

#extension GL_EXT_shader_subgroup_extended_types_float16:require

#define I1 uint32_t

layout(set=0,binding=0,std430)writeonly buffer ssboW_ {I1 ssboW[1024*1024*1024/4];};

layout(set=0,binding=1,r32ui)coherent uniform uimageBuffer stbC_I1[64];

layout(local_size_x=64)in;

void main(){

I1 v;

if(gl_LocalInvocationID.x==0)v=imageAtomicAdd(stbC_I1[0],0,1289);

v=subgroupBroadcast(v,0);

ssboW[gl_LocalInvocationID.x]=v;}

--

Disassembly

--

0x000000    s_mov_b32              s0, s1

0x000004    s_getpc_b64            s[2:3]

0x000008    s_mov_b64              s[4:5], exec

0x00000C    v_cmpx_eq_u32_e32      0, v0

0x000010    s_cbranch_execz        _L0

0x000014    s_mov_b64              s[8:9], exec

0x000018    s_mov_b64              s[6:7], exec

0x00001C    v_mbcnt_lo_u32_b32     v1, s8, 0

0x000024    v_mbcnt_hi_u32_b32     v1, s9, v1

0x00002C    v_cmpx_eq_u32_e32      0, v1

0x000030    s_cbranch_execz        _L1

0x000034    s_mov_b32              s1, s3

0x000038    v_mov_b32_e32          v3, 0

0x00003C    s_load_dwordx4         s[12:15], s[0:1], null

0x000044    s_bcnt1_i32_b64        s1, s[8:9]

0x000048    s_mulk_i32             s1, 0x509

0x00004C    v_mov_b32_e32          v2, s1

0x000050    s_waitcnt              lgkmcnt(0)

0x000054    buffer_atomic_add      v2, v3, s[12:15], 0 idxen glc

_L1:

0x00005C    s_or_b64               exec, exec, s[6:7]

0x000060    s_waitcnt              vmcnt(0)

0x000064    v_readfirstlane_b32    s1, v2

0x000068    v_mad_u32_u24          v1, 0x509, v1, s1

_L0:

0x000074    s_or_b64               exec, exec, s[4:5]

0x000078    s_mov_b32              s1, s3

0x00007C    v_readlane_b32         s4, v1, 0

0x000084    s_load_dwordx4         s[0:3], s[0:1], 0x800

0x00008C    v_lshlrev_b32_e32      v0, 2, v0

0x000090    v_mov_b32_e32          v1, s4

0x000094    s_waitcnt              lgkmcnt(0)

0x000098    buffer_store_dword     v1, v0, s[0:3], 0 offen

0x0000A0    s_endpgm              

--

Round 2 - Use gl_SubgroupInvocationID instead?

  • Surely the compiler would know it's already predicated to one lane?
  • Nope same collection of problems
  • Except it's actually worse
  • Because gl_SubgroupInvocationID requires 2 VALU ops (masked bit count)
  • Instead of just using the gl_LocalInvocationID.x which is already in a VGPR
  • So it ends up doing the masked bit count 2 times

Source

--

#version 460

#extension GL_EXT_shader_explicit_arithmetic_types:require

#extension GL_EXT_shader_explicit_arithmetic_types_int32:require

#extension GL_KHR_shader_subgroup_arithmetic:require

#extension GL_KHR_shader_subgroup_ballot:require

#extension GL_KHR_shader_subgroup_quad:require

#extension GL_KHR_shader_subgroup_shuffle:require

#extension GL_KHR_shader_subgroup_vote:require

#extension GL_EXT_shader_subgroup_extended_types_float16:require

#define I1 uint32_t

layout(set=0,binding=0,std430)writeonly buffer ssboW_ {I1 ssboW[1024*1024*1024/4];};

layout(set=0,binding=1,r32ui)coherent uniform uimageBuffer stbC_I1[64];

layout(local_size_x=64)in;

void main(){

I1 v;

if(gl_SubgroupInvocationID==0)v=imageAtomicAdd(stbC_I1[0],0,1289);

v=subgroupBroadcast(v,0);

ssboW[gl_LocalInvocationID.x]=v;}

--

Disassembly

--

0x000000    v_mbcnt_lo_u32_b32     v1, -1, 0

0x000008    s_mov_b32              s0, s1

0x00000C    s_getpc_b64            s[2:3]

0x000010    v_mbcnt_hi_u32_b32     v1, -1, v1

0x000018    v_cmp_eq_u32_e32       vcc_lo, 0, v1

0x00001C    s_and_saveexec_b64     s[4:5], vcc

0x000020    s_cbranch_execz        _L0

0x000024    s_mov_b64              s[8:9], exec

0x000028    s_mov_b64              s[6:7], exec

0x00002C    v_mbcnt_lo_u32_b32     v1, s8, 0

0x000034    v_mbcnt_hi_u32_b32     v1, s9, v1

0x00003C    v_cmpx_eq_u32_e32      0, v1

0x000040    s_cbranch_execz        _L1

0x000044    s_mov_b32              s1, s3

0x000048    v_mov_b32_e32          v3, 0

0x00004C    s_load_dwordx4         s[12:15], s[0:1], null

0x000054    s_bcnt1_i32_b64        s1, s[8:9]

0x000058    s_mulk_i32             s1, 0x509

0x00005C    v_mov_b32_e32          v2, s1

0x000060    s_waitcnt              lgkmcnt(0)

0x000064    buffer_atomic_add      v2, v3, s[12:15], 0 idxen glc

_L1:

0x00006C    s_or_b64               exec, exec, s[6:7]

0x000070    s_waitcnt              vmcnt(0)

0x000074    v_readfirstlane_b32    s1, v2

0x000078    v_mad_u32_u24          v1, 0x509, v1, s1

_L0:

0x000084    s_or_b64               exec, exec, s[4:5]

0x000088    s_mov_b32              s1, s3

0x00008C    v_readlane_b32         s4, v1, 0

0x000094    s_load_dwordx4         s[0:3], s[0:1], 0x800

0x00009C    v_lshlrev_b32_e32      v0, 2, v0

0x0000A0    v_mov_b32_e32          v1, s4

0x0000A4    s_waitcnt              lgkmcnt(0)

0x0000A8    buffer_store_dword     v1, v0, s[0:3], 0 offen

0x0000B0    s_endpgm              

--

Round 3 - How about subgroupElect()?

  • Certainly the compiler has to know it's only one lane, that is what subgroupElect() is for after all
  • Nope same collection of problems
  • But it actually got worse
  • The implementation of subgroupElect() tries to figure out the first lane
  • Even though it's the start of the program, and the first lane is obviously lane 0

Source

--

#version 460

#extension GL_EXT_shader_explicit_arithmetic_types:require

#extension GL_EXT_shader_explicit_arithmetic_types_int32:require

#extension GL_KHR_shader_subgroup_arithmetic:require

#extension GL_KHR_shader_subgroup_ballot:require

#extension GL_KHR_shader_subgroup_quad:require

#extension GL_KHR_shader_subgroup_shuffle:require

#extension GL_KHR_shader_subgroup_vote:require

#extension GL_EXT_shader_subgroup_extended_types_float16:require

#define I1 uint32_t

layout(set=0,binding=0,std430)writeonly buffer ssboW_ {I1 ssboW[1024*1024*1024/4];};

layout(set=0,binding=1,r32ui)coherent uniform uimageBuffer stbC_I1[64];

layout(local_size_x=64)in;

void main(){

I1 v;

if(subgroupElect())v=imageAtomicAdd(stbC_I1[0],0,1289);

v=subgroupBroadcast(v,0);

ssboW[gl_LocalInvocationID.x]=v;}

--

Disassembly

--

0x000000    v_mbcnt_lo_u32_b32     v1, exec_lo, 0

0x000008    s_mov_b32              s0, s1

0x00000C    s_getpc_b64            s[2:3]

0x000010    v_mbcnt_hi_u32_b32     v1, exec_hi, v1

0x000018    v_cmp_eq_u32_e32       vcc_lo, 0, v1

0x00001C    s_and_saveexec_b64     s[4:5], vcc

0x000020    s_cbranch_execz        _L0

0x000024    s_mov_b64              s[8:9], exec

0x000028    s_mov_b64              s[6:7], exec

0x00002C    v_mbcnt_lo_u32_b32     v1, s8, 0

0x000034    v_mbcnt_hi_u32_b32     v1, s9, v1

0x00003C    v_cmpx_eq_u32_e32      0, v1

0x000040    s_cbranch_execz        _L1

0x000044    s_mov_b32              s1, s3

0x000048    v_mov_b32_e32          v3, 0

0x00004C    s_load_dwordx4         s[12:15], s[0:1], null

0x000054    s_bcnt1_i32_b64        s1, s[8:9]

0x000058    s_mulk_i32             s1, 0x509

0x00005C    v_mov_b32_e32          v2, s1

0x000060    s_waitcnt              lgkmcnt(0)

0x000064    buffer_atomic_add      v2, v3, s[12:15], 0 idxen glc

_L1:

0x00006C    s_or_b64               exec, exec, s[6:7]

0x000070    s_waitcnt              vmcnt(0)

0x000074    v_readfirstlane_b32    s1, v2

0x000078    v_mad_u32_u24          v1, 0x509, v1, s1

_L0:

0x000084    s_or_b64               exec, exec, s[4:5]

0x000088    s_mov_b32              s1, s3

0x00008C    v_readlane_b32         s4, v1, 0

0x000094    s_load_dwordx4         s[0:3], s[0:1], 0x800

0x00009C    v_lshlrev_b32_e32      v0, 2, v0

0x0000A0    v_mov_b32_e32          v1, s4

0x0000A4    s_waitcnt              lgkmcnt(0)

0x0000A8    buffer_store_dword     v1, v0, s[0:3], 0 offen

0x0000B0    s_endpgm              

--

Round 4 - Trying Another Branch Strategy

  • Thought for certain this would work, but it is also horrible
  • Made a lane dynamic value that is 1289 on lane 0 but zero on all other lanes
  • Then predicated the atomic by if the value to add was zero
  • But the output is bad, it's still doing the multiply garbage
  • So either that is a bug, or it somehow knows the value is wave coherent inside the branch
  • Quite amazing, will have to try harder

Source

--

#version 460

#extension GL_EXT_shader_explicit_arithmetic_types:require

#extension GL_EXT_shader_explicit_arithmetic_types_int32:require

#extension GL_KHR_shader_subgroup_arithmetic:require

#extension GL_KHR_shader_subgroup_ballot:require

#extension GL_KHR_shader_subgroup_quad:require

#extension GL_KHR_shader_subgroup_shuffle:require

#extension GL_KHR_shader_subgroup_vote:require

#extension GL_EXT_shader_subgroup_extended_types_float16:require

#define I1 uint32_t

layout(set=0,binding=0,std430)writeonly buffer ssboW_ {I1 ssboW[1024*1024*1024/4];};

layout(set=0,binding=1,r32ui)coherent uniform uimageBuffer stbC_I1[64];

layout(local_size_x=64)in;

void main(){

I1 v=mix(0,1289,gl_LocalInvocationID.x==0);

if(v!=0)v=imageAtomicAdd(stbC_I1[0],0,v);

v=subgroupBroadcast(v,0);

ssboW[gl_LocalInvocationID.x]=v;}

--

Disassembly

--

0x000000    v_mov_b32_e32          v1, 0

0x000004    s_mov_b32              s0, s1

0x000008    s_getpc_b64            s[2:3]

0x00000C    s_mov_b64              s[4:5], exec

0x000010    v_cmpx_eq_u32_e32      0, v0

0x000014    s_cbranch_execz        _L0

0x000018    s_mov_b64              s[8:9], exec

0x00001C    s_mov_b64              s[6:7], exec

0x000020    v_mbcnt_lo_u32_b32     v1, s8, 0

0x000028    v_mbcnt_hi_u32_b32     v1, s9, v1

0x000030    v_cmpx_eq_u32_e32      0, v1

0x000034    s_cbranch_execz        _L1

0x000038    s_mov_b32              s1, s3

0x00003C    v_mov_b32_e32          v3, 0

0x000040    s_load_dwordx4         s[12:15], s[0:1], null

0x000048    s_bcnt1_i32_b64        s1, s[8:9]

0x00004C    s_mulk_i32             s1, 0x509

0x000050    v_mov_b32_e32          v2, s1

0x000054    s_waitcnt              lgkmcnt(0)

0x000058    buffer_atomic_add      v2, v3, s[12:15], 0 idxen glc

_L1:

0x000060    s_or_b64               exec, exec, s[6:7]

0x000064    s_waitcnt              vmcnt(0)

0x000068    v_readfirstlane_b32    s1, v2

0x00006C    v_mad_u32_u24          v1, 0x509, v1, s1

_L0:

0x000078    s_or_b64               exec, exec, s[4:5]

0x00007C    s_mov_b32              s1, s3

0x000080    v_readlane_b32         s4, v1, 0

0x000088    s_load_dwordx4         s[0:3], s[0:1], 0x800

0x000090    v_lshlrev_b32_e32      v0, 2, v0

0x000094    v_mov_b32_e32          v1, s4

0x000098    s_waitcnt              lgkmcnt(0)

0x00009C    buffer_store_dword     v1, v0, s[0:3], 0 offen

0x0000A4    s_endpgm              

--

Round 5 - Trying Not to Branch Strategy

  • Ok this finally worked, but it requires some background to understand how it works
  • For image operations there are 2 ways to disable the store or atomic
  • First set EXEC to disable the associated lane
  • This is what we are told to do in "school"
  • This is what the compiler fails quite hard at
  • The second is to just set the store address to something out of bounds
  • Note this likely won't work for SSBOs!
  • And definitely won't work for 64-bit pointers!
  • One needs to use STORAGE_TEXEL_BUFFER for this to work!
  • Out of bounds writes are disabled after address generation
  • For stores the hardware pre-merges all the same address writes
  • So regardless when the disable happens it should be fast
  • For atomics this is depending on it skipping the disabled lanes early
  • TODO: Will need to check this in a benchmark to double verify
  • So this one always does the atomic on all lanes
  • Depending on hardware fast path of disabling atomics for out of bound addresses
  • By just pushing the address to an out of bound value for all lanes except the first
  • This disables all the dead stupid compiler behavior
  • Probably because the address is now dynamic

Source

--

#version 460

#extension GL_EXT_shader_explicit_arithmetic_types:require

#extension GL_EXT_shader_explicit_arithmetic_types_int32:require

#extension GL_KHR_shader_subgroup_arithmetic:require

#extension GL_KHR_shader_subgroup_ballot:require

#extension GL_KHR_shader_subgroup_quad:require

#extension GL_KHR_shader_subgroup_shuffle:require

#extension GL_KHR_shader_subgroup_vote:require

#extension GL_EXT_shader_subgroup_extended_types_float16:require

#define I1 uint32_t

layout(set=0,binding=0,std430)writeonly buffer ssboW_ {I1 ssboW[1024*1024*1024/4];};

layout(set=0,binding=1,r32ui)coherent uniform uimageBuffer stbC_I1[64];

layout(local_size_x=64)in;

void main(){

I1 v=mix(I1(-4),0,gl_LocalInvocationID.x==0);

v=imageAtomicAdd(stbC_I1[0],int(v),1289);

v=subgroupBroadcast(v,0);

ssboW[gl_LocalInvocationID.x]=v;}

--

Disassembly

--

0x000000    s_mov_b32             s4, s1

0x000004    s_getpc_b64           s[0:1]

0x000008    v_cmp_eq_u32_e32      vcc_lo, 0, v0

0x00000C    s_mov_b32             s5, s1

0x000010    v_mov_b32_e32         v2, 0x509

0x000018    s_clause              0x1

0x00001C    s_load_dwordx4        s[0:3], s[4:5], null

0x000024    s_load_dwordx4        s[4:7], s[4:5], 0x800

0x00002C    v_cndmask_b32_e64     v1, -4, 0, vcc_lo

0x000034    v_lshlrev_b32_e32     v0, 2, v0

0x000038    s_waitcnt             lgkmcnt(0)

0x00003C    buffer_atomic_add     v2, v1, s[0:3], 0 idxen glc

0x000044    s_waitcnt             vmcnt(0)

0x000048    v_readlane_b32        s0, v2, 0

0x000050    v_mov_b32_e32         v1, s0

0x000054    buffer_store_dword    v1, v0, s[4:7], 0 offen

0x00005C    s_endpgm              

--

GL_EXT_buffer_reference

Re Gustav Sterbrant's comment: 'They support pointer arithmetic and everything'

Lets see if everything works ...

TLDR

  • Serious correctness bug with lacking 'coherent' GLC=1 on stores
  • Makes this completely unusable on AMD
  • Second problem, only compiles for RNDA3 optimize correctly with the simple case
  • RDNA2 and before are unusably slow (they emulate the hardware using 64-bit VALU ops)
  • Apparently AMD changed compilers on RDNA3 HW
  • Serious problems with basic code generation on even RNDA3 in the less simple path (packed 16-bit)
  • See the later example
  • Which is otherwise too bad because this extension is really exactly what the author was looking for

Review of AMD HW (example from RDNA2 ISA guide)

S_LOAD_*

  • Load from 1-16 dwords
  • address = base + offset + imm21
  • base : 64-bit SGPR pair
  • offset : 32-bit SGPR providing unsigned byte offset
  • imm21 : signed byte offset (but must be positive)

GLOBAL_*

  • Load or store or atomic via the FLAT instruction form
  • address = base + offset + imm12
  • base : 64-bit SGPR pair
  • offset : 32-bit VGPR providing unsigned byte offset
  • imm12 : signed byte offset
  • address = base + imm12
  • base : 32-bit | 64-bit VGPR
  • Does support {SLC,DLC,GLC} cache control bits

Using the Radeon GPU Analyzer from 2024/09/26

  • Apparently AMD deprecated pre-RDNA HW? Already?
  • So cannot check the output for my Vega based APU using this tool!

Using this program below that tests

  • SMEM getting 8 DWORD loads (can it do large accesses)
  • SMEM using 'base + offset + imm21' (can it use all components)
  • VMEM using 4 DWORD stores
  • VMEM using 'base + offset + imm12' (can it use all components)
  • VMEM using 'coherent' GLC=1 cache control bits

--

#version 460

#extension GL_EXT_shader_explicit_arithmetic_types:require

#extension GL_EXT_shader_explicit_arithmetic_types_int32:require

#extension GL_EXT_shader_explicit_arithmetic_types_int64:require

#extension GL_EXT_buffer_reference : require

#define I1 uint32_t

#define I4 u32vec4

#define L1 uint64_t

#define L4 u64vec4

layout(buffer_reference,std430,buffer_reference_align=32)readonly buffer BufR_L4{L4 v;};

layout(buffer_reference,std430,buffer_reference_align=32)writeonly coherent buffer BufW_L4{L4 v;};

layout(push_constant)uniform pc0_ {I4 pc;};

layout(local_size_x=64)in;

void main(){

I1 i=gl_LocalInvocationID.x<<5;

L1 base=pack64(pc.yz);

I1 off=pc.w;

I1 off2=pc.x;

BufR_L4 rL4=BufR_L4(base+off+0xaa0);

BufW_L4 wL4=BufW_L4(base+i+0xbb0);

wL4.v=rL4.v;}

--

RDNA1 (gfx1010) Disassembly

Bugs

  • Serious correctness bug, no GLC=1 (for coherent layout)
  • Serious perf bug, it's using [vgpr64] addressing and emulating the hardware using 64-bit VALU ops

--

0x000000    s_load_dwordx8          s[4:11], s[2:3], s4 offset:0xaa0    1,8     F40C0101 08000AA0

0x000008    v_lshlrev_b32_e32       v0, 5, v0                           1,8     34000085

0x00000C    v_add_co_u32            v0, s0, s2, v0                      1,8     D70F0000 00020002

0x000014    v_add_co_ci_u32_e64     v1, s0, s3, 0, s0                   2,8     D5280001 00010003

0x00001C    v_add_co_u32            v8, vcc_lo, 0x800, v0               3,8     D70F6A08 000200FF 00000800

0x000028    v_add_co_ci_u32_e32     v9, vcc_lo, 0, v1, vcc_lo           3,8     50120280

0x00002C    s_waitcnt               lgkmcnt(0)                          2,8     BF8CC07F

0x000030    v_mov_b32_e32           v6, s6                              3,8     7E0C0206

0x000034    v_mov_b32_e32           v7, s7                              4,8     7E0E0207

0x000038    v_mov_b32_e32           v4, s4                              5,8     7E080204

0x00003C    v_mov_b32_e32           v5, s5                              6,8     7E0A0205

0x000040    v_mov_b32_e32           v2, s10                             7,8     7E04020A

0x000044    v_mov_b32_e32           v3, s11                             8,8     7E06020B

0x000048    v_mov_b32_e32           v0, s8                              9,8     7E000208

0x00004C    v_mov_b32_e32           v1, s9                              10,8    7E020209

0x000050    global_store_dwordx4    v[8:9], v[4:7], off offset:944      10,8    DC7883B0 007D0408

0x000058    global_store_dwordx4    v[8:9], v[0:3], off offset:960      6,8     DC7883C0 007D0008

0x000060    s_endpgm                                                    0,8     BF810000

--

RDNA3 (gfx1100) Disassembly

Getting right

  • SMEM getting 8 DWORD loads (can it do large accesses)
  • SMEM using 'base + offset + imm21' (can it use all components)
  • VMEM using 4 DWORD stores
  • VMEM using 'base + offset + imm12' (can it use all components)

Bugs

  • Serious correctness bug, no GLC=1 (for coherent layout)

--

0x000000    s_load_b256          s[4:11], s[2:3], s4 offset:0xaa0    1,16    F40C0101 08000AA0

0x000008    v_lshlrev_b32_e32    v0, 5, v0                           1,16    30000085

0x00000C    s_delay_alu          instid0(VALU_DEP_1)                 1,16    BF870001

0x000010    v_and_b32_e32        v8, 0x7fe0, v0                      2,16    361000FF 00007FE0

0x000018    s_waitcnt            lgkmcnt(0)                          1,16    BF89FC07

0x00001C    v_mov_b32_e32        v6, s6                              2,16    7E0C0206

0x000020    v_mov_b32_e32        v7, s7                              3,16    7E0E0207

0x000024    v_mov_b32_e32        v4, s4                              4,16    7E080204

0x000028    v_mov_b32_e32        v5, s5                              5,16    7E0A0205

0x00002C    v_mov_b32_e32        v2, s10                             6,16    7E04020A

0x000030    v_mov_b32_e32        v3, s11                             7,16    7E06020B

0x000034    v_mov_b32_e32        v0, s8                              8,16    7E000208

0x000038    v_mov_b32_e32        v1, s9                              9,16    7E020209

0x00003C    s_clause             0x1                                 9,16    BF850001

0x000040    global_store_b128    v8, v[4:7], s[2:3] offset:2992      9,16    DC760BB0 00020408

0x000048    global_store_b128    v8, v[0:3], s[2:3] offset:3008      5,16    DC760BC0 00020008

0x000050    s_nop                0                                   0,16    BF800000

0x000054    s_sendmsg            sendmsg(MSG_DEALLOC_VGPRS)          0,16    BFB60003

0x000058    s_endpgm                                                 0,16    BFB00000

--

But With Closer Inspection Even the RDNA3 Code Gen is Quite Bad

Another simple case, but with some packed 16-bit maths

  • Compiler doesn't seem to be able to do basic register allocation right
  • Notice the extra V_LSHRREV_B32_E32
  • Compiler then does 4 global stores instead of 1 global store because it put the packed stuff in non-aligned registers

-

#version 460

#extension GL_EXT_shader_explicit_arithmetic_types:require

#extension GL_EXT_shader_explicit_arithmetic_types_int16:require

#extension GL_EXT_shader_explicit_arithmetic_types_int32:require

#extension GL_EXT_shader_explicit_arithmetic_types_int64:require

#extension GL_EXT_buffer_reference : require

#define W2 u16vec2

#define W4 u16vec4

#define I1 uint32_t

#define I4 u32vec4

#define L1 uint64_t

#define L4 u64vec4

layout(buffer_reference,std430,buffer_reference_align=8)readonly buffer BufR_W4{W4 v;};

layout(buffer_reference,std430,buffer_reference_align=8)writeonly coherent buffer BufW_W4{W4 v;};

layout(push_constant)uniform pc0_ {I4 pc;};

layout(local_size_x=64)in;

void main(){

I1 i=gl_LocalInvocationID.x<<5;

L1 base=pack64(pc.yz);

I1 off=pc.w;

I1 off2=pc.x;

BufR_W4 rW4=BufR_W4(base+off+0xaa0);

BufW_W4 wW4=BufW_W4(base+i+0xbb0);

W4 ww=rW4.v;

ww.xy=ww.xy*ww.zw+W2(12,24);

wW4.v=ww;}

-

0x000000    s_load_b64           s[0:1], s[2:3], s4 offset:0xaa0

0x000008    v_mov_b32_e32        v1, 0x18000c

0x000010    v_lshlrev_b32_e32    v0, 5, v0

0x000014    s_delay_alu          instid0(VALU_DEP_1) | instskip(SKIP_1) | instid1(VALU_DEP_3)

0x000018    v_and_b32_e32        v0, 0x7fe0, v0

0x000020    s_waitcnt            lgkmcnt(0)

0x000024    v_pk_mad_u16         v1, s0, s1, v1

0x00002C    s_lshr_b32           s0, s1, 16

0x000030    v_mov_b32_e32        v3, s1

0x000034    v_mov_b32_e32        v4, s0

0x000038    s_delay_alu          instid0(VALU_DEP_3)

0x00003C    v_lshrrev_b32_e32    v2, 16, v1

0x000040    s_clause             0x3

0x000044    global_store_b16     v0, v1, s[2:3] offset:2992

0x00004C    global_store_b16     v0, v2, s[2:3] offset:2994

0x000054    global_store_b16     v0, v3, s[2:3] offset:2996

0x00005C    global_store_b16     v0, v4, s[2:3] offset:2998

0x000064    s_nop                0

0x000068    s_sendmsg            sendmsg(MSG_DEALLOC_VGPRS)

0x00006C    s_endpgm            

-

Looks like in this case it is possible to workaround by re-packing to uvec2 before the store

  • Which implies the compiler just isn't putting the right constraints on things

-

#version 460

#extension GL_EXT_shader_explicit_arithmetic_types:require

#extension GL_EXT_shader_explicit_arithmetic_types_int16:require

#extension GL_EXT_shader_explicit_arithmetic_types_int32:require

#extension GL_EXT_shader_explicit_arithmetic_types_int64:require

#extension GL_EXT_buffer_reference : require

#define W2 u16vec2

#define W4 u16vec4

#define I1 uint32_t

#define I2 u32vec2

#define I4 u32vec4

#define L1 uint64_t

#define L4 u64vec4

#define I1_W2(a) packUint2x16(a)

I2 I2_W4(W4 a){I2 r;r.x=I1_W2(a.xy);r.y=I1_W2(a.zw);return r;}

layout(buffer_reference,std430,buffer_reference_align=8)readonly buffer BufR_W4{W4 v;};

layout(buffer_reference,std430,buffer_reference_align=8)writeonly coherent buffer BufW_W4{W4 v;};

layout(buffer_reference,std430,buffer_reference_align=8)writeonly coherent buffer BufW_I2{I2 v;};

layout(push_constant)uniform pc0_ {I4 pc;};

layout(local_size_x=64)in;

void main(){

I1 i=gl_LocalInvocationID.x<<5;

L1 base=pack64(pc.yz);

I1 off=pc.w;

I1 off2=pc.x;

BufR_W4 rW4=BufR_W4(base+off+0xaa0);

BufW_I2 wI2=BufW_I2(base+i+0xbb0);

W4 ww=rW4.v;

ww.xy=ww.xy*ww.zw+W2(12,24);

wI2.v=I2_W4(ww);}

-

0x000000    s_load_b64           s[0:1], s[2:3], s4 offset:0xaa0

0x000008    v_lshlrev_b32_e32    v0, 5, v0

0x00000C    v_mov_b32_e32        v1, 0x18000c

0x000014    s_delay_alu          instid0(VALU_DEP_2) | instskip(SKIP_1) | instid1(VALU_DEP_2)

0x000018    v_and_b32_e32        v2, 0x7fe0, v0

0x000020    s_waitcnt            lgkmcnt(0)

0x000024    v_pk_mad_u16         v0, s0, s1, v1

0x00002C    v_mov_b32_e32        v1, s1

0x000030    global_store_b64     v2, v[0:1], s[2:3] offset:2992

0x000038    s_nop                0

0x00003C    s_sendmsg            sendmsg(MSG_DEALLOC_VGPRS)

0x000040    s_endpgm            

-

Loading or Unpacking Descriptors Inside a Branch

  • Example below
  • If you predicate a store or atomic to one lane
  • AMD's driver will load/build the descriptor right before the operation
  • When it's on the latency critical path
  • Instead of doing the loading/building prior not on the latency critical path

--

  s_cbranch_execz  label_0012                           // 000000000020: BF880009

  s_and_b32     s2, s6, 0x0000ffff                      // 000000000024: 8602FF06 0000FFFF

  s_mov_b32     s4, s5                                  // 00000000002C: BE840005

  s_mov_b32     s5, s2                                  // 000000000030: BE850002

  s_movk_i32    s6, 0xffff                              // 000000000034: B006FFFF

  s_mov_b32     s7, 0x00024fac                          // 000000000038: BE8700FF 00024FAC

  buffer_store_dword  v1, v0, s[4:7], 0 offen offset:1024 glc // 000000000040: E0705400 80010100

label_0012:

--

Predicating to One Lane is Slow

Desired path

  • Explicit setting of EXEC to just lane 0
  • Using 'if(subgroupInverseBallot(uvec4(1,0,0,0)))'
  • Code produced below (highly ugly)

--

  v_mbcnt_lo_u32_b32  v1, -1, 0                         // 000000000000: D28C0001 000100C1

  v_mbcnt_hi_u32_b32  v1, -1, v1                        // 000000000008: D28D0001 000202C1

  v_lshlrev_b64  v[1:2], v1, 1                          // 000000000010: D28F0001 00010301

  v_and_b32     v1, 1, v1                               // 000000000018: 26020281

  s_mov_b64     s[0:1], exec                            // 00000000001C: BE80017E

  v_cmpx_ne_u32  s[2:3], v1, 0                          // 000000000020: D0DD0002 00010101

  ...

  s_cbranch_execz  label_0019                           // 00000000003C: BF880009

--

Workaround?

  • The workaround path is 'if(gl_LocalInvocationID.x==0)'
  • This is perhaps maybe not desired as a workaround because it would be far better to just explicitly set EXEC without doing the V_CMPX op and checking invocation ID in a VGPR
  • Code produced below
  • Note it forces a compare, and will do a full branch instead of just masking the EXEC
  • In theory the compiler knows this is a one wave workgroup 'layout(local_size_x=64)'
  • It could in theory pattern match that to EXEC manipulation

--

  s_mov_b64     s[0:1], exec                            // 000000000000: BE80017E

  v_cmpx_eq_i32  s[2:3], v0, 0                          // 000000000004: D0D20002 00010100

  ...

  s_cbranch_execz  label_0012                           // 000000000020: BF880009

--

Also bad

  • Using 'if(subgroupElect())'
  • The driver knows this is the beginning of a wave sized workgroup
  • It could have in theory just SAVEEXECed this to lane 1
  • Instead it looks for the first active lane and does the masked bit count mess too

--

  s_ff1_i32_b64  s0, exec                               // 000000000000: BE80117E

  v_mbcnt_lo_u32_b32  v1, -1, 0                         // 000000000004: D28C0001 000100C1

  v_mbcnt_hi_u32_b32  v1, -1, v1                        // 00000000000C: D28D0001 000202C1

  s_mov_b64     s[2:3], exec                            // 000000000014: BE82017E

  v_cmpx_eq_i32  s[0:1], s0, v1                         // 000000000018: 7DA402F9 06868000

  ...

  s_cbranch_execz  label_0017                           // 000000000034: BF880009

--

SSBO Won't Use the Free Addressing for SGPR Offset

Notes

  • So the optimizer does pickup small offsets
  • Large offsets of elements in SSBO get VALU added instead of using the free SGPR offset
  • And each instance of that gets wasted VGPR space and VALU work

Example shader

--

#version 460

#extension GL_EXT_shader_explicit_arithmetic_types:require

#extension GL_EXT_shader_explicit_arithmetic_types_int32:require

#define I1 uint32_t

#define I4 u32vec4

struct HugeSSBO{

I4 takeUpSpace[1024*1024*64];

I1 pastImm[1024];

I1 pastImm2[1024];};

layout(set=0,binding=0,std430)buffer ssbo {HugeSSBO huge;};

layout(push_constant)uniform pc0_ {I4 pc;};

layout(local_size_x=64)in;

void main(){

I1 i=gl_LocalInvocationID.x<<5;

I1 d=huge.pastImm[i];

d+=0xdead;

huge.pastImm2[i+0x1234]=d;}

--

And the disassembly on RNDA2

--

0x000000    s_getpc_b64           s[2:3]

0x000004    s_mov_b32             s0, s1

0x000008    s_mov_b32             s1, s3

0x00000C    v_lshlrev_b32_e32     v0, 7, v0

0x000010    s_load_dwordx4        s[0:3], s[0:1], null

0x000018    v_add_nc_u32_e32      v1, 2.0, v0

0x00001C    v_add_nc_u32_e32      v0, 0x40005000, v0

0x000024    s_waitcnt             lgkmcnt(0)

0x000028    buffer_load_dword     v1, v1, s[0:3], 0 offen

0x000030    s_waitcnt             vmcnt(0)

0x000034    v_add_nc_u32_e32      v1, 0xdead, v1

0x00003C    buffer_store_dword    v1, v0, s[0:3], 0 offen offset:2256

0x000044    s_endpgm              

--

Texel Buffer Won't Use the Free Addressing

Notes

  • AMD's implementation of Texel Buffer uses BUFFER instructions
  • And while the compiler knows the index stride in bytes due to the layout
  • It refuses to optimize cases where it could factor out immediates into either
  • SGPR offset
  • or IMM offset
  • Instead it uses extra expensive VALU ops

Example shader

--

#version 460

#extension GL_EXT_shader_explicit_arithmetic_types:require

#extension GL_EXT_shader_explicit_arithmetic_types_int32:require

#define I1 uint32_t

#define I4 u32vec4

layout(set=0,binding=1,r32ui)uniform uimageBuffer img[2];

layout(push_constant)uniform pc0_ {I4 pc;};

layout(local_size_x=64)in;

void main(){

I1 i=gl_LocalInvocationID.x<<5;

I1 d=imageLoad(img[0],int(i)).x;

d+=0xdead;

imageStore(img[0],int(i+0xdead00),uvec4(d));

imageStore(img[0],int(i+0x78),uvec4(d));}

--

And the disassembly on RNDA2

--

0x000000    s_getpc_b64              s[2:3]

0x000004    s_mov_b32                s0, s1

0x000008    s_mov_b32                s1, s3

0x00000C    v_lshlrev_b32_e32        v0, 5, v0

0x000010    s_load_dwordx4           s[0:3], s[0:1], null

0x000018    v_add_nc_u32_e32         v2, 0xdead00, v0

0x000020    s_waitcnt                lgkmcnt(0)

0x000024    buffer_load_format_x     v1, v0, s[0:3], 0 idxen

0x00002C    v_add_nc_u32_e32         v0, 0x78, v0

0x000034    s_waitcnt                vmcnt(0)

0x000038    v_add_nc_u32_e32         v1, 0xdead, v1

0x000040    buffer_store_format_x    v1, v2, s[0:3], 0 idxen

0x000048    buffer_store_format_x    v1, v0, s[0:3], 0 idxen

0x000050    s_endpgm                

--


Work-in-Progress

Consider everything below junk until it gets moved above ...

Stable ABI For Compute

TODO

General Statement About Hints

Many of the suggestions below manifest as hints that can be placed in the high-level languages like GLSL and the IRs like SPIR-V, with relatively low effort, even if the IVH backends don’t yet support. And IHVs could roll in support as works on their timelines. This would allow shaders to be authored well so improvements could be made later. Some vendors might not have HW support, and they can simply ignore many of these hints safely.

Audio From GPU

  • TODO, this is just a placeholder
  • Need to expand on a suggestion of how to implement
  • Including how one solves the problem of lower latency (GPU work scheduling,etc)
  • GPU is one of the best platforms for custom audio generation algorithms[g]
  • HDMI includes audio output
  • However there is no route on the GPU to write into an audio ring buffer
  • One has to write across the bus into CPU-side memory, then do a copy on the CPU into the audio ring buffer, then have the system copy that back to the GPU for audio out
  • Latency issues
  • GPU-to-CPU writes have no great portable forced uncached write (AMD is an exception)
  • Instead the APIs rely on last-level-cache forced write-back operations (expensive and latent)
  • So this is why today it’s hard to build low-latency GPU audio (because of the GPU->CPU->GPU expensive path)
  • For on-GPU the problem would be getting a high-priority wave or workgroup to execute when audio needs to be generated (periodically)
  • If it’s Graphics running, and no mid-triangle preemption, there would be a problem of a huge graphics job camping all the waves and blocking low-latency
  • If it’s Compute running, there is kernel preemption, so it is possible to get work on the GPUs (not that this is the best option)
  • If the game is using persistent waves and managing it’s own task cut-up, in theory it could respond to running audio generation tasks when needed
  • CPU audio generation has effectively the same kinds of actual latency issues in task execution, like fetching from audio samples is likely to be from a cold cache at the start of the task (assuming a fully loaded machine)
  • Side issue, a lot of modern audio is decompressed streams
  • GPUs do have scalar integer ALUs that could do decompression as well
  • There are options for vector decompression, but who needs 64 parallel decode streams
  • So one would need to parallel in samples not streams

Cache Control - Instruction Level / Etc

  • Getting Coherence In LLC (Last Level Cache)
  • TODO: Big topic
  • Standard Write
  • Suggested implementation
  • Hit Evict on any incoherent cache level
  • Evict Normal on any coherent cache level
  • AMD PC driver does this already with GLC=1 by default on stores (even without coherent layout)
  • Easy to Force: just mix 'coherent writeonly' layout
  • Don't want to leave the possibility of getting stale lines in incoherent caches
  • If the hardware doesn't do eviction after the write, then a full cache flush of the incoherent level is required later for safety
  • Last Use / Hit Evict
  • Suggestion: TODO, this requires more thought ...
  • Designed for private workgroup memory (register spill, stacks, etc)
  • NVIDIA according to PTX only allows this for private workgroup memory
  • Even if NVIDIA didn't support for non-private usage, one could still provide a hint for other platforms
  • Designed for temporary data
  • Designed to avoid unnecessary write-back of lines
  • Lines might have been written only to say NV's L1$ and not L2$
  • Last Use avoids writing any data to L2$ (line state becomes undefined)
  • The big usage case is for using large L3$ to hold non-private temporary data
  • NVIDIA doesn’t document this explicitly, but assuming it is "Evict" after access
  • Likely "Hit Evict" in AMD terms
  • Very hard to leverage "Last Use" in many non-private memory multi-reader parallel problems
  • Unknown at read time state of other parallel tasks which might read a line
  • Probably better to do a ranged invalidation of the cache level after the whole task is done in that case
  • The problem of course, that almost always involves getting the driver involved, which is definitely something one wants to avoid
  • Secondary option if the memory is DCCed (compressed) is to clear it while there is a possibility that the lines are still in the cache
  • By storing zeros to a full DCC block at the same time
  • The store would invoke a meta-data only operation (lower overhead than a full write-back of non-zero data)
  • By RDNA3.5 AMD only supports Hit Evict on L1 (mid-level read-only cache)
  • SLC=1 (STREAM) GLC=0 (cache in L0)
  • This is effectively implicitly selected by one combination

Scanout Query

  • Either ability to query when a frame starts on the GPU realtime shader clock domain
  • Or alternatively ability to query the scanout h-sync counter
  • This is another component to providing AWS (see Realtime Shader Clock Frequency Query)

VA Duplication for Shared Physical Backing for Aliasing?

  • TODO, this is just a placeholder
  • Might scrap this section, because it might not have advantages unless lower level caches are virtually tagged (which isn’t going to be good for GPU design anyway)
  • If aliased allocations have guaranteed different virtual addresses but share the same physical pages, it is possible to reduce the cache invalidation overhead to just the physically tagged caches, and all the virtual tagged caches need not be invalidated
  • But if the virtual tagged caches are all coherent, then one wouldn’t need to invalidate them
  • And this is likely the common configuration

VA Explicit Address for Buffer

  • TODO, this is just a placeholder
  • AMD kernel driver (Linux) supports fixed repeatable manual Virtual Address layout
  • It would be nice to have ability to layout a part of the GPU Virtual Address space explicitly in a portable way across multiple chipsets and multiple vendors
  • This is a great tool to pre-linking pointers and doing other related optimizations
  • It would be nice to have ability to layout a part of the GPU Virtual Address space explicitly
  • A secondary part of this is to be able to mix and match different buffer allocations for different usages into a common consistent VA space
  • For example, having a specific region supporting CPU-read-back, or CPU-write-through, but accessing that via a common pointer on the GPU
  • Common pointer might be a questionable ask if some vendors place some properties into a ‘descriptor’ instead of say a page table entry
  • Third example might be the classic VM enabled ring buffer
  • Repeat the same physical mapping at different VA ranges to have the VA address translation implement a ring buffer

Bus Crossing Topics

CPU-Write-GPU-Read

  • DEVICE_LOCAL+HOST_VISIBLE case
  • CPU is using write-combining stores that cross the bus and write to GPU DRAM
  • Assume the CPU writes don’t invalidate GPU cache entries
  • non-DEVICE_LOCAL
  • CPU is writing to its own cache hierarchy, and the GPU is reading across the bus
  • The bus crossing read snoops the CPU caches (PC)
  • HOST_VISIBLE+HOST_COHERENT
  • In theory could be faster for cases the CPU never reads, and writes full cachelines[h] (TODO validation of this statement)
  • HOST_VISIBLE+HOST_COHERENT+HOST_CACHED
  • In cases where the CPU might read/modify/write a line, or doesn’t write full lines, don’t want the reads to go uncached on the CPU so use this
  • Once a GPU read happens later reads can hit in the GPU cache and bypass re-reading across the bus
  • Always write full cachelines from the CPU (important for write-combining)
  • Have the GPU only read a CPU-written cacheline once per frame
  • Do not try to GPU write to any line that could be CPU-written
  • Make a copy of the data if multi-GPU-read is needed
  • This avoids the possibility of getting different data versions if the line is lost in the GPU cache
  • Note there is the possibility of getting a partial CPU-written cacheline if CPU wrote after submit
  • There is no guarantee of cacheline granularity atomic stores
  • One mitigation is to include a hash of the data in the cacheline
  • If the hash doesn’t agree with the contents, then toss out the data packet

CPU-Write-GPU-Read - GPU-Polling?

  • AMD provides DEVICE_UNCACHED_BIT_AMD for memory type which allows reads to bypass the GPU cache domain (likely in HW this is a page table bit)
  • This support extends back to at least Vega (so anything with packed 16-bit double rate supports it)
  • NVIDIA does not provide such support in a memory type in Vulkan (2024)
  • This now allows the GPU to poll the same line multiple times per frame to see if the CPU is finished with something
  • Without this, you’d need to poll different lines each poll read
  • TODO: Would be nice if there was a way to make GPU polling portable
  • Challenges of portable HW support given all the HW design possibilities (page table, vs in-descriptor, vs cache hints on opcode) … maybe being over-complete would solve that?

GPU-Write-CPU-Read

  • Starting with a basic rule, don’t share cacheline usage between read and write
  • GPU writes the full cacheline (no read) and CPU only reads a full cacheline (no write)
  • TODO: Note, I’m mostly talking about dGPU cases, clearly iGPU can be different
  • This is focusing on worst case …
  • DEVICE_LOCAL?
  • This typically implies uncached reads from the CPU, so don’t use for CPU read
  • There are typically two CPU-side memory types
  • HOST_VISIBLE+HOST_COHERENT
  • This typically implies uncached reads from the CPU, so don’t use for CPU read cases
  • HOST_VISIBLE+HOST_COHERENT+HOST_CACHED
  • CPU reads are cached (fast if they hit)
  • Without some form of cache control there is no guarantee on the timing a GPU write becomes CPU visible mid-frame
  • GPU cache can soak up writes, and wait until some write-back before kicking the lines across the bus
  • Typically one would need a pipeline barrier with VK_PIPELINE_STAGE_HOST_BIT to force a write-back and that could be quite costly
  • One mitigation plan for vendors without cache control might be to just run a GPU workload that blows through the cache, but if it’s a big L3 that could be more challenging
  • TODO: How to try to force a bus-write-through without brutal cost?
  • TODO: Is there a way to pipeline HOST_BIT write-back on both vendors?
  • The typical concern, if a bus crossing write-back is serialized cost is bus speed limited
  • TODO: Can one use ‘volatile’ qualifiers on any vendor to force write-through the bus?

GPU-Write-CPU-Read - Uncached AMD

  • AMD provides DEVICE_UNCACHED_BIT_AMD for memory type
  • This provides a way to force a write to cross the bus and bypass GPU caches
  • TODO: Starting topics which need to be discussed
  • Ability to saturate and stall on bus-crossing writes?

TODO SECTION[i]

Collecting notes from others to comment on when time allows …

John Brooks List and Related Comments

Reducing over time to what isn't commented on ...

1) Direct ptrs to LDS sram

2) Lock cache ways for manual caching

3) > 64KB sram (PS3 SPU had 256KB in 2005)

3) CPU-style stack

4) Function ptrs

5) Linking & libraries

6) Ability to write native assembly

7) Ability to control wave priority & sleep/wake

GPU compilers need to evolve into ptr/function/library paradigm instead of monolithic all-in-one codegen

I also bring data into LDS sram and do loads/stores into LDS so I can NOP lanes/insns by using LDS ptrs >64KB (ie null store) to avoid branches around code blocks that contain stores.”

Pointers

Repurposing memory (buffers + pointers + different types)


More Work in Progress

RDNA4 Notes

ISA Doc: https://www.amd.com/content/dam/amd/en/documents/radeon-tech-docs/instruction-set-architectures/rdna4-instruction-set-architecture.pdf

Workgroup Barriers

RDNA4 Reference

  • Workgroup has 64 barriers

NVIDIA PTX Reference

WMMA Sub-Topic

If we look at RNDA 3.5, the hardware is well documented in AMD's ISA Guide and dead easy to understand, just 16x16 element matrices

https://www.amd.com/content/dam/amd/en/documents/radeon-tech-docs/instruction-set-architectures/rdna35_instruction_set_architecture.pdf

But it has some serious pain points for traditional shader programming

  • Each input {A,B} matrix just sits in lanes {0 to 15}
  • But the other lanes {16-63} need to duplicate the data (wave64 shader)
  • So at one point the algorithm might run wave64
    but WMMAs require reducing to wave16 logic with duplication in the rest
  • And then results and accumulation matrix gets mapped across wave64 with a mix of rows in a given VGPR

There is a limit to the practical size of the matrix based on how many VGPRs one wants to use for a column, 16-tall is 8 VGPRs of packed 16-bit data for example

If one wanted to get serious about using WMMAs

  • It would be massively better to go back to SIMD16 physical hardware!
  • And have explicit sub-vector(wave16)/full-vector(wave64) execution
  • This way you can efficiently work in wave16 traditional shader logic
  • And also get the extreme register file size/lane that you'd want
  • For instance a mixed WMMA/shader algorithm could just run in wave16!

Bandwidth Pain Point

One way off bandwidth pain point: keep matrix A streaming through data while matrix B is held constant

  • But note you compromise your working space by holding matrices in VGPRs
  • In some respects: it's moving part of a program into VGPRs and using register file as I$

Hopefully it's obvious now that persistent waves was the answer all along

  • The same group that didn't or still doesn't support explicit persistent waves
  • Is now relying on them wholesale for efficient computation in the ML gold rush (irony)
  • Do the rest of us a favor and make the general persistent wave toolset available
  • Query how many workgroups fit on the GPU given a PSO
  • Bonus: all the cool hardware changes that could be done (separate topic)

Sparcicity

  • Better to think about things this way
  • Sparse matrix encoding is a form of compression for the matrix that is the baked "program"
  • That reduces the number of FMAs ultimately
  • The sparse matrix encoding is effectively embedded opcode bits
  • It's just MUXing operands
  • Same as the routing network in a systolic array
  • In the program held in the VGPRs (the matrix)
  • Not really something one could leverage for the dynamic data
  • TODO: Opens up the side question
  • Would there ever be utility in general compression of VGPR data beyond simple stuff like packing smaller types?
  • And the other one, what about embedding more "control" data in VGPR data
  • For more efficient execution
  • Aka VGPR data drives an operand MUX (just like the sparse matrix stuff does)
  • Part that a systolic array gets right and a designed-for-sparse-WMMA gets wrong
  • You can make network routing (aka the MUX) conditional on the resulting sign
  • It's too complex to describe what that enables here ...

Flattening Networks

  • The --ONLY-- "way it was meant to be played"
  • Example
  • Network is say NxN sized but needs M*NxM*N context flattened
  • One way to look at this is the filter kernel spatial window
  • Thus one ends up doing M*M duplication of work to avoid the dreaded DRAM round trip
  • This is why TOPs values are "inflated" and NN "efficiency" is complex
  • The bias towards using temporal networks becomes logical for realtime
  • Moving spatial to temporal keeps the M factor down!
  • The dreaded DRAM round trip
  • When networks reduce spatial dims they tend to expand in vector size
  • So round tripping that even after a "spatial reduction" step is painful

Ultimately bandwidth scaling is a dead-end, one cannot afford mass non-local speculative computation!

Simplified View of WMMA Network Logic

  • It's an extreme form of running both sides of a branch
  • Where the matrix getting reuse is part of the "program"
  • Aka a sparse matrix is simply MUX control for operands
  • One is running all filter kernel answers simultaneously
  • As well as running filter kernel logic to test the result match
  • The non-WMMA non-linear logics act as the selection in a way
  • This is one of the big components of TOP "inflation"
  • And can only do this kind of huge speculative computation if the data is local
  • Hierarchical reduction part of the network is simply factoring of common terms in this process
  • Building a hierarchy of terms or vocabulary to describe a domain

Another way of thinking about this

  • Have a collection of vectors of input in the columns of A
  • And a collection of patterns in the rows of B
  • Resulting dot product of those vectors is the weight of the match
  • Byproduct of doing a MMA
  • Testing all groups of inputs against all patterns
  • Concrete simple example but with tiny 4x4 matrices
  • Could have a 2x2 texels of luma unwrapped into a 4x1 column
  • Then 2x2 patterns to test against unwrapped into a 4x1 row
  • Like say {-1,-1,1,1} {-1,1,-1,1} {-1,1,1,-1} for {horz,vert,diagonal}
  • Looking at this at 16x16 sizing
  • A lane gets a column
  • Thus a lane likely has a 1:N mapping
  • If say a pixel has 4 attributes, you'd have N=4 to get 16 values
  • This is in sharp contrast to say pixel shading which is 1 lane to 1 pixel

Choosing a network design is forcing a fixed filter flow (the program structure), and training is figuring out a good set of patterns (the language of the network, the program matrix)

Problems That Don't Map Well To Networks

Simple Neural Networks are effectively a form of data compression

  • It's a lookup like a texture fetch, but computed by a network filter
  • The input vector is the coordinates of the lookup
  • The output vector is the interpolated result
  • The interpolation is multi-stage hierarchical and highly complex
  • When the network becomes stateful (recurrent) it functions like a temporal filter

Some problems can never map efficiently to even complex neural networks

Example, motion reprojection sampling

  • The input to the network would be the {entire image, and the fetch coordinate}
  • Obviously never going to fly, one would just sample that data
  • So anything that maps well to a lookup in a large data structure that cannot be compressed

How about reprojection filtering after sampling?

  • Input to the network would be {the sampling data for the filter window, the sub-pixel offset}
  • One of the better analytical solutions involves
  • Computing filter coefficients from the sub-pixel offset
  • Taking the {min,max} of the inner 2x2 pixels of the window
  • Doing the filter kernel (weighted average, maps well to dot products)
  • But then a highly non-linear clamp of the result by that {min,max}
  • This is something a neural network is CRAP at
  • Good filtering options involve highly non-differentiable functions

Problems that map to extremely sparse networks are obviously not efficient

TODO ...

 

[a]Conditional branches consume values and we can already label them

SPIR-V has the Subgroup and especially SubgroupId annotations that can be applied to any _value_, and even allow being more specific about the scope of uniformity (subgroup, workgroup, etc)

AFAIK these are not emitted by any compiler for you, but you can use intrinsics to label values, in fact it's a requirement for non-uniform descriptor indexing.

Better HLL compilers can and maybe one day will emit it for everything (ours has the ability to track that information using local type analysis but doesn't emit it for now)

Unsure if any device compilers actually leverage these, but they totally could do so today

[b]It would be cute in a similar way to be able to hint about the coherence of memory accesses (expected cache misses). This could be something that is added after PGO.

1 total reaction

Sebastian Aaltonen reacted with 👍 at 2024-12-17 08:31 AM

[c]This is... an IL... like today. Realistically, it seems that the GPUs ISAs are still radically evolving every generation or two, seems unwise to cap that innovation only to allow people to play with low-level ASM. Maybe one day when GPUs are totally "boring" that would be the road - but I think we'll observe it in practice (less changes in ISAs) - not impose it as a standard for reasons. Standards are the death of innovation.

2 total reactions

N “Nielsbishere” B reacted with 👍 at 2024-12-04 13:23 PM

Adam Sawicki reacted with 👍 at 2024-12-25 23:33 PM

[d]And even with cpus nowadays it's still a time sink, you'd have to write x64 and arm64 versions and if you care about 32-bit x86 and arm versions of the code (maybe RISC-V in the future). Rather than just writing the code once. On GPUs on desktop this would mean you have to do it for AMD, Intel, NV, QCOM and then foreach architecture... Very impractical. On mobile likely no one would get it right either :)

EDIT: Ah, maybe like embedding SPIR-V, but even that'd not be so useful since it's only intermediate. And you'd still need a separate one for DXIL and/or MSIL.

[e]Why HOST_CACHED? HOST_CACHED memory is the equivalent of DX12 READBACK heap, while HOST_VISIBLE without HOST_CACHED is the UPLOAD heap - still system RAM but uncachched and write-combined, good for CPU writes and GPU reads.

[f]For viewing AMD ISA, godbolt.org recently added support for HLSL+RGA

[g]Isn't it the case that audio requires very stringent latency requirements, whilst GPUs are great at throughput but are terrible at latency?

1 total reaction

Adam Sawicki reacted with ☝️ at 2024-12-26 21:23 PM

[h]This article may be useful: https://gpuopen.com/learn/get-the-most-out-of-smart-access-memory/ We did lots of experiments, we had tons of data. Unfortunately we had to water down the statements for this article to be so generic - no absolute statements, no numbers. You know how it is...

Anyway, the difference between this and DEVICE_LOCAL is really the question when do you want to cross the PCIe bus - when writing from the CPU (non-DEVICE_LOCAL) or when reading on the GPU (DEVICE_LOCAL).

About the perf of CPU writes, when using PCIe 4.0, writes to the VRAM are not as efficient as to the system RAM but same order of magnitude. May be few times slower. Definitely not like 100x slower.

[i]Somethings that I haven't see in the list:

- Allow raster to not generate quads / provide differentials

- Allow registers to be "dropped" at a given point in the shader execution (i.e. in the same shader evaluate some preamble, decide what path in the shader is really needed, drop the registers that the given path does not need to use)