A | B | C | D | E | F | G | H | I | J | |
---|---|---|---|---|---|---|---|---|---|---|
1 | This is a list of D3D11 vendor/driver hacks, inspired by Aras's list of D3D9 GPU Hacks. Everything here is not natively available in D3D11 using FEATURE_LEVEL_11_0, which is the maximum FL supported by Win7. | |||||||||
2 | The "support" columns indicate the minimum GPU on which you can use the listed extension(s) for that column. They do not necessarilly reflect the actual capabilities of the hardware, only the functionality that's exposed through D3D11 extensions. Note that entries in these columns with a question mark are unconfirmed, and are just my best guess at the moment. Please let me know you have information about the supported hardware, or can help confirm hardware support for particular feature. | |||||||||
3 | This spreadsheet is maintained by Matt Pettineo (MJP) @MyNameIsMJP https://therealmjp.github.io/ | |||||||||
4 | ||||||||||
5 | ========================= Rendering Hacks ============================================== | |||||||||
6 | ||||||||||
7 | Feature | Description | NVAPI Function(s) | AGS Function(s) | IGFX Function(s) | NV Support | AMD Support | Intel Support | D3D11.x/D3D12 Support | References |
8 | UAV Overlap | Tells the driver to skip synchronization between draw or dispatch calls that use UAVs. Normally the driver will sync after each draw or dispatch that writes to a UAV, in order to prevent hazards when two threads try to access the same area of memory. Using this extension can allow multiple Draws/Dispatches to run in parallel on the GPU, and will also let you keep your QA team busy with difficult-to-repro sync bugs! | NvAPI_D3D11_BeginUAVOverlapEx NvAPI_D3D11_EndUAVOverlap | agsDriverExtensionsDX11_BeginUAVOverlap agsDriverExtensionsDX11_EndUAVOverlap | N/A | Fermi? | Southern Islands | None | Equivalent behavior can be obtained by omitting barriers in D3D12 | https://gpuopen-librariesandsdks.github.io/ags/group__dx11misc.html#ga16f7cfc4d3c436b211f299341e25c801 https://gpuopen-librariesandsdks.github.io/ags/group__dx11misc.html#gae22fcecf7799dfd5aae4bfd308e6444e http://docs.nvidia.com/gameworks/content/gameworkslibrary/coresdk/nvapi/group__dx.html#gac3f34cbd997bdb51478ada50255a9dd7 http://docs.nvidia.com/gameworks/content/gameworkslibrary/coresdk/nvapi/group__dx.html#gaeb78a97e256f3c6c511451dded3994e5 http://docs.nvidia.com/gameworks/content/gameworkslibrary/coresdk/nvapi/group__dx.html#gaed6aaa526d5a4729d9524039eae4c825 |
9 | Depth Bounds Test | Rejects all pixels whose depth falls outside of a range specified by a minimum and maximum depth. Originally developed for accelerating stencil shadows, but can also be used when accumulating deferred lights or projective decals. | NvAPI_D3D11_SetDepthBoundsTest | agsDriverExtensionsDX11_SetDepthBounds | N/A | Fermi? | Southern Islands | None | Optional support as of Windows 10 version 1703 (AKA Creator's Update). NVAPI also has a D3D12 extension. | https://gpuopen-librariesandsdks.github.io/ags/group__dx11misc.html#gaf1635db8ecaaefa20b4950a9191fdcb6 http://docs.nvidia.com/gameworks/content/gameworkslibrary/coresdk/nvapi/group__dx.html#ga0502f9d58555b662a3b6fcc9b61b7d2a |
10 | Forced MSAA Sample Count | Forces a specified MSAA sample count regardless of the render targets and depth targets bound. Can be used to implement MSAA variants that don't require the full storage and bandwidth cost of MSAA render targets. | NvAPI_D3D11_CreateRasterizerState NvAPI_D3D11_RASTERIZER_DESC_EX::ForcedSampleCount | N/A | N/A | Fermi? | None | None | Target independent rasterization provides equivalent functionality, available in D3D11.1 and D3D12 with FL 11_1 | http://developer.download.nvidia.com/assets/events/GDC15/GEFORCE/Maxwell_Archictecture_GDC15.pdf http://docs.nvidia.com/gameworks/content/gameworkslibrary/coresdk/nvapi/group__dx.html#gaa05c2e42fdf9f7ead12acb291f1b9444 |
11 | Programmable MSAA Sample Positions | Allows specifying the location of MSAA sample points within a pixel. Can be used to implement interleaved sampling, jittered sampling, or poor man's decoupled shading. | NvAPI_D3D11_CreateRasterizerState NvAPI_D3D11_RASTERIZER_DESC_EX::ProgrammableSamplePositionsEnable NvAPI_D3D11_RASTERIZER_DESC_EX::InterleavedSamplingEnable NvAPI_D3D11_RASTERIZER_DESC_EX::SamplePositionsX NvAPI_D3D11_RASTERIZER_DESC_EX::SamplePositionsY | N/A | N/A | Maxwell 2.0 | None | None | Optional support with multiple tiers as of Windows 10 version 1703 (AKA Creator's Update). NVAPI has a D3D12 extension. | https://mynameismjp.wordpress.com/2015/09/13/programmable-sample-points/ http://www.geforce.com/hardware/technology/mfaa/technology https://developer.nvidia.com/sites/default/files/akamai/opengl/specs/GL_NV_sample_locations.txt http://developer.download.nvidia.com/assets/events/GDC15/GEFORCE/Maxwell_Archictecture_GDC15.pdf http://docs.nvidia.com/gameworks/content/gameworkslibrary/coresdk/nvapi/group__dx.html#gaa05c2e42fdf9f7ead12acb291f1b9444 |
12 | Conservative Rasterization | Causes a pixel to be shaded if any part of the pixel is covered by a primitive, instead of only testing at 1 or more sample points. Useful for voxelization, occlusion culling, analytical antialiasing, or tiled light binning. Note that using this will typically result in vertex attributes being extrapolated past triangle edges, since they will still be interpolated to the pixel center before shading. | NvAPI_D3D11_CreateRasterizerState NvAPI_D3D11_RASTERIZER_DESC_EX::ConservativeRasterEnable | N/A | N/A | Maxwell 2.0 | None | None | Optional support with multiple tiers in D3D12 and D3D11.3 with FL 12_1 | https://developer.nvidia.com/content/dont-be-conservative-conservative-rasterization http://developer.download.nvidia.com/assets/events/GDC15/GEFORCE/Maxwell_Archictecture_GDC15.pdf http://docs.nvidia.com/gameworks/content/gameworkslibrary/coresdk/nvapi/group__dx.html#gaa05c2e42fdf9f7ead12acb291f1b9444 |
13 | Quad Filling | Causes all pixels within a triangle's screen-space AABB to be shaded. Can also enable a mode where the entire viewport is shaded (Nvidia only). For AMD vertex attributes are not properly interpolated, so only SV_Position will be valid. | NvAPI_D3D11_CreateRasterizerState NvAPI_D3D11_RASTERIZER_DESC_EX::QuadFillMode | agsDriverExtensionsDX11_IASetPrimitiveTopology | N/A | Maxwell 2.0 | Southern Islands | None | None. (NVAPI has a D3D12 extension) | http://developer.download.nvidia.com/assets/events/GDC15/GEFORCE/Maxwell_Archictecture_GDC15.pdf https://www.opengl.org/registry/specs/NV/fill_rectangle.txt http://docs.nvidia.com/gameworks/content/gameworkslibrary/coresdk/nvapi/group__dx.html#gaa05c2e42fdf9f7ead12acb291f1b9444 https://gpuopen-librariesandsdks.github.io/ags/group__dx11misc.html#gaa5367888466f032a79c6869402282c5f |
14 | Post-Z Coverage | Causes SV_Coverage to reflect the active sample points after performing the depth test. Only applicable when SV_Coverage is used as an input to the PS. | NvAPI_D3D11_CreateRasterizerState NvAPI_D3D11_RASTERIZER_DESC_EX::PostZCoverageEnable | N/A | N/A | Maxwell 2.0 | None | None | None. (NVAPI has a D3D12 extension) | http://developer.download.nvidia.com/assets/events/GDC15/GEFORCE/Maxwell_Archictecture_GDC15.pdf https://developer.nvidia.com/sites/default/files/akamai/opengl/specs/GL_EXT_post_depth_coverage.txt http://docs.nvidia.com/gameworks/content/gameworkslibrary/coresdk/nvapi/group__dx.html#gaa05c2e42fdf9f7ead12acb291f1b9444 |
15 | Coverage to Color | Causes an SV_Coverage mask to be converted to a [0, 1] floating point value and multiplied with the PS output color. | NvAPI_D3D11_CreateRasterizerState NvAPI_D3D11_RASTERIZER_DESC_EX::CoverageToColorEnable NvAPI_D3D11_RASTERIZER_DESC_EX::CoverageToColorRTIndex | N/A | N/A | Maxwell 2.0 | None | None | None. (NVAPI has a D3D12 extension) | http://developer.download.nvidia.com/assets/events/GDC15/GEFORCE/Maxwell_Archictecture_GDC15.pdf https://developer.nvidia.com/sites/default/files/akamai/opengl/specs/GL_NV_fragment_coverage_to_color.txt http://docs.nvidia.com/gameworks/content/gameworkslibrary/coresdk/nvapi/group__dx.html#gaa05c2e42fdf9f7ead12acb291f1b9444 |
16 | Alias MSAA texture as a non-MSAA texture | Causes an alias of an MSAA texture that can be viewed as a non-MSAA texture in shaders that read from it. The width and height of the alias is either doubled or quadrupled depending on the MSAA mode. So a 2xMSAA alias will have 2x the width, a 4xMSAA alias will have 2x the width and 2x the height, and an 8xMSAA alias will have 4x the width and 2x the height. Possibly useful for using HW bilinear filtering when performing MSAA resolve. | NvAPI_D3D11_AliasMSAATexture2DAsNonMSAA | N/A | N/A | Fermi? | N/A | None | None. (NVAPI has a D3D12 extension) | http://docs.nvidia.com/gameworks/content/gameworkslibrary/coresdk/nvapi/group__dx.html#ga4f6364cba8cc3a6cbd45d282b413b03d |
17 | MultiDrawIndirect | Like DrawInstancedIndirect and DrawIndexedInstancedIndirect, except there's an additional parameter for the draw count. The GPU then loops over the draw count, and indexes into a buffer containing args for each draw. May cause your GPU's frontend to have a nervous breakdown. NOTE: Nvidia's version only supports passing a draw count from the CPU, while AMD supports both CPU-side and GPU-side draw counts. | NvAPI_D3D11_MultiDrawInstancedIndirect NvAPI_D3D11_MultiDrawIndexedInstancedIndirect | agsDriverExtensionsDX11_MultiDrawInstancedIndirect agsDriverExtensionsDX11_MultiDrawIndexedInstancedIndirect agsDriverExtensionsDX11_MultiDrawInstancedIndirectCountIndirect agsDriverExtensionsDX11_MultiDrawIndexedInstancedIndirectCountIndirect | N/A | Fermi? | Southern Islands | None | D3D12 natively supports ExecuteIndirect, which is a superset of MultiDrawIndirect functionality | https://gpuopen-librariesandsdks.github.io/ags/group__mdi.html#ga61b8abec809f1a11768d7fb9ae34ec1d https://gpuopen-librariesandsdks.github.io/ags/group__mdi.html#ga32a90d7d4e3b0f5a2fbbb8e2a6d49016 https://gpuopen-librariesandsdks.github.io/ags/group__mdi.html#gab94ccbaabcf176631416e73bdfca99e0 https://gpuopen-librariesandsdks.github.io/ags/group__mdi.html#gac1dbfb2ec7f0918450b5a02de4d058f8 |
18 | Quad List Primitives | Enables rendering using a list of quads instead of triangles. Pretend that you're developing for the Sega Saturn! | N/A | agsDriverExtensionsDX11_IASetPrimitiveTopology | N/A | N/A | Southern Islands | None | None | https://gpuopen-librariesandsdks.github.io/ags/group__dx11misc.html#gaf1635db8ecaaefa20b4950a9191fdcb6 |
19 | Multi-View Rendering | Allows replicating your draw calls to multiple viewports and/or render target array slices. The intended use case is stereoscopic rendering for 3D or VR, which requires drawing and rasterizing your geometry twice in the simplest case. Nvidia's version is called "single-pass stereo", and lets you specify separate post-projection X values for each eye from the vertex/geometry/domain shader. AMD's version lets you specify a viewport mask with optional clipping rectangles, and also lets you access the viewport/RT slice index in the shader. | NvAPI_D3D_SetSinglePassStereoMode NvAPI_D3D_QuerySinglePassStereoSupport | agsDriverExtensionsDX11_SetViewBroadcastMasks agsDriverExtensionsDX11_GetMaxClipRects agsDriverExtensionsDX11_SetClipRects AmdDxExtShaderIntrinsics_GetViewportIndex AmdDxExtShaderIntrinsics_GetViewportIndexPsOnly AmdDxExtShaderIntrinsics_GetRTArraySlice AmdDxExtShaderIntrinsics_GetRTArraySlicePsOnly | N/A | Pascal | Southern Islands? | None | Optional tiered support as of Windows 10 version 1703 (AKA Creator's Update). The highest tier level includes shader support for SV_ViewID, which also controls which shader stages are replicated per-view. | http://docs.nvidia.com/gameworks/content/gameworkslibrary/coresdk/nvapi/group__dx.html#ga874782dc7d22a7a946164fb2047b504f http://docs.nvidia.com/gameworks/content/gameworkslibrary/coresdk/nvapi/group__dx.html#gaf55ec1713d3d5a4a9a933f1cd020ad19 https://gpuopen-librariesandsdks.github.io/ags/group__multiview.html#gaa5f9d9b7b45d88824c03ff397036664d https://developer.nvidia.com/pascal-vr-tech |
20 | Modified Post-Projection W | Lets you specify coefficients that modify the post-projection W component with seperate coefficients per-viewport. The main use case is what Nvidia calls "Lens-Matched Shading", which effectively lets you taper off the rasterizatin/shading resolution towards the edges of a single view, which better matches the non-linear warping that's applied to images before displayed in a VR headset. | NvAPI_D3D_SetModifiedWMode NvAPI_D3D_QueryModifiedWSupport | N/A | N/A | Pascal | N/A | None | None. (NVAPI has a D3D12 extension) | http://docs.nvidia.com/gameworks/content/gameworkslibrary/coresdk/nvapi/group__dx.html#gab45e2704ef90ef7b1276761862e05c73 http://docs.nvidia.com/gameworks/content/gameworkslibrary/coresdk/nvapi/group__dx.html#gaa71a96da9ed91f3bcbeb156df601bbd3 https://developer.nvidia.com/pascal-vr-tech Also see the MultiProjection sample in Nvidia's VRWorks SDK |
21 | Late Latching | Normally constant buffers need to be updated by the CPU before issuing draw/dispatch calls, which means in practice the update happens well before the GPU actually executes the draw/dispatch. Late latching lets the CPU update the buffer just before the GPU reads from it, hence the "late" part of the name. The primary use case is for for VR, where you want to query the headet's pose and update the camera matrices as late as possible in order to reduce motion-to-photon latency. | NvAPI_D3D_CreateLateLatchObject NvAPI_D3D_QueryLateLatchSupport | N/A | N/A | Maxwell 2.0? | N/A | None | D3D12 by nature gives you significantly more control over resource updating and CPU/GPU synchronization. Threre's also the ID3D12GraphicsCommandList1::AtomicCopyBufferUINT function that's available as of Windows 10 version 1703, which lets you atomically update and and copy as a single step. | http://docs.nvidia.com/gameworks/content/gameworkslibrary/coresdk/nvapi/group__dx.html#gaf9fde46181b681155c375584e488d17f http://docs.nvidia.com/gameworks/content/gameworkslibrary/coresdk/nvapi/group__dx.html#ga34b758d8ad67c7e870d9ab14bfd78a90 Also see the LateLatch sample in Nvidia's VRWorks SDK |
22 | Driver Shader Compiler Control | Drivers need to JIT compile DXBC bytecode into the native ISA of the GPU before it can run a shader program. Drivers will often handle this by spawning background threads to compile the shaders asynchronously when the D3D shader object is created, and will possibly have to sync on those threads when a draw/dispatch is issued that uses the shader (which can cause hitching). These background threads can compete with the game's threads, and if a game is already creating the shaders as part of an async loading thread then you may be better off telling the driver not to spawn its own tasks. | N/A | agsDriverExtensionsDX11_SetMaxAsyncCompileThreadCount agsDriverExtensionsDX11_NumPendingAsyncCompileJobs agsDriverExtensionsDX11_SetDiskShaderCacheEnabled | N/A | N/A | Southern Islands | None | In D3D12 you're guaranteed to get compiled ISA when you create the PSO, since all of the required state is known up-front. Later versions of Windows 10 (1903 and later) also allow the driver to recompile the shaders in the background after PSO creation, which they may do as a form of profile-guided optimization. | https://gpuopen-librariesandsdks.github.io/ags/group__shadercompiler.html https://devblogs.microsoft.com/directx/background-shader-optimizations/ https://docs.microsoft.com/en-us/windows/win32/api/d3d12/nf-d3d12-id3d12device6-setbackgroundprocessingmode |
23 | ||||||||||
24 | ========================= Shader Hacks ============================================== | |||||||||
25 | ||||||||||
26 | Feature | Description | NVAPI Function(s) | AGS Function(s) | IGFX Function(s) | NV Support | AMD Support | Intel Support | D3D11.x/D3D12 Support | References |
27 | PixelSync | Enforces ordered access for UAV r/w operations based on primitive submission order. Useful for OIT, volumetric shadows, programmable blending, voxelization, and solving world hunger. | N/A | N/A | IntelExt_BeginPixelShaderOrdering IntelExt_BeginPixelShaderOrderingOnUAV | None | None | Haswell | Rasterizer Ordered Views provide equivalent functionality in D3D11.3 and D3D12 with FL 12_1 | http://advances.realtimerendering.com/s2013/2013-07-23-SIGGRAPH-PixelSync.pdf https://software.intel.com/en-us/articles/programmable-blend-with-pixel-shader-ordering https://software.intel.com/en-us/blogs/2013/07/18/order-independent-transparency-approximation-with-pixel-synchronization https://software.intel.com/en-us/blogs/2013/03/27/adaptive-volumetric-shadow-maps |
28 | "Fast" Geometry Shader | Allows creating a "pass-through" GS that can output a mask indicating which viewports a triangle should be rasterized to. Useful for multi-resolution VR, cubemap rendering, and voxelization. Graphics programmers remain skeptical after 9 years of being let down by geometry shaders. | NvAPI_D3D11_CreateFastGeometryShader NvAPI_D3D11_CreateFastGeometryShaderExplicit | N/A | N/A | Maxwell 2.0 | None | None | None. (NVAPI has a D3D12 extension) | http://developer.download.nvidia.com/assets/events/GDC15/GEFORCE/Maxwell_Archictecture_GDC15.pdf https://developer.nvidia.com/virtual-reality-development https://developer.nvidia.com/sites/default/files/akamai/gameworks/vr/GameWorks_VR_2015_Final_handouts.pdf https://developer.nvidia.com/sites/default/files/akamai/opengl/specs/GL_NV_geometry_shader_passthrough.txt https://www.opengl.org/registry/specs/NV/viewport_array2.txt Also, see Nvidia's Multi-ResVR sample included in the Gameworks VR SDK |
29 | Lane Shuffles | Performs a SIMD shuffle between the lanes of a warp, or a subgroup inside of a warp. Can broadcast the value from one lane to all lanes, shuffle based on a delta, or shuffle based on an XOR of the lane ID. Useful for fast reductions that don't require shared memory. Also useful for letting the world know that you're ready to stop pretending that GPUs aren't SIMD. | NvShfl NvShflUp NvShflDown NvShflXor | AmdDxExtShaderIntrinsics_ReadfirstlaneF/U AmdDxExtShaderIntrinsics_ReadlaneF/U AmdDxExtShaderIntrinsics_SwizzleF/U | IntelExt_WaveReadLaneFirst
IntelExt_WaveReadLaneAt IntelExt_QuadReadAcrossDiagonal IntelExt_QuadReadLaneAt IntelExt_QuadReadAcrossX IntelExt_QuadReadAcrossY | Kepler | Southern Islands | Haswell? | Supported in D3D12 with Shader Model 6.0 | http://devblogs.nvidia.com/parallelforall/faster-parallel-reductions-kepler/
https://www.opengl.org/registry/specs/NV/shader_thread_shuffle.txt http://docs.nvidia.com/cuda/cuda-c-programming-guide/#warp-shuffle-functions https://developer.nvidia.com/unlocking-gpu-intrinsics-hlsl https://developer.nvidia.com/reading-between-threads-shader-intrinsics http://gpuopen.com/gcn-shader-extensions-for-direct3d-and-vulkan/ https://github.com/intel/intel-graphics-compiler/blob/master/inc/IntelExtensions.hlsl |
30 | Lane Voting | Allows usage of ballot/any/all functionality in a shader. Any() will return true if a value is true on any lane of a warp. All() will return true if a value is true on all lanes of a warp. Ballot() will return a bitfield where each bit represents whether or not the specified value was true for every lane of the warp. | NvAny NvAll NvBallot | AmdDxExtShaderIntrinsics_Ballot AmdDxExtShaderIntrinsics_BallotAny AmdDxExtShaderIntrinsics_BallotAll | IntelExt_WaveActiveBallot
IntelExt_WaveActiveAllTrue IntelExt_WaveActiveAllEqual IntelExt_WaveAll | Fermi | Southern Islands | Haswell? | Supported in D3D12 with Shader Model 6.0 | http://docs.nvidia.com/cuda/cuda-c-programming-guide/#warp-vote-functions
https://www.opengl.org/registry/specs/NV/shader_thread_group.txt https://developer.nvidia.com/unlocking-gpu-intrinsics-hlsl https://developer.nvidia.com/reading-between-threads-shader-intrinsics http://gpuopen.com/gcn-shader-extensions-for-direct3d-and-vulkan/ https://github.com/intel/intel-graphics-compiler/blob/master/inc/IntelExtensions.hlsl |
31 | Lane ID | Returns the current thread's lane ID within a warp/wavefront. | NvGetLaneId | AmdDxExtShaderIntrinsics_LaneId | IntelExt_WaveGetLaneIndex | Kepler | Southern Islands | Haswell? | Supported in D3D12 with Shader Model 6.0 | https://developer.nvidia.com/unlocking-gpu-intrinsics-hlsl
https://developer.nvidia.com/reading-between-threads-shader-intrinsics http://gpuopen.com/gcn-shader-extensions-for-direct3d-and-vulkan/ https://github.com/intel/intel-graphics-compiler/blob/master/inc/IntelExtensions.hlsl |
32 | Count Active Lanes | Counts number of active lanes within a warp/wavefront that have an index less than the current lane | N/A | AmdDxExtShaderIntrinsics_MBCnt | N/A | None | Southern Islands | None | Supported in D3D12 with Shader Model 6.0 | http://gpuopen.com/gcn-shader-extensions-for-direct3d-and-vulkan/ |
33 | FP32 Atomics | Perform atomic adds on fp32 values in RWByteAddressBuffers or RWTextures. | NvInterlockedAddFp32 | N/A | N/A | Kepler | None | None | None. | https://developer.nvidia.com/unlocking-gpu-intrinsics-hlsl |
34 | FP16 Atomics | Perform atomic add/min/max on a groups of 2 or 4 fp16 values in RWByteAddressBuffers or RWTextures. | NvInterlockedAddFp16x2 NvInterlockedMinFp16x2 NvInterlockedMaxFp16x2 NvInterlockedAddFp16x4 NvInterlockedMinFp16x4 NvInterlockedMaxFp16x4 | N/A | N/A | Maxwell 2.0 | None | None | None. | http://developer.download.nvidia.com/assets/events/GDC15/GEFORCE/Maxwell_Archictecture_GDC15.pdf https://developer.nvidia.com/sites/default/files/akamai/opengl/specs/GL_NV_shader_atomic_fp16_vector.txt https://developer.nvidia.com/unlocking-gpu-intrinsics-hlsl |
35 | U64 Atomics | Perform atomic add/min/max/add/or/xor/exchange on a 64-bit unsigned integer in RWByteAddressBuffers or RWTextures. | N/A | AmdDxExtShaderIntrinsics_AtomicOp | N/A | None | Southern Islands | None | None. | https://github.com/GPUOpen-LibrariesAndSDKs/AGS_SDK/blob/master/ags_lib/hlsl/ags_shader_intrinsics_dx11.hlsl |
36 | UAV Typed Loads | Allows reading from Texture UAVs that have formats other than R32_UINT/R32_SINT/R32_FLOAT*, effectively bypassing The Most Annoying Restriction In The History Of Graphics APIs™. *Stock D3D11 does allow aliasing most 32-bit formats (such as R8G8B8A8_UNORM) as R32_UINT, allowing for manual packing and unpacking. | NvLoadUavTyped | N/A | N/A | Fermi? | None | None | Natively supported for at least 18 formats in D3D11.3/D3D12 with FL 12_0, optional support for the rest. | https://msdn.microsoft.com/en-us/library/windows/desktop/ff728749(v=vs.85).aspx |
37 | 3-parameter Min/Max/Med | Returns the min, max or median value from a set of 3 parameters. | N/A | AmdDxExtShaderIntrinsics_Min3F/U AmdDxExtShaderIntrinsics_Med3F/U AmdDxExtShaderIntrinsics_Max3F/U | N/A | None | Southern Islands | None | None. (AGS has D3D12 support) | http://gpuopen.com/gcn-shader-extensions-for-direct3d-and-vulkan/ |
38 | Barycentrics and Interpolation | Provides the pixel shader with access to the barycentrics used for interpolating vertex attributes, allowing for programmable interpolation. Also handy for implementing deferred rendering with visibility buffers. | N/A | AmdDxExtShaderIntrinsics_IjBarycentricCoords AmdDxExtShaderIntrinsics_PullModelBarycentricCoords AmdDxExtShaderIntrinsics_VertexParameter AmdDxExtShaderIntrinsics_VertexParameterComponent | N/A | None | Southern Islands | None | Optional support in D3D12 with Shader Model 6.1. (AGS has D3D12 support) | http://gpuopen.com/gcn-shader-extensions-for-direct3d-and-vulkan/ http://gpuopen.com/gaming-product/barycentrics12-dx12-gcnshader-ext-sample/ |
39 | Wave Reduction | Performs an operation on all active lanes of the current wavefront (such as a sum, min, or max) and returns the result. It's simpler and faster than using thread group shared memory to do the same thing! | N/A | AmdDxExtShaderIntrinsics_WaveReduce AmdDxExtShaderIntrinsics_WaveActiveSum AmdDxExtShaderIntrinsics_WaveActiveProduct AmdDxExtShaderIntrinsics_WaveActiveMin AmdDxExtShaderIntrinsics_WaveActiveMax AmdDxExtShaderIntrinsics_WaveActiveBitAnd AmdDxExtShaderIntrinsics_WaveActiveBitOr AmdDxExtShaderIntrinsics_WaveActiveBitXor | IntelExt_WaveActiveBitAnd
IntelExt_WaveActiveBitOr IntelExt_WaveActiveCountBits IntelExt_WaveActiveMax IntelExt_WaveActiveMin IntelExt_WaveActiveProduct IntelExt_WaveActiveSum | None | Southern Islands | Haswell? | Native support in D3D12 with Shader Model 6.0. | https://gpuopen.com/amd-gpu-services-5-1-1/
https://github.com/intel/intel-graphics-compiler/blob/master/inc/IntelExtensions.hlsl |
40 | Wave Scan | Similar to wave reductions, except these perform the operation on all active lanes prior to your own lane. So if you ran a prefix sum on lane 4, it would give you the sum of the value from lanes 0, 1, 2, and 3. A postfix sum would do the same, but would also include your own lane (Nvidia uses the terms "exclusive" and "inclusive" to mean the same thing as prefix/postfix). | NvWaveMultiPrefixInclusiveAdd
NvWaveMultiPrefixExclusiveAdd NvWaveMultiPrefixInclusiveAdd NvWaveMultiPrefixExclusiveAdd NvWaveMultiPrefixInclusiveAdd NvWaveMultiPrefixExclusiveAdd NvWaveMultiPrefixInclusiveAnd NvWaveMultiPrefixExclusiveAnd NvWaveMultiPrefixInclusiveAnd NvWaveMultiPrefixExclusiveAnd NvWaveMultiPrefixInclusiveAnd NvWaveMultiPrefixExclusiveAnd NvWaveMultiPrefixInclusiveOr NvWaveMultiPrefixExclusiveOr NvWaveMultiPrefixInclusiveOr NvWaveMultiPrefixExclusiveOr NvWaveMultiPrefixInclusiveOr NvWaveMultiPrefixExclusiveOr NvWaveMultiPrefixInclusiveXOr NvWaveMultiPrefixExclusiveXOr NvWaveMultiPrefixInclusiveXOr NvWaveMultiPrefixExclusiveXOr NvWaveMultiPrefixInclusiveXOr NvWaveMultiPrefixExclusiveXOr | AmdDxExtShaderIntrinsics_WaveScan AmdDxExtShaderIntrinsics_WavePrefixSum AmdDxExtShaderIntrinsics_WavePrefixProduct AmdDxExtShaderIntrinsics_WavePrefixMin AmdDxExtShaderIntrinsics_WavePrefixMax AmdDxExtShaderIntrinsics_WavePostfixSum AmdDxExtShaderIntrinsics_WavePostfixProduct AmdDxExtShaderIntrinsics_WavePostfixMin AmdDxExtShaderIntrinsics_WavePostfixMax | IntelExt_WavePrefixCountBits
IntelExt_WavePrefixProduct IntelExt_WavePrefixSum | Kepler? | Southern Islands | Haswell? | Native support in D3D12 with Shader Model 6.0 (no postfix instrinsics, but these can be trivially implemented on your own). | https://gpuopen.com/amd-gpu-services-5-1-1/
https://github.com/intel/intel-graphics-compiler/blob/master/inc/IntelExtensions.hlsl |