AMD FSR 4/Neural Supersampling and Denoising Tech Analysis

FSR 4

FSR 4 Performance Mode with Frame Generation Shown in Call of Duty Black Ops 6 (20:41)

Yesterday, in AMD’s RX 9070 XT launch presentation, they briefly discussed the technological specs of FSR 4 and the nature of its machine learning model.

Question: “Is FSR 4 a transformer model like the new DLSS 4?”

Answer:

Yes and no - it is highly likely to be a novel “transformer lite” type of model that is influenced by the design of transformer models, but retains a convolutional neural network (CNN) backbone similar to DLSS 2, XeSS, and PSSR.

Quotes from AMD’s announcement:

1. “Our new technology leverages a proprietary hybrid model resulting from extensive research across different types and combinations of neural networks and unique training techniques.” (19:30)

This heavily implies that FSR 4 is a transformer-CNN model hybrid that leverages the best parts of each architecture. (Explained later with recent research papers)

Enables image quality close to DLSS 4, although likely not surpassing it.
An improvement from purely CNN based upscaling models like PSSR, XeSS, and even DLSS 3.

2. “... and optimized it [FSR 4] for the new FP8 ML acceleration in RDNA 4.” (19:25)

FSR 4 utilizes enhanced machine learning acceleration not found in RDNA 3 or earlier.

Lower precision floating point data types (FP8) work better for transformer models than integer data types (INT8) do. For CNN models, INT8 is sufficient.

Transformers need granularity in values more than the range in values that INT8 provides.

3. “FSR 4 uses the new FP8 data type in the RDNA 4 architecture to balance quality and performance.” (19:39)

Confirms previous points about the need to balance quality and performance leading to a hybrid model.
Confirms utilization of FP8 acceleration new to RDNA 4.

Not directly supported by RDNA 3 (workarounds with much worse performance possible.) source

Evidence:

Recent Research Papers:

“A Simple Transformer-style Network for Lightweight Image Super-resolution”

(2022)

“...recently developed methods are computationally expensive and need much more memory. To solve this issue, we propose a simple Transformer-style network (STSN) for the image super resolution (SR) task. The idea of this method is based on using convolutional modulation (Conv2Former), which is a very simple block with a linearly compared to quadratically as in Transformers.” (yes I know it sounds like it’s missing a word but it’s not)

Analysis:

Hybrid Machine Learning Model: The combination of a Transformer-style structure with CNN components creates a hybrid. It also includes a multi-layer perceptron (MLP), another nod to Transformer architecture, but the reliance on convolutions for feature extraction and attention-like mechanisms ties it to

CNNs. The result is a model that blends Transformer concepts.

Traditional transformer models, while powerful, are computationally heavy due to their self-attention mechanisms, which scale quadratically with input size (O(n²)).

Combines elements of transformer models (attention-inspired blocks and multilayer perceptron) with CNN strengths (local feature extraction via convolutions).

Both local feature extraction (a strength of CNNs) and global context (a strength of transformers) can be beneficial for super resolution. This hybrid approach is tailored to balance these needs while optimizing for speed and memory, which is more important on RDNA cards than RTX cards that have dedicated cores for Ai.

Upscaling often needs some degree of non-local reasoning, like ensuring consistency across distant parts of the image (matching textures on opposite sides of a face). A CNN model alone might miss this broader synthesis.

“Incorporating Transformer Designs into Convolutions for Lightweight Image Super-Resolution”

(2023)

“...we propose a neighborhood attention (NA) module that upgrades the standard convolution with a self-attention mechanism. The NA module efficiently extracts long-range dependencies in a sliding window pattern, thereby achieving similar performance to large convolutional kernels but with fewer parameters.”

Analysis:

By embedding self-attention into a convolutional framework, NA directly incorporates a Transformer design principle (attention) into convolutions.
Embedding attention into convolutions and pairing it with transformer-style long-range modeling combines CNN’s computational efficiency and avoids the complexity of a pure transformer model (complete context of every pixel in the entire frame/previous frames.)

“Single-image super-resolution using lightweight transformer-convolutional neural network hybrid model”

(2022)

“These CNN-based methods cannot fully use the internal and external information of the image. The authors add a lightweight Transformer structure to capture this information.”

“The lightweight transformer block (LTB) further extracts features and learns the texture details between the patches through the self-attention mechanism.”

“In the LTB, we stack EMT blocks to capture long-term similarity information in feature maps.”

“We also compare the trade-off between the performance and the number of network parameters from our work and the existing methods. Figure 7 shows the PSNR performances of 10 methods versus the number of parameters, where the results are evaluated with the Set5 dataset for × 4 upscaling factor. We can find that our method significantly outperforms the relatively small models across this dataset and scale. Moreover, our method performs better than EDSR [28] and RDN [37] for × 4 upscaling factor, but with about 90% and 80% fewer parameters on average, respectively. Furthermore, compared with RCAN [29] on four upscaling factors, our model has fewer parameters and achieves higher PSNR. These comparisons indicate that our proposed network has a better trade-off between performance and model size.”

“Furthermore, in terms of FLOPs, our model is more economical than CNLN [31], EDSR, and image super-resolution via deep recursive residual network (DRRN) [58], and its performance is superior to these three methods. Although MemNet [59] and VDSR use fewer FLOPs, our approach obtains better performance and executes faster.”

Analysis:

combines the local efficiency of CNNs with the global context (previous frames/whole frames) awareness of transformers thanks to the LTB.

CNNs handle initial processing, transformer layers enhance global understanding (to refine and align the CNN’s output for upscaling.)

lightweight global processing enables it to deliver superior image quality with significantly reduced GPU memory usage compared to full blow transformer models. This makes it an excellent approach for applications requiring real-time or resource-efficient super-resolution, like gaming at high frame rates.

Overall:

FSR 4 must deliver high quality machine learning based spatial temporal upscaling while still being lightweight enough to allow games to run in parallel on the same GPU cores at high frame rates, meaning pure transformer-based models are impractical (see RDNA 4 compute unit.) These papers show that hybrid models can achieve near-Transformer quality while remaining lightweight.

CNNs are excellent for local feature extraction (texture details), but they struggle with long range information (preserving details in and between frames). By integrating transformer-like self-attention mechanisms into CNN architectures, these models improve quality without massive computational overhead.
Could help FSR 4 maintain image sharpness without requiring full-frame self-attention like DLSS 4.
Integrates both a transformer based block and a “transformer inspired” enhancement block, ensures minimal impact on performance.

FSR 4 on RDNA 3

(“FSR 4 Lite”)

Possible/probable architecture outlined in:

“Single-image super-resolution using lightweight transformer-convolutional neural network hybrid model” (cont)

“Single-image super-resolution using lightweight transformer-convolutional neural network hybrid model” highlights the possibility of removing the transformer part (LTB) of the hybrid model.

This would result in decreased computational load, and potentially removes the need for FP8 acceleration that empowers transformers. (speculation)

Enables RDNA 3 GPUs to accelerate “FSR 4 Lite” model with their support for INT8 acceleration

Retains Detail Attention Block (DAB) of FSR 4 (I misspoke in the video)

Represents an evolution over models like DLSS 3 or XeSS despite retaining CNN architecture
Improves sharpness by reducing “over smoothing” of texture detail and edges

Implications for image quality:

May lead to oversharpening artifacts or aliasing around high contrast edges.

A result of lack of global context

““Hence, we devise a simple channel attention mechanism to effectively capture the texture and details of high dimensional features in the HR space, thereby constructing the DAB.”

“ …the DAB is indispensable for producing SR [super resolution] images with highly detailed visual features. This is because the LR [Low Resolution] space contains limited information, and DAB [Detail Attention Block] can compensate for the missing critical local information by extracting the corresponding features in the HR [High Resolution] space.

Basically says the DAB can pick up on any complex details missed by the normal CNN layers after they’ve done most of the work of upscaling the image.

DAB extracts higher level concepts or details that are only noticeable when the extracted details are looked at together.

For example: it would be hard to know how to enhance an image of a mountain if you could only see/understand tiny fractions of the image at a time. Or the simplest observations. You might struggle to understand what the image really is. You’d only see some trees, some snow, some rocks, some clouds, without ever understanding they’re all a part of the same object that is the mountain.

This enhanced understanding helps the model align with our expectations for how things should look.

Also highlighted in a different paper:

“A Simple Transformer-style Network for Lightweight Image Super-resolution” (Cont)

“In this task, all the contents of the conv2Former block are removed, except of the 3 × 3 to indicate the impact of the attention module, as indicated in Fig. 3b (Model 2). The obtained results are indicated in Table 2, where 1st row represents the results of using the Conv2Former, and 3rd row represents the results without using the attention module. The results show that the attention module has a big impact on performance. For instance, the PSNR dropped from 33.77 dB to 33.61 dB on the Set14 dataset. So, these results show that the attention module can greatly impacts the performance.” (yes the grammar error is in the paper)

Moving on to another new technology shown by AMD in their RX 9070 XT launch event,

Neural Supersampling and Denoising

Some background first:

In the era of the GTX 1080 Ti (A.D. 2017), before the RTX series of GPUs, Nvidia released a paper about denoising path traced images using traditional hand coded algorithms (not machine learning) called:

“Spatiotemporal variance-guided filtering: real-time reconstruction for path-traced global illumination”

“We introduce a reconstruction algorithm that generates a temporally stable sequence of images from one path-per-pixel global illumination. To handle such noisy input, we use temporal accumulation to increase the effective sample count and spatiotemporal luminance variance estimates to drive a hierarchical, image-space wavelet filter. This hierarchy allows us to distinguish between noise and detail at multiple scales using local luminance variance.” - their alternative to a machine learning/Ai model

Nvidia research project

Aims to bring path tracing into real-time feasibility by reconstructing images from noisy inputs.

Traditional analytical approach (not machine learning!)

Interesting because Nvidia is all about Ai nowadays.

Released in July 2017

This is one year before the launch of the RTX series of graphics cards!
The GTX 1080 Ti was released in March 2017, 5 months before the paper.

Achieves stable images at 30 fps or higher.
Efficient - 10ms to reconstruct a 1920x1080p image from one sample per pixel, compared to 2048 samples per pixel for a reference image.
Algorithmic filtering method for denoising as opposed to machine learning (Ai) models.

Real-time performance even on less powerful hardware. Though it may not match Ai models in detail recovery or temporal stability.

No special cores or hardware acceleration needed. Doesn’t require upgrades or new hardware to be implemented.

Does not upscale the resolution of the image, unlike DLSS Ray Reconstruction. Renders at native resolution with 1 sample per pixel. According to the paper, traditional renders would use upwards of 2048 samples per pixel.

Achieves similar image quality while tracing just 0.05% the rays per pixel by accumulating results over multiple frames.

Leads to “boiling” artifacts, particularly in specular reflections.

Shown in video attached to the paper by Nvidia (link to paper at bottom of this doc)

Even before Nvidia announced ray reconstruction, which is their machine learning based denoiser integrated into DLSS, Intel released a research paper showcasing a technology very very similar to DLSS ray reconstruction. Not a lot of people seem to know/talk about this. This is that paper:

“Temporally Stable Real-Time Joint Neural Denoising and

Supersampling”

(Left: Nvidia's SVGF paper shown previously, Middle: Unreal Engine 4 Default, Right: Intel Upscaling-Denoising Tech)

“Recent advances in ray tracing hardware bring real-time path tracing into reach, and ray traced soft shadows, glossy reflections, and diffuse global illumination are now common features in games. Nonetheless, ray budgets are still limited. This results in undersampling, which manifests as aliasing and noise. Prior work addresses these issues separately. While temporal supersampling methods based on neural networks have gained a wide use in modern games due to their better robustness, neural denoising remains challenging because of its higher computational cost.”

“SVGF generally blurs the image too strongly. Fine details in the normal or roughness textures are blurred out. Nonetheless, there is residual low-frequent noise with a splotchy appearance. SVGF also struggles with specular signal components, since temporal accumulation with standard motion vectors leads to temporal lag under camera motion. In spite of the lower-resolution input, our method produces sharper results almost everywhere.”

Intel GPU research project
Based on machine learning

CNN architecture

Released in July 2022, over a year before DLSS Ray Reconstruction was announced!
1280 x 720p input resolution, 2560 x 1440p output resolution
Looks a lot better than SVGF, although it also upscales unlike SVFG from Nvidia.

(compares standard XeSS upscaling with joint denoising and super sampling technique)

Moving on to the present day,

“Neural Supersampling and Denoising for Real-time Path Tracing”

AMD research blog
Released in October 2024
Machine learning model that denoises path traced scenes as well as upscaling the resolution of the frame at the same time.

Similar to DLSS Ray Reconstruction (DLSS 3.5)

“The randomness of samples in Monte Carlo integration inherently produces noise when the scattered rays do not hit the light source after multiple bounces. Hence, many samples per pixel (spp) are required to achieve high quality pixels in Monte Carlo path tracing, often taking a couple of minutes or hours to render a single image. Although the higher number of samples per pixel, the higher chance of less noise in an image, in many cases even with several thousands of samples it still falls short to converge to high quality and shows visually annoying noise.”

Denoising is needed no matter the sample count.

“Neural denoisers use a deep neural network to predict denoising filter weights in a process of training on a large dataset. They are achieving remarkable progress in denoising quality compared to hand-crafted analytical denoising filters [2]. Depending on the complexity of a neural network and how it cooperates with other optimization techniques, neural denoisers are getting more attention to be used for real-time Monte Carlo path tracing.”

*[2]. AMD specifically references Nvidia’s SVGF paper as an example of an inferior technique in the “references” section of their blog haha xD

(not really a jab because it’s a seven year old paper, which is a long time in computer graphics research and especially Ai research since then) I just thought it was interesting.

*They also reference Intel’s paper in the references section

Real time path tracing benefits from neural network based denoising techniques significantly, according to AMD. (And the research community.)

Demoed in AMD’s RX 9070 XT reveal event in the “Toyshop” demo

Path Traced lighting

(shadows, reflections, AO, GI, etc)

1 Million Dynamic Lights
ReStir and Neural Radiance Caching (NRC)

NRC increases the effective number of ray bounces with caching/machine learning inference.
ReStir increases the effective sample count by intelligently shooting rays and caching vs random sampling every frame.

Neural Super Sampling Denoiser

Similar to DLSS 3.5 Ray Reconstruction
Denoises/upscales path traced lighting and upscales the frame too.

Note: Mirror shown is likely NOT a ray/path traced reflection.

Appears to be a sort of planar or render to texture

The sharpness of the reflection and the lack of robot reflection on the specular reflection directly beneath the mirror and robot indicates that dynamic (moving) objects are not included in the path tracing BVH structure.

Some Citations:

Yuanyuan Liu, Mengtao Yue, Han Yan, Lu Zhu: Single-image super-resolution using lightweight transformer-convolutional neural network hybrid model. (2023)

https://doi.org/10.1049/ipr2.12833

Gang Wu, Junjun Jiang, Yuanchao Bai, Xianming Liu: Incorporating Transformer Designs into Convolutions for Lightweight Image Super-Resolution. (2023)

https://doi.org/10.48550/arXiv.2303.14324

Garas Gendy, Nabil Sabor, Jingchao Hou, Guanghui He: A Simple Transformer-style Network for Lightweight Image Super-resolution. (2023)

https://doi.org/10.1109/CVPRW59228.2023.00153

Neural Supersampling and Denoising for Real-time Path Tracing. (2024)

https://gpuopen.com/learn/neural_supersampling_and_denoising_for_real-time_path_tracing/

Christoph Schied, Anton Kaplanya, Kris Wyman, Anjul Patney, Chakravarty R. Alla Chaitanya, John Burgess, Shiqiu Liu, Carsten Dachsbacher, Aaron Lefohn, Marco Salvi: Spatiotemporal Variance-Guided Filtering: Real-Time Reconstruction for Path-Traced Global Illumination. (2017)

https://research.nvidia.com/publication/2017-07_spatiotemporal-variance-guided-filtering-real-time-reconstruction-path-traced

Manu Mathew Thomas, Gabor Liktor, Christoph Peters, SungYe Kim, Karthik Vaidyanathan, Angus G. Forbes: Temporally Stable Real-Time Joint Neural Denoising and Supersampling. (2022)

https://www.intel.com/content/www/us/en/developer/articles/technical/temporally-stable-denoising-and-supersampling.html

All You Need for Gaming – AMD RDNA™ 4 and RX 9000 Series Reveal

Toyshop Realtime Path Tracing Neural Rendering Tech Demo - YouTube

Osvaldo Pinali Doederlein’s Twitter Post

https://x.com/opinali/status/1883889129894908258