Universal computational thermal imaging overcoming the ghosting effect
Jiachen Zhu
Xinlei Chen
Kaiming He
Yann LeCun
Zhuang Liu
(Project Lead)
Accepted as poster
Presenters:
Transformers without normalization
Jiachen Zhu
Xinlei Chen
Kaiming He
Yann LeCun
Zhuang Liu
(Project Lead)
Accepted as poster
Transformers without normalization
Jiachen Zhu
Xinlei Chen
Kaiming He
Yann LeCun
Zhuang Liu
(Project Lead)
Accepted as poster
January 2022 -> May 2024�881 days
A published paper each 10 days 😱 😱 😱
Transformers without normalization
Jiachen Zhu
Xinlei Chen
Kaiming He
Yann LeCun
Zhuang Liu
(Project Lead)
Accepted as poster
Agenda
01
02
03
04
05
06
Preliminaries
Previous Work
Proposed Method
Experiments
Follow-up papers
Conclusions
Preliminary: What is Normalization and why people uses it?
Preliminary: Why ViT needs Normalization?
Preliminary: Batch Normalization
2015
Problem: “...We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs…”
Preliminary: Batch Normalization
2015
Problem: “...We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs…”
: Mini-batch mean
: Mini-batch variance
Preliminary: Layer Normalization
Problem: “...the effect of batch normalization is dependent on the mini-batch size and it is not obvious how to apply it to recurrent neural networks…”
2016
Preliminary: Layer Normalization
Problem: “...the effect of batch normalization is dependent on the mini-batch size and it is not obvious how to apply it to recurrent neural networks…”
2016
: Layer mean
: Layer variance
Preliminary: Layer Normalization
Problem: “...the effect of batch normalization is dependent on the mini-batch size and it is not obvious how to apply it to recurrent neural networks…”
2016
: Layer mean
: Layer variance
LN ❤️Transformers
Preliminary: Layer Normalization
2016
Animación manim de Samuel????
Preliminary: Instance Normalization
Problem: “...The stylized image is, in fact, obtained by iterative optimization until it matches the desired statistics. In practice, it takes several minutes to stylize an image of size 512x512…”
2017
Preliminary: Instance Normalization
Problem: “...The stylized image is, in fact, obtained by iterative optimization until it matches the desired statistics. In practice, it takes several minutes to stylize an image of size 512x512…”
2017
Preliminary: Group Normalization
2018
Problem: “...However, normalizing along the batch dimension introduces problems — BN’s error increases rapidly when the batch size becomes smaller, caused by inaccurate batch statistics estimation…”
Preliminary: RMS Normalization
2019
Problem: “...However, the computational overhead introduced by LayerNorm makes these improvements expensive and significantly slows the underlying network, e.g. RNN in particular…”
Previous Work
Is normalization indispensable for training deep neural networks?
We want to preserve the variance
The exploding/vanishing gradient problem
NeurIPS 2020
init weights with He Initialization
Is normalization indispensable for training deep neural networks?
The exploding/vanishing gradient problem
NeurIPS 2020
init weights with He Initialization
QUIZ
What's the formula of He Initialization?
Is normalization indispensable for training deep neural networks?
The exploding/vanishing gradient problem
NeurIPS 2020
init weights with He Initialization
QUIZ
What's the formula of He Initialization?
Is normalization indispensable for training deep neural networks?
Variance of a layer
NeurIPS 2020
init weights with He Initialization
Is normalization indispensable for training deep neural networks?
NeurIPS 2020
Resnet style network
Is normalization indispensable for training deep neural networks?
NeurIPS 2020
Variance is not preserved! , doubles in each block
Is normalization indispensable for training deep neural networks?
NeurIPS 2020
Variance is not preserved! , doubles in each block
Normal people solves the variance problem using just normalization lol
Is normalization indispensable for training deep neural networks?
NeurIPS 2020
Transform it into:
are chosen hyperparameters
variance approximately preserved!
CHARACTERIZING SIGNAL PROPAGATION TO CLOSE THE PERFORMANCE GAP IN UNNORMALIZED RESNETS
Transform it into:
ICLR 2021
Each channel begins to have an offset, meaning that its output is no longer centered at zero
Mean Shift
The mean of the relu output is not zero
The solution
CHARACTERIZING SIGNAL PROPAGATION TO CLOSE THE PERFORMANCE GAP IN UNNORMALIZED RESNETS
The proposed network without normalization achieves the same performance and shows the same behavior as a network with BN
Proposed Method
S-shaped input-output mappings
S-shaped input-output mappings
Dynamic tanh (DyT)
: learnable scalar parameter
: learnable, per-channel vector parameters
Dynamic tanh (DyT)
: learnable scalar parameter
: learnable, per-channel vector parameters
And that’s it!
Dynamic tanh (DyT): Algorithm
The role of alpha
Dynamic tanh (DyT): Initialization of ω for Non-LLM Models
Non-LLM models are relatively insensitive to
Dynamic tanh (DyT): Initialization of ω for Non-LLM Models
Training a larger model is more prone to failure, requiring smaller values or learning rates for stable training.
Dynamic tanh (DyT): Initialization of ω for LLMs
Tuning ω0 enhances LLM performance
Two key findings emerge:
2. Higher ω0 values for attention blocks improve performance.
Experiments
Experiments
Supervised Training
Self-supervised Training
Experiments
Consistent gains for all scales
Experiments
Text
Audio
DNA
Follow up papers
Stronger Normalization-Free Transformers
Jiachen Zhu
Zhuang Liu
(Project Lead)
Mingzhi Chen
Taiming Lu
Mingjie Sun
Release date: 11 Dec 2025
6 months later 🤓
Stronger Normalization-Free Transformers
Jiachen Zhu
Zhuang Liu
(Project Lead)
Mingzhi Chen
Taiming Lu
Mingjie Sun
Release date: 11 Dec 2025
6 months later 🤓
The function values are distributed around 0
The function values are distributed around 0
The function has a limited range
The function values are distributed around 0
The function has a limited range
How sensitive the function is to changes around the zero center
The function values are distributed around 0
The function has a limited range
How sensitive the function is to changes around the zero center
The function can increase or decrease, but it cannot change direction
Top-1 accuracy on ViT-Base and image generation quality(FID) on DiT-B/4 and DiT-L/4
Stronger Generalization or Better Fitting?
better generalization than normalization layers
Conclusions
Transformers can be trained without normalization layers and still match or surpass the original performance through DyT
A simple tanh(αx) transformation captures much of LayerNorm’s functional effect, suggesting that explicit normalization may not be essential
DyT is especially effective in modern Transformer architectures, but it does not replace BatchNorm as successfully in classical ConvNets