1 of 53

Universal computational thermal imaging overcoming the ghosting effect

Jiachen Zhu

Xinlei Chen

Kaiming He

Yann LeCun

Zhuang Liu

(Project Lead)

Accepted as poster

Presenters:

  • Fabian Perez
  • Valentina Perez
  • Samuel Traslaviña

2 of 53

Transformers without normalization

Jiachen Zhu

Xinlei Chen

Kaiming He

Yann LeCun

Zhuang Liu

(Project Lead)

Accepted as poster

3 of 53

Transformers without normalization

Jiachen Zhu

Xinlei Chen

Kaiming He

Yann LeCun

Zhuang Liu

(Project Lead)

Accepted as poster

January 2022 -> May 2024�881 days

A published paper each 10 days 😱 😱 😱

4 of 53

Transformers without normalization

Jiachen Zhu

Xinlei Chen

Kaiming He

Yann LeCun

Zhuang Liu

(Project Lead)

Accepted as poster

5 of 53

Agenda

01

02

03

04

05

06

Preliminaries

Previous Work

Proposed Method

Experiments

Follow-up papers

Conclusions

6 of 53

Preliminary: What is Normalization and why people uses it?

7 of 53

Preliminary: Why ViT needs Normalization?

8 of 53

Preliminary: Batch Normalization

2015

Problem: “...We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs…”

9 of 53

Preliminary: Batch Normalization

2015

Problem: “...We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs…”

: Mini-batch mean

: Mini-batch variance

10 of 53

Preliminary: Layer Normalization

Problem: “...the effect of batch normalization is dependent on the mini-batch size and it is not obvious how to apply it to recurrent neural networks…”

2016

11 of 53

Preliminary: Layer Normalization

Problem: “...the effect of batch normalization is dependent on the mini-batch size and it is not obvious how to apply it to recurrent neural networks…”

2016

: Layer mean

: Layer variance

12 of 53

Preliminary: Layer Normalization

Problem: “...the effect of batch normalization is dependent on the mini-batch size and it is not obvious how to apply it to recurrent neural networks…”

2016

: Layer mean

: Layer variance

LN ❤️Transformers

  1. Batch size independent
  2. Works best with variable-length sequences
  3. Maintains token independence
  4. Facilitates deep training

13 of 53

Preliminary: Layer Normalization

2016

Animación manim de Samuel????

14 of 53

Preliminary: Instance Normalization

Problem: “...The stylized image is, in fact, obtained by iterative optimization until it matches the desired statistics. In practice, it takes several minutes to stylize an image of size 512x512…”

2017

15 of 53

Preliminary: Instance Normalization

Problem: “...The stylized image is, in fact, obtained by iterative optimization until it matches the desired statistics. In practice, it takes several minutes to stylize an image of size 512x512…”

2017

16 of 53

Preliminary: Group Normalization

2018

Problem: “...However, normalizing along the batch dimension introduces problems — BN’s error increases rapidly when the batch size becomes smaller, caused by inaccurate batch statistics estimation…”

17 of 53

Preliminary: RMS Normalization

2019

Problem: “...However, the computational overhead introduced by LayerNorm makes these improvements expensive and significantly slows the underlying network, e.g. RNN in particular…”

18 of 53

Previous Work

19 of 53

Is normalization indispensable for training deep neural networks?

We want to preserve the variance

The exploding/vanishing gradient problem

NeurIPS 2020

init weights with He Initialization

20 of 53

Is normalization indispensable for training deep neural networks?

The exploding/vanishing gradient problem

NeurIPS 2020

init weights with He Initialization

QUIZ

What's the formula of He Initialization?

21 of 53

Is normalization indispensable for training deep neural networks?

The exploding/vanishing gradient problem

NeurIPS 2020

init weights with He Initialization

QUIZ

What's the formula of He Initialization?

22 of 53

Is normalization indispensable for training deep neural networks?

Variance of a layer

NeurIPS 2020

init weights with He Initialization

23 of 53

Is normalization indispensable for training deep neural networks?

NeurIPS 2020

Resnet style network

24 of 53

Is normalization indispensable for training deep neural networks?

NeurIPS 2020

Variance is not preserved! , doubles in each block

25 of 53

Is normalization indispensable for training deep neural networks?

NeurIPS 2020

Variance is not preserved! , doubles in each block

Normal people solves the variance problem using just normalization lol

26 of 53

Is normalization indispensable for training deep neural networks?

NeurIPS 2020

Transform it into:

are chosen hyperparameters

variance approximately preserved!

27 of 53

CHARACTERIZING SIGNAL PROPAGATION TO CLOSE THE PERFORMANCE GAP IN UNNORMALIZED RESNETS

Transform it into:

ICLR 2021

Each channel begins to have an offset, meaning that its output is no longer centered at zero

Mean Shift

The mean of the relu output is not zero

The solution

28 of 53

CHARACTERIZING SIGNAL PROPAGATION TO CLOSE THE PERFORMANCE GAP IN UNNORMALIZED RESNETS

The proposed network without normalization achieves the same performance and shows the same behavior as a network with BN

29 of 53

Proposed Method

30 of 53

S-shaped input-output mappings

31 of 53

S-shaped input-output mappings

32 of 53

Dynamic tanh (DyT)

: learnable scalar parameter

: learnable, per-channel vector parameters

33 of 53

Dynamic tanh (DyT)

: learnable scalar parameter

: learnable, per-channel vector parameters

And that’s it!

34 of 53

Dynamic tanh (DyT): Algorithm

35 of 53

The role of alpha

36 of 53

Dynamic tanh (DyT): Initialization of ω for Non-LLM Models

Non-LLM models are relatively insensitive to

37 of 53

Dynamic tanh (DyT): Initialization of ω for Non-LLM Models

Training a larger model is more prone to failure, requiring smaller values or learning rates for stable training.

38 of 53

Dynamic tanh (DyT): Initialization of ω for LLMs

Tuning ω0 enhances LLM performance

Two key findings emerge:

  1. Larger models require smaller ω0 values

2. Higher ω0 values for attention blocks improve performance.

39 of 53

Experiments

40 of 53

Experiments

Supervised Training

Self-supervised Training

41 of 53

Experiments

Consistent gains for all scales

42 of 53

Experiments

Text

Audio

DNA

43 of 53

Follow up papers

44 of 53

Stronger Normalization-Free Transformers

Jiachen Zhu

Zhuang Liu

(Project Lead)

Mingzhi Chen

Taiming Lu

Mingjie Sun

Release date: 11 Dec 2025

6 months later 🤓

45 of 53

Stronger Normalization-Free Transformers

Jiachen Zhu

Zhuang Liu

(Project Lead)

Mingzhi Chen

Taiming Lu

Mingjie Sun

Release date: 11 Dec 2025

6 months later 🤓

46 of 53

The function values ​​are distributed around 0

47 of 53

The function values ​​are distributed around 0

The function has a limited range

48 of 53

The function values ​​are distributed around 0

The function has a limited range

How sensitive the function is to changes around the zero center

49 of 53

The function values ​​are distributed around 0

The function has a limited range

How sensitive the function is to changes around the zero center

The function can increase or decrease, but it cannot change direction

50 of 53

51 of 53

Top-1 accuracy on ViT-Base and image generation quality(FID) on DiT-B/4 and DiT-L/4

52 of 53

Stronger Generalization or Better Fitting?

better generalization than normalization layers

53 of 53

Conclusions

Transformers can be trained without normalization layers and still match or surpass the original performance through DyT

A simple tanh(αx) transformation captures much of LayerNorm’s functional effect, suggesting that explicit normalization may not be essential

DyT is especially effective in modern Transformer architectures, but it does not replace BatchNorm as successfully in classical ConvNets