1 of 4

Input

(Word Embeddings)

Sinusoidal Positional Embeddings

Masked Multi-Head Attention

Add & Norm

Feed Forward Layer

Add & Norm

Repeat this ‘N’ times

Linear

Softmax

Decoder-Only Transformer

2 of 4

Input

(Word Embeddings)

Sinusoidal Positional Embeddings

Masked Multi-Head Attention

Add & Norm

Feed Forward Layer

Add & Norm

Repeat this ‘N’ times

Linear

Softmax

Vanilla Transformer

Input

(Word Embeddings)

Sinusoidal Positional Embeddings

Masked Multi-Head Attention

Add & Norm

SwiGLU Network

Add & Norm

Repeat this ‘N’ times

Linear

Softmax

Transformer w/ SwiGLU

3 of 4

Input

(Word Embeddings)

Sinusoidal Positional Embeddings

Masked Multi-Head Attention

Add & Norm

Feed Forward Layer

Add & Norm

Repeat this ‘N’ times

Linear

Softmax

Vanilla Transformer

Input

(Word Embeddings)

Sinusoidal Positional Embeddings

Feed Forward Layer

Norm

Repeat this ‘N’ times

Linear

Softmax

Normalized Inputs

Masked Multi-Head Attention

Norm

4 of 4

Input

(Word Embeddings)

Sinusoidal Positional Embeddings

Multi-Head Attention (MHA)

Add & Norm

Feed Forward Layer

Add & Norm

Repeat this ‘N’ times

Linear

Softmax

Decoder-Only Transformer

Input

(Word Embeddings)

SwiGLU Network

RMSNorm

Repeat this ‘N’ times

Linear

Softmax

SoTA Architecture

Grouped Query Attention (GQA)

RMSNorm

Components of GQA

- Attention computed using FlashAttention-2

- Includes RoPE embeddings within each attention block

Across All Blocks

- No bias in linear layers

Optimizer

Optimizer

- Adam

- AdamW