Input
(Word Embeddings)
Sinusoidal Positional Embeddings
Masked Multi-Head Attention
Add & Norm
Feed Forward Layer
Add & Norm
Repeat this ‘N’ times
Linear
Softmax
Decoder-Only Transformer
Input
(Word Embeddings)
Sinusoidal Positional Embeddings
Masked Multi-Head Attention
Add & Norm
Feed Forward Layer
Add & Norm
Repeat this ‘N’ times
Linear
Softmax
Vanilla Transformer
Input
(Word Embeddings)
Sinusoidal Positional Embeddings
Masked Multi-Head Attention
Add & Norm
SwiGLU Network
Add & Norm
Repeat this ‘N’ times
Linear
Softmax
Transformer w/ SwiGLU
Input
(Word Embeddings)
Sinusoidal Positional Embeddings
Masked Multi-Head Attention
Add & Norm
Feed Forward Layer
Add & Norm
Repeat this ‘N’ times
Linear
Softmax
Vanilla Transformer
Input
(Word Embeddings)
Sinusoidal Positional Embeddings
Feed Forward Layer
Norm
Repeat this ‘N’ times
Linear
Softmax
Normalized Inputs
Masked Multi-Head Attention
Norm
Input
(Word Embeddings)
Sinusoidal Positional Embeddings
Multi-Head Attention (MHA)
Add & Norm
Feed Forward Layer
Add & Norm
Repeat this ‘N’ times
Linear
Softmax
Decoder-Only Transformer
Input
(Word Embeddings)
SwiGLU Network
RMSNorm
Repeat this ‘N’ times
Linear
Softmax
SoTA Architecture
Grouped Query Attention (GQA)
RMSNorm
Components of GQA
- Attention computed using FlashAttention-2
- Includes RoPE embeddings within each attention block
Across All Blocks
- No bias in linear layers
Optimizer
Optimizer
- Adam
- AdamW