Lecture 8: Language Model
Berkeley CS294-158 Deep Unsupervised Learning
Spring 2024
Hao Liu
Outline
1
Successes of machine and deep learning
2
Chatbot
Copilot
Sora
TPU datacenter
Language model
3
Language model
4
Autoregressive
5
Modeling likelihood
6
Learning
7
Why unsupervised learning?
8
Scaling compute
9
Compute
10
Token
11
Train LSTM with more compute
12
Radford, Alec, Rafal Jozefowicz, and Ilya Sutskever. "Learning to generate reviews and discovering sentiment.” arXiv 2017.
Train LSTM with more compute
13
Radford, Alec, Rafal Jozefowicz, and Ilya Sutskever. "Learning to generate reviews and discovering sentiment.” arXiv 2017.
Train LSTM with more compute
14
Radford, Alec, Rafal Jozefowicz, and Ilya Sutskever. "Learning to generate reviews and discovering sentiment.” arXiv 2017.
Scaling LSTM is difficult
15
Scaling laws for neural language models. Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., ... & Amodei, D. (2020)
“Better usage of context”
“Better model performance”
Transformer architecture
16
Vaswani, Ashish, et al. "Attention is all you need." Advances in neural information processing systems 30 (2017).
Pretraining objective
17
Wang, Thomas, et al. "What language model architecture and pretraining objective works best for zero-shot generalization?.” (2022)
Pretraining objective
18
Wang, Thomas, et al. "What language model architecture and pretraining objective works best for zero-shot generalization?.” (2022)
Model architecture
19
Tay, Y., Dehghani, M., Abnar, S., Chung, H. W., Fedus, W., Rao, J., ... & Metzler, D. Scaling laws vs model architectures: How does inductive bias influence scaling? (2022).
Model architecture
20
Tay, Y., Dehghani, M., Abnar, S., Chung, H. W., Fedus, W., Rao, J., ... & Metzler, D. Scaling laws vs model architectures: How does inductive bias influence scaling? (2022).
Compute cost
21
"Scaling laws for neural language models.” (2020)
Optimal token and parameter
22
"Scaling laws for neural language models.” (2020)
Allocate compute
23
Allocate more compute to parameters (a) or tokens (b)?
Allocate compute
24
"DeepSeek LLM Scaling Open-Source Language Models with Longtermism.” (2024)
Chinchilla scaling
25
"Training Compute-Optimal Large Language Models." (2022).
Chinchilla scaling
26
"Training Compute-Optimal Large Language Models." (2022).
Determine tokens
27
"Training Compute-Optimal Large Language Models." (2022).
Inference optimal
28
Touvron, Hugo, et al. "Llama: Open and efficient foundation language models." (2023).
Loss predicts performance
29
Touvron, Hugo, et al. "Llama: Open and efficient foundation language models." (2023).
Scaling law prediction
30
"DeepSeek LLM Scaling Open-Source Language Models with Longtermism.” (2024)
Large context
31
Genome
Agent
Codebase
Hyperlinked web
World
Blockwise parallel transformers
32
Liu, H., Abbeel, P. “Blockwise Parallel Transformer for Large Context Models”. NeurIPS 2023 Spotlight.
N layers
Outer loop over q
Inner loop on KV
Analysis of memory cost
33
4x times smaller peak activation memory
O(s**2)
8bsh
2bsh
Liu, H., Abbeel, P. “Blockwise Parallel Transformer for Large Context Models”. NeurIPS 2023 Spotlight.
Evaluation
34
Liu, H., Abbeel, P. “Blockwise Parallel Transformer for Large Context Models”. NeurIPS 2023 Spotlight.
Four times longer context than FlashAttention
Generally applicable
35
Blockwise Transformers allows 16x memory saving without overhead
Gemma: Open Models Based on Gemini Research and Technology
https://blog.google/technology/developers/gemma-open-models/
16x times expanded MLP hidden dimension
Liu, H., Abbeel, P. “Blockwise Parallel Transformer for Large Context Models”. NeurIPS 2023 Spotlight.
Still cannot do million-length sequence
36
Extension to RingAttention
37
Key-value loop overlaps communication / computation
query loop is distributed across devices
Liu, H., Abbeel, P. “Blockwise Parallel Transformer for Large Context Models”. NeurIPS 2023 Spotlight.
Liu, H., Zaharia, M., Abbeel, P. “Ring Attention with Blockwise Transformers for Near-Infinite Context”. ICLR 2024.
Analysis of arithmetic intensity
38
Liu, H., Abbeel, P. “Blockwise Parallel Transformer for Large Context Models”. NeurIPS 2023 Spotlight.
Liu, H., Zaharia, M., Abbeel, P. “Ring Attention with Blockwise Transformers for Near-Infinite Context”. ICLR 2024.
Evaluation of max sequence length
39
Liu, H., Abbeel, P. “Blockwise Parallel Transformer for Large Context Models”. NeurIPS 2023 Spotlight.
Liu, H., Zaharia, M., Abbeel, P. “Ring Attention with Blockwise Transformers for Near-Infinite Context”. ICLR 2024.
Evaluation of max sequence length
40
Liu, H., Abbeel, P. “Blockwise Parallel Transformer for Large Context Models”. NeurIPS 2023 Spotlight.
Liu, H., Zaharia, M., Abbeel, P. “Ring Attention with Blockwise Transformers for Near-Infinite Context”. ICLR 2024.
🡨 512 times longer context than blockwise transformers; 2048 times longer than flash attention🡪
Large World Model
41
“World Model on Million-Length Video and Language with RingAttention”. (2024) largeworldmode.github.io
1M effective context
42
LWM achieves highly effective context over 1M tokens. No “lost in the middle” observed.
“World Model on Million-Length Video and Language with RingAttention”. (2024) largeworldmode.github.io
1M effective context
43
“World Model on Million-Length Video and Language with RingAttention”. (2024) largeworldmode.github.io
Large World Model: Video Generation
44
“World Model on Million-Length Video and Language with RingAttention”. (2024) largeworldmode.github.io
Large World Model: Video Generation
45
“World Model on Million-Length Video and Language with RingAttention”. (2024) largeworldmode.github.io
Large World Model: Hour-Long Video Chat
46
“World Model on Million-Length Video and Language with RingAttention”. (2024) largeworldmode.github.io
Scaling of long context
47
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. 2024
Understand code repo
48
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. 2024
Large context applications
49
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. 2024
Large context applications
50
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. 2024
Large context applications
51
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. 2024
Large context applications
52
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. 2024
Reduce data movement
53
"Flashattention: Fast and memory-efficient exact attention with io-awareness.” (2022)
Optimization
54
"Flashattention-2: Faster attention with better parallelism and work partitioning.” (2023)
Higher throughput
55
"Flashattention: Fast and memory-efficient exact attention with io-awareness.” (2022)
"Flashattention-2: Faster attention with better parallelism and work partitioning.” (2023)
Tool use and retrieval
56
Tool use and retrieval
57
Borgeaud, Sebastian, et al. "Improving language models by retrieving from trillions of tokens." PMLR, 2022.
Tool use and retrieval
58
Borgeaud, Sebastian, et al. "Improving language models by retrieving from trillions of tokens." PMLR, 2022.
Tool use and retrieval
59
Borgeaud, Sebastian, et al. "Improving language models by retrieving from trillions of tokens." PMLR, 2022.
Tool use and retrieval
60
Borgeaud, Sebastian, et al. "Improving language models by retrieving from trillions of tokens." PMLR, 2022.
Tool use and retrieval
61
Alternative: sliding window attention
62
Child, Rewon, et al. "Generating long sequences with sparse transformers." (2019).
Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., Casas, D. D. L., ... & Sayed, W. E. (2023). Mistral 7B. (2023)
Beltagy, I., Peters, M. E., & Cohan, A. Longformer: The long-document transformer. (2020)
Alternative: state space model
63
Gu et al. “Efficiently Modeling Long Sequences with Structured State Spaces”. (2022)
Alternative: state space model
64
Gu, A., & Dao, T. “Mamba: Linear-time sequence modeling with selective state spaces”. (2023)
Alternative: attention + SSM
65
“Simple linear attention language models balance the recall-throughput tradeof” (2024)
Based
66
“Simple linear attention language models balance the recall-throughput tradeof” (2024)
Griffin
67
"Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models." (2024).
Griffin
68
"Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models." (2024).
Griffin
69
"Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models." (2024).
KV cache
70
KV cache
71
KV cache
72
Compute and memory bandwidth
73
KV cache compression
74
Hooper, Coleman, et al. "KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization.” (2024)
KV cache compression
75
Hooper, Coleman, et al. "KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization.” (2024)
Multi-query attention
76
Group query attention
77
Trade off
78
Group query attention
79
DRAM stacking on GPU
80
Mixture of experts
81
Shazeer, Noam, et al. "Outrageously large neural networks: The sparsely-gated mixture-of-experts layer." (2017).
Mixtral 8x7B
82
Mistral AI. "Mixtral of experts." (2024).
Mixtral 8x7B
83
Mistral AI. "Mixtral of experts." (2024).
Mixtral 8x7B
84
Mistral AI. "Mixtral of experts." (2024).
Reasoning
85
Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., ... & Schulman, J. Training verifiers to solve math word problems. (2021)
Finetuning requires lots of data
86
Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., ... & Schulman, J. Training verifiers to solve math word problems. (2021)
Scratchpad
87
Nye, Maxwell, et al. "Show your work: Scratchpads for intermediate computation with language models.” (2021)
Scratchpad
88
Nye, Maxwell, et al. "Show your work: Scratchpads for intermediate computation with language models.” (2021)
CoT
89
Wei, Jason, et al. "Chain-of-thought prompting elicits reasoning in large language models." (2022)
CoT
90
Wei, Jason, et al. "Chain-of-thought prompting elicits reasoning in large language models." (2022)
Zero-shot CoT
91
Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., & Iwasawa, Y. Large language models are zero-shot reasoners. (2022).
Zero-shot CoT
92
Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., & Iwasawa, Y. Large language models are zero-shot reasoners. (2022).
Zero-shot CoT
93
Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., & Iwasawa, Y. Large language models are zero-shot reasoners. (2022).
Zero-shot CoT
94
Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., & Iwasawa, Y. Large language models are zero-shot reasoners. (2022).
Process feedback
95
Lightman, H., Kosaraju, V., Burda, Y., Edwards, H., Baker, B., Lee, T., ... & Cobbe, K. Let's Verify Step by Step. (2023).
Process feedback
96
Lightman, H., Kosaraju, V., Burda, Y., Edwards, H., Baker, B., Lee, T., ... & Cobbe, K. Let's Verify Step by Step. (2023).
RLHF
97
Ouyang, Long, et al. "Training language models to follow instructions with human feedback." (2022)
RLHF
98
Ouyang, Long, et al. "Training language models to follow instructions with human feedback." (2022)
Code generation loss scaling
99
Chen, Mark, et al. "Evaluating large language models trained on code." (2021).
Code generation accuracy scaling
100
Chen, Mark, et al. "Evaluating large language models trained on code." (2021).
Large context code loss
101
The Claude 3 Model Family: Opus, Sonnet, Haiku. 2024
AlphaCode
102
“Competition-Level Code Generation with AlphaCode” (2022)
“AlphaCode 2 Technical Report” (2023)
AlphaCode
103
“Competition-Level Code Generation with AlphaCode” (2022)
“AlphaCode 2 Technical Report” (2023)
AlphaCode
104
“Competition-Level Code Generation with AlphaCode” (2022)
“AlphaCode 2 Technical Report” (2023)
AlphaCode2
105
“Competition-Level Code Generation with AlphaCode” (2022)
“AlphaCode 2 Technical Report” (2023)
Pretraining data
106
Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Foster, C., ... & Leahy, C. The pile: An 800gb dataset of diverse text for language modeling. (2020).
Filtering data
107
“LLaMA: Open and Efficient Foundation Language Models” (2023).
RedPajam
108
OpenLLaMA
109
“OpenLLaMA: An Open Reproduction of LLaMA” (2024)
TPU / GPU
110
GEMM
111
TPU, systolic array 8x128x128
GPU, many 8x4x8 ALU
Matmul sharding
112
Shoeybi M, Patwary M, Puri R, LeGresley P, Casper J, Catanzaro B. Megatron-lm: Training multi-billion parameter language models using model parallelism. (2019)
Matmul sharding
113
Matmul sharding
114
Matmul sharding
115
Matmul sharding
116
Matmul sharding
117
Matmul sharding
118
Matmul sharding
119
MLP sharding
120
MLP sharding
121
Attention sharding
122
Attention sharding
123
Attention sharding
124
Data parallelism
125
Rafi W. “Sharding Techniques Single Slice Sharding For Dense LLMs”
Data parallelism
126
Rafi W. “Sharding Techniques Single Slice Sharding For Dense LLMs”
Fully sharded data parallelism
127
Rafi W. “Sharding Techniques Single Slice Sharding For Dense LLMs”
Fully sharded data parallelism
128
Rafi W. “Sharding Techniques Single Slice Sharding For Dense LLMs”
Fully sharded data parallelism
129
Rafi W. “Sharding Techniques Single Slice Sharding For Dense LLMs”
Tensor parallelism
130
Rafi W. “Sharding Techniques Single Slice Sharding For Dense LLMs”
Tensor parallelism
131
Rafi W. “Sharding Techniques Single Slice Sharding For Dense LLMs”
Tensor parallelism
132
Rafi W. “Sharding Techniques Single Slice Sharding For Dense LLMs”
FSDP with TP
133
FSDP with TP
134
FSDP with TP
135
Practical guideline
136
Open problems
137