10-605 / 10-805
Machine Learning from Large Datasets
Outline
HYPERPARAMETER SELECTION
4
RECAP
5
Hyperparameter tuning
Ideally only use test data once when everything else has been finalized
RECAP
6
Grid search
Try out multiple values of λ
Pick the one that is best on the validation set
Validation set should be independent of the test set
“Best” does not have to be according to the loss the learner is optimizing.
Scale need not be linear
RECAP
7
Grid search
With multiple hyperparameters you need to test all combinations of the parameters.
Grid search gets expensive fast…we’ll come back to this later in the course!
e.g., regularization, drop out, number of epochs, learning rate, momentum term, network sizes, architectural variants, …
RECAP
you are here!
Hyperparameter Selection
Optimization task—but we can’t use gradients or other methods
Hyperparameter Selection
Outline
Hyperparameter Selection
Adaptive selection loop:
A simple case:
Goal: find the best of the k values quickly
Explore the space intelligently
Theory: Bandit Problems in ML
Goal: (1) discover which arm has the lowest loss, so you can exploit this information (2) discover the best arm quickly so you can start winning!
Theory: Bandit Problems in ML
Hyperparameter Selection
Adaptive selection loop:
Outline
Gaussian Processes
Gaussian Process Regression
Hyperparameter configuration x to learner performance f(x) is a regression task
Challenges:
Gaussian Process Regression is a good match for this
2-D Gaussian Process
Sample a point from the Gaussian, plot it as a line segment
2-D Gaussian Processes
Sample a point from the Gaussian, plot it as a line segment
2-D Gaussian Processes:�Condition on a Point
Sample a point from the p(X2|x1) plot it as a line segment
p(X2|x1) is also a Gaussian
5-D Gaussian Processes
Conditioned on a point
20-D / 5D Gaussian Processes
Conditioned on a point
Conditioned on 4 points
What is Σ ?
20-D / 5D Gaussian Processes
Conditioned on a point
Conditioned
Gaussian Process Regression
Define covariance between xi and xj with a kernel, defined for all xi, xj
y1 is f(x), y2 is x
A = cov(Y1)
B = cov(Y1, Y2)
C = cov(Y2)
Gaussian Process Regression
A = cov(Y1)
B = cov(Y1, Y2)
C = cov(Y2)
y2 is observations
y1 is prediction
computing C-1 is O(n3),
where n is number of observations
Gaussian Process Definition
Consider a process that generates any number of data points x1, …, xT
A Gaussian process is one where for any finite subset of indices t1, …, tn those variables are multivariate Gaussian
Xt1 … Xtn ~ Normal( μ, Σ )
y1 is f(x), y2 is x
A = cov(Y1)
B = cov(Y1, Y2)
C = cov(Y2)
even ∞
will define with kernel
Why I don’t like this cartoon
Why I don’t like this cartoon
Outline
Vanilla UCB can waste time in the “corners” of the graph
Gaussian Process Regression
Downsides:
Outline
EARLY STOPPING AND HYPERPARAMETER SELECTION
Hyperparameter Selection: Early Stopping
What if experiments have differing costs?
Example: #iterations of training is a hyperparameter
We can save computation if we cut off unpromising experiments early
Hyperparameter Selection: Early Stopping
What can we prove about this?
What do we need to assume?
In general we don’t know where a learning curve will end up…
In practice there are often big jumps in discrete eval metrics like error rate, win ratios, …
Many of the losses you train and test on are smooth
Recap: Chinchilla Scaling Laws
63B
1.4T
Constraint = Budget = $$
Idea: fit a simple model to the data that relates compute, parameters, and loss, and use that to optimize loss given a total compute budget.
Caveat About Scaling Laws
Hyperparameter Selection: Early Stopping
If we know where we are converging to something we can be confident when two curves separate
What can we prove about this?
What do we need to assume?
Hyperparameter Selection: Early Stopping
What’s a robust and useful heuristic?
What can we prove about this?
What do we need to assume?
Choosing the winner of a Science Fair
Project 1
Project n-1
Project n
. . .
Project 2
Judges
slide credit: Kevin Jamieson
Choosing the winner of a Science Fair
Project 1
Project n-1
Project n
. . .
Project 2
Judges
3, 3, …
5, 4, …
5, 4, …
2, 1, …
1
2
3
4
5
6
7
n
Uniform Judge Allocation
judges
per
project
. . .
average
score
. . .
1
2
3
4
5
6
7
n
slide credit: Kevin Jamieson
1
2
3
4
5
6
7
n
Choosing the winner of a Science Fair
Project 1
Project n-1
Project n
. . .
Project 2
Uniform Judge Allocation
judges
per
project
. . .
. . .
average
score
Judges
slide credit: Kevin Jamieson
1
2
3
4
5
6
7
n
Choosing the winner of a Science Fair
Project 1
Project n-1
Project n
. . .
Project 2
Uniform Judge Allocation
judges
per
project
. . .
+
-
.3
+
-
.3
+
-
.3
+
-
.3
[ ]
[ ]
[ ]
[ ]
. . .
average
score
X
X
X
X
Judges
slide credit: Kevin Jamieson
Statistically indistinguishable!
1
2
3
4
5
6
7
n
Choosing the winner of a Science Fair
Uniform Judge Allocation
judges
per
project
. . .
[ ]
[ ]
[ ]
[ ]
. . .
average
score
X
X
X
X
slide credit: Kevin Jamieson
Adaptive Judge Allocation
Project 1
Project n-1
Project n
. . .
Project 2
Judges
Choosing the winner of a Science Fair
. . .
Adaptive Judge Allocation
judges
per
project
. . .
Proceed in rounds: Round 1
1
2
3
4
5
6
7
n
average
score
. . .
1
2
3
4
5
6
7
n
Judges
2, 4
5, 4
5, 4
2, 1
slide credit: Kevin Jamieson
Project 1
Project n-1
Project n
Project 2
1
2
3
4
5
6
7
n
Uniform Judge Allocation
judges
per
project
. . .
[ ]
[ ]
[ ]
[ ]
. . .
average
score
X
X
X
X
Choosing the winner of a Science Fair
Project 1
. . .
Project 2
Adaptive Judge Allocation
judges
per
project
. . .
Proceed in rounds: Round 1
1
2
3
4
5
6
7
n
average
score
. . .
Judges
slide credit: Kevin Jamieson
Project n-1
Project n
1
2
3
4
5
6
7
n
Uniform Judge Allocation
judges
per
project
. . .
[ ]
[ ]
[ ]
[ ]
. . .
average
score
X
X
X
X
Choosing the winner of a Science Fair
Project 1
. . .
Project 2
Adaptive Judge Allocation
judges
per
project
. . .
Proceed in rounds: Round 1
1
2
3
4
5
6
7
n
average
score
. . .
+
-
.5
+
-
.5
+
-
.5
+
-
.5
[ ]
[ ]
[ ]
[ ]
[ ]
[ ]
[ ]
Judges
[ ]
slide credit: Kevin Jamieson
Project n-1
Project n
1
2
3
4
5
6
7
n
Uniform Judge Allocation
judges
per
project
. . .
[ ]
[ ]
[ ]
[ ]
. . .
average
score
X
X
X
X
Choosing the winner of a Science Fair
Project 1
. . .
Project 2
Adaptive Judge Allocation
judges
per
project
. . .
Proceed in rounds: Round 1
1
2
3
4
5
6
7
n
average
score
. . .
+
-
.5
+
-
.5
+
-
.5
+
-
.5
[ ]
[ ]
[ ]
[ ]
[ ]
[ ]
[ ]
Judges
[ ]
slide credit: Kevin Jamieson
Project n-1
Project n
1
2
3
4
5
6
7
n
Uniform Judge Allocation
judges
per
project
. . .
[ ]
[ ]
[ ]
[ ]
. . .
average
score
X
X
X
X
Choosing the winner of a Science Fair
Project 1
. . .
Project 2
Adaptive Judge Allocation
judges
per
project
. . .
Proceed in rounds: Round 1
1
2
3
4
5
6
7
n
average
score
. . .
+
-
.5
+
-
.5
+
-
.5
+
-
.5
[ ]
[ ]
[ ]
[ ]
[ ]
[ ]
[ ]
[ ]
Judges
slide credit: Kevin Jamieson
Project n-1
Project n
1
2
3
4
5
6
7
n
Uniform Judge Allocation
judges
per
project
. . .
[ ]
[ ]
[ ]
[ ]
. . .
average
score
X
X
X
X
Choosing the winner of a Science Fair
Project 1
. . .
Project 2
Adaptive Judge Allocation
judges
per
project
. . .
Proceed in rounds: Round 1
1
2
3
4
5
6
7
n
average
score
. . .
+
-
.5
+
-
.5
+
-
.5
+
-
.5
[ ]
[ ]
[ ]
[ ]
[ ]
X
X
X
Judges
slide credit: Kevin Jamieson
Project n-1
Project n
1
2
3
4
5
6
7
n
Uniform Judge Allocation
judges
per
project
. . .
[ ]
[ ]
[ ]
[ ]
. . .
average
score
X
X
X
X
Choosing the winner of a Science Fair
Project 1
. . .
Project 2
Adaptive Judge Allocation
judges
per
project
. . .
Proceed in rounds: Round 2
1
2
3
4
5
6
7
n
average
score
. . .
2.2
3.1
4.8
4.7
+
-
.5
+
-
.5
+
-
.5
+
-
.5
[ ]
4.7
+
-
.3
4.8
+
-
.3
1.
2.
1.
2.
[ ]
[ ]
[ ]
[ ]
X
X
X
Judges
slide credit: Kevin Jamieson
Project n-1
Project n
1
2
3
4
5
6
7
n
Uniform Judge Allocation
judges
per
project
. . .
[ ]
[ ]
[ ]
[ ]
. . .
average
score
X
X
X
X
Choosing the winner of a Science Fair
Project 1
. . .
Project 2
Adaptive Judge Allocation
judges
per
project
. . .
Proceed in rounds: Round 2
1
2
3
4
5
6
7
n
average
score
. . .
2.2
3.1
4.8
4.7
+
-
.5
+
-
.5
+
-
.5
+
-
.5
[ ]
4.7
+
-
.3
4.8
+
-
.3
1.
2.
1.
2.
[ ]
[ ]
[ ]
[ ]
X
X
X
Judges
slide credit: Kevin Jamieson
Project n-1
Project n
1
2
3
4
5
6
7
n
Uniform Judge Allocation
judges
per
project
. . .
[ ]
[ ]
[ ]
[ ]
. . .
average
score
X
X
X
X
Choosing the winner of a Science Fair
Project 1
. . .
Project 2
Adaptive Judge Allocation
judges
per
project
. . .
Proceed in rounds: Round 2
1
2
3
4
5
6
7
n
average
score
. . .
2.2
3.1
4.8
4.7
+
-
.5
+
-
.5
+
-
.5
+
-
.5
[ ]
4.7
+
-
.3
4.8
+
-
.3
1.
2.
1.
2.
[ ]
X
X
X
X
X
X
Judges
slide credit: Kevin Jamieson
Project n-1
Project n
1
2
3
4
5
6
7
n
Uniform Judge Allocation
judges
per
project
. . .
[ ]
[ ]
[ ]
[ ]
. . .
average
score
X
X
X
X
Choosing the winner of a Science Fair
Project 1
. . .
Project 2
Adaptive Judge Allocation
judges
per
project
. . .
Proceed in rounds: Round 3
1
2
3
4
5
6
7
n
average
score
. . .
2.2
3.1
4.8
4.7
+
-
.5
+
-
.5
+
-
.5
+
-
.5
[]
4.7
+
-
.3
4.8
+
-
.3
1.
2.
1.
2.
[]
X
X
X
X
X
X
4.9
+
-
.1
3.
4.7
+
-
.1
3.
Judges
slide credit: Kevin Jamieson
Project n-1
Project n
1
2
3
4
5
6
7
n
Uniform Judge Allocation
judges
per
project
. . .
[ ]
[ ]
[ ]
[ ]
. . .
average
score
X
X
X
X
Choosing the winner of a Science Fair
Project 1
. . .
Project 2
Adaptive Judge Allocation
judges
per
project
. . .
Proceed in rounds: Round 3
1
2
3
4
5
6
7
n
average
score
. . .
2.2
3.1
4.8
4.7
+
-
.5
+
-
.5
+
-
.5
+
-
.5
[]
4.7
+
-
.3
4.8
+
-
.3
1.
2.
1.
2.
[]
X
X
X
X
X
X
4.9
+
-
.1
3.
4.7
+
-
.1
3.
Judges
slide credit: Kevin Jamieson
Project n-1
Project n
1
2
3
4
5
6
7
n
Uniform Judge Allocation
judges
per
project
. . .
[ ]
[ ]
[ ]
[ ]
. . .
average
score
X
X
X
X
Project n-1
Project n
Choosing the winner of a Science Fair
Project 1
. . .
Project 2
Adaptive Judge Allocation
judges
per
project
. . .
Proceed in rounds: Round 3
1
2
3
4
5
6
7
n
average
score
. . .
2.2
3.1
4.8
4.7
+
-
.5
+
-
.5
+
-
.5
+
-
.5
[]
4.7
+
-
.3
4.8
+
-
.3
1.
2.
1.
2.
X
X
X
X
X
X
4.9
+
-
.1
3.
4.7
+
-
.1
3.
X
Judges
slide credit: Kevin Jamieson
1
2
3
4
5
6
7
n
Uniform Judge Allocation
judges
per
project
. . .
[ ]
[ ]
[ ]
[ ]
. . .
average
score
X
X
X
X
Project n-1
Project n
Choosing the winner of a Science Fair
Project 1
. . .
Project 2
Adaptive Judge Allocation
judges
per
project
. . .
Proceed in rounds: Round 3
1
2
3
4
5
6
7
n
average
score
. . .
2.2
3.1
4.8
4.7
+
-
.5
+
-
.5
+
-
.5
+
-
.5
[]
4.7
+
-
.3
4.8
+
-
.3
1.
2.
1.
2.
X
X
X
X
X
X
4.9
+
-
.1
3.
4.7
+
-
.1
3.
X
Same number of total judgments
Judges
slide credit: Kevin Jamieson
1
2
3
4
5
6
7
n
Uniform Judge Allocation
judges
per
project
. . .
[ ]
[ ]
[ ]
[ ]
. . .
average
score
X
X
X
X
Why Halving?
x | | | | | | | |
x | x | | | | | | |
x | x | x | x | | | | |
x | x | x | x | x | x | x | x |
8
4
2
1
Successive Halving Method
If the number of configurations n is large, we may get very noisy results early on and miss the best
If n is too small then we will waste resources on bad arms.
Outline
Hyperband
Hyperband runs several rounds of SuccessiveHalving with very different sizes n
Outline
TOKENIZATION FOR TRANSFORMERS
The token vocabulary that is used is important
For NLP, current approaches use techniques like Byte Pair Encoding, wordpieces, or sentence pieces---which are all quite similar
Tokenization for Vision Transformers
Transformers are used for other modalities of input as well: images, audio, …
Tokenization for Vision Transformers
2D positional encoding is not better than 1D positional encoding
Tokenization for Vision Transformers
JFT = 300M images @ Google
Also trained on JFT
21k classes and 14M images
Vision Transformers – Pros and Cons
Multi-Model Transformer: Example
2022
Multi-Model Transformer: Example
Multi-Model Transformer: Example
1.
2. FT
0. ViT and T5
Multi-Model Transformer: Example
1.
0. ViT and T5
One encoder is used for two purposes – retrieval and generation (given a retrieved “memory”)
Image-caption pairs: ignore memory, train image + prompt =“generate caption:” 🡪 caption
Visual QA: memory is captions, train for image + prompt=question 🡪 answer
PAQ (question, passage, answer): memory is passages, train for question🡪answer
Vision Transformers – Pros and Cons
Oracle retrieval
Retrieval at test time
Outline
KEY-VALUE CACHING FOR TRANSFORMERS
Masked self-attention is used in the decoder
In masked self-attention, the query for token at position i can only attend to thr keys for tokens at positions 0,1,…,i
This is done with a mask
Prediction for token at i is not affected by “future” tokens at i+1,i+2, …
aka ”causal LM” or “causal masking”
Recap: masked self-attention
GPT
Recap: masked self-attention is always used in decoder-only Transformers
GPT
Recap: generation is auto-regressive
Observation
Step 1 | Step 2 | Step 3 | Step 4 |
Observation
In Transformers the QKT matmul is the only place where information from one token’s representation is moved to another token
if xi is token representation
qi = xi WQ. ; ki = xi WK. ; vi = xi WV.
Observation
input | input | Step 3 | Step 4 |
Third output depends only on the first two tokens
Observation
input | input | Step 3 | Step 4 |
Third output depends only on the first two tokens
Key-Value Caching
input | input | input | Step 4 |
Third output depends only on the first two tokens
Fourth output output depends only on the first three tokens
Stuff in green boxes has not changed since step 3
The keys and values can be cached and re-used in forward pass
GPT
Important point: the FFN and LayerNorms are applied independently to each token position, so MSA is the only place where information moves from one token position to another.