1 of 91

10-605 / 10-805

Machine Learning from Large Datasets

2 of 91

Outline

Hyperparameter Tuning

Beyond grid search
Experimentation with a surrogate model
Gaussian Process Regression models

Early stopping for hyperparameter selection

Successive halving
Hyperband

More on Transformers

Tokenization for Vision Transformers (ViT)
Key-Value Caching

3 of 91

HYPERPARAMETER SELECTION

4 of 91

4

RECAP

5 of 91

5

Hyperparameter tuning

Ideally only use test data once when everything else has been finalized

RECAP

6 of 91

6

Grid search

Try out multiple values of λ

Pick the one that is best on the validation set

Validation set should be independent of the test set

“Best” does not have to be according to the loss the learner is optimizing.

Scale need not be linear

RECAP

7 of 91

7

Grid search

With multiple hyperparameters you need to test all combinations of the parameters.

Grid search gets expensive fast…we’ll come back to this later in the course!

e.g., regularization, drop out, number of epochs, learning rate, momentum term, network sizes, architectural variants, …

RECAP

you are here!

8 of 91

Hyperparameter Selection

Given

a set of hyperparameters

which leads to a space of possible configurations

Find:

the configuration c that gives the best possible result f(c) for some task

eg: minimize validation loss after training with c

Optimization task—but we can’t use gradients or other methods

9 of 91

Hyperparameter Selection

10 of 91

Outline

Hyperparameter Tuning

Beyond grid search
Experimentation with a surrogate model
Gaussian Process Regression models

Early stopping for hyperparameter selection

Successive halving
Hyperband

More on Transformers

Tokenization for Vision Transformers (ViT)
Key-Value Caching

11 of 91

Hyperparameter Selection

Adaptive selection loop:

Build a model of f from past experiments

“Surrogate model”

Pick the next experiment to do based on the surrogate model

Experiment = learning task

Repeat…

A simple case:

a single discrete hyperparameter with k possible values
but f is random (e.g., there’s a seed for the ML algorithm)

Goal: find the best of the k values quickly

Explore the space intelligently

12 of 91

Theory: Bandit Problems in ML

Goal: (1) discover which arm has the lowest loss, so you can exploit this information (2) discover the best arm quickly so you can start winning!

13 of 91

Theory: Bandit Problems in ML

Some common sampling strategies

𝜺-greedy: Pick a random arm with probability 𝜺 and the best arm with probability 1- 𝜺
Upper confidence bound (UCB): Compute confidence intervals on the expected loss of each arm, and pick the arm with best UCB

14 of 91

Hyperparameter Selection

Adaptive selection loop:

Build a model of f from past experiments

“Surrogate model”

Pick the next experiment to do based on the surrogate model
Repeat…

15 of 91

Outline

Hyperparameter Tuning

Beyond grid search
Experimentation with a surrogate model
Gaussian Process Regression models

Early stopping for hyperparameter selection

Successive halving
Hyperband

More on Transformers

Tokenization for Vision Transformers (ViT)
Key-Value Caching

16 of 91

Gaussian Processes

17 of 91

Gaussian Process Regression

Hyperparameter configuration x to learner performance f(x) is a regression task

Challenges:

Labels are expensive
For UCB we’d like to measure uncertainty in the regression function

Gaussian Process Regression is a good match for this

18 of 91

2-D Gaussian Process

https://thegradient.pub/gaussian-process-not-quite-for-dummies/

Sample a point from the Gaussian, plot it as a line segment

19 of 91

2-D Gaussian Processes

https://thegradient.pub/gaussian-process-not-quite-for-dummies/

Sample a point from the Gaussian, plot it as a line segment

20 of 91

2-D Gaussian Processes:�Condition on a Point

https://thegradient.pub/gaussian-process-not-quite-for-dummies/

Sample a point from the p(X2|x1) plot it as a line segment

p(X2|x1) is also a Gaussian

21 of 91

5-D Gaussian Processes

https://thegradient.pub/gaussian-process-not-quite-for-dummies/

Conditioned on a point

22 of 91

20-D / 5D Gaussian Processes

https://thegradient.pub/gaussian-process-not-quite-for-dummies/

Conditioned on a point

Conditioned on 4 points

What is Σ ?

23 of 91

20-D / 5D Gaussian Processes

https://thegradient.pub/gaussian-process-not-quite-for-dummies/

Conditioned on a point

Conditioned

24 of 91

Gaussian Process Regression

https://thegradient.pub/gaussian-process-not-quite-for-dummies/

Define covariance between x_i and x_j with a kernel, defined for all x_i, x_j

y1 is f(x), y2 is x

A = cov(Y1)

B = cov(Y1, Y2)

C = cov(Y2)

25 of 91

Gaussian Process Regression

https://thegradient.pub/gaussian-process-not-quite-for-dummies/

A = cov(Y1)

B = cov(Y1, Y2)

C = cov(Y2)

y2 is observations

y1 is prediction

computing C^-1 is O(n³),

where n is number of observations

26 of 91

Gaussian Process Definition

Consider a process that generates any number of data points x_1, …, x_T

A Gaussian process is one where for any finite subset of indices t₁, …, t_n those variables are multivariate Gaussian

X_t1 … X_tn ~ Normal( μ, Σ )

y1 is f(x), y2 is x

A = cov(Y1)

B = cov(Y1, Y2)

C = cov(Y2)

even ∞

will define with kernel

27 of 91

Why I don’t like this cartoon

28 of 91

Why I don’t like this cartoon

https://research.cs.aalto.fi/pml/

Observations are previously tested configurations
Usually fairly high dimension (5-20)
That surface is hard to see
The 1-D plots of variable indices are misleading…

29 of 91

Outline

Hyperparameter Tuning

Beyond grid search
Experimentation with a surrogate model
Gaussian Process Regression models

Early stopping for hyperparameter selection

Successive halving
Hyperband

More on Transformers

Tokenization for Vision Transformers (ViT)
Key-Value Caching

30 of 91

31 of 91

32 of 91

33 of 91

34 of 91

35 of 91

Vanilla UCB can waste time in the “corners” of the graph

36 of 91

Gaussian Process Regression

https://thegradient.pub/gaussian-process-not-quite-for-dummies/

Downsides:

Expensive for many observations
Gaussian processes also have hyperparameters!
… associated with the kernel

37 of 91

38 of 91

Outline

Hyperparameter Tuning

Beyond grid search
Experimentation with a surrogate model
Gaussian Process Regression models

Early stopping for hyperparameter selection

Successive halving
Hyperband

More on Transformers

Tokenization for Vision Transformers (ViT)
Key-Value Caching

39 of 91

EARLY STOPPING AND HYPERPARAMETER SELECTION

40 of 91

Hyperparameter Selection: Early Stopping

What if experiments have differing costs?

Example: #iterations of training is a hyperparameter

We can save computation if we cut off unpromising experiments early

41 of 91

Hyperparameter Selection: Early Stopping

What can we prove about this?

What do we need to assume?

In general we don’t know where a learning curve will end up…

In practice there are often big jumps in discrete eval metrics like error rate, win ratios, …

Many of the losses you train and test on are smooth

42 of 91

Recap: Chinchilla Scaling Laws

63B

1.4T

Constraint = Budget = $$

Idea: fit a simple model to the data that relates compute, parameters, and loss, and use that to optimize loss given a total compute budget.

43 of 91

Caveat About Scaling Laws

44 of 91

Hyperparameter Selection: Early Stopping

If we know where we are converging to something we can be confident when two curves separate

What can we prove about this?

What do we need to assume?

45 of 91

Hyperparameter Selection: Early Stopping

What’s a robust and useful heuristic?

What can we prove about this?

What do we need to assume?

46 of 91

47 of 91

Choosing the winner of a Science Fair

Project 1

Project n-1

Project n

. . .

Project 2

Judges

slide credit: Kevin Jamieson

48 of 91

Choosing the winner of a Science Fair

Project 1

Project n-1

Project n

. . .

Project 2

Judges

3, 3, …

5, 4, …

2, 1, …

1

2

3

4

5

6

7

n

Uniform Judge Allocation

judges

per

project

. . .

average

score

. . .

1

2

3

4

5

6

7

n

slide credit: Kevin Jamieson

49 of 91

1

2

3

4

5

6

7

n

Choosing the winner of a Science Fair

Project 1

Project n-1

Project n

. . .

Project 2

Uniform Judge Allocation

judges

per

project

. . .

average

score

Judges

slide credit: Kevin Jamieson

50 of 91

1

2

3

4

5

6

7

n

Choosing the winner of a Science Fair

Project 1

Project n-1

Project n

. . .

Project 2

Uniform Judge Allocation

judges

per

project

. . .

+

-

.3

+

-

.3

+

-

.3

+

-

.3

[ ]

. . .

average

score

X

Judges

slide credit: Kevin Jamieson

Statistically indistinguishable!

51 of 91

1

2

3

4

5

6

7

n

Choosing the winner of a Science Fair

Uniform Judge Allocation

judges

per

project

. . .

[ ]

. . .

average

score

X

slide credit: Kevin Jamieson

Adaptive Judge Allocation

Project 1

Project n-1

Project n

. . .

Project 2

Judges

52 of 91

Choosing the winner of a Science Fair

. . .

Adaptive Judge Allocation

judges

per

project

. . .

Proceed in rounds: Round 1

1

2

3

4

5

6

7

n

average

score

. . .

1

2

3

4

5

6

7

n

Judges

2, 4

5, 4

2, 1

slide credit: Kevin Jamieson

Project 1

Project n-1

Project n

Project 2

1

2

3

4

5

6

7

n

Uniform Judge Allocation

judges

per

project

. . .

[ ]

. . .

average

score

X

53 of 91

Choosing the winner of a Science Fair

Project 1

. . .

Project 2

Adaptive Judge Allocation

judges

per

project

. . .

Proceed in rounds: Round 1

1

2

3

4

5

6

7

n

average

score

. . .

Judges

slide credit: Kevin Jamieson

Project n-1

Project n

1

2

3

4

5

6

7

n

Uniform Judge Allocation

judges

per

project

. . .

[ ]

. . .

average

score

X

54 of 91

Choosing the winner of a Science Fair

Project 1

. . .

Project 2

Adaptive Judge Allocation

judges

per

project

. . .

Proceed in rounds: Round 1

1

2

3

4

5

6

7

n

average

score

. . .

+

-

.5

+

-

.5

+

-

.5

+

-

.5

[ ]

Judges

[ ]

slide credit: Kevin Jamieson

Project n-1

Project n

1

2

3

4

5

6

7

n

Uniform Judge Allocation

judges

per

project

. . .

[ ]

. . .

average

score

X

55 of 91

Choosing the winner of a Science Fair

Project 1

. . .

Project 2

Adaptive Judge Allocation

judges

per

project

. . .

Proceed in rounds: Round 1

1

2

3

4

5

6

7

n

average

score

. . .

+

-

.5

+

-

.5

+

-

.5

+

-

.5

[ ]

Judges

[ ]

slide credit: Kevin Jamieson

Project n-1

Project n

1

2

3

4

5

6

7

n

Uniform Judge Allocation

judges

per

project

. . .

[ ]

. . .

average

score

X

56 of 91

Choosing the winner of a Science Fair

Project 1

. . .

Project 2

Adaptive Judge Allocation

judges

per

project

. . .

Proceed in rounds: Round 1

1

2

3

4

5

6

7

n

average

score

. . .

+

-

.5

+

-

.5

+

-

.5

+

-

.5

[ ]

Judges

slide credit: Kevin Jamieson

Project n-1

Project n

1

2

3

4

5

6

7

n

Uniform Judge Allocation

judges

per

project

. . .

[ ]

. . .

average

score

X

57 of 91

Choosing the winner of a Science Fair

Project 1

. . .

Project 2

Adaptive Judge Allocation

judges

per

project

. . .

Proceed in rounds: Round 1

1

2

3

4

5

6

7

n

average

score

. . .

+

-

.5

+

-

.5

+

-

.5

+

-

.5

[ ]

X

Judges

slide credit: Kevin Jamieson

Project n-1

Project n

1

2

3

4

5

6

7

n

Uniform Judge Allocation

judges

per

project

. . .

[ ]

. . .

average

score

X

58 of 91

Choosing the winner of a Science Fair

Project 1

. . .

Project 2

Adaptive Judge Allocation

judges

per

project

. . .

Proceed in rounds: Round 2

1

2

3

4

5

6

7

n

average

score

. . .

2.2

3.1

4.8

4.7

+

-

.5

+

-

.5

+

-

.5

+

-

.5

[ ]

4.7

+

-

.3

4.8

+

-

.3

1.

2.

1.

2.

[ ]

X

Judges

slide credit: Kevin Jamieson

Project n-1

Project n

1

2

3

4

5

6

7

n

Uniform Judge Allocation

judges

per

project

. . .

[ ]

. . .

average

score

X

59 of 91

Choosing the winner of a Science Fair

Project 1

. . .

Project 2

Adaptive Judge Allocation

judges

per

project

. . .

Proceed in rounds: Round 2

1

2

3

4

5

6

7

n

average

score

. . .

2.2

3.1

4.8

4.7

+

-

.5

+

-

.5

+

-

.5

+

-

.5

[ ]

4.7

+

-

.3

4.8

+

-

.3

1.

2.

1.

2.

[ ]

X

Judges

slide credit: Kevin Jamieson

Project n-1

Project n

1

2

3

4

5

6

7

n

Uniform Judge Allocation

judges

per

project

. . .

[ ]

. . .

average

score

X

60 of 91

Choosing the winner of a Science Fair

Project 1

. . .

Project 2

Adaptive Judge Allocation

judges

per

project

. . .

Proceed in rounds: Round 2

1

2

3

4

5

6

7

n

average

score

. . .

2.2

3.1

4.8

4.7

+

-

.5

+

-

.5

+

-

.5

+

-

.5

[ ]

4.7

+

-

.3

4.8

+

-

.3

1.

2.

1.

2.

[ ]

X

Judges

slide credit: Kevin Jamieson

Project n-1

Project n

1

2

3

4

5

6

7

n

Uniform Judge Allocation

judges

per

project

. . .

[ ]

. . .

average

score

X

61 of 91

Choosing the winner of a Science Fair

Project 1

. . .

Project 2

Adaptive Judge Allocation

judges

per

project

. . .

Proceed in rounds: Round 3

1

2

3

4

5

6

7

n

average

score

. . .

2.2

3.1

4.8

4.7

+

-

.5

+

-

.5

+

-

.5

+

-

.5

[]

4.7

+

-

.3

4.8

+

-

.3

1.

2.

1.

2.

[]

X

4.9

+

-

.1

3.

4.7

+

-

.1

3.

Judges

slide credit: Kevin Jamieson

Project n-1

Project n

1

2

3

4

5

6

7

n

Uniform Judge Allocation

judges

per

project

. . .

[ ]

. . .

average

score

X

62 of 91

Choosing the winner of a Science Fair

Project 1

. . .

Project 2

Adaptive Judge Allocation

judges

per

project

. . .

Proceed in rounds: Round 3

1

2

3

4

5

6

7

n

average

score

. . .

2.2

3.1

4.8

4.7

+

-

.5

+

-

.5

+

-

.5

+

-

.5

[]

4.7

+

-

.3

4.8

+

-

.3

1.

2.

1.

2.

[]

X

4.9

+

-

.1

3.

4.7

+

-

.1

3.

Judges

slide credit: Kevin Jamieson

Project n-1

Project n

1

2

3

4

5

6

7

n

Uniform Judge Allocation

judges

per

project

. . .

[ ]

. . .

average

score

X

63 of 91

Project n-1

Project n

Choosing the winner of a Science Fair

Project 1

. . .

Project 2

Adaptive Judge Allocation

judges

per

project

. . .

Proceed in rounds: Round 3

1

2

3

4

5

6

7

n

average

score

. . .

2.2

3.1

4.8

4.7

+

-

.5

+

-

.5

+

-

.5

+

-

.5

[]

4.7

+

-

.3

4.8

+

-

.3

1.

2.

1.

2.

X

4.9

+

-

.1

3.

4.7

+

-

.1

3.

X

Judges

slide credit: Kevin Jamieson

1

2

3

4

5

6

7

n

Uniform Judge Allocation

judges

per

project

. . .

[ ]

. . .

average

score

X

64 of 91

Project n-1

Project n

Choosing the winner of a Science Fair

Project 1

. . .

Project 2

Adaptive Judge Allocation

judges

per

project

. . .

Proceed in rounds: Round 3

1

2

3

4

5

6

7

n

average

score

. . .

2.2

3.1

4.8

4.7

+

-

.5

+

-

.5

+

-

.5

+

-

.5

[]

4.7

+

-

.3

4.8

+

-

.3

1.

2.

1.

2.

X

4.9

+

-

.1

3.

4.7

+

-

.1

3.

X

Same number of total judgments

Judges

slide credit: Kevin Jamieson

1

2

3

4

5

6

7

n

Uniform Judge Allocation

judges

per

project

. . .

[ ]

. . .

average

score

X

65 of 91

Why Halving?

x
x	x
x	x	x	x
x	x	x	x	x	x	x	x

8

4

2

1

66 of 91

Successive Halving Method

https://jmlr.org/papers/volume18/16-558/16-558.pdf

If the number of configurations n is large, we may get very noisy results early on and miss the best

If n is too small then we will waste resources on bad arms.

67 of 91

Outline

Hyperparameter Tuning

Beyond grid search
Experimentation with a surrogate model
Gaussian Process Regression models

Early stopping for hyperparameter selection

Successive halving ✅
Hyperband

More on Transformers

Tokenization for Vision Transformers (ViT)
Key-Value Caching

68 of 91

Hyperband

Hyperband runs several rounds of SuccessiveHalving with very different sizes n

69 of 91

Outline

Hyperparameter Tuning

Beyond grid search
Experimentation with a surrogate model
Gaussian Process Regression models

Early stopping for hyperparameter selection

Successive halving
Hyperband

More on Transformers

Tokenization for Vision Transformers (ViT)
Key-Value Caching

70 of 91

TOKENIZATION FOR TRANSFORMERS

71 of 91

The token vocabulary that is used is important

For NLP, current approaches use techniques like Byte Pair Encoding, wordpieces, or sentence pieces---which are all quite similar

72 of 91

Tokenization for Vision Transformers

Transformers are used for other modalities of input as well: images, audio, …

73 of 91

Tokenization for Vision Transformers

2D positional encoding is not better than 1D positional encoding

74 of 91

Tokenization for Vision Transformers

JFT = 300M images @ Google

Also trained on JFT

21k classes and 14M images

75 of 91

Vision Transformers – Pros and Cons

ViTs have a weaker bias than CNNs and usually need more data to train

They don’t “know” that local-area features matter
“For example, ViT-B/32 is slightly faster than ResNet50; it performs much worse on the 9M subset, but better on 90M+ subsets.”

Transformers are not the SoTA way of generating images

Although they often do quite well!

ViTs are excellent for multimodal vision/language models

76 of 91

Multi-Model Transformer: Example

2022

77 of 91

Multi-Model Transformer: Example

78 of 91

Multi-Model Transformer: Example

1.

2. FT

0. ViT and T5

79 of 91

Multi-Model Transformer: Example

1.

0. ViT and T5

One encoder is used for two purposes – retrieval and generation (given a retrieved “memory”)

Image-caption pairs: ignore memory, train image + prompt =“generate caption:” 🡪 caption

Visual QA: memory is captions, train for image + prompt=question 🡪 answer

PAQ (question, passage, answer): memory is passages, train for question🡪answer

80 of 91

Vision Transformers – Pros and Cons

Oracle retrieval

Retrieval at test time

81 of 91

Outline

Hyperparameter Tuning

Beyond grid search
Experimentation with a surrogate model
Gaussian Process Regression models

Early stopping for hyperparameter selection

Successive halving
Hyperband

More on Transformers

Tokenization for Vision Transformers (ViT)
Key-Value Caching

82 of 91

KEY-VALUE CACHING FOR TRANSFORMERS

83 of 91

Masked self-attention is used in the decoder

In masked self-attention, the query for token at position i can only attend to thr keys for tokens at positions 0,1,…,i

This is done with a mask

Prediction for token at i is not affected by “future” tokens at i+1,i+2, …

aka ”causal LM” or “causal masking”

Recap: masked self-attention

84 of 91

GPT

Recap: masked self-attention is always used in decoder-only Transformers

85 of 91

GPT

Recap: generation is auto-regressive

https://medium.com/@joaolages/kv-caching-explained-276520203249

86 of 91

Observation

Step 1	Step 2	Step 3	Step 4

87 of 91

Observation

In Transformers the QK^T matmul is the only place where information from one token’s representation is moved to another token

if x_i is token representation

q_i= x_iW_Q.; k_i= x_iW_K.; v_i= x_iW_V.

88 of 91

Observation

input	input	Step 3	Step 4

Third output depends only on the first two tokens

89 of 91

Observation

input	input	Step 3	Step 4

Third output depends only on the first two tokens

90 of 91

Key-Value Caching

input	input	input	Step 4

Third output depends only on the first two tokens

Fourth output output depends only on the first three tokens

Stuff in green boxes has not changed since step 3

The keys and values can be cached and re-used in forward pass

91 of 91

GPT

Important point: the FFN and LayerNorms are applied independently to each token position, so MSA is the only place where information moves from one token position to another.