1 of 64

Generative Sequence Modeling

A tale as old as backpropagation through time

Zain Shah

2 of 64

What you will learn

How does GPT-3 work?
How does it differ from what came before?
Which limits have we broken? And which haven’t we (yet)?

3 of 64

Why are sequences important? How are they different?

Why sequences?

ABABCABCDABCDE

4 of 64

Why are sequences important? How are they different?

Why sequences?

ABABCABCDABCDE

A	B	C	D	E
4	4	3	2	1

5 of 64

Order matters

Why sequences?

ABABCABCDABCDE

A	B	C	D	E
4	4	3	2	1

CDABADBCAEBABC

DCDCBCABBAABAE

BBABCDBCDAEACA

6 of 64

Structure matters

Why sequences?

7 of 64

9 white cells → 🙂

Structure matters

Why sequences?

8 of 64

9 white cells → 🙂

Structure matters

Why sequences?

x	1
y	1
z	1
a	1
b	0

position

value

Regression?

🙂

9 of 64

Translation does not matter

Why sequences?

10 of 64

Resolution does not matter

Why sequences?

11 of 64

Why sequences?

ABABCABCDABCDE

☀️ ⛅ ☀️ ⛅ ☁️ ☀️ ⛅ ☁️ 🌧️ ☀️ ⛅ ☁️ 🌧️ ⛈️

What comes next?

12 of 64

What else is like this?

sound
language
images
video
sensing
structured data

13 of 64

structure

&

variable size

Real world data has

14 of 64

How to measure & optimize?

maximize the likelihood of our data under our model

15 of 64

How to measure & optimize?

maximize the likelihood of our data under our model

16 of 64

Recurrent Neural Networks (1986)

B

A

C

B

A

17 of 64

Recurrent Neural Networks (1986)

B

A

C

B

A

B

C

18 of 64

Recurrent Neural Networks (1986)

B

A

C

B

A

B

A

C

B

A

C

A

B

C

19 of 64

Generating Sequences with Recurrent Neural Networks (Graves 2013)

“In principle a large enough RNN should be sufficient to generate sequences

of arbitrary complexity.”

20 of 64

same paper (Graves 2013)

“In practice however, standard RNNs are unable to store information about past inputs for very long”

21 of 64

Forgetting

Is the main problem. The majority of our time is spent on this.

How to deal with longer sequences, larger images, coherence over grander timescales?

22 of 64

LSTM

Long short-term memory

recurrent neural network

(Schmidhuber 1997)

23 of 64

Sequence to Sequence Learning

Translation

Labeling

Speech Recognition

Conditional Speech Synthesis

Text Generation

(Sutskever 2014)

24 of 64

Connectionist Temporal Classification

the alignment subproblem (Graves 2014)

25 of 64

The beginning of “attention”

Jointly Learning to Align and Translate (Bahdanau 2014)

26 of 64

The beginning of “attention”

Jointly Learning to Align and Translate (Bahdanau 2014)

27 of 64

The beginning of “attention”

Jointly Learning to Align and Translate (Bahdanau 2014)

28 of 64

Attention

“Solved” sequence length. Enabled interpretability.

29 of 64

Attention

“Solved” sequence length. Enabled interpretability.

30 of 64

so we’re still limited by this:

Well, sort of

because of this:

sequences in the real world are very very long

31 of 64

Well, sort of

sequences in the real world are very very long

so we’re still limited by this:

32 of 64

Well, sort of

sequences in the real world are very very long

so we’re still limited by this:

33 of 64

Attention

Recurrence <

34 of 64

Attention

enables parallelism for training

35 of 64

Attention

enables parallelism for training

we’re no longer limited to a single forward⬌backward process

this means we can train bigger models, on more data

the only limit is how many computers we can use

36 of 64

Transformer:

Attention is All You Need

(Vaswani 2017)

37 of 64

Transformer:

Attention is All You Need

(Vaswani 2017)

38 of 64

39 of 64

40 of 64

41 of 64

42 of 64

GPT-3

43 of 64

44 of 64

this is a log scale.

this is 15x more.

🤯

45 of 64

The Data Details

Representation: How is the data represented?

Generation: How is data generated?

Training: What data was it trained on?

46 of 64

Representation:

Byte Pair Encoding

+ Embeddings

+ Positional Encoding

47 of 64

Byte Pair Encoding:

48 of 64

Byte Pair Encoding:

49 of 64

Embeddings:

50 of 64

Positional Encoding:

51 of 64

Generation:

Greedy Sampling

+ Nucleus Sampling

+ (Not) Beam Search

52 of 64

Greedy Sampling:

53 of 64

Nucleus Sampling:

54 of 64

(not) Beam Search:

55 of 64

(not) Beam Search:

TL;DR-

Beam search samples don’t look human

56 of 64

Training:

Filtered CommonCrawl (most of the internet)

+ Microsoft supercomputing cluster

So big, that minimizing the negative log likelihood means it needs to learn to do this:

57 of 64

Training:

Filtered CommonCrawl (most of the internet)

+ Microsoft supercomputing cluster

So big, that minimizing the negative log likelihood means it needs to learn to do this:

58 of 64

Training:

Filtered CommonCrawl (most of the internet)

+ Microsoft supercomputing cluster

So big, that minimizing the negative log likelihood means it needs to learn to do this:

59 of 64

Limitations?

Is this artificial general intelligence?

60 of 64

Limitations:

+ Fails at tasks that require world knowledge

+ Uniform importance / loss function +Unidirectional context (explains WIC)

+ Learning at test time?

61 of 64

World Knowledge:

Specifically GPT-3 has difficulty with questions of the type

“If I put cheese into the fridge, will it melt?”

62 of 64

Simulation + Self Play:

World knowledge is not a structural limitation,

only one of the data.

63 of 64

Uniform Importance

64 of 64

Language Models are Few-Shot Learners

“Ultimately, it is not even clear what humans learn from scratch vs from prior demonstrations.”