Generative Sequence Modeling
A tale as old as backpropagation through time
Zain Shah
What you will learn
Why are sequences important? How are they different?
Why sequences?
ABABCABCDABCDE
Why are sequences important? How are they different?
Why sequences?
ABABCABCDABCDE
A | B | C | D | E |
4 | 4 | 3 | 2 | 1 |
Order matters
Why sequences?
ABABCABCDABCDE
A | B | C | D | E |
4 | 4 | 3 | 2 | 1 |
CDABADBCAEBABC
DCDCBCABBAABAE
BBABCDBCDAEACA
Structure matters
Why sequences?
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
9 white cells → 🙂
Structure matters
Why sequences?
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
9 white cells → 🙂
Structure matters
Why sequences?
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
x | 1 |
y | 1 |
z | 1 |
a | 1 |
b | 0 |
position
value
Regression?
🙂
Translation does not matter
Why sequences?
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
| | | | | | | | |
Resolution does not matter
Why sequences?
| | | | | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | | | |
| | | | | | | | | | | | | | | | | |
Why sequences?
ABABCABCDABCDE
☀️ ⛅ ☀️ ⛅ ☁️ ☀️ ⛅ ☁️ 🌧️ ☀️ ⛅ ☁️ 🌧️ ⛈️
What comes next?
What else is like this?
structure
&
variable size
Real world data has
How to measure & optimize?
maximize the likelihood of our data under our model
How to measure & optimize?
maximize the likelihood of our data under our model
Recurrent Neural Networks (1986)
B
A
C
C
B
A
Recurrent Neural Networks (1986)
B
A
C
C
B
A
A
A
A
A
B
C
Recurrent Neural Networks (1986)
B
A
C
C
B
A
B
A
C
B
A
C
A
A
A
A
B
C
Generating Sequences with Recurrent Neural Networks (Graves 2013)
“In principle a large enough RNN should be sufficient to generate sequences
of arbitrary complexity.”
same paper (Graves 2013)
“In practice however, standard RNNs are unable to store information about past inputs for very long”
Forgetting
Is the main problem. The majority of our time is spent on this.
How to deal with longer sequences, larger images, coherence over grander timescales?
LSTM
Long short-term memory
recurrent neural network
(Schmidhuber 1997)
Sequence to Sequence Learning
Translation
Labeling
Speech Recognition
Conditional Speech Synthesis
Text Generation
(Sutskever 2014)
Connectionist Temporal Classification
the alignment subproblem (Graves 2014)
The beginning of “attention”
Jointly Learning to Align and Translate (Bahdanau 2014)
The beginning of “attention”
Jointly Learning to Align and Translate (Bahdanau 2014)
The beginning of “attention”
Jointly Learning to Align and Translate (Bahdanau 2014)
Attention
“Solved” sequence length. Enabled interpretability.
Attention
“Solved” sequence length. Enabled interpretability.
so we’re still limited by this:
Well, sort of
because of this:
sequences in the real world are very very long
Well, sort of
sequences in the real world are very very long
so we’re still limited by this:
Well, sort of
sequences in the real world are very very long
so we’re still limited by this:
Attention
Recurrence <
Attention
enables parallelism for training
Attention
enables parallelism for training
we’re no longer limited to a single forward⬌backward process
this means we can train bigger models, on more data
the only limit is how many computers we can use
Transformer:
Attention is All You Need
(Vaswani 2017)
Transformer:
Attention is All You Need
(Vaswani 2017)
GPT-3
this is a log scale.
this is 15x more.
🤯
The Data Details
Representation: How is the data represented?
Generation: How is data generated?
Training: What data was it trained on?
Representation:
Byte Pair Encoding
+ Embeddings
+ Positional Encoding
Byte Pair Encoding:
Byte Pair Encoding:
Embeddings:
Positional Encoding:
Generation:
Greedy Sampling
+ Nucleus Sampling
+ (Not) Beam Search
Greedy Sampling:
Nucleus Sampling:
(not) Beam Search:
(not) Beam Search:
TL;DR-
Beam search samples don’t look human
Training:
Filtered CommonCrawl (most of the internet)
+ Microsoft supercomputing cluster
So big, that minimizing the negative log likelihood means it needs to learn to do this:
Training:
Filtered CommonCrawl (most of the internet)
+ Microsoft supercomputing cluster
So big, that minimizing the negative log likelihood means it needs to learn to do this:
Training:
Filtered CommonCrawl (most of the internet)
+ Microsoft supercomputing cluster
So big, that minimizing the negative log likelihood means it needs to learn to do this:
Limitations?
Is this artificial general intelligence?
Limitations:
+ Fails at tasks that require world knowledge
+ Uniform importance / loss function +Unidirectional context (explains WIC)
+ Learning at test time?
World Knowledge:
Specifically GPT-3 has difficulty with questions of the type
“If I put cheese into the fridge, will it melt?”
Simulation + Self Play:
World knowledge is not a structural limitation,
only one of the data.
Uniform Importance
Language Models are Few-Shot Learners
“Ultimately, it is not even clear what humans learn from scratch vs from prior demonstrations.”