Lecture 5:�Sequence-to-Sequence (seq2seq) Models
Soo Kim
Spring 2025
Machine Translation Problem
2
Review: Many-to-Many RNN
Why
Why
Why
tanh
tanh
tanh
Whh
Wxh
Whh
Wxh
Whh
Wxh
x1
fW
h0
h1
x2
fW
h2
x3
fW
h3
ŷ3
ŷt = σ(Whyht)
ŷt = Whyht
For binary classification:
For regression:
ŷ2
ŷ1
3
Machine Translation Problem
Do you see any problem?
x1
fW
h0
h1
x2
fW
h2
x3
fW
h3
ŷ3
ŷ2
ŷ1
x4
fW
h4
x5
fW
h5
x6
fW
h6
ŷ6
ŷ5
ŷ4
Vivo
en
un
pueblo
pequeño
en
I
live
in
a
small
town
4
Machine Translation Problem
x1
fW
h0
h1
x2
fW
h2
x3
fW
h3
ŷ3
ŷ2
ŷ1
x4
fW
h4
x5
fW
h5
x6
fW
h6
ŷ6
ŷ5
ŷ4
Vivo
en
un
pueblo
pequeño
en
I
live
in
a
small
town
5
Machine Translation Problem
This token should be “Vivo”, but we haven’t read the corresponding word “live” yet.
Generating an output token per each input token may not work. Any better idea?
x1
fW
h0
h1
x2
fW
h2
ŷ2
ŷ1
I
live
???
6
Encoder-Decoder Structure
Randomly initialized. No information yet.
Encoded information about x1: “Vivo”
x1
fW
h0
h1
x2
fW
h2
x3
fW
h3
x4
fW
h4
x5
fW
h5
x6
fW
h6
Vivo
en
un
pueblo
pequeño
en
Encoded information about x1:2: “Vivo en”
Encoded information about x1:3: “Vivo en un”
Encoded information about x1:4: “Vivo en un pueblo”
Encoded information about x1:5:�“Vivo en un pueblo pequeño”
7
Encoder-Decoder Structure
h14
x15
fW
h15
.
At the end of the sequence, h15 encodes the entire sequence x1:15:�“Vivo en un pueblo pequeño en una montaña con mi mujer y dos hijos.”
h13
x14
fW
hijos
8
Decoder: Auto-Regressive Generation
<SOS>
fW
h0
h1
fW
h2
fW
ŷ3
h3
y3
ŷ1
ŷ2
ŷ1
ŷ2
y1
y2
Initialized as the last hidden state of the encoder.
A special token indicating the first time step.�<Start of Sentence>
Encodes the context + where we are (the first).
Output the first token (e.g., the first word in English sentence)
Auto-regressive input from the previous output.
Loss is computed by comparing each output token with the ground truth (English sentence).
9
Decoder: Teacher Forcing
<SOS>
fW
h0
h1
fW
h2
fW
ŷ3
h3
y3
ŷ1
ŷ2
y1
y2
y1
y2
10
Decoder: Teacher Forcing
<SOS>
fW
h0
h1
fW
h2
fW
ŷ3
h3
y3
ŷ1
ŷ2
y1
y2
ŷ1
ŷ2
11
Overall Sequence-to-Sequence (seq2seq) Model
x1
fW
h0
h1
fW
h2
fW
h3
x2
x3
<SOS>
fW
s0
s1
fW
s2
fW
ŷ3
s3
y3
ŷ1
ŷ2
ŷ1
ŷ2
y1
y2
”Vivo en un pueblo pequeño en una montaña con mi mujer y dos hijos.”
I
live
in
12
Implementation: Encoder
class EncoderLSTM(nn.Module):
def __init__(self, input_size, embedding_size, hidden_size, num_layers, p):
super(EncoderLSTM, self).__init__()
self.input_size = input_size # length of one-hot input
self.embedding_size = embedding_size # dimensionality of an input token (word embedding)
self.hidden_size = hidden_size # dimensionality of hidden representation
self.num_layers = num_layers # Number of layers in the LSTM
self.dropout = nn.Dropout(p)
self.embedding = nn.Embedding(self.input_size, self.embedding_size)
self.LSTM = nn.LSTM(self.embedding_size, hidden_size, num_layers, dropout=p)
def forward(self, x):
# shape: [sequence length, batch size, embedding dims]
embedding = self.dropout(self.embedding(x))
# outputs shape: [sequence length, batch size, hidden_size]
# hs, cs shape: [num_layers, batch_size, hidden_size]
outputs, (hidden_state, cell_state) = self.LSTM(embedding)
return hidden_state, cell_state
x1
fW
h10
h11
x2
fW
h12
fW
h21
fW
h22
fW
ŷ1
h31
fW
ŷ2
h32
h20
h30
outputs
hidden_state�cell_state
13
Implementation: Decoder
class DecoderLSTM(nn.Module):
def __init__(self, input_size, embedding_size, hidden_size, num_layers, p, output_size):
super(DecoderLSTM, self).__init__()
self.input_size = input_size # length of one-hot input (input language vocab size)
self.embedding_size = embedding_size # word embedding size
self.hidden_size = hidden_size # dimensionality of hidden representation
self.num_layers = num_layers # Number of layers in the LSTM
self.output_size = output_size # length of one-hot output (output language vocab size)
self.dropout = nn.Dropout(p)
self.embedding = nn.Embedding(self.input_size, self.embedding_size)
self.LSTM = nn.LSTM(self.embedding_size, hidden_size, num_layers, dropout = p)
self.fc = nn.Linear(self.hidden_size, self.output_size)
def forward(self, x, hidden_state, cell_state):
x = x.unsqueeze(0) # shape of x: [1, batch_size]
embedding = self.dropout(self.embedding(x)) # shape: [1, batch size, embedding dims]
# outputs shape: [1, batch size, hidden_size]
# hs, cs shape: [num_layers, batch_size, hidden_size] ← hs, cs from Encoder
outputs, (hidden_state, cell_state) = self.LSTM(embedding, (hidden_state, cell_state))
predictions = self.fc(outputs) # shape: [1, batch_size, output_size]
predictions = predictions.squeeze(0) # shape: [batch_size, output_size]
return predictions, hidden_state, cell_state
14
Implementation: Seq2seq Interface
class Seq2Seq(nn.Module):
def __init__(self, Encoder_LSTM, Decoder_LSTM):
super(Seq2Seq, self).__init__()
self.Encoder_LSTM = Encoder_LSTM
self.Decoder_LSTM = Decoder_LSTM
def forward(self, source, target):
batch_size = source.shape[1] # source shape: [input language seq len, num_sentences]
target_len = target.shape[0] # target shape: [output language seq len, num_sentences]
target_vocab_size = len(english.vocab)
outputs = torch.zeros(target_len, batch_size, target_vocab_size)
hs, cs = self.Encoder_LSTM(source)
x = target[0] # Trigger token <SOS>; shape: [batch_size]
for i in range(1, target_len):
output, hs, cs = self.Decoder_LSTM(x, hs, cs)
outputs[i] = output
x = output.argmax(1)
return outputs # shape: [output language seq len, batch_size, target_vocab_size]
15