1 of 15

Lecture 5:�Sequence-to-Sequence (seq2seq) Models

Soo Kim

Spring 2025

2 of 15

Machine Translation Problem

Let’s consider another many-to-many NLP problem, machine translation:�given a sentence in one language, the task is generating a sentence of same meaning in another language.

2

3 of 15

Review: Many-to-Many RNN

We covered many-to-many RNN in the last lecture:

W_hy

tanh

W_hh

W_xh

W_hh

W_xh

W_hh

W_xh

x₁

f_W

h₀

h₁

x₂

f_W

h₂

x₃

f_W

h₃

ŷ₃

ŷ_t = σ(W_hyh_t)

ŷ_t = W_hyh_t

For binary classification:

For regression:

ŷ₂

ŷ₁

3

4 of 15

Machine Translation Problem

Our first try: our many-to-many RNN!

Do you see any problem?

x₁

f_W

h₀

h₁

x₂

f_W

h₂

x₃

f_W

h₃

ŷ₃

ŷ₂

ŷ₁

x₄

f_W

h₄

x₅

f_W

h₅

x₆

f_W

h₆

ŷ₆

ŷ₅

ŷ₄

Vivo

en

un

pueblo

pequeño

en

I

live

in

a

small

town

4

5 of 15

Machine Translation Problem

Our RNN assumes 1:1 relationship.

Input length = output length
Semantics of input[k] = output[k].

For machine translation,

Sentence length varies by language.
Word may appear in different order!

x₁

f_W

h₀

h₁

x₂

f_W

h₂

x₃

f_W

h₃

ŷ₃

ŷ₂

ŷ₁

x₄

f_W

h₄

x₅

f_W

h₅

x₆

f_W

h₆

ŷ₆

ŷ₅

ŷ₄

Vivo

en

un

pueblo

pequeño

en

I

live

in

a

small

town

5

6 of 15

Machine Translation Problem

Sometimes it is even impossible to output based on the consumed inputs so far.

This token should be “Vivo”, but we haven’t read the corresponding word “live” yet.

Generating an output token per each input token may not work. Any better idea?

x₁

f_W

h₀

h₁

x₂

f_W

h₂

ŷ₂

ŷ₁

I

live

???

6

7 of 15

Encoder-Decoder Structure

Let’s step back to our original encoder structure, without outputting at each step:

Randomly initialized. No information yet.

Encoded information about x₁: “Vivo”

x₁

f_W

h₀

h₁

x₂

f_W

h₂

x₃

f_W

h₃

x₄

f_W

h₄

x₅

f_W

h₅

x₆

f_W

h₆

Vivo

en

un

pueblo

pequeño

en

Encoded information about x_1:2: “Vivo en”

Encoded information about x_1:3: “Vivo en un”

Encoded information about x_1:4: “Vivo en un pueblo”

Encoded information about x_1:5:�“Vivo en un pueblo pequeño”

7

8 of 15

Encoder-Decoder Structure

Let’s step back to our original encoder structure, without outputting at each step:

Now, we have an embedding containing semantics of the entire input sequence.
Then, we build a Decoder, generating outputs one by one starting from this embedding.
The loss to train the encoder is also coming from all the way from the decoder output.

h₁₄

x₁₅

f_W

h₁₅

.

At the end of the sequence, h₁₅ encodes the entire sequence x_1:15:�“Vivo en un pueblo pequeño en una montaña con mi mujer y dos hijos.”

h₁₃

x₁₄

f_W

hijos

8

9 of 15

Decoder: Auto-Regressive Generation

At each step, given a hidden state (expected to carry information about input sequence; context) and the last output (indicating where we are), it decides the next output token.
Auto-regressive input: the lagged (auto-regressive) values of the time series are used as inputs.

<SOS>

f_W

h₀

h₁

f_W

h₂

f_W

ŷ₃

h₃

y₃

ŷ₁

ŷ₂

ŷ₁

ŷ₂

y₁

y₂

Initialized as the last hidden state of the encoder.

A special token indicating the first time step.�<Start of Sentence>

Encodes the context + where we are (the first).

Output the first token (e.g., the first word in English sentence)

Auto-regressive input from the previous output.

Loss is computed by comparing each output token with the ground truth (English sentence).

9

10 of 15

Decoder: Teacher Forcing

At training, we use the ground truth y_t-1 as input, because the model needs to learn what to output from the correct inputs.

Otherwise, the model may not train anything at the beginning!

At inference, we do not have access to the ground truth y_t-1, so we actually feed the previous output ŷ_t-1 auto-regressively.

<SOS>

f_W

h₀

h₁

f_W

h₂

f_W

ŷ₃

h₃

y₃

ŷ₁

ŷ₂

y₁

y₂

y₁

y₂

10

11 of 15

Decoder: Teacher Forcing

At training, we use the ground truth y_t-1 as input, because the model needs to learn what to output from the correct inputs.

Otherwise, the model may not train anything at the beginning!

At inference, we do not have access to the ground truth y_t-1, so we actually feed the previous output ŷ_t-1 auto-regressively.

<SOS>

f_W

h₀

h₁

f_W

h₂

f_W

ŷ₃

h₃

y₃

ŷ₁

ŷ₂

y₁

y₂

ŷ₁

ŷ₂

11

12 of 15

Overall Sequence-to-Sequence (seq2seq) Model

Many-to-one as encoder, then one-to-many as decoder.
The input sequence is encoded as a single vector at the end of the encoder.
From this single vector, the decoder generates output sequence.

x₁

f_W

h₀

h₁

f_W

h₂

f_W

h₃

x₂

x₃

<SOS>

f_W

s₀

s₁

f_W

s₂

f_W

ŷ₃

s₃

y₃

ŷ₁

ŷ₂

ŷ₁

ŷ₂

y₁

y₂

”Vivo en un pueblo pequeño en una montaña con mi mujer y dos hijos.”

I

live

in

12

13 of 15

Implementation: Encoder

class EncoderLSTM(nn.Module):

def __init__(self, input_size, embedding_size, hidden_size, num_layers, p):

super(EncoderLSTM, self).__init__()

self.input_size = input_size # length of one-hot input

self.embedding_size = embedding_size # dimensionality of an input token (word embedding)

self.hidden_size = hidden_size # dimensionality of hidden representation

self.num_layers = num_layers # Number of layers in the LSTM

self.dropout = nn.Dropout(p)

self.embedding = nn.Embedding(self.input_size, self.embedding_size)

self.LSTM = nn.LSTM(self.embedding_size, hidden_size, num_layers, dropout=p)

def forward(self, x):

# shape: [sequence length, batch size, embedding dims]

embedding = self.dropout(self.embedding(x))

# outputs shape: [sequence length, batch size, hidden_size]

# hs, cs shape: [num_layers, batch_size, hidden_size]

outputs, (hidden_state, cell_state) = self.LSTM(embedding)

return hidden_state, cell_state

x₁

f_W

h₁₀

h₁₁

x₂

f_W

h₁₂

f_W

h₂₁

f_W

h₂₂

f_W

ŷ₁

h₃₁

f_W

ŷ₂

h₃₂

h₂₀

h₃₀

outputs

hidden_state�cell_state

13

14 of 15

Implementation: Decoder

class DecoderLSTM(nn.Module):

def __init__(self, input_size, embedding_size, hidden_size, num_layers, p, output_size):

super(DecoderLSTM, self).__init__()

self.input_size = input_size # length of one-hot input (input language vocab size)

self.embedding_size = embedding_size # word embedding size

self.hidden_size = hidden_size # dimensionality of hidden representation

self.num_layers = num_layers # Number of layers in the LSTM

self.output_size = output_size # length of one-hot output (output language vocab size)

self.dropout = nn.Dropout(p)

self.embedding = nn.Embedding(self.input_size, self.embedding_size)

self.LSTM = nn.LSTM(self.embedding_size, hidden_size, num_layers, dropout = p)

self.fc = nn.Linear(self.hidden_size, self.output_size)

def forward(self, x, hidden_state, cell_state):

x = x.unsqueeze(0) # shape of x: [1, batch_size]

embedding = self.dropout(self.embedding(x)) # shape: [1, batch size, embedding dims]

# outputs shape: [1, batch size, hidden_size]

# hs, cs shape: [num_layers, batch_size, hidden_size] ← hs, cs from Encoder

outputs, (hidden_state, cell_state) = self.LSTM(embedding, (hidden_state, cell_state))

predictions = self.fc(outputs) # shape: [1, batch_size, output_size]

predictions = predictions.squeeze(0) # shape: [batch_size, output_size]

return predictions, hidden_state, cell_state

14

15 of 15

Implementation: Seq2seq Interface

class Seq2Seq(nn.Module):

def __init__(self, Encoder_LSTM, Decoder_LSTM):

super(Seq2Seq, self).__init__()

self.Encoder_LSTM = Encoder_LSTM

self.Decoder_LSTM = Decoder_LSTM

def forward(self, source, target):

batch_size = source.shape[1] # source shape: [input language seq len, num_sentences]

target_len = target.shape[0] # target shape: [output language seq len, num_sentences]

target_vocab_size = len(english.vocab)

outputs = torch.zeros(target_len, batch_size, target_vocab_size)

hs, cs = self.Encoder_LSTM(source)

x = target[0] # Trigger token <SOS>; shape: [batch_size]

for i in range(1, target_len):

output, hs, cs = self.Decoder_LSTM(x, hs, cs)

outputs[i] = output

x = output.argmax(1)

return outputs # shape: [output language seq len, batch_size, target_vocab_size]

15