1 of 15

Lecture 5:�Sequence-to-Sequence (seq2seq) Models

Soo Kim

Spring 2025

2 of 15

Machine Translation Problem

  • Let’s consider another many-to-many NLP problem, machine translation:�given a sentence in one language, the task is generating a sentence of same meaning in another language.

2

3 of 15

Review: Many-to-Many RNN

  • We covered many-to-many RNN in the last lecture:

Why

Why

Why

tanh

tanh

tanh

Whh

Wxh

Whh

Wxh

Whh

Wxh

x1

fW

h0

h1

x2

fW

h2

x3

fW

h3

ŷ3

ŷt = σ(Whyht)

ŷt = Whyht

For binary classification:

For regression:

ŷ2

ŷ1

3

4 of 15

Machine Translation Problem

  • Our first try: our many-to-many RNN!

Do you see any problem?

x1

fW

h0

h1

x2

fW

h2

x3

fW

h3

ŷ3

ŷ2

ŷ1

x4

fW

h4

x5

fW

h5

x6

fW

h6

ŷ6

ŷ5

ŷ4

Vivo

en

un

pueblo

pequeño

en

I

live

in

a

small

town

4

5 of 15

Machine Translation Problem

  • Our RNN assumes 1:1 relationship.
    • Input length = output length
    • Semantics of input[k] = output[k].
  • For machine translation,
    • Sentence length varies by language.
    • Word may appear in different order!

x1

fW

h0

h1

x2

fW

h2

x3

fW

h3

ŷ3

ŷ2

ŷ1

x4

fW

h4

x5

fW

h5

x6

fW

h6

ŷ6

ŷ5

ŷ4

Vivo

en

un

pueblo

pequeño

en

I

live

in

a

small

town

5

6 of 15

Machine Translation Problem

  • Sometimes it is even impossible to output based on the consumed inputs so far.

This token should be “Vivo”, but we haven’t read the corresponding word “live” yet.

Generating an output token per each input token may not work. Any better idea?

x1

fW

h0

h1

x2

fW

h2

ŷ2

ŷ1

I

live

???

6

7 of 15

Encoder-Decoder Structure

  • Let’s step back to our original encoder structure, without outputting at each step:

Randomly initialized. No information yet.

Encoded information about x1: “Vivo

x1

fW

h0

h1

x2

fW

h2

x3

fW

h3

x4

fW

h4

x5

fW

h5

x6

fW

h6

Vivo

en

un

pueblo

pequeño

en

Encoded information about x1:2: “Vivo en

Encoded information about x1:3: “Vivo en un

Encoded information about x1:4: “Vivo en un pueblo

Encoded information about x1:5:�“Vivo en un pueblo pequeño

7

8 of 15

Encoder-Decoder Structure

  • Let’s step back to our original encoder structure, without outputting at each step:
  • Now, we have an embedding containing semantics of the entire input sequence.
  • Then, we build a Decoder, generating outputs one by one starting from this embedding.
  • The loss to train the encoder is also coming from all the way from the decoder output.

h14

x15

fW

h15

.

At the end of the sequence, h15 encodes the entire sequence x1:15:�“Vivo en un pueblo pequeño en una montaña con mi mujer y dos hijos.

h13

x14

fW

hijos

8

9 of 15

Decoder: Auto-Regressive Generation

  • At each step, given a hidden state (expected to carry information about input sequence; context) and the last output (indicating where we are), it decides the next output token.
  • Auto-regressive input: the lagged (auto-regressive) values of the time series are used as inputs.

<SOS>

fW

h0

h1

fW

h2

fW

ŷ3

h3

y3

ŷ1

ŷ2

ŷ1

ŷ2

y1

y2

Initialized as the last hidden state of the encoder.

A special token indicating the first time step.�<Start of Sentence>

Encodes the context + where we are (the first).

Output the first token (e.g., the first word in English sentence)

Auto-regressive input from the previous output.

Loss is computed by comparing each output token with the ground truth (English sentence).

9

10 of 15

Decoder: Teacher Forcing

  • At training, we use the ground truth yt-1 as input, because the model needs to learn what to output from the correct inputs.
    • Otherwise, the model may not train anything at the beginning!
  • At inference, we do not have access to the ground truth yt-1, so we actually feed the previous output ŷt-1 auto-regressively.

<SOS>

fW

h0

h1

fW

h2

fW

ŷ3

h3

y3

ŷ1

ŷ2

y1

y2

y1

y2

10

11 of 15

Decoder: Teacher Forcing

  • At training, we use the ground truth yt-1 as input, because the model needs to learn what to output from the correct inputs.
    • Otherwise, the model may not train anything at the beginning!
  • At inference, we do not have access to the ground truth yt-1, so we actually feed the previous output ŷt-1 auto-regressively.

<SOS>

fW

h0

h1

fW

h2

fW

ŷ3

h3

y3

ŷ1

ŷ2

y1

y2

ŷ1

ŷ2

11

12 of 15

Overall Sequence-to-Sequence (seq2seq) Model

  • Many-to-one as encoder, then one-to-many as decoder.
  • The input sequence is encoded as a single vector at the end of the encoder.
  • From this single vector, the decoder generates output sequence.

x1

fW

h0

h1

fW

h2

fW

h3

x2

x3

<SOS>

fW

s0

s1

fW

s2

fW

ŷ3

s3

y3

ŷ1

ŷ2

ŷ1

ŷ2

y1

y2

”Vivo en un pueblo pequeño en una montaña con mi mujer y dos hijos.”

I

live

in

12

13 of 15

Implementation: Encoder

class EncoderLSTM(nn.Module):

def __init__(self, input_size, embedding_size, hidden_size, num_layers, p):

super(EncoderLSTM, self).__init__()

self.input_size = input_size # length of one-hot input

self.embedding_size = embedding_size # dimensionality of an input token (word embedding)

self.hidden_size = hidden_size # dimensionality of hidden representation

self.num_layers = num_layers # Number of layers in the LSTM

self.dropout = nn.Dropout(p)

self.embedding = nn.Embedding(self.input_size, self.embedding_size)

self.LSTM = nn.LSTM(self.embedding_size, hidden_size, num_layers, dropout=p)

def forward(self, x):

# shape: [sequence length, batch size, embedding dims]

embedding = self.dropout(self.embedding(x))

# outputs shape: [sequence length, batch size, hidden_size]

# hs, cs shape: [num_layers, batch_size, hidden_size]

outputs, (hidden_state, cell_state) = self.LSTM(embedding)

return hidden_state, cell_state

x1

fW

h10

h11

x2

fW

h12

fW

h21

fW

h22

fW

ŷ1

h31

fW

ŷ2

h32

h20

h30

outputs

hidden_state�cell_state

13

14 of 15

Implementation: Decoder

class DecoderLSTM(nn.Module):

def __init__(self, input_size, embedding_size, hidden_size, num_layers, p, output_size):

super(DecoderLSTM, self).__init__()

self.input_size = input_size # length of one-hot input (input language vocab size)

self.embedding_size = embedding_size # word embedding size

self.hidden_size = hidden_size # dimensionality of hidden representation

self.num_layers = num_layers # Number of layers in the LSTM

self.output_size = output_size # length of one-hot output (output language vocab size)

self.dropout = nn.Dropout(p)

self.embedding = nn.Embedding(self.input_size, self.embedding_size)

self.LSTM = nn.LSTM(self.embedding_size, hidden_size, num_layers, dropout = p)

self.fc = nn.Linear(self.hidden_size, self.output_size)

def forward(self, x, hidden_state, cell_state):

x = x.unsqueeze(0) # shape of x: [1, batch_size]

embedding = self.dropout(self.embedding(x)) # shape: [1, batch size, embedding dims]

# outputs shape: [1, batch size, hidden_size]

# hs, cs shape: [num_layers, batch_size, hidden_size] ← hs, cs from Encoder

outputs, (hidden_state, cell_state) = self.LSTM(embedding, (hidden_state, cell_state))

predictions = self.fc(outputs) # shape: [1, batch_size, output_size]

predictions = predictions.squeeze(0) # shape: [batch_size, output_size]

return predictions, hidden_state, cell_state

14

15 of 15

Implementation: Seq2seq Interface

class Seq2Seq(nn.Module):

def __init__(self, Encoder_LSTM, Decoder_LSTM):

super(Seq2Seq, self).__init__()

self.Encoder_LSTM = Encoder_LSTM

self.Decoder_LSTM = Decoder_LSTM

def forward(self, source, target):

batch_size = source.shape[1] # source shape: [input language seq len, num_sentences]

target_len = target.shape[0] # target shape: [output language seq len, num_sentences]

target_vocab_size = len(english.vocab)

outputs = torch.zeros(target_len, batch_size, target_vocab_size)

hs, cs = self.Encoder_LSTM(source)

x = target[0] # Trigger token <SOS>; shape: [batch_size]

for i in range(1, target_len):

output, hs, cs = self.Decoder_LSTM(x, hs, cs)

outputs[i] = output

x = output.argmax(1)

return outputs # shape: [output language seq len, batch_size, target_vocab_size]

15