1 of 28

Encoder Decoder / Attention/ Transformers /

1

10-Apr-23

2 of 28

Today Nov 4

Encoder Decoder
Attention
Transformers

2

10-Apr-23

3 of 28

Encoder-Decoder

RNN: input sequence is transformed into output sequence in a one-to-one fashion.

Goal: Develop an architecture capable of generating contextually appropriate, arbitrary length, output sequences
Applications:

Machine translation,
Summarization,
Question answering,
Dialogue modeling.

4 of 28

Simple recurrent neural network illustrated as a feed-forward network

Most significant change: new set of weights, U

connect the hidden layer from the previous time step to the current hidden layer.
determine how the network should make use of past context in calculating the output for the current input.

5 of 28

Simple-RNN abstraction

y₂

y₁

y₃

6 of 28

RNN Applications

Language Modeling

Sequence Classification (Sentiment, Topic)

Sequence to Sequence

7 of 28

Sentence Completion using an RNN

Trained Neural Language Model can be used to generate novel sequences
Or to complete a given sequence (until end of sentence token <\s> is generated)

8 of 28

Extending (autoregressive) generation to Machine Translation

Build an RNN language model on the concatenation of source and target

Training data are parallel text e.g., English / French

there lived a hobbit vivait un hobbit

……..

there lived a hobbit <\s> vivait un hobbit <\s>

……..

word generated at each time step is conditioned on word from previous step.

9 of 28

Extending (autoregressive) generation to Machine Translation

Translation as Sentence Completion !

10 of 28

(simple) Encoder Decoder Networks

Encoder generates a contextualized representation of the input (last state).
Decoder takes that state and autoregressively generates a sequence of outputs

Limiting design choices

E and D assumed to have the same internal structure (here RNNs)
Final state of the E is the only context available to D
this context is only available to D as its initial hidden state.

11 of 28

General Encoder Decoder Networks

Abstracting away from these choices

Encoder: accepts an input sequence, x_1:n and generates a corresponding sequence of contextualized representations, h_1:n
Context vector c: function of h_1:n and conveys the essence of the input to the decoder.
Decoder: accepts c as input and generates an arbitrary length sequence of hidden states h_1:m from which a corresponding sequence of output states y_1:m can be obtained.

h₁

h₂

h_n

h_m

, we can say that encoder-decoder networks consist of three components:

Basic architecture for an abstract encoder-decoder network. The context is a

function of the vector of contextualized input representations and may be used by the decoder

in a variety of ways.

Among the major ones are that the encoder and the decoder

are assumed to have the same internal structure (RNNs in this case), that the final

state of the encoder is the only context available to the decoder, and finally that

this context is only available to the decoder as its initial hidden state. Abstracting

away from these choices, we can say that encoder-decoder networks consist of three

components:

1. An encoder that accepts an input sequence, xn

1, and generates a corresponding

sequence of contextualized representations, hn1

.

2. A context vector, c, which is a function of hn1

, and conveys the essence of the

input to the decoder.

3. And a decoder, which accepts c as input and generates an arbitrary length

sequence of hidden states hm1

, from which a corresponding sequence of output

states ym1

, can be obtained.

12 of 28

Popular architectural choices: Encoder

Widely used encoder design: stacked Bi-LSTMs

Contextualized representations for each time step: hidden states from top layers from the forward and backward passes

13 of 28

Decoder Basic Design

produce an output sequence an element at a time

Last hidden state of the encoder

First hidden state of the decoder

z₁

z₂

14 of 28

Decoder Design�Enhancement

Context available at each step of decoding

z₁

z₂

15 of 28

Decoder: How output y is chosen

Sample soft-max distribution (OK for generating novel output, not OK for e.g. MT or Summ)
Most likely output (doesn’t guarantee individual choices being made make sense together)

z₁

z₂

For sequence labeling we used Viterbi – here not possible ☹

16 of 28

4 most likely “words” decoded from initial state
Feed each of those in decoder and keep most likely 4 sequences of two words
Feed most recent word in decoder and keep most likely 4 sequences of three words …….
When EOS is generated. Stop sequence and reduce Beam by 1

17 of 28

Today Nov 4

Encoder Decoder
Attention
Transformers

17

10-Apr-23

18 of 28

Sequence to Sequence Learning

An encoder processes the input sequence and compresses the information into a context vector (also known as sentence embedding or “thought” vector) of a fixed length. This representation is expected to be a good summary of the meaning of the whole source sequence.�
A decoder is initialized with the context vector to emit the transformed output. The early work only used the last state of the encoder network as the decoder initial state.

Both the encoder and decoder are recurrent neural networks, i.e. using LSTM or GRU units.

19 of 28

A critical and apparent disadvantage of this fixed-length context vector design is incapability of remembering long sentences. Often it has forgotten the first part once it completes processing the whole input. The attention mechanism was born (Bahdanau et al., 2015) to resolve this problem.

20 of 28

Attention Model

The attention mechanism was born to help memorize long source sentences in neural machine translation (NMT). Rather than building a single context vector out of the encoder’s last hidden state.

The last state can not remember all the information of previous states. So they introduced attention model to overcome this problem.

The secret sauce invented by attention is to create shortcuts between the context vector and the entire source input. The weights of these shortcut connections are customizable for each output element.

Context vector has access to the entire input sequence, we don’t need to worry about forgetting. The alignment between the source and target is learned and controlled by the context vector.

21 of 28

Essentially the context vector consumes t pieces of information:

encoder hidden states;
decoder hidden states;
alignment between source and target.

22 of 28

Flexible context: Attention

Context vector c: function of h_1:n and conveys the essence of the input to the decoder.

h₁

h₂

h_n

h_m

Flexible?

Different for each h_i
Flexibly combining the h_j

23 of 28

Attention (1): dynamically derived context

Replace static context vector with dynamic c_i
derived from the encoder hidden states at each point i during decoding

Ideas:

should be a linear combination of those states

should depend on ?

24 of 28

Attention (2): computing c_i

Compute a vector of scores that capture the relevance of each encoder hidden state to the decoder state

Just the similarity

Give network the ability to learn which aspects of similarity between the decoder and encoder states are important to the current application.

25 of 28

Attention (3): computing c_i�From scores to weights

Create vector of weights by normalizing scores

Goal achieved: compute a fixed-length context vector for the current decoder state by taking a weighted average over all the encoder hidden states.

26 of 28

Attention: Summary

Encoder

Decoder

27 of 28

Explain Y. Goldberg different notation

28 of 28

Intro to Encoder-Decoder and Attention (Goldberg’s notation)

Encoder

Decoder