1 of 28

Encoder Decoder / Attention/ Transformers /

1

10-Apr-23

2 of 28

Today Nov 4

  • Encoder Decoder
  • Attention
  • Transformers

2

10-Apr-23

3 of 28

Encoder-Decoder

  • RNN: input sequence is transformed into output sequence in a one-to-one fashion.

  • Goal: Develop an architecture capable of generating contextually appropriate, arbitrary length, output sequences
  • Applications:
    • Machine translation,
    • Summarization,
    • Question answering,
    • Dialogue modeling.

4 of 28

Simple recurrent neural network illustrated as a feed-forward network

Most significant change: new set of weights, U

  • connect the hidden layer from the previous time step to the current hidden layer.
  • determine how the network should make use of past context in calculating the output for the current input.

 

 

 

5 of 28

Simple-RNN abstraction

y2

y1

y3

6 of 28

RNN Applications

  • Language Modeling

  • Sequence Classification (Sentiment, Topic)

  • Sequence to Sequence

7 of 28

Sentence Completion using an RNN

  • Trained Neural Language Model can be used to generate novel sequences
  • Or to complete a given sequence (until end of sentence token <\s> is generated)

 

 

8 of 28

Extending (autoregressive) generation to Machine Translation

  • Build an RNN language model on the concatenation of source and target
  • Training data are parallel text e.g., English / French

there lived a hobbit vivait un hobbit

……..

there lived a hobbit <\s> vivait un hobbit <\s>

……..

word generated at each time step is conditioned on word from previous step.

9 of 28

Extending (autoregressive) generation to Machine Translation

  • Translation as Sentence Completion !

10 of 28

(simple) Encoder Decoder Networks

  • Encoder generates a contextualized representation of the input (last state).
  • Decoder takes that state and autoregressively generates a sequence of outputs

Limiting design choices

  • E and D assumed to have the same internal structure (here RNNs)
  • Final state of the E is the only context available to D
  • this context is only available to D as its initial hidden state.

11 of 28

General Encoder Decoder Networks

Abstracting away from these choices

  1. Encoder: accepts an input sequence, x1:n and generates a corresponding sequence of contextualized representations, h1:n
  2. Context vector c: function of h1:n and conveys the essence of the input to the decoder.
  3. Decoder: accepts c as input and generates an arbitrary length sequence of hidden states h1:m from which a corresponding sequence of output states y1:m can be obtained.

h1

h1

h2

h2

hn

hm

12 of 28

Popular architectural choices: Encoder

Widely used encoder design: stacked Bi-LSTMs

  • Contextualized representations for each time step: hidden states from top layers from the forward and backward passes

13 of 28

Decoder Basic Design

  • produce an output sequence an element at a time

Last hidden state of the encoder

First hidden state of the decoder

z1

z2

 

14 of 28

Decoder Design�Enhancement

Context available at each step of decoding

z1

z2

15 of 28

Decoder: How output y is chosen

  • Sample soft-max distribution (OK for generating novel output, not OK for e.g. MT or Summ)
  • Most likely output (doesn’t guarantee individual choices being made make sense together)

z1

z2

For sequence labeling we used Viterbi – here not possible ☹

16 of 28

  • 4 most likely “words” decoded from initial state
  • Feed each of those in decoder and keep most likely 4 sequences of two words
  • Feed most recent word in decoder and keep most likely 4 sequences of three words …….
  • When EOS is generated. Stop sequence and reduce Beam by 1

17 of 28

Today Nov 4

  • Encoder Decoder
  • Attention
  • Transformers

17

10-Apr-23

18 of 28

Sequence to Sequence Learning

  • An encoder processes the input sequence and compresses the information into a context vector (also known as sentence embedding or “thought” vector) of a fixed length. This representation is expected to be a good summary of the meaning of the whole source sequence.�
  • A decoder is initialized with the context vector to emit the transformed output. The early work only used the last state of the encoder network as the decoder initial state.

  • Both the encoder and decoder are recurrent neural networks, i.e. using LSTM or GRU units.

19 of 28

  • A critical and apparent disadvantage of this fixed-length context vector design is incapability of remembering long sentences. Often it has forgotten the first part once it completes processing the whole input. The attention mechanism was born (Bahdanau et al., 2015) to resolve this problem.

20 of 28

Attention Model

  • The attention mechanism was born to help memorize long source sentences in neural machine translation (NMT). Rather than building a single context vector out of the encoder’s last hidden state.

  • The last state can not remember all the information of previous states. So they introduced attention model to overcome this problem.

  • The secret sauce invented by attention is to create shortcuts between the context vector and the entire source input. The weights of these shortcut connections are customizable for each output element.

  • Context vector has access to the entire input sequence, we don’t need to worry about forgetting. The alignment between the source and target is learned and controlled by the context vector.

21 of 28

Essentially the context vector consumes t pieces of information:

  • encoder hidden states;
  • decoder hidden states;
  • alignment between source and target.

22 of 28

Flexible context: Attention

Context vector c: function of h1:n and conveys the essence of the input to the decoder.

h1

h1

h2

h2

hn

hm

Flexible?

  • Different for each hi
  • Flexibly combining the hj

23 of 28

Attention (1): dynamically derived context

  • Replace static context vector with dynamic ci
  • derived from the encoder hidden states at each point i during decoding

Ideas:

  • should be a linear combination of those states

  • should depend on ?

24 of 28

Attention (2): computing ci

  • Compute a vector of scores that capture the relevance of each encoder hidden state to the decoder state
  • Just the similarity
  • Give network the ability to learn which aspects of similarity between the decoder and encoder states are important to the current application.

25 of 28

Attention (3): computing ci�From scores to weights

  • Create vector of weights by normalizing scores
  • Goal achieved: compute a fixed-length context vector for the current decoder state by taking a weighted average over all the encoder hidden states.

26 of 28

Attention: Summary

Encoder

Decoder

27 of 28

Explain Y. Goldberg different notation

28 of 28

Intro to Encoder-Decoder and Attention (Goldberg’s notation)

Encoder

Decoder