1 of 22

Introduction to Transformer-Based Language Model

Presenters : Bishnu Sarker, Sayane Shome

Date: 17-18 July, 2023

2 of 22

Learning Objectives of the session

Understanding the fundamental concepts behind transformers including ProtTrans transformers.

2

3 of 22

What is a Language Modeling?

  1. Goal: compute the probability of a sentence or sequence of words:

P(W) = P(w1,w2,w3,w4,w5…wn)

  • Related task: probability of an upcoming word:

P(w5|w1,w2,w3,w4)

  • A model that computes either of these:

P(W) or P(wn|w1,w2…wn-1) is called a language model.

3

Jurafsky, D. and Martin, J.H. (2023) Speech and Language Processing. https://web.stanford.edu/~jurafsky/slp3/

4 of 22

The Chain Rule applied to compute joint probability of words in a sentence

P(“its water is so transparent”) = P(its) × P(water|its) × P(is|its water)

× P(so|its water is) × P(transparent|its water is so)

4

Jurafsky, D. and Martin, J.H. (2023) Speech and Language Processing. https://web.stanford.edu/~jurafsky/slp3/

5 of 22

Markov Assumption

In other words, we approximate each component in the product

5

Jurafsky, D. and Martin, J.H. (2023) Speech and Language Processing. https://web.stanford.edu/~jurafsky/slp3/

6 of 22

Simplest case: Unigram model

6

fifth, an, of, futures, the, an, incorporated, a, a, the, inflation, most, dollars, quarter, in, is, mass

thrift, did, eighty, said, hard, 'm, july, bullish

that, or, limited, the

Some automatically generated sentences from a Unigram model

Jurafsky, D. and Martin, J.H. (2023) Speech and Language Processing. https://web.stanford.edu/~jurafsky/slp3/

7 of 22

Bigram model

7

  • Condition on the previous word:

texaco, rose, one, in, this, issue, is, pursuing, growth, in, a, boiler, house, said, mr., gurria, mexico, 's, motion, control, proposal, without, permission, from, five, hundred, fifty, five, yen

outside, new, car, parking, lot, of, the, agreement, reached

this, would, be, a, record, november

Jurafsky, D. and Martin, J.H. (2023) Speech and Language Processing. https://web.stanford.edu/~jurafsky/slp3/

8 of 22

N-gram models

  • We can extend to trigrams, 4-grams, 5-grams.
  • In general this is an insufficient model of language because language has long-distance dependencies:

“The computer which I had just put into the machine room on the fifth floor crashed.”

  • But we can often get away with N-gram models.

8

Jurafsky, D. and Martin, J.H. (2023) Speech and Language Processing. https://web.stanford.edu/~jurafsky/slp3/

9 of 22

Neural Language Models (LMs)

  • Language Modeling: Calculating the probability of the next word in a sequence given some history.
    • We've seen N-gram based LMs
    • But neural network LMs far outperform n-gram language models
  • State-of-the-art neural LMs are based on more powerful neural network technology like Transformers
  • But simple feedforward LMs can do almost as well!

9

Jurafsky, D. and Martin, J.H. (2023) Speech and Language Processing. https://web.stanford.edu/~jurafsky/slp3/

10 of 22

Simple feedforward Neural Language Models

  • Task: predict next word wt, given prior words wt-1, wt-2, wt-3, …
  • Problem: Now we’re dealing with sequences of arbitrary length.
  • Solution: Sliding windows (of fixed length)

10

Jurafsky, D. and Martin, J.H. (2023) Speech and Language Processing. https://web.stanford.edu/~jurafsky/slp3/

11 of 22

Neural Language Model

Jurafsky, D. and Martin, J.H. (2023) Speech and Language Processing. https://web.stanford.edu/~jurafsky/slp3/

12 of 22

Why Neural LMs work better than N-gram LMs

  • Training data:
    • We've seen: I have to make sure that the cat gets fed.
    • Never seen: dog gets fed
  • Test data:
    • I forgot to make sure that the dog gets ___
    • N-gram LM can't predict "fed"!
    • Neural LM can use similarity of "cat" and "dog" embeddings to generalize and predict “fed” after dog

12

Jurafsky, D. and Martin, J.H. (2023) Speech and Language Processing. https://web.stanford.edu/~jurafsky/slp3/

13 of 22

Transformers-based Large Language Model

14 of 22

Transformers-based Large Language Model

14

  • The de-facto sequence model architecture includes multiple identical encoders and decoders.
  • Each encoder consists of 1) an attention layer, 2) Feedforward layer.
  • Through 8 heads, the attention layer attend different parts of the input.
  • Each token is passed to individual feed forward neural network.
  • The output from the encoder is passed through the top level encoders until fed to the decoders.
  • The output from the top encoder is used as embeddings.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.

15 of 22

Transformers-based Large Language Model

15

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.

Embedding Layer

Position Encoding

Multi-Head Attention Layer

Feed Forward Layer

Multi-Head Attention Layer

Encoder-Decoder Attention Layer

Feed Forward Neural Network

Linear Layer

Softmax Layer

16 of 22

Transformers-based Large Language Model

16

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.

Embedding Layer

Position Encoding

Multi-Head Attention Layer

Feed Forward Layer

Multi-Head Attention Layer

Encoder-Decoder Attention Layer

Feed Forward Neural Network

Linear Layer

Softmax Layer

17 of 22

Transformers Self Attention Layer

17

P

C

Alammar, J (2018). The Illustrated Transformer [Blog post]. Retrieved from https://jalammar.github.io/illustrated-transformer/

18 of 22

Multi-Head Attention

18

P

C

Alammar, J (2018). The Illustrated Transformer [Blog post]. Retrieved from https://jalammar.github.io/illustrated-transformer/

19 of 22

Attention in Sequence Analysis

Elnaggar, Ahmed, et al. "Prottrans: Toward understanding the language of life through self-supervised learning." IEEE transactions on pattern analysis and machine intelligence 44.10 (2021): 7112-7127. https://github.com/agemagician/ProtTrans

20 of 22

ProtTrans Architecture for Sequence Embedding

20

Elnaggar, Ahmed, et al. "Prottrans: Toward understanding the language of life through self-supervised learning." IEEE transactions on pattern analysis and machine intelligence 44.10 (2021): 7112-7127.

21 of 22

Hands on Tutorial

Google colab notebook

21

22 of 22

Break !

We will reconvene in 15 mins. Meanwhile, we are available for Q/As

Next in line : Hands-on case study of Protein Function Annotation

22