1 of 22

Introduction to Transformer-Based Language Model

Presenters : Bishnu Sarker, Sayane Shome

Date: 17-18 July, 2023

2 of 22

Learning Objectives of the session

Understanding the fundamental concepts behind transformers including ProtTrans transformers.

3 of 22

What is a Language Modeling?

Goal: compute the probability of a sentence or sequence of words:

P(W) = P(w₁,w₂,w₃,w₄,w₅…w_n)

Related task: probability of an upcoming word:

P(w₅|w₁,w₂,w₃,w₄)

A model that computes either of these:

P(W) or P(w_n|w₁,w₂…w_n-1) is called a language model.

Jurafsky, D. and Martin, J.H. (2023) Speech and Language Processing. https://web.stanford.edu/~jurafsky/slp3/

4 of 22

The Chain Rule applied to compute joint probability of words in a sentence

P(“its water is so transparent”) = P(its) × P(water|its) × P(is|its water)

× P(so|its water is) × P(transparent|its water is so)

Jurafsky, D. and Martin, J.H. (2023) Speech and Language Processing. https://web.stanford.edu/~jurafsky/slp3/

5 of 22

Markov Assumption

In other words, we approximate each component in the product

Jurafsky, D. and Martin, J.H. (2023) Speech and Language Processing. https://web.stanford.edu/~jurafsky/slp3/

6 of 22

Simplest case: Unigram model

fifth, an, of, futures, the, an, incorporated, a, a, the, inflation, most, dollars, quarter, in, is, mass

thrift, did, eighty, said, hard, 'm, july, bullish

that, or, limited, the

Some automatically generated sentences from a Unigram model

Jurafsky, D. and Martin, J.H. (2023) Speech and Language Processing. https://web.stanford.edu/~jurafsky/slp3/

7 of 22

Bigram model

Condition on the previous word:

texaco, rose, one, in, this, issue, is, pursuing, growth, in, a, boiler, house, said, mr., gurria, mexico, 's, motion, control, proposal, without, permission, from, five, hundred, fifty, five, yen

outside, new, car, parking, lot, of, the, agreement, reached

this, would, be, a, record, november

Jurafsky, D. and Martin, J.H. (2023) Speech and Language Processing. https://web.stanford.edu/~jurafsky/slp3/

8 of 22

N-gram models

We can extend to trigrams, 4-grams, 5-grams.
In general this is an insufficient model of language because language has long-distance dependencies:

“The computer which I had just put into the machine room on the fifth floor crashed.”

But we can often get away with N-gram models.

Jurafsky, D. and Martin, J.H. (2023) Speech and Language Processing. https://web.stanford.edu/~jurafsky/slp3/

9 of 22

Neural Language Models (LMs)

Language Modeling: Calculating the probability of the next word in a sequence given some history.

We've seen N-gram based LMs
But neural network LMs far outperform n-gram language models

State-of-the-art neural LMs are based on more powerful neural network technology like Transformers
But simple feedforward LMs can do almost as well!

Jurafsky, D. and Martin, J.H. (2023) Speech and Language Processing. https://web.stanford.edu/~jurafsky/slp3/

10 of 22

Simple feedforward Neural Language Models

Task: predict next word w_t, given prior words w_t-1, w_t-2, w_t-3, …
Problem: Now we’re dealing with sequences of arbitrary length.
Solution: Sliding windows (of fixed length)

Jurafsky, D. and Martin, J.H. (2023) Speech and Language Processing. https://web.stanford.edu/~jurafsky/slp3/

11 of 22

Neural Language Model

Jurafsky, D. and Martin, J.H. (2023) Speech and Language Processing. https://web.stanford.edu/~jurafsky/slp3/

12 of 22

Why Neural LMs work better than N-gram LMs

Training data:

We've seen: I have to make sure that the cat gets fed.
Never seen: dog gets fed

Test data:

I forgot to make sure that the dog gets ___
N-gram LM can't predict "fed"!
Neural LM can use similarity of "cat" and "dog" embeddings to generalize and predict “fed” after dog

Jurafsky, D. and Martin, J.H. (2023) Speech and Language Processing. https://web.stanford.edu/~jurafsky/slp3/

13 of 22

Transformers-based Large Language Model

14 of 22

Transformers-based Large Language Model

The de-facto sequence model architecture includes multiple identical encoders and decoders.
Each encoder consists of 1) an attention layer, 2) Feedforward layer.
Through 8 heads, the attention layer attend different parts of the input.
Each token is passed to individual feed forward neural network.
The output from the encoder is passed through the top level encoders until fed to the decoders.
The output from the top encoder is used as embeddings.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.

15 of 22

Transformers-based Large Language Model

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.

Embedding Layer

Position Encoding

Multi-Head Attention Layer

Feed Forward Layer

Multi-Head Attention Layer

Encoder-Decoder Attention Layer

Feed Forward Neural Network

Linear Layer

Softmax Layer

16 of 22

Transformers-based Large Language Model

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.

Embedding Layer

Position Encoding

Multi-Head Attention Layer

Feed Forward Layer

Multi-Head Attention Layer

Encoder-Decoder Attention Layer

Feed Forward Neural Network

Linear Layer

Softmax Layer

17 of 22

Transformers Self Attention Layer

Alammar, J (2018). The Illustrated Transformer [Blog post]. Retrieved from https://jalammar.github.io/illustrated-transformer/

18 of 22

Multi-Head Attention

Alammar, J (2018). The Illustrated Transformer [Blog post]. Retrieved from https://jalammar.github.io/illustrated-transformer/

19 of 22

Attention in Sequence Analysis

Elnaggar, Ahmed, et al. "Prottrans: Toward understanding the language of life through self-supervised learning." IEEE transactions on pattern analysis and machine intelligence 44.10 (2021): 7112-7127. https://github.com/agemagician/ProtTrans

20 of 22

ProtTrans Architecture for Sequence Embedding

21 of 22

Hands on Tutorial

Google colab notebook

Link : Colab-Notebook-Transformer

22 of 22

Break !

We will reconvene in 15 mins. Meanwhile, we are available for Q/As

Next in line : Hands-on case study of Protein Function Annotation