1 of 48

2 of 48

SPEAKER DISCLOSURE

We have not had any relationships with ACCME-defined ineligible companies within the past 24 months.

3 of 48

AMIA 2024 W18

Applying Generative Language Models for Biological Sequence Analysis

Presenters : Bishnu Sarker, Sayane Shome

Date: 9 November, 2024

4 of 48

Organizing Team and Speakers

4

Sayane Shome,Ph.D.

Postdoctoral Fellow

Anesthesia and Pediatrics,

Stanford University School of Medicine,

California, USA

sshome@stanford.edu

Bishnu Sarker,Ph.D.

Assistant Professor

Computer Science and Data Science,

Meharry School of Applied Computational Sciences,

Tennessee, USA

bsarker@mmc.edu

5 of 48

Learning Objectives

Educational Objectives: By the end of the tutorial, participants will:

  • Learn the basics and applications of language models in bioinformatics.
  • Implement Python pipeline of collecting, preprocessing, and vectorizing biological sequence data for analysis.
  • Apply pre-trained transformer models and language models for specific protein sequence analysis tasks.

Link with all presentation materials/tutorials : https://sites.google.com/view/bishnusarker/invited-talks-and-tutorials/AMIA-2024-W18

5

6 of 48

Tutorial Agenda: Saturday, Nov 9, 2024 (1:00 – 4:30 PM)

6

Schedule

Topics covered

1:00-1:20 PM

Session 1 : Warm-up and Refresher

1:20-1:50 PM

Session 2 : General Introduction to language models.

1:50-2:30 PM

Session 3 : Unboxing Generative Language Models

2:30-3:00 PM

Short Break and Q/As

3:00-4:00 PM

Session 4 : Hands-on case studies on applications of transformers in protein sequence analysis

4:00-4:30 PM

Q/A and Closing remarks

7 of 48

Session 1

Warm-up and Refresher

Presenters : Bishnu Sarker, Sayane Shome

Date: 9 Nov, 2024

8 of 48

Proteins are building blocks of life

8

9 of 48

Amino acids and Protein structures

9

  • Depending on the constituent of alkyl group in side-chain (R), there are 20 different amino acids.
  • Amino acids are linked with peptide bonds leading to long polypeptide chains which fold in different manner to form tertiary structure of proteins.
  • Different combinations and arrangements of amino acids result in a vast array of proteins.

10 of 48

Protein Domains and Protein Function

  • Protein domains are distinct units within a protein that possess specific structures and functions.
  • They provide modularity, flexibility, and evolutionary versatility to proteins, allowing for the development of complex functions and adaptation to diverse biological environments.
  • They can be pose as sites for protein-protein interactions,metal binding sites and others.

10

11 of 48

Biological Sequences Are Strings.

And String Similarity is at the heart of sequence analysis.

11

AGGCTATCACCTGACCTCCAGGCCGATGCCC

TAGCTATCACGACCGCGGTCGATTTGCCCGAC

Given two Sequences

-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---

TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC

Finding the sequence similarity or align them..

12 of 48

Introduction to Language Models and Transformers

  • Language models are designed to understand and generate human language.
  • Transformer-based models, such as BERT and GPT, have revolutionized this field.
  • Transformers use an attention mechanism that allows them to weigh the importance of different parts of the input sequence, making them particularly powerful for tasks like text generation, translation, summarization, and more.

12

13 of 48

Prerequisite Python packages

1.Transformers related packages :�� For Pytorch

pip install transformers torch

For Tensorflow

pip install transformers tensorflow�

2.For handling biological sequences :

pip install biopython

13

14 of 48

Session 2

Language modelling

Presenters : Bishnu Sarker, Sayane Shome

Date: 9 Nov, 2024

15 of 48

What is a Language Modeling?

  1. Goal: compute the probability of a sentence or sequence of words:

P(W) = P(w1,w2,w3,w4,w5…wn)

  • Related task: probability of an upcoming word:

P(w5|w1,w2,w3,w4)

  • A model that computes either of these:

P(W) or P(wn|w1,w2…wn-1) is called a language model.

15

Jurafsky, D. and Martin, J.H. (2023) Speech and Language Processing. https://web.stanford.edu/~jurafsky/slp3/

16 of 48

The Chain Rule applied to compute joint probability of words in a sentence.

P(“its water is so transparent”) = P(its) × P(water|its) × P(is|its water)

× P(so|its water is) × P(transparent|its water is so)

16

Jurafsky, D. and Martin, J.H. (2023) Speech and Language Processing. https://web.stanford.edu/~jurafsky/slp3/

17 of 48

Markov Assumption

In other words, we approximate each component in the product

17

Jurafsky, D. and Martin, J.H. (2023) Speech and Language Processing. https://web.stanford.edu/~jurafsky/slp3/

18 of 48

Simplest case: Unigram model

18

fifth, an, of, futures, the, an, incorporated, a, a, the, inflation, most, dollars, quarter, in, is, mass

thrift, did, eighty, said, hard, 'm, july, bullish

that, or, limited, the

Some automatically generated sentences from a Unigram model

Jurafsky, D. and Martin, J.H. (2023) Speech and Language Processing. https://web.stanford.edu/~jurafsky/slp3/

19 of 48

Bigram model

19

  • Condition on the previous word:

texaco, rose, one, in, this, issue, is, pursuing, growth, in, a, boiler, house, said, mr., gurria, mexico, 's, motion, control, proposal, without, permission, from, five, hundred, fifty, five, yen

outside, new, car, parking, lot, of, the, agreement, reached

this, would, be, a, record, november

Jurafsky, D. and Martin, J.H. (2023) Speech and Language Processing. https://web.stanford.edu/~jurafsky/slp3/

20 of 48

N-gram models

  • We can extend to trigrams, 4-grams, 5-grams.
  • In general this is an insufficient model of language because language has long-distance dependencies:

“The computer which I had just put into the machine room on the fifth floor crashed.”

  • But we can often get away with N-gram models.

20

Jurafsky, D. and Martin, J.H. (2023) Speech and Language Processing. https://web.stanford.edu/~jurafsky/slp3/

21 of 48

Neural Language Models (LMs)

  • Language Modeling: Calculating the probability of the next word in a sequence given some history.
    • We've seen N-gram based LMs
    • But neural network LMs far outperform n-gram language models
  • State-of-the-art neural LMs are based on more powerful neural network technology like Transformers
  • But simple feedforward LMs can do almost as well!

21

Jurafsky, D. and Martin, J.H. (2023) Speech and Language Processing. https://web.stanford.edu/~jurafsky/slp3/

22 of 48

Simple feedforward Neural Language Models

  • Task: predict next word wt, given prior words wt-1, wt-2, wt-3, …
  • Problem: Now we’re dealing with sequences of arbitrary length.
  • Solution: Sliding windows (of fixed length)

22

Jurafsky, D. and Martin, J.H. (2023) Speech and Language Processing. https://web.stanford.edu/~jurafsky/slp3/

23 of 48

Neural Language Model

Jurafsky, D. and Martin, J.H. (2023) Speech and Language Processing. https://web.stanford.edu/~jurafsky/slp3/

24 of 48

Why Neural LMs work better than N-gram LMs

  • Training data:
    • We've seen: I have to make sure that the cat gets fed.
    • Never seen: dog gets fed
  • Test data:
    • I forgot to make sure that the dog gets ___
    • N-gram LM can't predict "fed"!
    • Neural LM can use similarity of "cat" and "dog" embeddings to generalize and predict “fed” after dog

24

Jurafsky, D. and Martin, J.H. (2023) Speech and Language Processing. https://web.stanford.edu/~jurafsky/slp3/

25 of 48

Word2Vec Neural Language Model

25

  • word2vec learns the low-rank numerical representation of words using Neural Network.
  • It learns to predict the context (surrounding) words given a target word. The weights learned corresponding to each k-mer serve as the embedding.
  • In natural language processing,sentences are composed of words. The spatial relations typically holds the meaning the full sentence.
  • In case of biological sequence, there is no notion of words.
  • K-mers such as 3-mers may serve as the words.

26 of 48

Session 3

Unboxing Generative Language Models

Presenters : Bishnu Sarker, Sayane Shome

Date: 9 Nov, 2024

27 of 48

Building Large Language Models

Raschka, S. (2024). Build a Large Language Model (From Scratch). Simon and Schuster.

”Attention is all we need”

28 of 48

Transformer-based Large Language Model

29 of 48

Transformers-based Large Language Model

29

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.

Embedding Layer

Position Encoding

Multi-Head Attention Layer

Feed Forward Layer

Multi-Head Attention Layer

Encoder-Decoder Attention Layer

Feed Forward Neural Network

Linear Layer

Softmax Layer

30 of 48

Transformers Self Attention Layer

30

P

C

Alammar, J (2018). The Illustrated Transformer [Blog post]. Retrieved from https://jalammar.github.io/illustrated-transformer/

31 of 48

Multi-Head Attention

31

P

C

Alammar, J (2018). The Illustrated Transformer [Blog post]. Retrieved from https://jalammar.github.io/illustrated-transformer/

32 of 48

Attention in Sequence Analysis

Elnaggar, Ahmed, et al. "Prottrans: Toward understanding the language of life through self-supervised learning." IEEE transactions on pattern analysis and machine intelligence 44.10 (2021): 7112-7127. https://github.com/agemagician/ProtTrans

33 of 48

ProtTrans Architecture

33

Elnaggar, Ahmed, et al. "Prottrans: Toward understanding the language of life through self-supervised learning." IEEE transactions on pattern analysis and machine intelligence 44.10 (2021): 7112-7127.

34 of 48

Hands on Tutorial

Google colab notebook

34

https://tinyurl.com/yeyyfskh

For downloading all notebook related-materials :

35 of 48

Session 4

Hands-on case studies on applications of transformers in protein sequence analysis

Presenters : Bishnu Sarker, Sayane Shome

Date: 9 Nov, 2024

36 of 48

Learning Objectives

To expand the concepts we learnt in previous sessions into practical applications such as protein function prediction.

36

37 of 48

Problem Definition

Given a protein sequence of length L,the objective is to assign functional terms such as Gene Ontologies or Enzyme commission number.

  • Gene Ontologies(GO) is a standardized system that assigns functional terms to genes and gene products based on their known or predicted molecular functions, biological processes, and cellular components.
  • Enzyme Commission (EC) numbers are a classification system used to categorize enzymes based on the reactions they catalyze. The EC number provides a unique identifier for each enzyme and is widely used in biochemistry and molecular biology.

37

38 of 48

Gene Ontologies

38

39 of 48

Background

Manual Annotation

39

Curators

40 of 48

Background

Automatic Annotation

40

41 of 48

Protein Function Annotation

Input Data and Data Sources

41

42 of 48

Protein Function Annotation

Output Data and Data Sources

42

43 of 48

Protein Function Annotation

Approach

43

Obtaining pretrained embeddings for the protein sequence dataset from Uniprot

Using ML models for classifying the sequences with the GO IDs/EC IDs

Obtaining protein sequence dataset from Uniprot and associated GO IDs/EC IDs

Evaluating ML model performance using metrics

44 of 48

Protein Function Annotation

Future Challenges

44

Explainability

Computational Cost

Multi-omics Integration

03

01

02

45 of 48

Hands on Tutorial

Google colab notebook

Link : https://tinyurl.com/yeyyfskh

  1. Loading Transformer Model

1-Loading-Pre-Trainied-Transformers-From-HF.ipynb

  • Using Embedding for Protein Function Prediction

2-Case_Study1-Transformer-Protein-Sequence-Classification.ipynb

  • Fine-Tuned ESM for Protein Function Prediction

3-ESM2-Protein-Function-Prediction

45

46 of 48

Q&A and Final Remarks!

46

47 of 48

Acknowledgements

  • Participants !
  • AMIA 2024 Tutorial committee chairs and reviewers
  • Meharry Medical College,Tennessee,USA
  • Stanford University,California,USA
  • National Science Foundation Grant #2302637 for supporting the travel.

47

48 of 48

Thank you for joining us !

For any correspondence regarding questions about the materials and related topics :

  • Bishnu Sarker (bsarker@mmc.edu)
  • Sayane Shome (sshome@stanford.edu)

48