SPEAKER DISCLOSURE
We have not had any relationships with ACCME-defined ineligible companies within the past 24 months.
AMIA 2024 W18
Applying Generative Language Models for Biological Sequence Analysis
Presenters : Bishnu Sarker, Sayane Shome
Date: 9 November, 2024
Organizing Team and Speakers
4
Sayane Shome,Ph.D.
Postdoctoral Fellow
Anesthesia and Pediatrics,
Stanford University School of Medicine,
California, USA
Bishnu Sarker,Ph.D.
Assistant Professor
Computer Science and Data Science,
Meharry School of Applied Computational Sciences,
Tennessee, USA
Learning Objectives
Educational Objectives: By the end of the tutorial, participants will:
Link with all presentation materials/tutorials : https://sites.google.com/view/bishnusarker/invited-talks-and-tutorials/AMIA-2024-W18
5
Tutorial Agenda: Saturday, Nov 9, 2024 (1:00 – 4:30 PM)
6
Schedule | Topics covered |
1:00-1:20 PM | Session 1 : Warm-up and Refresher |
1:20-1:50 PM | Session 2 : General Introduction to language models. |
1:50-2:30 PM | Session 3 : Unboxing Generative Language Models |
2:30-3:00 PM | Short Break and Q/As |
3:00-4:00 PM | Session 4 : Hands-on case studies on applications of transformers in protein sequence analysis |
4:00-4:30 PM | Q/A and Closing remarks |
Session 1
Warm-up and Refresher
Presenters : Bishnu Sarker, Sayane Shome
Date: 9 Nov, 2024
Proteins are building blocks of life
8
Amino acids and Protein structures
9
Protein Domains and Protein Function
10
Biological Sequences Are Strings.
And String Similarity is at the heart of sequence analysis.
11
AGGCTATCACCTGACCTCCAGGCCGATGCCC
TAGCTATCACGACCGCGGTCGATTTGCCCGAC
Given two Sequences
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---
TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
Finding the sequence similarity or align them..
Introduction to Language Models and Transformers
12
Prerequisite Python packages
1.Transformers related packages :�� For Pytorch
pip install transformers torch
For Tensorflow
pip install transformers tensorflow�
2.For handling biological sequences :
pip install biopython
13
Session 2
Language modelling
Presenters : Bishnu Sarker, Sayane Shome
Date: 9 Nov, 2024
What is a Language Modeling?
P(W) = P(w1,w2,w3,w4,w5…wn)
P(w5|w1,w2,w3,w4)
P(W) or P(wn|w1,w2…wn-1) is called a language model.
15
Jurafsky, D. and Martin, J.H. (2023) Speech and Language Processing. https://web.stanford.edu/~jurafsky/slp3/
The Chain Rule applied to compute joint probability of words in a sentence.
P(“its water is so transparent”) = P(its) × P(water|its) × P(is|its water)
× P(so|its water is) × P(transparent|its water is so)
16
Jurafsky, D. and Martin, J.H. (2023) Speech and Language Processing. https://web.stanford.edu/~jurafsky/slp3/
Markov Assumption
In other words, we approximate each component in the product
17
Jurafsky, D. and Martin, J.H. (2023) Speech and Language Processing. https://web.stanford.edu/~jurafsky/slp3/
Simplest case: Unigram model
18
fifth, an, of, futures, the, an, incorporated, a, a, the, inflation, most, dollars, quarter, in, is, mass
thrift, did, eighty, said, hard, 'm, july, bullish
that, or, limited, the
Some automatically generated sentences from a Unigram model
Jurafsky, D. and Martin, J.H. (2023) Speech and Language Processing. https://web.stanford.edu/~jurafsky/slp3/
Bigram model
19
texaco, rose, one, in, this, issue, is, pursuing, growth, in, a, boiler, house, said, mr., gurria, mexico, 's, motion, control, proposal, without, permission, from, five, hundred, fifty, five, yen
outside, new, car, parking, lot, of, the, agreement, reached
this, would, be, a, record, november
Jurafsky, D. and Martin, J.H. (2023) Speech and Language Processing. https://web.stanford.edu/~jurafsky/slp3/
N-gram models
“The computer which I had just put into the machine room on the fifth floor crashed.”
20
Jurafsky, D. and Martin, J.H. (2023) Speech and Language Processing. https://web.stanford.edu/~jurafsky/slp3/
Neural Language Models (LMs)
21
Jurafsky, D. and Martin, J.H. (2023) Speech and Language Processing. https://web.stanford.edu/~jurafsky/slp3/
Simple feedforward Neural Language Models
22
Jurafsky, D. and Martin, J.H. (2023) Speech and Language Processing. https://web.stanford.edu/~jurafsky/slp3/
Neural Language Model
Jurafsky, D. and Martin, J.H. (2023) Speech and Language Processing. https://web.stanford.edu/~jurafsky/slp3/
Why Neural LMs work better than N-gram LMs
24
Jurafsky, D. and Martin, J.H. (2023) Speech and Language Processing. https://web.stanford.edu/~jurafsky/slp3/
Word2Vec Neural Language Model
25
Session 3
Unboxing Generative Language Models
Presenters : Bishnu Sarker, Sayane Shome
Date: 9 Nov, 2024
Building Large Language Models
Raschka, S. (2024). Build a Large Language Model (From Scratch). Simon and Schuster.
”Attention is all we need”
Transformer-based Large Language Model
Transformers-based Large Language Model
29
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.
Embedding Layer
Position Encoding
Multi-Head Attention Layer
Feed Forward Layer
Multi-Head Attention Layer
Encoder-Decoder Attention Layer
Feed Forward Neural Network
Linear Layer
Softmax Layer
Transformers Self Attention Layer
30
P
C
Alammar, J (2018). The Illustrated Transformer [Blog post]. Retrieved from https://jalammar.github.io/illustrated-transformer/
Multi-Head Attention
31
P
C
Alammar, J (2018). The Illustrated Transformer [Blog post]. Retrieved from https://jalammar.github.io/illustrated-transformer/
Attention in Sequence Analysis
Elnaggar, Ahmed, et al. "Prottrans: Toward understanding the language of life through self-supervised learning." IEEE transactions on pattern analysis and machine intelligence 44.10 (2021): 7112-7127. https://github.com/agemagician/ProtTrans
ProtTrans Architecture
33
Elnaggar, Ahmed, et al. "Prottrans: Toward understanding the language of life through self-supervised learning." IEEE transactions on pattern analysis and machine intelligence 44.10 (2021): 7112-7127.
Hands on Tutorial
Google colab notebook
34
https://tinyurl.com/yeyyfskh
For downloading all notebook related-materials :
Session 4
Hands-on case studies on applications of transformers in protein sequence analysis
Presenters : Bishnu Sarker, Sayane Shome
Date: 9 Nov, 2024
Learning Objectives
To expand the concepts we learnt in previous sessions into practical applications such as protein function prediction.
36
Problem Definition
Given a protein sequence of length L,the objective is to assign functional terms such as Gene Ontologies or Enzyme commission number.
37
Gene Ontologies
38
Background
Manual Annotation
39
Curators
Background
Automatic Annotation
40
Protein Function Annotation
Input Data and Data Sources
41
Protein Function Annotation
Output Data and Data Sources
42
Protein Function Annotation
Approach
43
Obtaining pretrained embeddings for the protein sequence dataset from Uniprot
Using ML models for classifying the sequences with the GO IDs/EC IDs
Obtaining protein sequence dataset from Uniprot and associated GO IDs/EC IDs
Evaluating ML model performance using metrics
Protein Function Annotation
Future Challenges
44
Explainability
Computational Cost
Multi-omics Integration
03
01
02
Hands on Tutorial
Google colab notebook
Link : https://tinyurl.com/yeyyfskh
1-Loading-Pre-Trainied-Transformers-From-HF.ipynb
2-Case_Study1-Transformer-Protein-Sequence-Classification.ipynb
3-ESM2-Protein-Function-Prediction
45
Q&A and Final Remarks!
46
Acknowledgements
47
Thank you for joining us !
For any correspondence regarding questions about the materials and related topics :
48