Introduction to biological sequence analysis using Deep Learning in Python
Presenters : Bishnu Sarker, Sayane Shome
Date: 17-18 July, 2023
Learning Objectives of the session
Obtain understanding of high-level concepts related to sequence similarity including k-mers and then how natural language processing can be used in case of protein sequences.
2
Sequence similarity
when sequences are equal-length
Hamming Distance
3
https://www.thedatanotes.com/post/dynamic-programming-sequence-alignment/
Sequence similarity
Variable Length Strings
d: Deletion, s: Substitution, and i: Insertion
If each operation has a cost of 1, distance between the two strings is 5
If substitutions cost 2,distance between them is 8
4
Levenshtein Distance
Jurafsky, D. and Martin, J.H. (2023) Speech and Language Processing. https://web.stanford.edu/~jurafsky/slp3/
Biological Sequences Are Strings
5
AGGCTATCACCTGACCTCCAGGCCGATGCCC
TAGCTATCACGACCGCGGTCGATTTGCCCGAC
Given two Sequences
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---
TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
Find the sequence Similarity or Align them..
K-mers based similarity
6
Word2Vec using K-mer words
7
Word2Vec embeddings using Protein Domains
8
Problem 1 : No notion of word in protein sequence; instead a continuous string of amino acid symbols
Solution 1 : Split the sequence into arbitrary short words (K-mers)
Problem 2 : Unlike words, K-mers are not biologically significant unit
Solution 2 : Decompose sequence into evolutionarily conserved domains/motifs.
Word2Vec embeddings using Protein Domains
9
Sequence to Function prediction using Word2vec
Sarker, B., Ritchie, D. W., & Aridhi, S. (2019, September). Functional annotation of proteins using domain embedding based sequence classification. In KDIR 2019-11th International Conference on Knowledge Discovery and Information Retrieval (pp. 163-170).
Does not consider the ordered long range dependency among k-mers and domains.
Recurrent Neural Networks (RNN)
10
Hands on Tutorial on RNNs
Google colab notebook
Link : Colab-Notebook-RNN
Long Short-Term Memory (LSTM)
12
https://upload.wikimedia.org/wikipedia/commons/1/17/The_LSTM_Cell.svg
Hands on Tutorial on LSTMs
Google colab notebook
Link : Colab-Notebook-LSTM
Transformers
14
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.