1 of 14

Introduction to biological sequence analysis using Deep Learning in Python

Presenters : Bishnu Sarker, Sayane Shome

Date: 17-18 July, 2023

2 of 14

Learning Objectives of the session

Obtain understanding of high-level concepts related to sequence similarity including k-mers and then how natural language processing can be used in case of protein sequences.

2

3 of 14

Sequence similarity

when sequences are equal-length

Hamming Distance

3

  • If two strings are of equal length, Hamming Distance may score the number of mismatches.
  • It does not consider the similarity from the semantic perspective.
  • It counts the dissimilar letters in corresponding position of the strings.

https://www.thedatanotes.com/post/dynamic-programming-sequence-alignment/

4 of 14

Sequence similarity

Variable Length Strings

d: Deletion, s: Substitution, and i: Insertion

If each operation has a cost of 1, distance between the two strings is 5

If substitutions cost 2,distance between them is 8

4

  • Edit Distance computes the minimum edits requires to transform one string to the other.
  • The allowed operations are: Insertion, Deletion and Substitution.
  • Each operation has a set score, that adds up to give distance or dissimilarity between two strings.
  • Similar to hamming distance, it only considers the syntactical distance.
  • It does not provide any insights on semantic similarity.

Levenshtein Distance

Jurafsky, D. and Martin, J.H. (2023) Speech and Language Processing. https://web.stanford.edu/~jurafsky/slp3/

5 of 14

Biological Sequences Are Strings

5

AGGCTATCACCTGACCTCCAGGCCGATGCCC

TAGCTATCACGACCGCGGTCGATTTGCCCGAC

Given two Sequences

-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---

TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC

Find the sequence Similarity or Align them..

6 of 14

K-mers based similarity

6

  • Edit distance is computationally expensive for long sequences.
  • K-mer splits the the sequence into equal K-length sub sequences.
  • This provides a sentence like resemblances for biological sequences. K-mers serve as words.
  • Each sequence is a set of k-mers.
  • Can be encoded as fixed length vector of 1’s and 0’s.
  • Curse of Dimensionality: (20)^3=8000 possible 3-mers. Sparse!

7 of 14

Word2Vec using K-mer words

7

  • word2vec learns the low-rank numerical representation of words using Neural Network.
  • It learns to predict the context (surrounding) words given a target word. The weights learned corresponding to each k-mer serve as the embedding.
  • In natural language processing,sentences are composed of words. The spatial relations typically holds the meaning the full sentence.
  • In case of biological sequence, there is no notion of words.
  • K-mers such as 3-mers may serve as the words.

8 of 14

Word2Vec embeddings using Protein Domains

8

Problem 1 : No notion of word in protein sequence; instead a continuous string of amino acid symbols

Solution 1 : Split the sequence into arbitrary short words (K-mers)

Problem 2 : Unlike words, K-mers are not biologically significant unit

Solution 2 : Decompose sequence into evolutionarily conserved domains/motifs.

9 of 14

Word2Vec embeddings using Protein Domains

9

Sequence to Function prediction using Word2vec

Sarker, B., Ritchie, D. W., & Aridhi, S. (2019, September). Functional annotation of proteins using domain embedding based sequence classification. In KDIR 2019-11th International Conference on Knowledge Discovery and Information Retrieval (pp. 163-170).

Does not consider the ordered long range dependency among k-mers and domains.

10 of 14

Recurrent Neural Networks (RNN)

10

  • Predicts next word/residues
  • Model spatial dependency
  • Take into account time signal
  • Suffers from vanishing gradient problem
  • Fails to model very long dependency

11 of 14

Hands on Tutorial on RNNs

Google colab notebook

12 of 14

Long Short-Term Memory (LSTM)

12

  • LSTM employs a complex flow of information to overcome the problem of Vanishing Gradient.
  • One of the most popular deep learning models for sequence modelling.
  • Does not support attention to support focus on different part of the input.

https://upload.wikimedia.org/wikipedia/commons/1/17/The_LSTM_Cell.svg

13 of 14

Hands on Tutorial on LSTMs

Google colab notebook

14 of 14

Transformers

14

  • The de-facto sequence model architecture includes multiple identical encoders and decoders.
  • Each encoder consists of 1) an attention layer, 2) Feedforward layer.
  • Through 8 heads, the attention layer attend different parts of the input.
  • Each token is passed to individual feed forward neural network.
  • The output from the encoder is passed through the top level encoders until fed to the decoders.
  • The output from the top encoder is used as embeddings.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.