1 of 22

Case Study 1 : Protein Function Annotation

Presenters : Bishnu Sarker, Sayane Shome

Date: 17-18 July, 2023

2 of 22

Learning Objectives of the next two sessions

To expand the concepts we learnt in previous sessions into practical applications such as protein function prediction and metal binding site prediction in proteins.

2

3 of 22

Problem Definition

Given a protein sequence of length L,the objective is to assign functional terms such as Gene Ontologies or Enzyme commission number.

  • Gene Ontologies(GO) is a standardized system that assigns functional terms to genes and gene products based on their known or predicted molecular functions, biological processes, and cellular components.
  • Enzyme Commission (EC) numbers are a classification system used to categorize enzymes based on the reactions they catalyze. The EC number provides a unique identifier for each enzyme and is widely used in biochemistry and molecular biology.

3

4 of 22

Gene Ontologies

4

5 of 22

Background

Manual Annotation

5

Curators

6 of 22

Background

Automatic Annotation

6

7 of 22

Protein Function Annotation

Input Data and Data Sources

7

8 of 22

Protein Function Annotation

Output Data and Data Sources

8

9 of 22

Protein Function Annotation

Approach

9

Obtaining pretrained embeddings for the protein sequence dataset from Uniprot

Using ML models for classifying the sequences with the GO IDs/EC IDs

Obtaining protein sequence dataset from Uniprot and associated GO IDs/EC IDs

Evaluating ML model performance using metrics

10 of 22

Protein Function Annotation

Future Challenges

10

Explainability

Computational Cost

Multi-omics Integration

03

01

02

11 of 22

Hands on Tutorial

Google colab notebook

11

12 of 22

Break !

We will reconvene in 15 mins.

Next in line : Hands-on tutorial on Metal-binding site prediction

12

13 of 22

Case Study 2 : Metal Binding Site Prediction

Presenters : Bishnu Sarker, Sayane Shome

Date: 17-18 July, 2023

14 of 22

Problem Definition

Given a protein sequence of length L and residue positions of the metal-binding sites in the protein,the objective is to find which metal ions will most likely bind to the sites.

We formulate this as a machine learning problem to be the focus of this hands-on tutorial.

14

15 of 22

Metal-Binding Site Prediction

Input/Output Data and Data Sources

Input Data

  • Protein Sequences data
  • Protein residue positions at the binding sites

Output Data

  • Names of binding metal ions and ChEMBL ID

15

16 of 22

Metal-Binding Site Prediction

Approach

16

Obtaining positional encodings for the residue positions encompassing the binding sites

Using ML models for predicting the metal ions binding at the sites

Obtaining protein sequence dataset from Uniprot and associated pretrained embeddings

Evaluating ML model performance using metrics

17 of 22

Metal-Binding Site Prediction

Approach

17

Sequence embedding

Positional encoding

Protein Sequence

Predicted

metal ions

Figure from : https://www.biorxiv.org/content/10.1101/2023.03.20.533488v1.full.pdf

18 of 22

Metal-Binding Site Prediction

Current and Future Challenges

18

Explainability

Computational Cost

Metal binding site integration

03

01

02

19 of 22

Hands-on Tutorial

Google colab notebook

19

20 of 22

Acknowledgements

  • ISMB/ECCB 2023 Tutorial committee chairs and reviewers
  • Meharry Medical College,Tennessee,USA
  • Stanford University,California,USA
  • Kingston University,London,UK
  • Participants !

20

21 of 22

Thank you for joining us !

For any correspondence regarding questions about the materials and related topics :

  • Bishnu Sarker (bsarker@mmc.edu)
  • Sayane Shome (sshome@stanford.edu)

21

22 of 22

ISMB/ECCB tutorial Feedback Link!

Please provide your valuable feedback and suggestions!

22