1 of 120

Sequence classification using InterPro and Pfam

Sara Chuguransky, Typhaine Paysan-Lafosse

Data curator, Senior bioinformatician

www.ebi.ac.uk/interpro

@InterProDB @PfamDB

Structural bioinformatics - 3rd October 2023

2 of 120

Housekeeping and ground rules

  • Critique with care - all comments and critiques are welcome

  • There is no stupid question

3 of 120

Timeline

Part 1: General introduction to protein classification and InterPro

Part 2: Searching the InterPro website and downloading InterPro data

Part 3: Pfam updates/building using AlphaFold structure predictions

Conclusion + questions

4 of 120

What are your expectations for this session?

Click Present with Slido or install our Chrome extension to activate this poll while presenting.

5 of 120

Have you heard of or used InterPro before?

Click Present with Slido or install our Chrome extension to activate this poll while presenting.

6 of 120

Learning objectives

  • Summarise the concept of protein classification

  • Recall different protein model algorithms
  • List different types of search in the InterPro website
  • List the information available on the InterPro protein and structure pages
  • Produce a summary of the results obtained when analysing a protein sequence using InterProScan
  • Summarise how predicted structures can be used to refine protein domain boundaries

7 of 120

Part 1

General introduction to protein classification

and InterPro

8 of 120

What is InterPro?

  • Functional analysis of protein sequences
  • Predicts protein families and the presence of domains and important sites
  • Combines predictive models from several member databases
  • Provides literature referenced information

9 of 120

What is InterPro?

  • Functional analysis of protein sequences
  • Predicts protein families and the presence of domains and important sites
  • Combines predictive models from several member databases
  • Provides literature referenced information

NDGKAGKEEFYRALCDTPGVDPKLISRIWVYNHYRWIIWKLAAMECAFPKEFANRCLSPERVLLQLKYRYDTEIDRSRR

10 of 120

What is InterPro?

  • Functional analysis of protein sequences
  • Predicts protein families and the presence of domains and important sites
  • Combines predictive models from several member databases
  • Provides literature referenced information

PF09169

NDGKAGKEEFYRALCDTPGVDPKLISRIWVYNHYRWIIWKLAAMECAFPKEFANRCLSPERVLLQLKYRYDTEIDRSRR

11 of 120

What is InterPro?

  • Functional analysis of protein sequences
  • Predicts protein families and the presence of domains and important sites
  • Combines predictive models from several member databases
  • Provides literature referenced information

PF09169

NDGKAGKEEFYRALCDTPGVDPKLISRIWVYNHYRWIIWKLAAMECAFPKEFANRCLSPERVLLQLKYRYDTEIDRSRR

IPR015252

12 of 120

What is InterPro?

  • Functional analysis of protein sequences
  • Predicts protein families and the presence of domains and important sites
  • Combines predictive models from several member databases
  • Provides literature referenced information

PS50096

Cd04493

PF09103

PF09121

SM01341

PIRSF00034

PF09169

PTHR11289

NDGKAGKEEFYRALCDTPGVDPKLISRIWVYNHYRWIIWKLAAMECAFPKEFANRCLSPERVLLQLKYRYDTEIDRSRR

IPR015525

IPR000048

IPR015252

IPR015187

IPR015205

13 of 120

InterPro = consortium of 13 member databases

14 of 120

InterPro - integrated classification of protein families

ADDITIONAL ANNOTATIONS

TMHMM

AntiFam

SignalP

Pfam-N

Coils

FunFams

Phobius

ELM

Genome3D

RepeatsDB

DisProt

INTERPRO

15 of 120

Can you name 1 InterPro member database and 1 additional annotation resource?

Click Present with Slido or install our Chrome extension to activate this poll while presenting.

16 of 120

Different types of protein signatures

17 of 120

Different types of protein signatures

Full alignment methods

Single motif methods

Patterns

Multiple motif methods

Fingerprints

Profiles & Hidden Markov models (HMMs)

18 of 120

Patterns

Many important sequence features, such as binding sites or the active sites of enzymes, consist of only a few amino acids that are essential for protein function

Sequence alignment

Motif

Pattern signature

[AC] – x -V- x(4) - {ED}

Build regular expression

PS00001

Extract pattern sequences

ALVKLISG

AIVHESAT

CHVRDLSC

CPVESTIS

19 of 120

Fingerprints: a multiple motif approach

Sequence alignment

Motif 2

Motif 3

Motif 1

Define motifs

Fingerprint signature

PR00001

Extract motif sequences

xxxxxx

xxxxxx

xxxxxx

xxxxxx

xxxxxx

xxxxxx

xxxxxx

xxxxxx

xxxxxx

xxxxxx

xxxxxx

xxxxxx

Weight matrices

20 of 120

Profiles & HMMs

Sequence alignment

Entire domain

Define coverage

Whole protein

Use entire alignment of domain or protein family

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxxxxxxxx

Build model

Profile or HMM signature

21 of 120

Profiles

Start with a multiple sequence alignment

Amino acids at each position in the alignment are scored according to the frequency with which they occur

Scores are weighted according to evolutionary distance using a BLOSUM matrix

Good at identifying homologues

Sequence alignment

Residue frequency at each position

Sequence 1:

Sequence 2:

Sequence 3:

Sequence 4:

F

F

Y

L

K�K�P�E

L�A�I�F

L�P�V�I

S�G�G�S

H�Q�Q�E

C�T�E�C

L�M�L�I

L�F�L�I

V�Q�G�Q

Scoring matrix

22 of 120

Profile Hidden Markov Models – encapsulate diversity

seq1

seq2

seq3

seq4

1 2 3 – 4 5

A C G L D

S C G E

N C G g F D

T C G W Q

core

model

Input multiple alignment:

Consensus columns assigned,

Defining inserts and deletes:

insertion

deletion

N

T

A

S

W

F

L

Y

D

E

Q

G

C

M1

M2

M3

M4

M5

B

E

M1

M2

M3

M4

M5

B

E

D5

D4

D3

D1

D2

I0

I1

I2

I3

I4

I5

23 of 120

Profile Hidden Markov Models – encapsulate diversity

seq1

seq2

seq3

seq4

1 2 3 – 4 5

A C G L D

S C G E

N C G g F D

T C G W Q

core

model

Input multiple alignment:

Consensus columns assigned,

Defining inserts and deletes:

insertion

deletion

M1

M2

M3

M4

M5

B

E

M1

M2

M3

M4

M5

B

E

D5

D4

D3

D1

D2

I0

I1

I2

I3

I4

I5

M1

M2

M3

M4

M5

B

E

N

T

A

S

W

F

L

Y

D

E

Q

G

C

M1

M2

M3

M4

M5

B

E

D5

D4

D3

D1

D2

I0

I1

I2

I3

I4

I5

S

C

G

Y

Q

24 of 120

Anecdotal search example: globin superfamily

25 of 120

Anecdotal search example: globin superfamily

query: alignment of three vertebrate hemoglobins and one myoglobin

26 of 120

Anecdotal search example: globin superfamily

query: alignment of three vertebrate hemoglobins and one myoglobin

27 of 120

Anecdotal search example: globin superfamily

query: alignment of three vertebrate hemoglobins and one myoglobin

target db: Uniprot 7.0 (207K seqs)

(contains about 1060 known globins)

at E <= 0.01:

PSI-BLAST sees: 915 globins (9 sec)

HMMER3 sees: 1002 globins (8sec)

28 of 120

HMMER search methods

  • phmmer - single protein sequence against protein sequence database

  • hmmscan - single protein sequence against profile HMM library (Pfam, CATH-Gene3D, PIRSF, Superfamily and TIGRFAMs)

  • hmmsearch - either multiple sequence alignment or profile HMM against protein sequence database

  • jackhmmer - iterative searches. Initiated with a single sequence, a profile HMM or a multiple sequence alignment against a target sequence database

29 of 120

What are the 4 different methods used to build the models?

Click Present with Slido or install our Chrome extension to activate this poll while presenting.

30 of 120

Different types of protein signatures

Full alignment methods

Single motif methods

Patterns

Multiple motif methods

Fingerprints

Profiles & Hidden Markov models (HMMs)

31 of 120

Audience Q&A Session

Click Present with Slido or install our Chrome extension to show live Q&A while presenting.

32 of 120

InterPro - integrated classification of protein families

ADDITIONAL ANNOTATIONS

TMHMM

AntiFam

SignalP

Pfam-N

Coils

FunFams

Phobius

ELM

Genome3D

RepeatsDB

DisProt

INTERPRO

33 of 120

Intrinsic disorder

  • Function in flexible linkers, linear motifs, coupled folding and binding
  • Little evolutionary conservation, hard to model with signatures
  • Different predictors required to capture complementary aspects

34 of 120

Intrinsic disorder

  • Function in flexible linkers, linear motifs, coupled folding and binding
  • Little evolutionary conservation, hard to model with signatures
  • Different predictors required to capture complementary aspects

1

5

8

4

3

7

2

6

Disorder predictors

35 of 120

Intrinsic disorder

  • Function in flexible linkers, linear motifs, coupled folding and binding
  • Little evolutionary conservation, hard to model with signatures
  • Different predictors required to capture complementary aspects

1

5

8

4

3

7

2

6

Disorder predictors

36 of 120

Intrinsic disorder

  • Function in flexible linkers, linear motifs, coupled folding and binding
  • Little evolutionary conservation, hard to model with signatures
  • Different predictors required to capture complementary aspects

1

5

8

4

3

7

2

6

Disorder predictors

37 of 120

InterPro - integrated classification of protein families

ADDITIONAL ANNOTATIONS

TMHMM

AntiFam

SignalP

Pfam-N

Coils

FunFams

Phobius

ELM

Genome3D

RepeatsDB

DisProt

INTERPRO

38 of 120

InterPro entry types

Proteins share a common evolutionary origin, as reflected in their related functions, sequences or structure

Family

Distinct functional, structural or sequence units that may exist in a variety of biological contexts

Domain

Short sequences typically repeated within a protein

Repeats

PTM

Active site

Binding site

Conserved site

Sites

Proteins share a common evolutionary origin, as reflected by similarity in their structure

Homologous Superfamily

39 of 120

InterPro entry

Description of one family or domain or conserved site etc...

=

One or more protein signatures that model the entity being described

40 of 120

InterPro entry

=

Proteins matched by the InterPro entry

Description of one family or domain or conserved site etc...

One or more protein signatures that model the entity being described

41 of 120

InterPro entry

=

Literature-supported description of the entity

One or more protein signatures that model the entity being described

Description of one family or domain or conserved site etc...

Proteins matched by the InterPro entry

42 of 120

InterPro entry

=

Other supporting links and information about the entity

One or more protein signatures that model the entity being described

Description of one family or domain or conserved site etc...

Proteins matched by the InterPro entry

Literature-supported description of the entity

43 of 120

What information is available in an InterPro entry page?

44 of 120

InterPro relationships: families

G protein-coupled receptors

rhodopsin-like GPCRs

opsins

blue-sensitive opsins

secretine-like

GPCRs

cAMP receptors

metabotropic glutamate receptors

etc

APJ receptors

relaxin receptors

adenosine receptors

dopamine receptors

etc

etc

rhodopsins

green-sensitive opsins

red-sensitive opsins

  • Membrane receptors
  • Interact with G proteins
  • 7TM structure
  • Conserved sequence
  • Light receptors
  • Bound to chromophore
  • Specific absorption spectrum of chromophore

45 of 120

InterPro relationships: domains

Protein kinase-like

domain

Protein kinase

catalytic domain

Serine/threonine

kinase catalytic

domain

Tyrosine

kinase catalytic

domain

46 of 120

47 of 120

48 of 120

Questions?

49 of 120

Audience Q&A Session

Click Present with Slido or install our Chrome extension to show live Q&A while presenting.

50 of 120

Part 2

Searching the InterPro website

and downloading InterPro data

51 of 120

Gathering information using InterPro

  • Information available for a protein

  • Analysis of a protein sequence

  • Interest in a specific PDB structure

  • Downloading InterPro data

52 of 120

53 of 120

Searching in InterPro

Text search

Sequence search

Domain architecture search

Browse

54 of 120

Using the InterPro text search

Exact match to a protein

List of InterPro entries and member database signatures matching the text searched

55 of 120

Gathering information using InterPro

  • Information available for a protein

  • Analysis of a protein sequence

  • Interest in a specific PDB structure

  • Downloading InterPro data

56 of 120

Hands-on part 1

Finding information on a protein page

57 of 120

Protein page

58 of 120

What information can be found in the protein sequence viewer?

Protein length

InterPro entry

Contributing signatures

AlphaFold pLDDT score

Matches organised by

entry type

Signatures not integrated in InterPro entries

59 of 120

What information can be found in the protein sequence viewer?

Protein length

Matches organised by

entry type

InterPro entry

Contributing signatures

AlphaFold pLDDT score

Collapse category

Signatures not integrated in InterPro entries

60 of 120

What information can be found in the protein sequence viewer?

Other Features:

  • Disordered regions from MobiDB
  • Transmembrane regions from Phobius and/or TMHMM
  • Coiled regions from COILS
  • Cytoplasmic/non-cytoplasmic domains from Phobius
  • Signal peptide regions from SignalP and/or Phobius
  • CATH-FunFams
  • Pfam-N annotations
  • Eukaryotic linear motifs from ELM

External Sources:

  • DisProt
  • RepeatsDB
  • Genome3D

61 of 120

What information can be found in the protein sequence viewer?

Residues annotations

62 of 120

Residue annotation

Protein signature

Signature matches

Filtered matches

  • Current per-residue annotation in InterProScan is from CDD, SFLD and PIRSR

63 of 120

What information can be found in the protein sequence viewer?

64 of 120

  • Grew out of the model organism community
  • Biological knowledge in a computable format
  • Arranged as a hierarchy
  • Three strands
    • Molecular function
    • Biological process
    • Cellular location

www.ebi.ac.uk/QuickGO

Less specific concepts

More specific concepts

65 of 120

InterPro GO terms

  • GO terms are manually assigned to InterPro entries, based on scientific literature
  • Sequences searched against the database are annotated with GO terms, depending on the entry they match

InterPro

66 of 120

67 of 120

Side by side view of the protein sequence viewer and 3D structure

68 of 120

Audience Q&A Session

Click Present with Slido or install our Chrome extension to show live Q&A while presenting.

69 of 120

Gathering information using InterPro

  • Information available for a protein

  • Analysis of a protein sequence

  • Interest in a specific PDB structure

  • Downloading InterPro data

70 of 120

InterProScan and InterPro curation

InterProScan

The underlying software that allows sequences (protein and nucleic) to be scanned against InterPro's signatures. 

  • Pre-calculated for protein pages
  • InterPro sequence search: single or batch sequences
  • Local install of InterProScan

71 of 120

Sequence search in InterPro��

>unknown_protein

MWIQVRTMDGRQTHTVDSLSRLTKVEELRRKIQELFHVEPGLQRLFYRGKQMEDGHTLFD

YEVRLNDTIQLLVRQSLVLPHSTKERDSELSDTDSGCCLGQSESDKSSTHGEAAAETDSR

PADEDMWDETELGLYKVNEYVDARDTNMGAWFEAQVVRVTRKAPSRDEPCSSTSRPALEE

DVIYHVKYDDYPENGVVQMNSRDVRARARTIIKWQDLEVGQVVMLNYNPDNPKERGFWYD

AEISRKRETRTARELYANVVLGDDSLNDCRIIFVDEVFKIERPGEGSPMVDNPMRRKSGP

SCKHCKDDVNRLCRVCACHLCGGRQDPDKQLMCDECDMAFHIYCLDPPLSSVPSEDEWYC

PECRNDASEVVLAGERLRESKKKAKMASATSSSQRDWGKGMACVGRTKECTIVPSNHYGP

IPGIPVGTMWRFRVQVSESGVHRPHVAGIHGRSNDGAYSLVLAGGYEDDVDHGNFFTYTG

SGGRDLSGNKRTAEQSCDQKLTNTNRALALNCFAPINDQEGAEAKDWRSGKPVRVVRNVK

GGKNSKYAPAEGNRYDGIYKVVKYWPEKGKSGFLVWRYLLRRDDDEPGPWTKEGKDRIKK

LGLTMQYPEGYLEALANREREKENSKREEEEQQEGGFASPRTGKGKWKRKSAGGGPSRAG

SPRRTSKKTKVEPYSLTAQQSSLIREDKSNAKLWNEVLASLKDRPASGSPFQLFLSKVEE

TFQCICCQELVFRPITTVCQHNVCKDCLDRSFRAQVFSCPACRYDLGRSYAMQVNQPLQT

VLNQLFPGYGNGR

http://www.ebi.ac.uk/interpro/search/sequence/

72 of 120

Performing a protein sequence search

73 of 120

Performing a protein sequence search

74 of 120

What information can be found on the result page?

3 tabs: Overview, Entries and Sequence

75 of 120

What information can be found on the result page?

Action items

Protein family membership

76 of 120

What information can be found on the result page?

Protein length

Matches organised by

entry type

InterPro entry

Contributing signatures

77 of 120

What information can be found on the result page?

Signatures not integrated in InterPro entries

Residues annotations

Additional annotations

GO terms annotations

78 of 120

Audience Q&A Session

Click Present with Slido or install our Chrome extension to show live Q&A while presenting.

79 of 120

Hands-on part 2

Performing a sequence search and analysing the results

80 of 120

Gathering information using InterPro

  • Information available for a protein

  • Analysis of a protein sequence

  • Interest in a specific PDB structure

  • Downloading InterPro data

81 of 120

How to access PDB structures pages?

Text search with PDB accession

e.g. 6aa2

Structure tab in InterPro entry and signature pages

Structure tab in signature pages

Structure tab in protein pages

82 of 120

What information is available in PDB structures pages?

General information about the structure

e.g release date, Experiment type, resolution

Link to other structure resources

83 of 120

What information is available in PDB structures pages?

General information about the structure

e.g release date, Experiment type, resolution

Link to other structure resources

84 of 120

What information is available in PDB structures pages?

3D structure viewer

Download and viewing options

General information about the structure

e.g release date, Experiment type, resolution

Link to other structure resources

List of InterPro entries and proteins

85 of 120

What information is available in PDB structures pages?

3D structure viewer

Download and viewing options

InterPro annotations for each chain

General information about the structure

e.g release date, Experiment type, resolution

Link to other structure resources

List of InterPro entries and proteins

86 of 120

Audience Q&A Session

Click Present with Slido or install our Chrome extension to show live Q&A while presenting.

87 of 120

Hands-on part 3

Searching a PDB structure

88 of 120

Gathering information using InterPro

  • Information available for a protein

  • Analysis of a protein sequence

  • Interest in a specific PDB structure

  • Downloading InterPro data

89 of 120

InterPro API

  • InterPro API online access

http://www.ebi.ac.uk/interpro/api/

  • InterPro API documentation

https://github.com/ProteinsWebTeam/interpro7-api/tree/master/docs

  • Help page to build an API url and code snippet

Accessible from Results/Your Downloads > Select and Download InterPro data

https://www.ebi.ac.uk/interpro/result/download/#/entry/InterPro/|accession

90 of 120

Example: getting reviewed proteins matching an InterPro entry

URL building

Main end point: data type

Data type: protein

Data filter: reviewed

Second end point

Data type: entry

Data filter: InterPro

Accession: IPR001128

91 of 120

Downloading InterPro data

92 of 120

Ensuring accuracy in InterPro

  • 13 member DBs

member

database 1

member

database 3

member

database 2

member

database 5

member

database 4

member

database 7

member

database 6

member

database 8

InterPro

  • Member DB update
  • UniProt update
  • Comprehensive checking and fixing at each stage

93 of 120

UniProt, InterPro and InterProScan

InterProScan

94 of 120

95 of 120

Audience Q&A Session

Click Present with Slido or install our Chrome extension to show live Q&A while presenting.

96 of 120

Part 3

Introduction to Pfam and how AlphaFold structure predictions can be used to improve its content

97 of 120

Refine model boundaries using protein structure predictions

98 of 120

Structure predictions

  • Source
    • AlphaFold (UniProt)

  • Available for
    • Protein pages
    • InterPro entry pages
    • Signatures pages

  • Use cases
    • Finding a function for domains of unknown function
    • Refining domain boundaries of member database signatures

99 of 120

Using AlphaFold to refine domain boundaries in Pfam

PF05444 - Protein of unknown function (DUF753)

Example: A0A034WDA7

100 of 120

Using AlphaFold to refine domain boundaries in Pfam

PF05444 - Protein of unknown function (DUF753)

Example: A0A034WDA7

2 identical domains

101 of 120

Using AlphaFold to refine domain boundaries in Pfam

PF05444 - Protein of unknown function (DUF753)

Example: A0A034WDA7

PF0544

25-178

102 of 120

Using AlphaFold to refine domain boundaries in Pfam

PF05444 - Protein of unknown function (DUF753)

ID DUF753

AC PF05444

DE Domain of unknown function (DUF753)

AU Bateman A;0000-0002-6982-4660

AU Moxon SJ;0000-0003-4644-1816

SE Pfam-B_1957 (release 8.0)

GA 24.70 24.70;

TC 24.70 24.70;

NC 24.60 24.60;

BM hmmbuild -o /dev/null HMM SEED

SM hmmsearch -Z 61295632 --cpu 4 -E 1000 HMM pfamseq

TP Domain

WK Domain_of_unknown_function

CL CL0117

CC This domain contains is found in many copies in a variety of

CC uncharacterised fly proteins. This domain has a

CC Ly6/uPAR/alpha-neurotoxin-like domain structure.

** Prediction of similarity from Holm paper.

ID DUF753

AC PF05444

DE Protein of unknown function (DUF753)

AU Moxon SJ;0000-0003-4644-1816

SE Pfam-B_1957 (release 8.0)

GA 23.90 23.90;

TC 24.30 23.90;

NC 23.70 23.70;

BM hmmbuild -o /dev/null HMM SEED

SM hmmsearch -Z 61295632 -E 1000 --cpu 4 HMM pfamseq

TP Family

WK Domain_of_unknown_function

CC This family contains sequences with are repeated in several

CC uncharacterised proteins from Drosophila melanogaster.

103 of 120

Using AlphaFold to build missing domains in Pfam

Example: Zinc finger SWIM-type domain-containing protein 3 (ZSWIM3) (Q96MP5)

104 of 120

Using AlphaFold to build missing domains in Pfam

Example: ZSWIM3 (Q96MP5) current annotations:

PF04434

SWIM-Zf

(531-572)

PF19286

DUF5909

(579-660)

105 of 120

Using AlphaFold to build missing domains in Pfam

Example: ZSWIM3 (Q96MP5) new annotations

PF21599, N-terminal domain (CL0274)

PF04434

SWIM-Zf

(531-572)

PF19286

DUF5909

(579-660)

PF21559

N-terminal

(1-104)

ZSWIM3_N

106 of 120

Using AlphaFold to build missing domains in Pfam

Example: ZSWIM3 (Q96MP5) new annotations

PF21056, RNaseH-like domain

(CL0219)

PF21056

RNaseH-like

(179-304)

PF04434

SWIM-Zf

(531-572)

PF19286

DUF5909

(579-660)

PF21559

N-terminal

(1-104)

ZSWIM3_N

ZSWIM1-3_RNaseH-like

107 of 120

Using AlphaFold to build missing domains in Pfam

Example: ZSWIM3 (Q96MP5) new annotations

PF21056

RNaseH-like

(179-304)

PF04434

SWIM-Zf

(531-572)

PF19286

DUF5909

(579-660)

PF21559

N-terminal

(1-104)

PF21600, helical domain

PF21600

Helical domain

(312-437)

ZSWIM3_N

ZSWIM1-3_RNaseH-like

ZSWIM1-3_helical

108 of 120

Using AlphaFold to adjust domains in Pfam

Example: ZSWIM3 (Q96MP5)

PF04434

SWIM-Zf

(531-572)

PF21559

N-terminal

(1-104)

PF21056

RNaseH-like

(179-304)

PF21600

Helical domain

(312-437)

PF19286, DUF5909 (C-terminal domain)

Boundaries improved now

PF19286

SWIM1-3_C

(613-653)

109 of 120

ProtENN

  • Google research team developed this deep-learning method to predict functional annotations for unaligned sequences using Pfam as training set.

  • Useful tool to expand Pfam families coverage and detection of distant homologues not yet covered

  • Displayed in “Other features” in InterPro

Bileschi,M.L., Belanger,D., Bryant,D.H., Sanderson,T., Carter,B., Sculley,D., Bateman,A., DePristo,M.A. and Colwell,L.J. (2022) Using deep learning to annotate the protein universe. Nat. Biotechnol., 40, 932–937.

ProtENN finds 21K sequences

Increase in our coverage

PF00746 (LPXTG cell wall anchor motif) includes ~3.8K sequences

This method works really well with short motifs

110 of 120

ProtENN

Bileschi,M.L., Belanger,D., Bryant,D.H., Sanderson,T., Carter,B., Sculley,D., Bateman,A., DePristo,M.A. and Colwell,L.J. (2022) Using deep learning to annotate the protein universe. Nat. Biotechnol., 40, 932–937.

Example: Pullulanase A from Streptococcus pneumoniae (A0A0H2UNG0)

111 of 120

ProtENN: Example of new annotation: General odorant-binding protein 68 (A0NAZ8)

Previous matches

New matches

112 of 120

ProtENN: Example of new annotation: General odorant-binding protein 68 (A0NAZ8)

After ProtENN work, Pfam annotations can be propagated to new related sequences, not identified by the model (HMMs)

A0NAZ8 is matching PF01395 in Pfam-N. However, it is not be included in the list of proteins of the Pfam signature page

113 of 120

Audience Q&A Session

Click Present with Slido or install our Chrome extension to show live Q&A while presenting.

114 of 120

Hands-on part 4

Using AlphaFold to update Pfam

115 of 120

Audience Q&A Session

Click Present with Slido or install our Chrome extension to show live Q&A while presenting.

116 of 120

Learning objectives

  • Summarise the concept of protein classification

  • Recall different protein model algorithms
  • List different types of search in the InterPro website
  • List the information available on the InterPro protein and structure pages
  • Produce a summary of the results obtained when analysing a protein sequence using InterProScan
  • Summarise how predicted structures can be used to refine protein domain boundaries

117 of 120

Acknowledgements

Group Head

Sara

Chuguransky

Beatriz

Lazaro Pinto

Gustavo

Salazar-Orejuela

Matthias

Blum

Tiago

Grego

Alex

Bateman

Curators

Web developer

Production

Funding

Typhaine

Paysan-Lafosse

Antonina

Entcheva Andreeva

Google research

Lucy Colwell

Max Bileschi

Luis

Sanchez Pulido

Laise

Cavalcanti Florentino

118 of 120

Useful links

Paysan-Lafosse T, et al.

InterPro in 2022.

Nucleic Acids Research, Nov 2022, (doi: 10.1093/nar/gkac993)

119 of 120

120 of 120

What is your takeout from this session?