Sequence classification using InterPro and Pfam
Sara Chuguransky, Typhaine Paysan-Lafosse
Data curator, Senior bioinformatician
Structural bioinformatics - 3rd October 2023
Housekeeping and ground rules
Timeline
Part 1: General introduction to protein classification and InterPro
Part 2: Searching the InterPro website and downloading InterPro data
Part 3: Pfam updates/building using AlphaFold structure predictions
Conclusion + questions
What are your expectations for this session?
ⓘ
Click Present with Slido or install our Chrome extension to activate this poll while presenting.
Have you heard of or used InterPro before?
ⓘ
Click Present with Slido or install our Chrome extension to activate this poll while presenting.
Learning objectives
Part 1
General introduction to protein classification
and InterPro
What is InterPro?
What is InterPro?
NDGKAGKEEFYRALCDTPGVDPKLISRIWVYNHYRWIIWKLAAMECAFPKEFANRCLSPERVLLQLKYRYDTEIDRSRR
What is InterPro?
PF09169
NDGKAGKEEFYRALCDTPGVDPKLISRIWVYNHYRWIIWKLAAMECAFPKEFANRCLSPERVLLQLKYRYDTEIDRSRR
What is InterPro?
PF09169
NDGKAGKEEFYRALCDTPGVDPKLISRIWVYNHYRWIIWKLAAMECAFPKEFANRCLSPERVLLQLKYRYDTEIDRSRR
IPR015252
What is InterPro?
PS50096
Cd04493
PF09103
PF09121
SM01341
PIRSF00034
PF09169
PTHR11289
NDGKAGKEEFYRALCDTPGVDPKLISRIWVYNHYRWIIWKLAAMECAFPKEFANRCLSPERVLLQLKYRYDTEIDRSRR
IPR015525
IPR000048
IPR015252
IPR015187
IPR015205
InterPro = consortium of 13 member databases
InterPro - integrated classification of protein families
ADDITIONAL ANNOTATIONS
TMHMM
AntiFam
SignalP
Pfam-N
Coils
FunFams
Phobius
ELM
Genome3D
RepeatsDB
DisProt
INTERPRO
Can you name 1 InterPro member database and 1 additional annotation resource?
ⓘ
Click Present with Slido or install our Chrome extension to activate this poll while presenting.
Different types of protein signatures
Different types of protein signatures
Full alignment methods
Single motif methods
Patterns
Multiple motif methods
Fingerprints
Profiles & Hidden Markov models (HMMs)
Patterns
Many important sequence features, such as binding sites or the active sites of enzymes, consist of only a few amino acids that are essential for protein function
Sequence alignment
Motif
Pattern signature
[AC] – x -V- x(4) - {ED}
Build regular expression
PS00001
Extract pattern sequences
ALVKLISG
AIVHESAT
CHVRDLSC
CPVESTIS
Fingerprints: a multiple motif approach
Sequence alignment
Motif 2
Motif 3
Motif 1
Define motifs
Fingerprint signature
PR00001
Extract motif sequences
xxxxxx
xxxxxx
xxxxxx
xxxxxx
xxxxxx
xxxxxx
xxxxxx
xxxxxx
xxxxxx
xxxxxx
xxxxxx
xxxxxx
Weight matrices
Profiles & HMMs
Sequence alignment
Entire domain
Define coverage
Whole protein
Use entire alignment of domain or protein family
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxx
Build model
Profile or HMM signature
Profiles
Start with a multiple sequence alignment
Amino acids at each position in the alignment are scored according to the frequency with which they occur
Scores are weighted according to evolutionary distance using a BLOSUM matrix
Good at identifying homologues
Sequence alignment
Residue frequency at each position
Sequence 1:
Sequence 2:
Sequence 3:
Sequence 4:
F
F
Y
L
K�K�P�E
L�A�I�F
L�P�V�I
S�G�G�S
H�Q�Q�E
C�T�E�C
L�M�L�I
L�F�L�I
V�Q�G�Q
Scoring matrix
Profile Hidden Markov Models – encapsulate diversity
seq1
seq2
seq3
seq4
1 2 3 – 4 5
A C G – L D
S C G – – E
N C G g F D
T C G – W Q
core
model
Input multiple alignment:
Consensus columns assigned,
Defining inserts and deletes:
insertion
deletion
N
T
A
S
W
F
L
Y
D
E
Q
G
C
M1
M2
M3
M4
M5
B
E
M1
M2
M3
M4
M5
B
E
D5
D4
D3
D1
D2
I0
I1
I2
I3
I4
I5
Profile Hidden Markov Models – encapsulate diversity
seq1
seq2
seq3
seq4
1 2 3 – 4 5
A C G – L D
S C G – – E
N C G g F D
T C G – W Q
core
model
Input multiple alignment:
Consensus columns assigned,
Defining inserts and deletes:
insertion
deletion
M1
M2
M3
M4
M5
B
E
M1
M2
M3
M4
M5
B
E
D5
D4
D3
D1
D2
I0
I1
I2
I3
I4
I5
M1
M2
M3
M4
M5
B
E
N
T
A
S
W
F
L
Y
D
E
Q
G
C
M1
M2
M3
M4
M5
B
E
D5
D4
D3
D1
D2
I0
I1
I2
I3
I4
I5
S
C
G
Y
Q
Anecdotal search example: globin superfamily
Anecdotal search example: globin superfamily
query: alignment of three vertebrate hemoglobins and one myoglobin
Anecdotal search example: globin superfamily
query: alignment of three vertebrate hemoglobins and one myoglobin
Anecdotal search example: globin superfamily
query: alignment of three vertebrate hemoglobins and one myoglobin
target db: Uniprot 7.0 (207K seqs)
(contains about 1060 known globins)
at E <= 0.01:
PSI-BLAST sees: 915 globins (9 sec)
HMMER3 sees: 1002 globins (8sec)
HMMER search methods
What are the 4 different methods used to build the models?
ⓘ
Click Present with Slido or install our Chrome extension to activate this poll while presenting.
Different types of protein signatures
Full alignment methods
Single motif methods
Patterns
Multiple motif methods
Fingerprints
Profiles & Hidden Markov models (HMMs)
Audience Q&A Session
ⓘ
Click Present with Slido or install our Chrome extension to show live Q&A while presenting.
InterPro - integrated classification of protein families
ADDITIONAL ANNOTATIONS
TMHMM
AntiFam
SignalP
Pfam-N
Coils
FunFams
Phobius
ELM
Genome3D
RepeatsDB
DisProt
INTERPRO
Intrinsic disorder
Intrinsic disorder
1
5
8
4
3
7
2
6
Disorder predictors
Intrinsic disorder
1
5
8
4
3
7
2
6
Disorder predictors
Intrinsic disorder
1
5
8
4
3
7
2
6
Disorder predictors
InterPro - integrated classification of protein families
ADDITIONAL ANNOTATIONS
TMHMM
AntiFam
SignalP
Pfam-N
Coils
FunFams
Phobius
ELM
Genome3D
RepeatsDB
DisProt
INTERPRO
InterPro entry types
Proteins share a common evolutionary origin, as reflected in their related functions, sequences or structure
Family
Distinct functional, structural or sequence units that may exist in a variety of biological contexts
Domain
Short sequences typically repeated within a protein
Repeats
PTM
Active site
Binding site
Conserved site
Sites
Proteins share a common evolutionary origin, as reflected by similarity in their structure
Homologous Superfamily
InterPro entry
Description of one family or domain or conserved site etc...
=
One or more protein signatures that model the entity being described
InterPro entry
=
Proteins matched by the InterPro entry
Description of one family or domain or conserved site etc...
One or more protein signatures that model the entity being described
InterPro entry
=
Literature-supported description of the entity
One or more protein signatures that model the entity being described
Description of one family or domain or conserved site etc...
Proteins matched by the InterPro entry
InterPro entry
=
Other supporting links and information about the entity
One or more protein signatures that model the entity being described
Description of one family or domain or conserved site etc...
Proteins matched by the InterPro entry
Literature-supported description of the entity
What information is available in an InterPro entry page?
InterPro relationships: families
G protein-coupled receptors
rhodopsin-like GPCRs
opsins
blue-sensitive opsins
secretine-like
GPCRs
cAMP receptors
metabotropic glutamate receptors
etc
APJ receptors
relaxin receptors
adenosine receptors
dopamine receptors
etc
etc
rhodopsins
green-sensitive opsins
red-sensitive opsins
InterPro relationships: domains
Protein kinase-like
domain
Protein kinase
catalytic domain
Serine/threonine
kinase catalytic
domain
Tyrosine
kinase catalytic
domain
Questions?
Audience Q&A Session
ⓘ
Click Present with Slido or install our Chrome extension to show live Q&A while presenting.
Part 2
Searching the InterPro website
and downloading InterPro data
Gathering information using InterPro
Searching in InterPro
Text search
Sequence search
Domain architecture search
Browse
Using the InterPro text search
Exact match to a protein
List of InterPro entries and member database signatures matching the text searched
Gathering information using InterPro
Hands-on part 1
Finding information on a protein page
Protein page
What information can be found in the protein sequence viewer?
Protein length
InterPro entry
Contributing signatures
AlphaFold pLDDT score
Matches organised by
entry type
Signatures not integrated in InterPro entries
What information can be found in the protein sequence viewer?
Protein length
Matches organised by
entry type
InterPro entry
Contributing signatures
AlphaFold pLDDT score
Collapse category
Signatures not integrated in InterPro entries
What information can be found in the protein sequence viewer?
Other Features:
External Sources:
What information can be found in the protein sequence viewer?
Residues annotations
Residue annotation
Protein signature
Signature matches
Filtered matches
✓
✓
✓
✗
What information can be found in the protein sequence viewer?
www.ebi.ac.uk/QuickGO
Less specific concepts
More specific concepts
InterPro GO terms
InterPro
Side by side view of the protein sequence viewer and 3D structure
Audience Q&A Session
ⓘ
Click Present with Slido or install our Chrome extension to show live Q&A while presenting.
Gathering information using InterPro
InterProScan and InterPro curation
InterProScan
The underlying software that allows sequences (protein and nucleic) to be scanned against InterPro's signatures.
Sequence search in InterPro��
>unknown_protein
MWIQVRTMDGRQTHTVDSLSRLTKVEELRRKIQELFHVEPGLQRLFYRGKQMEDGHTLFD
YEVRLNDTIQLLVRQSLVLPHSTKERDSELSDTDSGCCLGQSESDKSSTHGEAAAETDSR
PADEDMWDETELGLYKVNEYVDARDTNMGAWFEAQVVRVTRKAPSRDEPCSSTSRPALEE
DVIYHVKYDDYPENGVVQMNSRDVRARARTIIKWQDLEVGQVVMLNYNPDNPKERGFWYD
AEISRKRETRTARELYANVVLGDDSLNDCRIIFVDEVFKIERPGEGSPMVDNPMRRKSGP
SCKHCKDDVNRLCRVCACHLCGGRQDPDKQLMCDECDMAFHIYCLDPPLSSVPSEDEWYC
PECRNDASEVVLAGERLRESKKKAKMASATSSSQRDWGKGMACVGRTKECTIVPSNHYGP
IPGIPVGTMWRFRVQVSESGVHRPHVAGIHGRSNDGAYSLVLAGGYEDDVDHGNFFTYTG
SGGRDLSGNKRTAEQSCDQKLTNTNRALALNCFAPINDQEGAEAKDWRSGKPVRVVRNVK
GGKNSKYAPAEGNRYDGIYKVVKYWPEKGKSGFLVWRYLLRRDDDEPGPWTKEGKDRIKK
LGLTMQYPEGYLEALANREREKENSKREEEEQQEGGFASPRTGKGKWKRKSAGGGPSRAG
SPRRTSKKTKVEPYSLTAQQSSLIREDKSNAKLWNEVLASLKDRPASGSPFQLFLSKVEE
TFQCICCQELVFRPITTVCQHNVCKDCLDRSFRAQVFSCPACRYDLGRSYAMQVNQPLQT
VLNQLFPGYGNGR
http://www.ebi.ac.uk/interpro/search/sequence/
Performing a protein sequence search
Performing a protein sequence search
What information can be found on the result page?
3 tabs: Overview, Entries and Sequence
What information can be found on the result page?
Action items
Protein family membership
What information can be found on the result page?
Protein length
Matches organised by
entry type
InterPro entry
Contributing signatures
What information can be found on the result page?
Signatures not integrated in InterPro entries
Residues annotations
Additional annotations
GO terms annotations
Audience Q&A Session
ⓘ
Click Present with Slido or install our Chrome extension to show live Q&A while presenting.
Hands-on part 2
Performing a sequence search and analysing the results
Gathering information using InterPro
How to access PDB structures pages?
Text search with PDB accession
e.g. 6aa2
Structure tab in InterPro entry and signature pages
Structure tab in signature pages
Structure tab in protein pages
What information is available in PDB structures pages?
General information about the structure
e.g release date, Experiment type, resolution
Link to other structure resources
What information is available in PDB structures pages?
General information about the structure
e.g release date, Experiment type, resolution
Link to other structure resources
What information is available in PDB structures pages?
3D structure viewer
Download and viewing options
General information about the structure
e.g release date, Experiment type, resolution
Link to other structure resources
List of InterPro entries and proteins
What information is available in PDB structures pages?
3D structure viewer
Download and viewing options
InterPro annotations for each chain
General information about the structure
e.g release date, Experiment type, resolution
Link to other structure resources
List of InterPro entries and proteins
Audience Q&A Session
ⓘ
Click Present with Slido or install our Chrome extension to show live Q&A while presenting.
Hands-on part 3
Searching a PDB structure
Gathering information using InterPro
InterPro API
http://www.ebi.ac.uk/interpro/api/
https://github.com/ProteinsWebTeam/interpro7-api/tree/master/docs
Accessible from Results/Your Downloads > Select and Download InterPro data
https://www.ebi.ac.uk/interpro/result/download/#/entry/InterPro/|accession
Example: getting reviewed proteins matching an InterPro entry
URL building
Main end point: data type
Data type: protein
Data filter: reviewed
Second end point
Data type: entry
Data filter: InterPro
Accession: IPR001128
Downloading InterPro data
Ensuring accuracy in InterPro
member
database 1
member
database 3
member
database 2
member
database 5
member
database 4
member
database 7
member
database 6
member
database 8
InterPro
UniProt, InterPro and InterProScan
InterProScan
Audience Q&A Session
ⓘ
Click Present with Slido or install our Chrome extension to show live Q&A while presenting.
Part 3
Introduction to Pfam and how AlphaFold structure predictions can be used to improve its content
Refine model boundaries using protein structure predictions
Structure predictions
Using AlphaFold to refine domain boundaries in Pfam
Using AlphaFold to refine domain boundaries in Pfam
PF05444 - Protein of unknown function (DUF753)
Example: A0A034WDA7
2 identical domains
Using AlphaFold to refine domain boundaries in Pfam
PF0544
25-178
Using AlphaFold to refine domain boundaries in Pfam
ID DUF753
AC PF05444
DE Domain of unknown function (DUF753)
AU Bateman A;0000-0002-6982-4660
AU Moxon SJ;0000-0003-4644-1816
SE Pfam-B_1957 (release 8.0)
GA 24.70 24.70;
TC 24.70 24.70;
NC 24.60 24.60;
BM hmmbuild -o /dev/null HMM SEED
SM hmmsearch -Z 61295632 --cpu 4 -E 1000 HMM pfamseq
TP Domain
WK Domain_of_unknown_function
CL CL0117
CC This domain contains is found in many copies in a variety of
CC uncharacterised fly proteins. This domain has a
CC Ly6/uPAR/alpha-neurotoxin-like domain structure.
** Prediction of similarity from Holm paper.
ID DUF753
AC PF05444
DE Protein of unknown function (DUF753)
AU Moxon SJ;0000-0003-4644-1816
SE Pfam-B_1957 (release 8.0)
GA 23.90 23.90;
TC 24.30 23.90;
NC 23.70 23.70;
BM hmmbuild -o /dev/null HMM SEED
SM hmmsearch -Z 61295632 -E 1000 --cpu 4 HMM pfamseq
TP Family
WK Domain_of_unknown_function
CC This family contains sequences with are repeated in several
CC uncharacterised proteins from Drosophila melanogaster.
Using AlphaFold to build missing domains in Pfam
Example: Zinc finger SWIM-type domain-containing protein 3 (ZSWIM3) (Q96MP5)
Using AlphaFold to build missing domains in Pfam
Example: ZSWIM3 (Q96MP5) current annotations:
PF04434
SWIM-Zf
(531-572)
PF19286
DUF5909
(579-660)
Using AlphaFold to build missing domains in Pfam
Example: ZSWIM3 (Q96MP5) new annotations
PF21599, N-terminal domain (CL0274)
PF04434
SWIM-Zf
(531-572)
PF19286
DUF5909
(579-660)
PF21559
N-terminal
(1-104)
ZSWIM3_N
Using AlphaFold to build missing domains in Pfam
Example: ZSWIM3 (Q96MP5) new annotations
PF21056, RNaseH-like domain
(CL0219)
PF21056
RNaseH-like
(179-304)
PF04434
SWIM-Zf
(531-572)
PF19286
DUF5909
(579-660)
PF21559
N-terminal
(1-104)
ZSWIM3_N
ZSWIM1-3_RNaseH-like
Using AlphaFold to build missing domains in Pfam
Example: ZSWIM3 (Q96MP5) new annotations
PF21056
RNaseH-like
(179-304)
PF04434
SWIM-Zf
(531-572)
PF19286
DUF5909
(579-660)
PF21559
N-terminal
(1-104)
PF21600, helical domain
PF21600
Helical domain
(312-437)
ZSWIM3_N
ZSWIM1-3_RNaseH-like
ZSWIM1-3_helical
Using AlphaFold to adjust domains in Pfam
Example: ZSWIM3 (Q96MP5)
PF04434
SWIM-Zf
(531-572)
PF21559
N-terminal
(1-104)
PF21056
RNaseH-like
(179-304)
PF21600
Helical domain
(312-437)
PF19286, DUF5909 (C-terminal domain)
Boundaries improved now
PF19286
SWIM1-3_C
(613-653)
ProtENN
Bileschi,M.L., Belanger,D., Bryant,D.H., Sanderson,T., Carter,B., Sculley,D., Bateman,A., DePristo,M.A. and Colwell,L.J. (2022) Using deep learning to annotate the protein universe. Nat. Biotechnol., 40, 932–937.
ProtENN finds 21K sequences
Increase in our coverage
PF00746 (LPXTG cell wall anchor motif) includes ~3.8K sequences
This method works really well with short motifs
ProtENN
Bileschi,M.L., Belanger,D., Bryant,D.H., Sanderson,T., Carter,B., Sculley,D., Bateman,A., DePristo,M.A. and Colwell,L.J. (2022) Using deep learning to annotate the protein universe. Nat. Biotechnol., 40, 932–937.
Example: Pullulanase A from Streptococcus pneumoniae (A0A0H2UNG0)
ProtENN: Example of new annotation: General odorant-binding protein 68 (A0NAZ8)
Previous matches
New matches
ProtENN: Example of new annotation: General odorant-binding protein 68 (A0NAZ8)
After ProtENN work, Pfam annotations can be propagated to new related sequences, not identified by the model (HMMs)
A0NAZ8 is matching PF01395 in Pfam-N. However, it is not be included in the list of proteins of the Pfam signature page
Audience Q&A Session
ⓘ
Click Present with Slido or install our Chrome extension to show live Q&A while presenting.
Hands-on part 4
Using AlphaFold to update Pfam
Audience Q&A Session
ⓘ
Click Present with Slido or install our Chrome extension to show live Q&A while presenting.
Learning objectives
Acknowledgements
Group Head
Sara
Chuguransky
Beatriz
Lazaro Pinto
Gustavo
Salazar-Orejuela
Matthias
Blum
Tiago
Grego
Alex
Bateman
Curators
Web developer
Production
Funding
Typhaine
Paysan-Lafosse
Antonina
Entcheva Andreeva
Google research
Lucy Colwell
Max Bileschi
Luis
Sanchez Pulido
Laise
Cavalcanti Florentino
Useful links
http://interpro-documentation.readthedocs.io/en/latest/
https://github.com/ProteinsWebTeam/interpro7-api/tree/master/docs
https://pfam-docs.readthedocs.io/en/latest/
Paysan-Lafosse T, et al.
InterPro in 2022.
Nucleic Acids Research, Nov 2022, (doi: 10.1093/nar/gkac993)
What is your takeout from this session?