Applied Bioinformatics 2025�Week 1 Session 1�Intro to Proteomics
Natalie Turner, PhD
Postdoctoral Fellow – Yates Lab
Department of Molecular Medicine
naturner@scripps.edu
Slide credit:
John R Yates III
Patrick Garrett (Grad student, Yates lab)
Proteins are the workhorses of the cell�DNA mRNA Proteins Metabolites
A key determinant to understanding protein function is discovering biological roles in a context.
Model Organisms
yeast
P. falciparum
fly
worm
Human Genome
Project
Genome Analysis has Revolutionized Biology
An Unexpected Game Changer for Protein Biochemistry
What is Proteomics?
This Photo by Unknown Author is licensed under CC BY-SA
https://www.ebi.ac.uk/training/online/courses/proteomics-an-introduction/what-is-proteomics/
Proteomics strategies�
Proteomics Applications
Mass Spectrometers Made Rapid Identification of Proteins Easy
Ionization
Mass Analysis, e.g.
Ion Separation
Ion Detection
The three basic elements of an MS
Conversion of gases, liquids, solids
to the gas phase as ions
Separation of ions based
on mass to charge, m/z
Conversion of the separated
ions to an electrical signal
Bottom-up proteomics – enzymatic digest
Peptide Fragmentation
R1
R2
R3
R4
H
H
H
H
H
H
H
O
O
OH
O
O
H2N
C
C
N
C
C
N
C
C
N
C
C
C-terminal ions
N-terminal ions
Y3
Y2
Y1
B1
B2
B3
X3
Z3
X2
Z2
X1
Z1
A1
C1
A2
C2
A3
C3
S-P-A-F-D-S-I-M-A-E-T-L-K
(protonated mass 1410.6)
mass+ b-ions y-ions mass+
88.1 S PAFDSIMAETLK 1323.6
185.2 SP AFDSIMAETLK 1226.4
256.3 SPA FDSIMAETLK 1155.4
403.5 SPAF DSIMAETLK 1008.2
518.5 SPAFD SIMAETLK 893.1
605.6 SPAFDS IMAETLK 806.0
718.8 SPAFDSI MAETLK 692.3
850.0 SPAFDSIM AETLK 561.7
921.1 SPAFDSIMA ETLK 490.6
1050.2 SPAFDSIMAE TLK 361.5
1151.3 SPAFDSIMAET LK 260.4
1264.4 SPAFDSIMAETL K 147.2
Bottom-up database search
AELTVDPQGALAIRQLASVILKQYVETHWCAQSEKFRPPETTERAKIVIRELLPNGLRESISKVRSSVAYAVSAIAHWDWPEAWPQLFNLLMEMLVSGDLNAVHGAMRVLTEFTREVTDT
QYVETHWCAQSEKFR
FRPPETTER
PPETTER
PPETTERAK
IVIRELLPNGLR
ELLPNGLR
ELLPNGLRESISK
...
Sample
Search
Protein
Peptide
Spectra
Search Result
Xcorr: 3.5
Matched Intensity: 17.77 %
Protein Identification Using Data Dependent or Independent MS/MS
Data-dependent acquisition | | | | |||
Rep. | # MS/MS spectra | # peptide identifications | # protein identifications (P2) | |||
1 | 86243 | 676 | 193 | |||
2 | 88619 | 745 | 201 | |||
3 | 86573 | 677 | 185 | |||
| | | | |||
Avg | 87,145 | 699 | 193 | |||
Data-independent acquisition | | | | |||
Rep. | # MS/MS spectra | # peptide identifications | # protein identifications (P2) | |||
1 | 112612 | 784 | 246 | |||
2 | 98264 | 739 | 213 | |||
3 | 108873 | 738 | 232 | |||
| | | | |||
Avg | 106,583 | 753 | 230 (16.1%) | |||
Venable et al Nature Methods 1, 39-45, 2004.
Genome Sequences Allow Fast and Accurate Lookup of Information
Database
100
200
300
400
500
600
700
800
900
1000
1100
1200
m/z
0
5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
95
100
Relative Abundance
566.0
557.9
377.7
1130.4
574.1
1131.2
1001.1
802.1
1115.1
687.0
330.1
445.0
930.1
873.0
130.9
443.1
501.4
689.1
960.1
803.1
201.9
259.1
1133.2
1086.3
645.0
754.1
804.1
128.0
Database
Amino Acid Sequence “Bar Code”
Schmidt, A., Forne, I. & Imhof, A. Bioinformatic analysis of proteomics data. BMC Syst Biol 8 (Suppl 2), S3 (2014). https://doi.org/10.1186/1752-0509-8-S2-S3
Module overview
| Week 1 Data pre-processing (1) & qualitative analysis (1) | Week 2 Data visualization, data formatting (1), qualitative analysis (2) | Week 3 Data pre-processing (2), data formatting (2), quantitative analysis |
Session 1 | Data import, dataframe inspection, annotations files CamprotR: Contaminant identification and removal | mixOmics: Intro to mixOmics SRBCT Case study: Data visualization, PCA, PLS-DA, heatmap | MSstats: Proteomics (quantitative analysis) Data-pre-processing revisited Data formatting for MSstats |
Session 2 | Dplyr: Data cleaning, transformation, outlier removal, data manipulation EnrichR: Enrichment analysis | Formatting data for mixOmics Practice dataset Diann: Data quantitation Data normalization Boxplots | Differential protein abundance Volcano plots Capstone task: You will apply the skills learned in each of these Sessions to create (and add to) the results of a scientific publication. |
Getting the notebooks
Getting the notebooks
Getting the notebooks
Questions?
## Week 1: Data Import and Preprocessing
###Session 1: Introduction to Data Import
Dataframe columns - Descriptive
Run = Run name associated with the sample (usually an abbreviated form of ‘File.name’
Protein.Group = Inferred proteins. Primary (citable) Uniprot Accession for all assigned proteins.
Protein.Ids = All proteins matched to the precursor in the library or, in case of library-free search, in the sequence database.
Protein.Name = Names of the proteins in the Protein.Group.
Naming convention = X_Y, where X is a mnemonic protein identification code of at most 5 alphanumeric characters, ‘_’ is a separator, Y is a mnemonic species identification code of at most 5 alphanumeric characters.
Genes = Gene names corresponding to the proteins in Protein.Group.
Dataframe columns - Quantitative
PG.MaxLFQ - *MaxLFQ normalised quantity for the protein group, channel-specific
PG.Q.Value - Run-specific q-value for the protein group, channel-specific
Lib.PG.Q.Value - protein group q-value for the respective library entry, 'global' if the library was created by DIA-NN. In case of MBR, this applies to the library created after the first MBR pass
*Intensity determination and normalization procedure. Protein abundance profiles are assembled using the maximum possible information from MS signals.
Cox et al 2014 https://doi.org/10.1074/mcp.M113.031591
Definitions
False Discovery Rate (FDR):
The expected ratio of false positive classifications to the total number of positive classifications.
Q value:
A statistical method for estimating the false discovery rate (FDR) of a set of tests. The q-value is the minimum FDR at which a test can be considered significant. It's used to balance the number of false positives and true positives in genome-wide studies.
Adjusted P Value (P adj value):
An adjusted P value, also known as a P adj value, is a statistical measurement that corrects for multiple comparisons in hypothesis testing. It's the smallest significance level at which a comparison is considered statistically significant.
Week 1 Session 1 (continued)
###CamprotR
Database files (FASTA files)
The common Repository of Adventitious Proteins, cRAP (pronounced "cee-RAP")
Adventitious proteins (unintended contaminants in proteomic samples), can:
CamprotR: Cambridge Centre for Proteomics
Refer to the camprotR exercise in your R notebooks and follow along with the tutorial contained therein.