1 of 27

Applied Bioinformatics 2025�Week 1 Session 1�Intro to Proteomics

Natalie Turner, PhD

Postdoctoral Fellow – Yates Lab

Department of Molecular Medicine

naturner@scripps.edu

Slide credit:

John R Yates III

Patrick Garrett (Grad student, Yates lab)

2 of 27

Proteins are the workhorses of the cellDNA mRNA Proteins Metabolites

  • Heteropolymers of amino acids’
    • “Linear” sequence of amino acids folds into a 3D structure
  • Form the Structural and Functional elements of cells and tissues
  • Enzymes – catalyze reactions, transmit signals, regulate processes
  • Covalent modifications regulate activities and functions
  • Protein sequences are derived from genes
    • genomes are sequenced

A key determinant to understanding protein function is discovering biological roles in a context.

3 of 27

Model Organisms

yeast

P. falciparum

fly

worm

Human Genome

Project

Genome Analysis has Revolutionized Biology

An Unexpected Game Changer for Protein Biochemistry

4 of 27

5 of 27

What is Proteomics?

  • Mass spectrometry is used to identify and quantify proteins in a sample
  • Traditionally qualitative – now quantitative & qualitative
  • Functional vs structural
    • Bottom-up
    • Top-down

This Photo by Unknown Author is licensed under CC BY-SA

https://www.ebi.ac.uk/training/online/courses/proteomics-an-introduction/what-is-proteomics/

6 of 27

Proteomics strategies�

  • Shotgun Proteomics:
    • Identifies and quantifies proteins in complex mixtures using LC-MS/MS.
  • Targeted Proteomics:
    • Quantifies specific proteins/peptides using SRM/MRM.
  • Quantitative Proteomics:
    • Measures protein abundance using label-free, SILAC, or iTRAQ methods.
  • Interaction Proteomics:
    • Examines protein-protein interactions using affinity purification (AP)-MS and cross-linking (XL)-MS.
  • Functional Proteomics:
    • Analyzes protein activities and PTMs.

  • Structural Proteomics:
    • Determines the three-dimensional structures of proteins and protein complexes to understand their functions and interactions.
  • Top-Down Proteomics:
    • Analyzes intact proteins and PTMs.
  • Bottom-Up Proteomics:
    • Digests proteins into peptides before analysis.
  • Clinical Proteomics:
    • Identifies biomarkers and disease-associated proteins.
  • Metaproteomics:
    • Studies protein composition in microbial communities.

7 of 27

Proteomics Applications

  • Disease Biomarker Discovery: Identifying proteins associated with diseases for early diagnosis and prognosis.
  • Drug Target Identification: Finding proteins that can be targeted by new drugs.
  • Understanding Disease Mechanisms: Investigating how diseases affect protein expression and function.
  • Personalized Medicine: Tailoring treatments based on individual protein profiles.
  • Therapeutic Protein Production: Optimizing the production of therapeutic proteins, such as insulin and monoclonal antibodies.
  • Vaccine Development: Identifying protein antigens for vaccine design.
  • Environmental Monitoring: Analyzing protein changes in organisms exposed to pollutants.
  • Agricultural Improvements: Studying plant and animal proteomes to enhance crop and livestock productivity.

8 of 27

Mass Spectrometers Made Rapid Identification of Proteins Easy

Ionization

Mass Analysis, e.g.

Ion Separation

Ion Detection

The three basic elements of an MS

Conversion of gases, liquids, solids

to the gas phase as ions

Separation of ions based

on mass to charge, m/z

Conversion of the separated

ions to an electrical signal

9 of 27

Bottom-up proteomics – enzymatic digest

10 of 27

Peptide Fragmentation

R1

R2

R3

R4

H

H

H

H

H

H

H

O

O

OH

O

O

H2N

C

C

N

C

C

N

C

C

N

C

C

C-terminal ions

N-terminal ions

Y3

Y2

Y1

B1

B2

B3

X3

Z3

X2

Z2

X1

Z1

A1

C1

A2

C2

A3

C3

11 of 27

S-P-A-F-D-S-I-M-A-E-T-L-K

(protonated mass 1410.6)

mass+ b-ions y-ions mass+

88.1 S PAFDSIMAETLK 1323.6

185.2 SP AFDSIMAETLK 1226.4

256.3 SPA FDSIMAETLK 1155.4

403.5 SPAF DSIMAETLK 1008.2

518.5 SPAFD SIMAETLK 893.1

605.6 SPAFDS IMAETLK 806.0

718.8 SPAFDSI MAETLK 692.3

850.0 SPAFDSIM AETLK 561.7

921.1 SPAFDSIMA ETLK 490.6

1050.2 SPAFDSIMAE TLK 361.5

1151.3 SPAFDSIMAET LK 260.4

1264.4 SPAFDSIMAETL K 147.2

12 of 27

Bottom-up database search

AELTVDPQGALAIRQLASVILKQYVETHWCAQSEKFRPPETTERAKIVIRELLPNGLRESISKVRSSVAYAVSAIAHWDWPEAWPQLFNLLMEMLVSGDLNAVHGAMRVLTEFTREVTDT

QYVETHWCAQSEKFR

FRPPETTER

PPETTER

PPETTERAK

IVIRELLPNGLR

ELLPNGLR

ELLPNGLRESISK

...

Sample

Search

Protein

Peptide

Spectra

Search Result

Xcorr: 3.5

Matched Intensity: 17.77 %

13 of 27

Protein Identification Using Data Dependent or Independent MS/MS

Data-dependent acquisition

Rep.

# MS/MS spectra

# peptide identifications

# protein identifications (P2)

1

86243

676

193

2

88619

745

201

3

86573

677

185

Avg

87,145

699

193

Data-independent acquisition

Rep.

# MS/MS spectra

# peptide identifications

# protein identifications (P2)

1

112612

784

246

2

98264

739

213

3

108873

738

232

Avg

106,583

753

230 (16.1%)

Venable et al Nature Methods 1, 39-45, 2004.

14 of 27

Genome Sequences Allow Fast and Accurate Lookup of Information

  • Product/Price
  • Inventory Adjustment
  • Order more product

Database

100

200

300

400

500

600

700

800

900

1000

1100

1200

m/z

0

5

10

15

20

25

30

35

40

45

50

55

60

65

70

75

80

85

90

95

100

Relative Abundance

566.0

557.9

377.7

1130.4

574.1

1131.2

1001.1

802.1

1115.1

687.0

330.1

445.0

930.1

873.0

130.9

443.1

501.4

689.1

960.1

803.1

201.9

259.1

1133.2

1086.3

645.0

754.1

804.1

128.0

  • Amino Acid Sequence
  • Protein (Gene)
  • Experimental Context
  • Functional Context
  • High Throughput

Database

Amino Acid Sequence “Bar Code”

15 of 27

Schmidt, A., Forne, I. & Imhof, A. Bioinformatic analysis of proteomics data. BMC Syst Biol 8 (Suppl 2), S3 (2014). https://doi.org/10.1186/1752-0509-8-S2-S3

16 of 27

Module overview

Week 1

Data pre-processing (1) & qualitative analysis (1)

Week 2

Data visualization, data formatting (1), qualitative analysis (2)

Week 3

Data pre-processing (2), data formatting (2), quantitative analysis

Session 1

Data import, dataframe inspection, annotations files

CamprotR:

Contaminant identification and removal

mixOmics:

Intro to mixOmics

SRBCT Case study:

Data visualization, PCA, PLS-DA, heatmap

MSstats:

Proteomics

(quantitative analysis)

Data-pre-processing revisited

Data formatting for MSstats

Session 2

Dplyr:

Data cleaning, transformation, outlier removal, data manipulation

EnrichR:

Enrichment analysis

Formatting data for mixOmics

Practice dataset

Diann:

Data quantitation

Data normalization

Boxplots

Differential protein abundance

Volcano plots

Capstone task:

You will apply the skills learned in each of these Sessions to create (and add to) the results of a scientific publication.

17 of 27

Getting the notebooks

  • In Github, click "New Repository", then "Import a repository"

18 of 27

Getting the notebooks

19 of 27

Getting the notebooks

  • From RStudio, create a new project from that repo
    • File -> "New Project" -> "Version Control" -> "Git"

20 of 27

Questions?

  • Open your R Notebooks

## Week 1: Data Import and Preprocessing

###Session 1: Introduction to Data Import

21 of 27

Dataframe columns - Descriptive

Run = Run name associated with the sample (usually an abbreviated form of ‘File.name’

Protein.Group = Inferred proteins. Primary (citable) Uniprot Accession for all assigned proteins.

Protein.Ids = All proteins matched to the precursor in the library or, in case of library-free search, in the sequence database.

Protein.Name = Names of the proteins in the Protein.Group.

Naming convention = X_Y, where X is a mnemonic protein identification code of at most 5 alphanumeric characters, ‘_’ is a separator, Y is a mnemonic species identification code of at most 5 alphanumeric characters.

Genes = Gene names corresponding to the proteins in Protein.Group.

22 of 27

Dataframe columns - Quantitative

PG.MaxLFQ - *MaxLFQ normalised quantity for the protein group, channel-specific

PG.Q.Value - Run-specific q-value for the protein group, channel-specific

Lib.PG.Q.Value - protein group q-value for the respective library entry, 'global' if the library was created by DIA-NN. In case of MBR, this applies to the library created after the first MBR pass

*Intensity determination and normalization procedure. Protein abundance profiles are assembled using the maximum possible information from MS signals.

Cox et al 2014 https://doi.org/10.1074/mcp.M113.031591

23 of 27

Definitions

False Discovery Rate (FDR):

The expected ratio of false positive classifications to the total number of positive classifications.

Q value:

A statistical method for estimating the false discovery rate (FDR) of a set of tests. The q-value is the minimum FDR at which a test can be considered significant. It's used to balance the number of false positives and true positives in genome-wide studies.

Adjusted P Value (P adj value):

An adjusted P value, also known as a P adj value, is a statistical measurement that corrects for multiple comparisons in hypothesis testing. It's the smallest significance level at which a comparison is considered statistically significant.

24 of 27

Week 1 Session 1 (continued)

###CamprotR

25 of 27

Database files (FASTA files)

  • Proteomics data is searched against a background database (proteome) to obtain the protein sequences used for protein identification
  • Proteomes can be downloaded from Uniprot.org/ according to species (or other) in .fasta format
  • Additional protein sequences can be appended to these downloaded proteomes to create a custom database, which can then be used to identify potentially problematic/unwanted proteins

26 of 27

The common Repository of Adventitious Proteins, cRAP (pronounced "cee-RAP")

Adventitious proteins (unintended contaminants in proteomic samples), can:

  • Significantly impact the accuracy and reliability of proteomic analyses
  • Lead to false positives, misidentification of proteins, and skewed quantitative results, ultimately affecting the interpretation of experimental data and the conclusions drawn from it.

27 of 27

CamprotR: Cambridge Centre for Proteomics

  • CamprotR is an R package that enables you to download an extensive cRAP database and customise it according to your project needs.
  • The sample dataset provided to you has already been searched against a database file containing cRAP, however it’s important to understand the principle and basis for identifying and excluding problematic proteins from proteomics search results.

Refer to the camprotR exercise in your R notebooks and follow along with the tutorial contained therein.