1 of 26

Proteogenomics 2: Database Searching

Galaxy Training Network Smörgåsbord

February 18th, 2021

2 of 26

‘Omics Technologies In The Era Of Systems Biology

3 of 26

Mass Spectrometry-based Proteomics

4 of 26

5 of 26

Looking Beyond The Known Proteome

Mass spectrum

Reference Protein Database

from genomic annotation

Cancer / Disease related

Databases such as COSMIC,

IARC p53, OMIM…

Deep genome sequencing data from ICGC, TCGA and CPTAC

RNASeq data

(Customized OR Combined)

6-frame DNA sequences.

3-frame cDNA sequences.

Identification of peptides corresponding to novel proteoforms.

6 of 26

Proteogenomics Workflows in Galaxy

Database Searching

Using MS/MS data

7 of 26

Proteogenomics Database Search Workflow

8 of 26

Database Search Workflow: Input files

9 of 26

Input files for Database Searching

Created with Biorender.com

10 of 26

Input files for Database Searching

Created with Biorender.com

11 of 26

Input files for Database Searching

Created with Biorender.com

12 of 26

Input files for Database Searching

Created with Biorender.com

13 of 26

Database Search Workflow: SearchGUI

14 of 26

SearchGUI matches MS/MS spectra to peptide sequences

  • Several peptide matching algorithms have been developed over the years
  • SearchGUI allows for multiple search engines to run simultaneously

Created with Biorender.com

15 of 26

SearchGUI matches MS/MS spectra to peptide sequences

  • Specific digestion conditions can be selected
  • Mass spectrometer parameters can be selected to maximize the efficacy of spectral matches
  • Post-Translational Modifications (PTMs) can be added to the search parameters

16 of 26

Database Search Workflow: Peptide Shaker

17 of 26

Peptide Shaker filters the results of Search GUI

  • Search GUI results are filtered by FDR to yield most confident peptide spectral matches (PSMs)
  • Peptide Shaker outputs mzIdentML files for future analysis

Vaudel et al. Nat Biotechnol 33, 22–24 (2015).

18 of 26

Database Search Workflow: Data Filtration

19 of 26

Data Filtration

  • Search GUI + Peptide Shaker will identify PSMs for all peptides in a sample
  • In proteogenomic analysis we’re interested primarily in novel peptides that are not found in the conventional proteome
  • These filtration steps remove peptides corresponding to normal peptides, contaminating peptides etc.

All PSMs identified in MS data

Novel PSMs

Novel Peptides

Novel Peptides

Contaminants

Normal Peptides

20 of 26

Database Search Workflow: Mz to SQLite

21 of 26

MZ to SQLite

  • Peptide Shaker generates mzIdenML files to store peptide/protein identification data
  • MZ-to-SQLite converts mzIdenML files into the mzsqlite format for analysis in the Multi-omics Viewing Platform

22 of 26

Database Search Workflow: Tabular to FASTA

23 of 26

Tabular to FASTA

  • FASTA files are required for subsequent BLAST-P analyses in the third proteogenomics workflow
  • Ensure that the Title column is column 1 of the previous output, Sequence Column is column 2

24 of 26

Now let’s go through how to set up this workflow in Galaxy…

25 of 26

Other Galaxy-P Tutorials in the GTN Smörgåsbord

Custom Database Creation

James Johnson

Novel Peptide Analysis

Subina Mehta

Metaproteomics

Pratik Jagtap

Introduction to Proteogenomics

Tim Griffin

26 of 26

Supplementary tutorials for proteogenomics can be found at the Galaxy Training Network ��https://training.galaxyproject.org/training-material/topics/proteomics/tutorials/proteogenomics-dbsearch/tutorial.html