Applied Bioinformatics 2025�Week 1 Session 2�Data manipulation
Natalie Turner, PhD
Postdoctoral Fellow – Yates Lab
Department of Molecular Medicine
naturner@scripps.edu
Manipulating data
Data cleaning
Remove outliers & missing values
Filter according to q-value (< 0.01)
Data transformation
Log transformation
(usually log2 but can be log10)
Creates a normal distribution (important for statistical assumptions)
Data normalization*
Reduces variability and accounts for systematic bias introduced during sample preparation, storage, treatment, and MS analysis
*Covered later in the module
Formatting
Go to your R Notebook
Annotations file
With many R packages designed for proteomics analysis (and other open-source software packages), there are four (4) essential columns in annotations files:
Base R functions
Dplyr functions
Other functions
Go to your R notebook
EnrichR
General principle of Gene Ontology (GO)
‘Concept of associating a collection of genes with a functional biological term in a systematic way.’
Searches against selected gene libraries
Enrichment analysis will find which GO terms are over-represented (or under-represented) using annotations for a gene set
Output an be generated for different GO aspects (e.g., Biological Process, Molecular Function, Cellular Component)
EnrichR generates a p value for enriched vs not-enriched terms
Gene = protein = gene
Protein ID is by inference (due to the reconstruction of peptide sequences into their protein sequence)
Protein sequences are derived from their genetic counterparts i.e., gene sequences
Usually does not account for proteoforms (= same protein, but modified), which may have functional differences (this is hard to detect with standard bottom-up proteomics)
GO/Enrichment analysis is a popular and powerful tool to understand the qualitative properties of a dataset
Go to your R Notebook