1 of 13

Applied Bioinformatics 2025�Week 1 Session 2�Data manipulation

Natalie Turner, PhD

Postdoctoral Fellow – Yates Lab

Department of Molecular Medicine

naturner@scripps.edu

2 of 13

Manipulating data

Data cleaning

Remove outliers & missing values

Filter according to q-value (< 0.01)

Data transformation

Log transformation

(usually log₂ but can be log₁₀)

Creates a normal distribution (important for statistical assumptions)

Data normalization*

Reduces variability and accounts for systematic bias introduced during sample preparation, storage, treatment, and MS analysis

*Covered later in the module

3 of 13

Formatting

Data formatting can be your best friend (or your worst enemy!)

It’s essential to know how to manipulate a dataframe:

Correct data (rows/columns) and row/column headers
How to clean/log-transform/normalize data
Creating lists/assigning variable names is sometimes necessary (coming up next session)

Get comfortable with the data: know your data, know the format required for the downstream analytical pipeline

4 of 13

Go to your R Notebook

Run the first code chunk
Open the annotations file in Excel

5 of 13

Annotations file

With many R packages designed for proteomics analysis (and other open-source software packages), there are four (4) essential columns in annotations files:

Run 🡪 The name of the MS data file corresponding to a particular sample run
BioReplicate 🡪 Usually reserved for biological replicate number, but can be assigned as a technical replicate number in certain situations.
Condition 🡪 The group that a sample belongs to i.e., control or treatment
Outlier 🡪 Whether the run should be removed from analysis or not. The command for this is user-defined, but usually logical (TRUE/FALSE, or YES/NO).

6 of 13

Base R functions

Log(x): Computes logarithms for (x). Default is natural, but can be specified as log2, log10

%>%: Operator that takes the output of the expression on its left and passes it as the first argument to the function on its right.

%in%: Returns a logical vector indicating if there is a match or not for its left operand.

7 of 13

Dplyr functions

Filter: Keep rows that match a condition

Mutate: Create, modify, and delete columns

Select: Keep or drop columns using their names and types

8 of 13

Other functions

%like% (data.table): Logical vector, TRUE for items that match pattern.

na.omit (R stats): Handle Missing Values in Objects (completely removes row containing ‘NA’)

9 of 13

Go to your R notebook

Explore the annotations file
Take note of the samples annotated as outliers
Examine and run the next code chunk

10 of 13

EnrichR

11 of 13

General principle of Gene Ontology (GO)

‘Concept of associating a collection of genes with a functional biological term in a systematic way.’

Searches against selected gene libraries

Enrichment analysis will find which GO terms are over-represented (or under-represented) using annotations for a gene set

Output an be generated for different GO aspects (e.g., Biological Process, Molecular Function, Cellular Component)

EnrichR generates a p value for enriched vs not-enriched terms

12 of 13

Gene = protein = gene

Protein ID is by inference (due to the reconstruction of peptide sequences into their protein sequence)

Protein sequences are derived from their genetic counterparts i.e., gene sequences

Usually does not account for proteoforms (= same protein, but modified), which may have functional differences (this is hard to detect with standard bottom-up proteomics)

GO/Enrichment analysis is a popular and powerful tool to understand the qualitative properties of a dataset

13 of 13

Go to your R Notebook

Work through the EnrichR example in small groups of 3-4
Refer to the EnrichR vignette for instructions (see Notebook for details)