1 of 13

Applied Bioinformatics 2025�Week 1 Session 2�Data manipulation

Natalie Turner, PhD

Postdoctoral Fellow – Yates Lab

Department of Molecular Medicine

naturner@scripps.edu

2 of 13

Manipulating data

Data cleaning

Remove outliers & missing values

Filter according to q-value (< 0.01)

Data transformation

Log transformation

(usually log2 but can be log10)

Creates a normal distribution (important for statistical assumptions)

Data normalization*

Reduces variability and accounts for systematic bias introduced during sample preparation, storage, treatment, and MS analysis

*Covered later in the module

3 of 13

Formatting

  • Data formatting can be your best friend (or your worst enemy!)

  • It’s essential to know how to manipulate a dataframe:
    • Correct data (rows/columns) and row/column headers
    • How to clean/log-transform/normalize data
    • Creating lists/assigning variable names is sometimes necessary (coming up next session)

  • Get comfortable with the data: know your data, know the format required for the downstream analytical pipeline

4 of 13

Go to your R Notebook

  • Run the first code chunk
  • Open the annotations file in Excel

5 of 13

Annotations file

With many R packages designed for proteomics analysis (and other open-source software packages), there are four (4) essential columns in annotations files:

    • Run 🡪 The name of the MS data file corresponding to a particular sample run
    • BioReplicate 🡪 Usually reserved for biological replicate number, but can be assigned as a technical replicate number in certain situations.
    • Condition 🡪 The group that a sample belongs to i.e., control or treatment
    • Outlier 🡪 Whether the run should be removed from analysis or not. The command for this is user-defined, but usually logical (TRUE/FALSE, or YES/NO).

6 of 13

Base R functions

  • Log(x): Computes logarithms for (x). Default is natural, but can be specified as log2, log10

  • %>%: Operator that takes the output of the expression on its left and passes it as the first argument to the function on its right.

  • %in%: Returns a logical vector indicating if there is a match or not for its left operand.

7 of 13

Dplyr functions

  • Filter: Keep rows that match a condition

  • Mutate: Create, modify, and delete columns

  • Select: Keep or drop columns using their names and types

8 of 13

Other functions

  • %like% (data.table): Logical vector, TRUE for items that match pattern.

  • na.omit (R stats): Handle Missing Values in Objects (completely removes row containing ‘NA’)

9 of 13

Go to your R notebook

  • Explore the annotations file
  • Take note of the samples annotated as outliers
  • Examine and run the next code chunk

10 of 13

EnrichR

11 of 13

General principle of Gene Ontology (GO)

‘Concept of associating a collection of genes with a functional biological term in a systematic way.’

Searches against selected gene libraries

Enrichment analysis will find which GO terms are over-represented (or under-represented) using annotations for a gene set

Output an be generated for different GO aspects (e.g., Biological Process, Molecular Function, Cellular Component)

EnrichR generates a p value for enriched vs not-enriched terms

12 of 13

Gene = protein = gene

Protein ID is by inference (due to the reconstruction of peptide sequences into their protein sequence)

Protein sequences are derived from their genetic counterparts i.e., gene sequences

Usually does not account for proteoforms (= same protein, but modified), which may have functional differences (this is hard to detect with standard bottom-up proteomics)

GO/Enrichment analysis is a popular and powerful tool to understand the qualitative properties of a dataset

13 of 13

Go to your R Notebook

  • Work through the EnrichR example in small groups of 3-4
  • Refer to the EnrichR vignette for instructions (see Notebook for details)