1 of 24

Tidyomics - a tool for multiple omics analysis?

Min Hyung Ryu

October 18, 2023

CDNM Multiple Omics Meeting

2 of 24

Tidyverse and tidyomics

https://github.com/tidyomics

https://r4ds.hadley.nz/

3 of 24

4 of 24

5 of 24

Treatment A

Treatment B

Mike

-

2

John

85

45

Mary

10

5

Consider dataset below:

There are three variables:

  • Person
  • Treatment
  • Test result

6 of 24

An Example: tidy version

Person

Treatment

Test result

Mike

A

-

John

A

85

Mary

A

16

Mike

B

2

John

B

45

Mary

B

5

7 of 24

Tibble data structure

8 of 24

Advantages of Tidyverse

1. Consistency:

  • promotes a consistent and coherent approach to data analysis.
  • follows a unified grammar and design philosophy.

2. Readability:

  • code is more readable and expressive.
  • follows conventions resembling natural language.

3. Piping:

  • uses the `%>%` or ‘|>’ (pipe) operator for chaining data manipulation operations.
  • enhances code clarity and flow.

4. Data Manipulation:

  • packages like dplyr and tidyr simplify data wrangling.
  • includes functions for filtering, grouping, summarizing, and reshaping data.

9 of 24

Verb-based operations

10 of 24

Summarize after grouping

11 of 24

Summarized output

12 of 24

Advantages of Tidyverse

5. Visualization:

  • ggplot2 creates customizable, publication-quality visualizations
  • adheres to a "grammar of graphics" for complex plots

6. Data Import:

  • readr and readxl simplify importing data from various formats.
  • Supports CSV, Excel, and more.

7. Data Exploration:

  • dplyr and tidyr facilitate data exploration and pattern identification.

8. Package Ecosystem:

  • integrates well with other R packages and libraries.
  • extends the power and capabilities of R.

13 of 24

Tibble data structure

14 of 24

15 of 24

16 of 24

17 of 24

Example: genomic and transcriptomic data integration

g <- ensembldb::genes(edb)

18 of 24

Only include genes that were found in the scRNAseq data

Call in all the gene_names in your scRNA-seq data

19 of 24

Clean the data from ChIP-seq

Chromatin accessibility

Let’s see if genes near peaks of active chromatin marks (H3K4me3 measured with ChIP-seq)

20 of 24

Derive the distance from gene (scRNAseq) to H3K4me3 peaks (ChiP-seq)

21 of 24

Make the gene-to-peak distance into a categorical variable

22 of 24

Summarize nested data for each cell type

Application:

You can nest for cell-type, disease, donor, etc.

23 of 24

Visualize the results

24 of 24

Other applications?

Summary

  • RNA-seq and methylation integration
  • CITE-seq (RNA-seq and surface protein markers)
  • Single-cell multiple omics integration
  • Metabolomics and proteomics
  • Tidy data paradigm allows clean and scalable data analysis
    • Powerful visualization for exploration and publications
  • Integration of genomics and transcriptomics
  • Tidyomics is an expanding tool designed to apply “tidy” to omics research