1 of 53

Single Cell Multiomics Data Integration and Visualisation

Dr Shila Ghazanfar�Royal Society- Newton International Fellow�John Marioni Lab�Cancer Research UK Cambridge Institute�University of Cambridge�@shazanfar�

Introduction to multiomics data integration and visualisation�EMBL-EBI Training�21-25 March 2022

2 of 53

Single Cell Multiomics Data Integration and Visualisation

  • Why single cells? Single cell RNA-seq�
  • Multimodal single cell data�
  • Single cell data integration: horizontal, vertical and mosaic�
  • Working with integrated single cell data�
  • Static and dynamic visualization of multiomics single cell data

3 of 53

Single Cell Multiomics Data Integration and Visualisation

  • Why single cells? Single cell RNA-seq
  • Multimodal single cell data�
  • Single cell data integration: horizontal, vertical and mosaic�
  • Working with integrated single cell data�
  • Static and dynamic visualization of multiomics single cell data

4 of 53

Why single cell genomics?

  • It allows us to characterise heterogeneity in the gene expression profile of a population.
  • The cell is the basic unit of life. At the cell-level we can:
  • Define cell identities (e.g. cell-types or subtypes).
  • Observe cell states and behaviour (e.g. cell cycle, metabolism, stress).
  • Study dynamic processes (e.g. differentiation, activation).
  • Study noise in transcriptional regulation.

5 of 53

Single cell transcriptomics

  • RNA-seq allows quantification of the whole transcriptome.
  • FISH: small number of transcripts.
  • seqFISH/MERFISH/ISS: ~100s-1,000 genes
  • seqFISH+: 10,000 genes.
  • FACS: small number of proteins.
  • Mass cytometry: ~40 proteins.

6 of 53

A typical single cell experiment

  • Dissociation can be easy or hard (blood vs muscle).
  • Many separation methods (plate-based, droplets).
  • Different protocols for RT and cDNA generation (full-length, 5’/3’ biased)

7 of 53

Throughput of scRNA-seq technologies

  • scRNA-seq protocols have increased hugely in throughput.
  • Cell separation using FACS or microfluidic devices.
  • Automation of RT and cDNA generation.

8 of 53

Single cell capture protocols

  • Plate-based
  • Hundreds to a few thousand cells.
  • High capture efficiency.
  • High number of genes detected.
  • Full-length transcripts.
  • UMIs optional.
  • Compatible with spike-ins.
  • Droplet-based
  • Tens to hundreds of thousands cells.
  • Variable capture efficiency.
  • Lower number of genes detected.
  • 3’/5’-biased.
  • UMIs.
  • No spike-ins.

10.1038/s41576-019-0150-2

9 of 53

scRNA-seq data

  • In its rawest form, FASTQ files after Illumina sequencing.
  • 1. Align reads to reference genome.
  • Many good and fast aligners (e.g. subread, STAR).
  • 2. Count number of reads mapped to each gene (e.g. HTSeq, featureCounts).
  • This produces a count matrix with one count per gene per cell.
  • If UMIs are used, reads with the same UMI are collapsed to a single count.
  • Data generated with the 10X platform can be processed with CellRanger.

10 of 53

scRNA-seq data analysis

  • Aim: to extract real biology from data with technical noise
  • 1. Quality control.
  • 2. Normalisation of cell-specific biases.
  • 3. Batch correction.
  • 4. Modelling technical noise.
  • 5. Dimensionality reduction and visualisation.
  • 6. Clustering.
  • . . . followed by higher-level analyses and interpretation.

11 of 53

Quality control

  • We use several metrics to identify low-quality samples:
  • Total number of reads per cell (low).
  • Total number of genes detected (low).
  • Percentage of reads mapped to mitochondrial genes (high).
  • Percentage of reads mapped to spike-in transcripts (high).

12 of 53

Normalisation by sharing information across cells

  • Cells are pooled to increase counts and avoid problems with zeroes.
  • Size factor per pool estimated robustly (median), to protect against DE.
  • Clustering the data before pooling further protects against DE.
  • Solve linear system to obtain a size factor per cell.

13 of 53

Batch correction

  • Data generated by different labs, or at different times, suffer from batch effects.
  • Large datasets inevitably need to be processed in multiple batches.
  • Such effects need to be removed to be able to compare them.

Data from Nestorawa et al., Blood (2016); Paul et al., Cell (2015).

14 of 53

Batch correction – Mutual nearest neighbours

  • Methods developed for bulk RNA-seq data fail due to composition biases.
  • - Assumption: cell composition is identical across batches.
  • - Systematic differences in mean expression are technical. Instead, use mutual nearest neighbours to identify equivalent populations.
  • Use these to compute the batch effect.

Data from Nestorawa et al., Blood (2016); Paul et al., Cell (2015).

15 of 53

Batch correction - scMerge

  • Identify genes that tend to be stably expressed within multiple datasets across multiple biological contexts.
  • Use these genes to compute the batch effect and identify pseudoreplicates to be merged closer between batches.

Lin et al (2019) 10.1073/pnas.1820006116

16 of 53

Dimensionality reduction with Principal Component Analysis

  • identifies axes of maximal variance in high-dimensional data.
  • each principal component (PC) explains less variance.

  • The first few (5-100) PCs can be used as a “summary” of the data.
  • Speed up downstream analyses by reducing dimensionality.
  • Focus on biology, remove random noise in later PCs.

17 of 53

Visualisation in low dimensional space - PCA

  • The first 2-3 PCs can be used for visualisation.
  • Simple and efficient, but limited resolution of complex structure.

18 of 53

Visualisation in low dimensional space - t-SNE & UMAP

  • Preserve distances to neighbouring cells.
  • Non-linear: not limited to straight axes.

  • Powerful, but can be sensitive to choice of random seed and parameters

19 of 53

Single Cell Multiomics Data Integration and Visualisation

  • Why single cells? Single cell RNA-seq�
  • Multimodal single cell data
  • Single cell data integration: horizontal, vertical and mosaic�
  • Working with integrated single cell data�
  • Static and dynamic visualization of multiomics single cell data

20 of 53

Single cell multiomics data

Stuart & Satija (2019)

21 of 53

Types of single cell multiomics

CITE-seq

10X Multiome

SeqFISH

22 of 53

Challenges in analysing single cell multiomics data

  • Quality control�- Differences in quality between omics layers
  • Normalisation
  • Interpretation of each layer
  • Learning relationships between omics layers

23 of 53

Analysing single cell multiomics data

  • Readouts of each omics ‘layer’ may be affected by different experimental factors, or require different data normalization and interpretation

Clark et al (2018) 10.1038/s41467-018-03149-4

scNMT-seq

24 of 53

Analysing single cell multiomics data - MOFA

Argelaguet et al (2018) 10.1186/s13059-020-02015-1

25 of 53

Analysing single cell multiomics data - WNN

Hao et al (2021) 10.1016/j.cell.2021.04.048

26 of 53

Single Cell Multiomics Data Integration and Visualisation

  • Why single cells? Single cell RNA-seq�
  • Multimodal single cell data�
  • Single cell data integration: horizontal, vertical and mosaic
  • Working with integrated single cell data�
  • Static and dynamic visualization of multiomics single cell data

27 of 53

Horizontal and vertical single cell data integration

Argelaguet, Cuomo et al (2021)

e.g. batch effects

e.g. single cell multiomics

28 of 53

Mosaic single cell data integration

29 of 53

seqFISH & scRNA-seq as Mosaic integration

Lohoff*, Ghazanfar* et al (2021) 10.1038/s41587-021-01006-2

naive approach: restrict to just intersecting features*

30 of 53

seqFISH & scRNA-seq as Mosaic integration

naive approach: restrict to just intersecting features*

Lohoff*, Ghazanfar* et al (2021) 10.1038/s41587-021-01006-2

31 of 53

Single cell mosaic data integration - StabMap

features

cells

Observed data matrices

StabMap embedding

dimensions

cells

Ghazanfar et al (2022) bioRxiv 10.1101/2022.02.24.481823

32 of 53

features

cells

No shared

features

Mosaic Data Topology

Observed data matrices

StabMap embedding

dimensions

cells

Single cell mosaic data integration - StabMap

Ghazanfar et al (2022) bioRxiv 10.1101/2022.02.24.481823

33 of 53

features

cells

No shared

features

Mosaic Data Topology

Observed data matrices

StabMap embedding

dimensions

cells

Single cell mosaic data integration - StabMap

Ghazanfar et al (2022) bioRxiv 10.1101/2022.02.24.481823

34 of 53

features

cells

No shared

features

Mosaic Data Topology

Observed data matrices

StabMap embedding

dimensions

cells

Single cell mosaic data integration - StabMap

Ghazanfar et al (2022) bioRxiv 10.1101/2022.02.24.481823

35 of 53

features

cells

No shared

features

Mosaic Data Topology

Observed data matrices

StabMap embedding

dimensions

cells

Single cell mosaic data integration - StabMap

Ghazanfar et al (2022) bioRxiv 10.1101/2022.02.24.481823

36 of 53

features

cells

No shared

features

Mosaic Data Topology

Observed data matrices

StabMap embedding

dimensions

cells

StabMap embedding

Single cell mosaic data integration - StabMap

Ghazanfar et al (2022) bioRxiv 10.1101/2022.02.24.481823

37 of 53

Single cell mosaic data integration – Bridge integration using dictionary learning

Hao et al (2022) bioRxiv 10.1101/2022.02.24.481684

38 of 53

Single cell mosaic data integration – UINMF

Kriebel et al (2022) 10.1038/s41467-022-28431-4

39 of 53

Single cell mosaic data integration – MultiMAP

Jain et al (2021) 10.1186/s13059-021-02565-y

Output:

  • Graph
  • 2D UMAP

40 of 53

Single Cell Multiomics Data Integration and Visualisation

  • Why single cells? Single cell RNA-seq�
  • Multimodal single cell data�
  • Single cell data integration: horizontal, vertical and mosaic�
  • Working with integrated single cell data
  • Static and dynamic visualization of multiomics single cell data

41 of 53

Working with integrated single cell data

  • Supervised learning and reference-based mapping

  • Joint clustering to discover new biology

http://bioconductor.org/books/3.14/OSCA

42 of 53

Working with integrated single cell data

  • Graph based inference, e.g. Milo to test for changes in abundance of cells across experimental conditions

  • Estimating differentiation trajectories

Dann et al (2021) 10.1038/s41587-021-01033-z

43 of 53

Working with integrated single cell data

  • Imputation of missing modalities

  • Bespoke methods specific to certain modalities, e.g. RNA velocity

Lohoff*, Ghazanfar* et al (2021) 10.1038/s41587-021-01006-2

44 of 53

Single cell and multiomics data containers

  • R & Bioconductor:
  • SingleCellExperiment
  • MultiAssayExperiment
  • SpatialExperiment
  • Seurat�
  • Python:
  • scanpy (AnnData)
  • squidpy
  • Muon

  • zellkonverter is a Bioconductor package that provides methods to convert between Python AnnData objects and SingleCellExperiment objects

muon

Bredikhin et al (2022) 10.1186/s13059-021-02577-8

45 of 53

Single Cell Multiomics Data Integration and Visualisation

  • Why single cells? Single cell RNA-seq�
  • Multimodal single cell data�
  • Single cell data integration: horizontal, vertical and mosaic�
  • Working with integrated single cell data�
  • Static and dynamic visualization of multiomics single cell data

46 of 53

Common issues visualising single cell multiomics data

  • Overplotting�- density-based plots e.g. scHex package

  • Unwieldy vector graphics�rasterise points layer using scattermore or ggrastr

  • Too high dimensional�non-linear 2D embeddings like tSNE/UMAP�generate 3D plots (static or dynamic)

  • Want to display multiple views�Animations using gganimate, �repeated plots with various �colour scales

Freytag et al (2020) 10.1093/bioinformatics/btz907

https://marionilab.cruk.cam.ac.uk/MouseGastrulation2018/

scRNA-seq

seqFISH

47 of 53

Visualising single cell multiomics data jointly

48 of 53

Interactive single cell visualisation platforms

49 of 53

Interactive single cell visualisation platforms - Vitessce

50 of 53

Interactive single cell visualisation platforms - Shiny

https://crukci.shinyapps.io/SpatialMouseAtlas/

51 of 53

Single Cell Multiomics Data Integration and Visualisation

  • Why single cells? Single cell RNA-seq�
  • Multimodal single cell data�
  • Single cell data integration: horizontal, vertical and mosaic�
  • Working with integrated single cell data�
  • Static and dynamic visualization of multiomics single cell data

52 of 53

Additional resources

  • What was not covered�- deep learning techniques for single cell data integration, e.g. scVI https://github.com/YosefLab/scVI/�- Interpretation of factor models such as MOFA (see Day 2)�- Comparative single cell studies, e.g. Shafer et al (2020) 10.3389/fcell.2019.00175�- Visualising single cell networks (ugly or not ☺)

53 of 53

Thank you! Questions welcome

  • Single cell RNA-seq course materials:�Aaron Lun, Ximena Ibarra-Soria, John Marioni
  • Past and present members of the John Marioni lab for sharing their wisdom
  • Scientific collaborators:�Long Cai�Wolf Reik�Jenny Nichols�Bertie Gottgens�Ben Simons�Shankar Srinivas�Dana Pe’er�Kat Hadjantonakis�James Briscoe�Carolina Guibentif�Richard Tyser�Nico Pierson�Noushin Koulena�Tim Lohoff�

@shazanfar�Shila.Ghazanfar@cruk.cam.ac.uk Shila.Ghazanfar@sydney.edu.au

From May 2022: please get in touch!�- PhD Student�- Collaborations

The Ghazanfar lab will focus on developing statistical approaches for spatial genomics at single cell resolution.

  • Single cell mosaic data integration (StabMap)
  • Spatial reconstruction of scRNA-seq (SageNet, E. Heidari)
  • Data analysis strategies (scHOT)
  • Novel feature extraction and transfer learning