1 of 18

Quality control and exploration of scRNA-seq datasets

Yesid Cuesta Astroz

Associate Professor

Universidad de Antioquia

School of Microbiology

yesid.cuesta@udea.edu.co

2 of 18

Quality Control

Motivation:

'Cells' featuring one or more of:

⚠️ Low total counts

⚠️ Few expressed genes

⚠️ High proportion of reads coming from mitochondria

3 of 18

Quality Control (consequences)

  • Distinct cluster(s) complicating interpretation of results
  • Distortion of population heterogeneity
  • Artificial 'upregulation' of certain genes

4 of 18

5 of 18

Quality Control

  • Quality control is performed to ensure that the data quality is sufficient for downstream analysis. As “sufficient data quality” cannot be determined a priori, it is judged based on downstream analysis performance (e.g., cluster annotation).

  • Thus, it may be necessary to revisit quality control decisions multiple times when analysing the data. Often it is beneficial to start with permissive QC thresholds and investigate the effects of these thresholds before going back to perform more stringent QC.

  • This approach is particularly relevant for datasets containing heterogeneous cell populations where cell types or states may be misinterpreted as low-quality outlier cells.

Luecken et al., 2019

6 of 18

Quality Control

Kisilev et al 2019

7 of 18

Quality Control

  • Metrics
    • RNA count (or count depth)
    • Feature count (or gene count)
    • Mitochondria content
  • Recommendations
    • Identify and discard outliers
    • Different samples may require different cutoffs.

8 of 18

Quality Control

  • Number of unique genes detected in each cell.
  • Low-quality cells or empty droplets will often have very few genes.
  • Cell doublets or multiplets may exhibit an aberrantly high gene count.
  • Technical terms:
    • Feature count = number of genes
    • RNA count = number of UMIs

Doublet = when two cells are lysed and sequenced within the same droplet

DePasquale et al 2019

9 of 18

Quality Control

  • Cell doublets increase linearly with number of cells.
  • Sensitivity: need +- 50-100 cells with a unique transcriptome to identify a population cluster.

10 of 18

Quality Control

  • Mitochondrial RNA

    • Due to very harsh conditions in tissue dissociation step.

    • Dying cells release their cytoplasmic contents.

  • The percentage of reads that map to the mitochondrial genome.
  • + Low-quality / dying cells often exhibit extensive mitochondrial contamination.

11 of 18

Quality control software options

12 of 18

Workflow

13 of 18

SingleCellExperiment Object (Bioconductor)

-Datos primarios

-count matrix

-datos transformados

-Metadata

-cell

-feature

-experiment

-Reducción de dimensiones

-PCA, tSNE, etc

Alternative experiments

14 of 18

Seurat object

15 of 18

Seurat object

16 of 18

Data tables

17 of 18

Data tables

18 of 18

THANKS