Doublet identification and characterization in single-cell data with scDblFinder
Pierre-Luc Germain1,2, Mark D. Robinson1
1
1 Lab of Statistical Bioinformatics, IMLS, University of Zürich
2 Institute for Neuroscience, D-HEST, ETH Zürich
2
(Adapted from DePasquale et al. 2019)
A doublet (or multiplet) is defined as two (or more) cells captured in a single reaction volume (i.e. well or droplet)
Experimental approaches and their limitations
3
(Adapted from Kang et al. 2019)
Samples of diverse genetic backgrounds
Experimental approaches and their limitations
4
(Adapted from Kang et al. 2019)
Doublets between cells of the same sample
Samples of diverse genetic backgrounds
5
(Adapted from Kang et al. 2019)
The more cells we try to capture (in a single sample/capture), the higher the proportion of doublets.
Genetically-derived doublets can be used as (imperfect) truth on which to evaluate doublet detection methods
Big families of doublet detection methods
6
Based on gene co-expression
Rationale:
identify genes that tend not to be expressed together, and flag cells that coexpress them
Many informal approaches have relied on cluster-based marker identification and their co-expression, others use general binary co-expression across cells (e.g. scds::cxds)
Based on artificial doublets
Rationale:
generate doublets by combining real cells in silico, and flag real cells that are very similar to the artificial ones.
Using machine learning (scds::bcds) or the count artificial doublets in the neighborhood (DoubletFinder, Scrublet, etc.)
One stop for doublet detection!
7
Doublet methods formerly in scran :
8
sce <- scDblFinder(sce, samples=sce$sample_id, BPPARAM=MulticoreParam(3))
scDblFinder thresholding
9
Minimizing:
Misclassification of + proportion deviation from
artificial doublets expected # of doublets
Multiple samples
10
split
by sample
artificial doublet + kNN stats
Train
classifier
Thresholding
SCE
SCE
scDblFinder handles multiple samples in a batch-aware and efficient manner:
Okay, but does it work?
11
12
Current top method:
DoubletFinder
scDblFinder vs DoubletFinder
13
better accuracy, approximately 10x faster
Doublet enrichment analysis
14
Two types of putative enrichments:
A specific combination of clusters tends to form more doublets than expected:
A specific cluster tends to form more doublets than expected (cluster stickiness):
clusters
clusters
Take home messages
15
Also on bioconductor;
Under constant improvement, so use a recent version!
Thanks!
Will Macnair
Aaron Lun
Mark & the Robinson group
The folks at ETH’s H-HEST Institute for Neuroscience