1 of 16

Doublet identification and characterization in single-cell data with scDblFinder

Pierre-Luc Germain^1,2, Mark D. Robinson¹

1 Lab of Statistical Bioinformatics, IMLS, University of Zürich

2 Institute for Neuroscience, D-HEST, ETH Zürich

2 of 16

(Adapted from DePasquale et al. 2019)

A doublet (or multiplet) is defined as two (or more) cells captured in a single reaction volume (i.e. well or droplet)

Homotypic doublets are between two cells of the same type,
Heterotypic or neotypic doublets are between cells of different types.

Neotypic doublets can form their own clusters and appear as spurious ‘new cell types’

Doublets can disrupt trajectory analysis and coexpression analysis

3 of 16

Experimental approaches and their limitations

(Adapted from Kang et al. 2019)

Samples of diverse genetic backgrounds

4 of 16

Experimental approaches and their limitations

(Adapted from Kang et al. 2019)

Doublets between cells of the same sample

Samples of diverse genetic backgrounds

5 of 16

(Adapted from Kang et al. 2019)

The more cells we try to capture (in a single sample/capture), the higher the proportion of doublets.

Genetically-derived doublets can be used as (imperfect) truth on which to evaluate doublet detection methods

6 of 16

Big families of doublet detection methods

Based on gene co-expression

Rationale:

identify genes that tend not to be expressed together, and flag cells that coexpress them

Many informal approaches have relied on cluster-based marker identification and their co-expression, others use general binary co-expression across cells (e.g. scds::cxds)

Based on artificial doublets

Rationale:

generate doublets by combining real cells in silico, and flag real cells that are very similar to the artificial ones.

Using machine learning (scds::bcds) or the count artificial doublets in the neighborhood (DoubletFinder, Scrublet, etc.)

7 of 16

One stop for doublet detection!

New comprehensive method:

scDblFinder

Doublet methods formerly in scran :

computeDoubletDensity (formerly scran::doubletCells)
recoverDoublets
findDoubletClusters (formerly scran::doubletCluster)

https://github.com/plger/scDblFinder

8 of 16

sce <- scDblFinder(sce, samples=sce$sample_id, BPPARAM=MulticoreParam(3))

9 of 16

scDblFinder thresholding

Minimizing:

Misclassification of + proportion deviation from

artificial doublets expected # of doublets

10 of 16

Multiple samples

split

by sample

artificial doublet + kNN stats

Train

classifier

Thresholding

SCE

scDblFinder handles multiple samples in a batch-aware and efficient manner:

11 of 16

Okay, but does it work?

12 of 16

Benchmarking computational doublet-detection methods for single-cell RNA

Xi and Li, preprint

Current top method:

DoubletFinder

13 of 16

scDblFinder vs DoubletFinder

better accuracy, approximately 10x faster

14 of 16

Doublet enrichment analysis

Two types of putative enrichments:

A specific combination of clusters tends to form more doublets than expected:

A specific cluster tends to form more doublets than expected (cluster stickiness):

clusters

15 of 16

Take home messages

we can (and should!) identify (heterotypic) doublets pretty accurately
The scDblFinder package assembles a set of doublet detection methods
The scDblFinder method is very easy to use, outperforms alternative methods, runs considerably faster, scales better, and offers more functionalities

https://github.com/plger/scDblFinder

Also on bioconductor;

Under constant improvement, so use a recent version!

16 of 16

Thanks!

Will Macnair

Aaron Lun

Mark & the Robinson group

The folks at ETH’s H-HEST Institute for Neuroscience