1 of 16

Doublet identification and characterization in single-cell data with scDblFinder

Pierre-Luc Germain1,2, Mark D. Robinson1

1

1 Lab of Statistical Bioinformatics, IMLS, University of Zürich

2 Institute for Neuroscience, D-HEST, ETH Zürich

2 of 16

2

(Adapted from DePasquale et al. 2019)

A doublet (or multiplet) is defined as two (or more) cells captured in a single reaction volume (i.e. well or droplet)

  • Homotypic doublets are between two cells of the same type,
  • Heterotypic or neotypic doublets are between cells of different types.

  • Neotypic doublets can form their own clusters and appear as spurious ‘new cell types’

  • Doublets can disrupt trajectory analysis and coexpression analysis

3 of 16

Experimental approaches and their limitations

3

(Adapted from Kang et al. 2019)

Samples of diverse genetic backgrounds

4 of 16

Experimental approaches and their limitations

4

(Adapted from Kang et al. 2019)

Doublets between cells of the same sample

Samples of diverse genetic backgrounds

5 of 16

5

(Adapted from Kang et al. 2019)

The more cells we try to capture (in a single sample/capture), the higher the proportion of doublets.

Genetically-derived doublets can be used as (imperfect) truth on which to evaluate doublet detection methods

6 of 16

Big families of doublet detection methods

6

Based on gene co-expression

Rationale:

identify genes that tend not to be expressed together, and flag cells that coexpress them

Many informal approaches have relied on cluster-based marker identification and their co-expression, others use general binary co-expression across cells (e.g. scds::cxds)

Based on artificial doublets

Rationale:

generate doublets by combining real cells in silico, and flag real cells that are very similar to the artificial ones.

Using machine learning (scds::bcds) or the count artificial doublets in the neighborhood (DoubletFinder, Scrublet, etc.)

7 of 16

One stop for doublet detection!

7

New comprehensive method:

  • scDblFinder

Doublet methods formerly in scran :

8 of 16

8

sce <- scDblFinder(sce, samples=sce$sample_id, BPPARAM=MulticoreParam(3))

9 of 16

scDblFinder thresholding

9

Minimizing:

Misclassification of + proportion deviation from

artificial doublets expected # of doublets

10 of 16

Multiple samples

10

split

by sample

artificial doublet + kNN stats

Train

classifier

Thresholding

SCE

SCE

scDblFinder handles multiple samples in a batch-aware and efficient manner:

11 of 16

Okay, but does it work?

11

12 of 16

Benchmarking computational doublet-detection methods for single-cell RNA

Xi and Li, preprint

12

Current top method:

DoubletFinder

13 of 16

scDblFinder vs DoubletFinder

13

better accuracy, approximately 10x faster

14 of 16

Doublet enrichment analysis

14

Two types of putative enrichments:

A specific combination of clusters tends to form more doublets than expected:

A specific cluster tends to form more doublets than expected (cluster stickiness):

clusters

clusters

15 of 16

Take home messages

  • we can (and should!) identify (heterotypic) doublets pretty accurately
  • The scDblFinder package assembles a set of doublet detection methods
  • The scDblFinder method is very easy to use, outperforms alternative methods, runs considerably faster, scales better, and offers more functionalities

15

Also on bioconductor;

Under constant improvement, so use a recent version!

16 of 16

Thanks!

Will Macnair

Aaron Lun

Mark & the Robinson group

The folks at ETH’s H-HEST Institute for Neuroscience