1 of 49

Efficient and scalable analysis of single-cell RNA-seq data using Bioconductor

Davide Risso� � @drisso1893�

2 of 49

Efficient and scalable analysis of single-cell RNA-seq data using Bioconductor

Davide Risso� � @drisso1893�

clustering

3 of 49

Efficient and scalable analysis of single-cell RNA-seq data using Bioconductor

clustering

Stephanie Hicks

Elizabeth Purdom

Yuwei Ni

Ruoxi Liu

4 of 49

“I have some awesome single-cell �RNA-seq data and want to analyze it!”

5 of 49

“...but it’s so big that it doesn’t �even load into memory”

6 of 49

“What should I do?”

7 of 49

Scalable method to cluster millions of single cells

Fast: we want to be able to quickly cluster (multiple times) thousands to millions of cells in PCA space (data may fit in memory)

On-disk: we may need to quickly cluster full data matrices (millions of cells by thousands of genes) which do not fit in memory.

In some cases (e.g., normalization) speed is more �important than accuracy

8 of 49

How much of a problem is it?

2.5 Millions cells

2.2 Millions cells

2 Millions cells

1 Million cells

Source: www.nxn.se/single-cell-studies/gui

9 of 49

k-means clustering

Given a set of n data points (x) and a number k, k-means partitions the data in k clusters.

More formally, k-means clustering aims at minimizing the within-cluster sum of squares:

In practice, we use an iterative algorithm based on two steps:

Assignment: given a set of centroids, assign each observation to the closest centroid.
Update: compute new centroids for each cluster.

10 of 49

Mini-batch k-means clustering (Sculley, 2010)

At each iteration, use small random subsets of the data (“mini-batches”)

No need to store the whole dataset in memory.
At each iteration, only the distances between a mini-batch and the k centroids need to be computed.
At each iteration, one only needs to have a subset of the data (mini-batch) and the k centroids in memory.
This makes it a natural candidate for clustering on-disk data.

https://www.eecs.tufts.edu/~dsculley/papers/fastkmeans.pdf

11 of 49

Existing implementations

ClusterR CRAN package

The input data must be in base numeric matrix type

MiniBatchKMeans()

The input data can be stored in a HDF5 file� Reticulate�

12 of 49

Our implementation: the mbkmeans package

Builds upon ClusterR, Rcpp, beachmat, DelayedArray, HDF5Array, rhdf5

13 of 49

Why (mini-batch) k-means?

Hector Roux de Bezieux

14 of 49

Why (mini-batch) k-means?

One of the most popular clustering methods.
Building block of two Bioconductor packages for the clustering of single-cell RNA-seq, clusterExperiment and SC3.
We envision scalable versions of these packages that leverage our implementation.

15 of 49

What is HDF5?

HDF5 is a unique technology suite that makes possible the management of extremely large and complex data collections.

A versatile data model
A completely portable file format
A software library: high-level APIs with interfaces in C, C++, python, R, …

http://portal.hdfgroup.org/display/support

16 of 49

Why HDF5?

De facto standard for single-cell RNA-seq data

10X Genomics Cell Ranger software stores pre-processed data as a HDF5 file
Scanpy’s data format, anndata, is based on HDF5
The loompy data format is based on HDF5

17 of 49

What is a DelayedArray?

A convenient way to deal with HDF5 files in R/Bioconductor is via the DelayedArray framework.
Data are stored on-disk in a HDF5 file.
Operations are delayed and only performed on the subset of the data for which it is needed.

18 of 49

How do I use them in practice?

19 of 49

Simulations

Bivariate gaussian data (true 2D biological space)
Three cell types
Project the data into high dimensional space and add noise

20 of 49

Accuracy: Within-Cluster Sum of Squares

21 of 49

Accuracy: Adjusted Rand Index

22 of 49

Scalability: time

23 of 49

Scalability: memory usage

24 of 49

Subsampling analysis

1.3M Brain cells from 10X Genomics
5,000 most variable genes
Subsampled to: 75k, 150k, 300k, 500k, 750k, 1M cells
iMac w/SSD and 64GB RAM

25 of 49

Scalability: time

26 of 49

Scalability: memory

27 of 49

Real datasets

68k PBMCs from 10X Genomics
378k HCA Bone Marrow cells
384k HCA Cord Blood cells
1.3M Brain cells from 10X Genomics

28 of 49

Comparing hardware

29 of 49

Scalability: memory

30 of 49

Scalability: time

31 of 49

How you read from HDF5 has a huge impact

Using get_rows() instead of get_row() in beachmat is 40x faster!

(Thanks Hervé!)

32 of 49

A complete workflow: HCA preview data

Census of Immune Cells: Bone Marrow
333,253 cells
Downloaded using the HCAData package by Federico Marini

33 of 49

Analysis of Human Cell Atlas Preview Data

Remove low-quality cells with scater
Keep only genes with at least one UMI in 5% of the cells (4,475 genes kept)
scran normalization*
First 30 Principal Components (using 1,000 most variable genes) using BiocSingular’s random PCA
Mini-batch k-means clustering with mbkmeans (batch size of 100, k = 10, 10 random starts).

* mbkmeans was used as a preliminary step for scran normalization

34 of 49

Compute time

4:30 hours for complete workflow (starting from UMI counts).
13 mins for the clustering of the full 333,253 x 4,475 matrix (HDF5; useful for normalization).
42 mins for normalization (parallel; 6 cores).
2:30 hours for random PCA (parallel; 6 cores).
18 secs for the clustering of the 333,253 x 30 matrix of the top 30 Principal Components (in memory).
27 mins for visualization (t-sne) or 3 mins (umap).

(Note that some of these steps may be further optimized)

35 of 49

k=13 clusters correspond to biological cell types

Cluster 1: B-cells
Cluster 2: Megakaryocytes
Cluster 3: CD14+ Monocytes
Cluster 4: Dendritic cells
Cluster 5: CD4 T-cells
Cluster 9: CD8 T-cells
Cluster 13: NK cells

36 of 49

Dimensionality reduction is the bottleneck

4:30 hours for complete workflow (starting from UMI counts).
13 mins for the clustering of the full 333,253 x 4,475 matrix (HDF5; useful for normalization).
42 mins for normalization (parallel; 6 cores).
2:30 hours for random PCA (parallel; 6 cores).
18 secs for the clustering of the 333,253 x 30 matrix of the top 30 Principal Components (in memory).
27 mins for visualization (t-sne) or 3 mins (umap).

(Note that some of these steps may be further optimized)

37 of 49

Bioconductor workflow for analyzing single-cell data

https://o sca.bioconductor.org

Stephanie Hicks

Robert Amezquita

Aaron Lun

Raphael Gottardo

38 of 49

Dimensionality reduction is the bottleneck

Federico Agostinis

39 of 49

Scalable, accurate alternatives to PCA are needed

Federico Agostinis

40 of 49

Clustering is difficult!

Methods require to decide the number of clusters (directly or indirectly).
Many methods to choose from.
Each method has its parameters to tune.
How do we know we are finding a good partition?
Obtain different partitions and find the resolution for which they agree.

41 of 49

The Dune algorithm

Dune is an iterative algorithm.
Given R partitions, it seeks to improve the overall agreement of the partitions.
At each step it merges the clusters within each partition that maximize the Adjusted Rand Index between the partitions.
Stops when no beneficial merge can be identified.