Efficient and scalable analysis of single-cell RNA-seq data using Bioconductor
Davide Risso� � @drisso1893�
Efficient and scalable analysis of single-cell RNA-seq data using Bioconductor
Davide Risso� � @drisso1893�
clustering
Efficient and scalable analysis of single-cell RNA-seq data using Bioconductor
clustering
Stephanie Hicks
Elizabeth Purdom
Yuwei Ni
Ruoxi Liu
“I have some awesome single-cell �RNA-seq data and want to analyze it!”
“...but it’s so big that it doesn’t �even load into memory”
“What should I do?”
Scalable method to cluster millions of single cells
Fast: we want to be able to quickly cluster (multiple times) thousands to millions of cells in PCA space (data may fit in memory)
On-disk: we may need to quickly cluster full data matrices (millions of cells by thousands of genes) which do not fit in memory.
In some cases (e.g., normalization) speed is more �important than accuracy
How much of a problem is it?
2.5 Millions cells
2.2 Millions cells
2 Millions cells
1 Million cells
Source: www.nxn.se/single-cell-studies/gui
k-means clustering
Given a set of n data points (x) and a number k, k-means partitions the data in k clusters.
More formally, k-means clustering aims at minimizing the within-cluster sum of squares:
In practice, we use an iterative algorithm based on two steps:
Mini-batch k-means clustering (Sculley, 2010)
At each iteration, use small random subsets of the data (“mini-batches”)
https://www.eecs.tufts.edu/~dsculley/papers/fastkmeans.pdf
Existing implementations
ClusterR CRAN package
The input data must be in base numeric matrix type
MiniBatchKMeans()
The input data can be stored in a HDF5 file� Reticulate�
Our implementation: the mbkmeans package
Why (mini-batch) k-means?
Hector Roux de Bezieux
Why (mini-batch) k-means?
What is HDF5?
HDF5 is a unique technology suite that makes possible the management of extremely large and complex data collections.
http://portal.hdfgroup.org/display/support
Why HDF5?
De facto standard for single-cell RNA-seq data
What is a DelayedArray?
How do I use them in practice?
Simulations
Accuracy: Within-Cluster Sum of Squares
Accuracy: Adjusted Rand Index
Scalability: time
Scalability: memory usage
Subsampling analysis
Scalability: time
Scalability: memory
Real datasets
Comparing hardware
vs
Scalability: memory
Scalability: time
How you read from HDF5 has a huge impact
Using get_rows() instead of get_row() in beachmat is 40x faster!
(Thanks Hervé!)
A complete workflow: HCA preview data
Analysis of Human Cell Atlas Preview Data
* mbkmeans was used as a preliminary step for scran normalization
Compute time
(Note that some of these steps may be further optimized)
k=13 clusters correspond to biological cell types
Dimensionality reduction is the bottleneck
(Note that some of these steps may be further optimized)
Bioconductor workflow for analyzing single-cell data
Stephanie Hicks
Robert Amezquita
Aaron Lun
Raphael Gottardo
Dimensionality reduction is the bottleneck
Federico Agostinis
Scalable, accurate alternatives to PCA are needed
Federico Agostinis
Clustering is difficult!
The Dune algorithm
Hector Roux de Bézieux, Kelly Street, Sandrine Dudoit
Dune merging
Hector Roux de Bézieux, Kelly Street, Sandrine Dudoit
Dune merging
Hector Roux de Bézieux, Kelly Street, Sandrine Dudoit
Dune outperforms other strategies
Hector Roux de Bézieux, Kelly Street, Sandrine Dudoit
Dune applied to Mouse Motor Cortex data
Hector Roux de Bézieux, Kelly Street, Sandrine Dudoit
Lessons learned and ongoing work
Thank you!
Questions/comments? risso.davide@gmail.com
mbkmeans
Stephanie Hicks
Elizabeth Purdom
Yuwei Ni
Ruoxi Liu
Dune
Hector Roux de Bézieux
Kelly Street
Sandrine Dudoit
John Ngai
@drisso1893
Aaron Lun
Hervé Pagès
Mike Smith
Peter Hickey
Memory usage for increasing k
Elapsed time for increasing k