1 of 47

Single-Cell RNA-seq Technology meets Biology

California Institute of Technology

Lecture 2

Caltech Bi/BE/CS183

Spring 2024

These slides are distributed under the CC BY 4.0 license

2 of 47

Single-Cell RNA-seq technology meets Biology

These slides are distributed under the CC BY 4.0 license

3 of 47

The ideal (dissociated) single-cell transcriptomics method**

Universal in terms of cell size, type and state.�
In situ measurements.�
No minimum input of number of cells to be assayed.�
Every cell is assayed, i.e.100% cell capture rate.�
Every transcript in every cell is detected, i.e.100% molecular capture = sensitivity.�
Every transcript is identified by its full-length sequence (isoform specificity).�
Transcripts are readily associated to single cells, e.g. no cell doublets.�
Additional measurements of other cell attributes (including so called “multiomics”).�
Cost effective per cell.�
Easy to use.�
Open source.

** The field is not here yet; trade-offs still required

4 of 47

A decade of scRNA-seq technology and adoption

What is it teaching us?

Svensson et al. 20 20

5 of 47

Consider the full range of single cells

What question do you want to ask?

How is that supported or limited by technology type ?

This entire alga is one cell with many nuclei

6 of 47

Technology development ⇔ methods development

Produced on January 8, 2023 with a Google Colab notebook

7 of 47

Google Colab notebook for this lecture

Svensson et al., 2020

Zappia et al., 2018

8 of 47

Spatial single-cell RNA-seq

Moses et al. 2021

9 of 47

Popular single-cell RNA-seq protocols

Chen et al. 2018

10 of 47

Physical separation of cells in wells

Papalexi and Satija, 2017

11 of 47

Example: library preparation for SMART-Seq2

Picelli et al., 2014

12 of 47

Example: SMART-Seq2 performance

Universal in terms of cell size, type and state.
In situ measurements.
No minimum input of number of cells to be assayed.
Every cell is assayed, i.e.100% capture rate.
Every transcript in every cell is detected, i.e.100% sensitivity.
Every transcript is identified by its full-length sequence.
Transcripts are assigned correctly to cells, e.g. in theory no doublets.
Additional multimodal measurements.
Cost effective per cell.
Easy to use.
Open source.

13 of 47

Cost is complicated to measure

Ziegenhain et al. 2017

14 of 47

Sequencing costs are also complicated

UC Davis Genome Center facility, January 2019. https://genomecenter.ucdavis.edu/

15 of 47

Microfluidic methods

Drop-seq

inDrops

16 of 47

Foundation: monodispersed emulsions

Booeshaghi et al. 2019

17 of 47

Example: the inDrops approach

Zilionis et al. 2016

cell capture

cell lysis

barcoding

18 of 47

Example: the inDrops protocol

Library preparation

Zilionis et al. 2016

reverse transcription

amplification creates cDNA library

19 of 47

Unique Molecular Identifiers

Barcodes used to distinguish molecules (rather than associate transcripts with cells).
UMI collapsing refers to the process of using UMIs to avoid double-counting molecules after sequencing.

Naïve UMI collapsing consists of counting all reads that have the same UMI and cell barcode as a single event
UMI collapsing can include collision detection by checking whether reads also originate from the same molecule.

Zilionis et al. 2016

20 of 47

Sequencing

The cDNA library is sequenced. There are three types of current sequencing technology that are commonly used:

Single-end reads
Paired-end reads
“Long” reads

Sequencing errors can lead to:

incorrectly labeled cells (from cell barcodes).
erroneous molecule counts (from UMIs).

Error correction can be used to address these problems

Cell barcode error correction can sometimes be performed using a list of the known cell barcodes in the experiment (technology dependent).
Cell barcode and UMI error correction can be performed by first identifying sequences that likely represent true barcodes (based on frequency).

Sequencing errors are (sequencing) technology dependent. More on this in Lecture 8.

21 of 47

What it means to be “3’ technology”

Ntranos et al. 2019.

22 of 47

Beads, Cells and Droplets

Split

Doublet

No capture

Goal

Good

Bad

Irrelevant

Collision

23 of 47

Barcode diversity

Barcode collisions occur when beads with�identical barcode sequences are present �in droplets with two different cells.�
The number of available barcode sequences depends on the �sequence length L. Sequences of length L can yield up to 4^L barcodes.�
The number of distinct barcodes needed is a function of the number of cells that are to be barcoded.

Collision

24 of 47

Estimating the number of cells that will be uniquely barcoded

Assuming that each of N cells get one barcode at random from a set of M barcodes, the expected number of cells with a unique barcode is given by��

Expected value is a generalization of weighted average, and is a number associated to a random variable.

25 of 47

Random variables

A random variable is neither random nor a variable…

A random variable is a function from a sample space of a probability space, to the real numbers (or more generally a measurable space).��

Example: the sample space is the set { present, absent } (for instance the presence or absence of a specific barcode in a droplet), and the random variable assigns the real number 1 to “present” and the real number 0 to “absent”. This is an example of an indicator random variable.

26 of 47

Estimating the number of cells that will be uniquely barcoded

Assuming that each of N cells get one barcode at random from a set of M barcodes, the expected number of cells with a unique barcode is given by��

Proof: If we denote the probability that any specific barcode associates with some cell by p, then p=1/M. The probability that a given barcode is used for some specific set of k cells is therefore

27 of 47

Estimating the number of cells that will be uniquely barcoded

28 of 47

Expected value of a random variable

The expectation of a discrete random variable with outcomes x₁, x₂, x₃, … x_n each occuring with probability p₁, p₂, p₃, … p_n respectively is

If X_jis an indicator random variable corresponding to barcode j, i.e. X_j is 1 if barcode j is used for only one cell, and 0 otherwise, then

29 of 47

The seemingly magic linearity of expectation

If X and Y are two random variables, then��This is true regardless of whether X and Y are independent.�
Thus,

30 of 47

Barcode collisions

For N assayed cells and M barcodes, the barcode collision rate can be estimated as

Barcode collisions lead to synthetic doublets. Avoiding synthetic doublets requires high relative barcode diversity, i.e., a high ratio of M/N.

31 of 47

Droplet tuning concepts

The capture rate is 1 - the fraction of cells that that are �in droplets without any beads.��
The split rate is the fraction of droplets with exactly one �cell that have more than one bead.��
The doublet rate is the fraction of droplets with 1 bead �that have more than one cell.

Doublet

Split

No capture

32 of 47

Binomial distributions for beads and cells

Consider n droplets, each of which has a probability p of containing a single cell. Then the probability that k cells will be captured is��
This suggests modeling the number of cells captured with a random variable that follows a Binomial distribution. That is, X ~ B(n,p).

33 of 47

The law of rare events

For large n and small p, the Binomial distribution B(n,p) is approximated well with the Poisson distribution Pois(λ = np), i.e.��
This is convenient for many reasons: the expression on the right is easier to evaluate and the parameter λ is readily interpretable as the expected value of a Poisson random variable.

34 of 47

Expected value of a Poisson random variable

If X is a Poisson random variable, i.e. X ~ Pois(λ), then the expected value of X is given by

which is equal to λ.

35 of 47

A Poisson approximation for beads and cells

Load cells into droplets at Poisson rate .
Load beads into droplets at Poisson rate .

36 of 47

Cell capture and duplication rates

The Poisson approximation yields a simple formula for the capture rate:� �
The split rate estimate is �� .�
This provides a quantitative�assessment of the tradeoff �between the capture rate and �the split rate.

Capture rate

Split rate

37 of 47

Reducing the number of beadless droplets

	Drop-seq	inDrops	10x genomics
Bead Material	Polystyrene	Hydrogel	Hydrogel
Loading Dynamics	Poisson	Sub-Poisson	Sub-Poisson
Dissolvable	No	No	Yes
Barcode Release	No	UV release	Chemical release
Customizable	Demonstrated	Not shown	Feasible
Licensing	Open Source	Open source	Proprietary
Availability	Beads are sold	Commercial	Commercial

38 of 47

Sub-Poisson (sometimes called super-Poisson) loading

39 of 47

Technical doublets

Technical doublets arise when two or more cells are captured in a droplet with a single bead. The technical doublet rate is therefore the probability of capturing two or more cells in a droplet given that at least one cell has been captured in a droplet:

Note that “overloading” a microfluidics single-cell experiment by loading more cells while keeping flow rates constant will increase the number of technical doublets due to an effective increase in .

40 of 47

Doublet detection: the barnyard plot

Croset et al. 2018.

41 of 47

Bloom’s correction

Total number of droplets N, barnyard axes measured at N₁ and N₂, and observed doublets N_1,2.

Bloom, 2018.

42 of 47

Biological doublets

Biological doublets arise when two cells form a discrete unit that does not break apart during disruption to form a suspension.
Biological doublets will not be detected via barnyard plots.
One approach to avoiding biological doublets is to perform nuclear single-cell RNA-seq: Habib et al. 2017.
However, biological doublets are not necessarily just a technical nuisance to be avoided. The paper Halpern et al. 2018 utilizes biological doublets of hepatocytes and liver endothelial cells to assign tissue coordinates to liver endothelial cells via imputation from their hepatocyte partners.

43 of 47

Summary

Split

Doublet

No capture

Goal

Good

Bad

Irrelevant

Collision

Technical

Biological

high λ

low μ

small L

high μ

low λ

44 of 47

Summary of droplet single-cell RNA-seq methods and features

Zhang et al. 2019 .

45 of 47

What most single-cell RNA-seq is not, circa 2023

It is NOT single-cell: While measurements are made from RNA molecules in individual cells, in standard practice (ie 10X genomics, drop-seq) very few RNA molecules are captured (OM 1%). Therefore what is measured samples but does not provide a complete picture of the RNA inside any given cell. Since measurements from cells are incomplete, claims are for the most part restricted to groups of cells, rather than individual cells. �
It is NOT RNA: Most RNA-seq largely consists of sampling cDNA molecules from a cDNA library, which serves as a proxy for the (captured) RNA content in cells. (Direct RNA sequencing is in its infancy)

46 of 47

Extensions of single-cell RNA-seq

Multimodal assays.

Patch-seq: Cadwell et al. 2015.
sci-CAR: Cao et al. 2018.

Multiplexing of samples.

Cell hashing: Stoeckius et al. 2018.
Click tags: Gehring et al. 2018.

Spatial single-cell technologies.

seqFISH: Lubeck et al. 2014.
MERFISH: Chen et al. 2015.

Packer and Trapnell 2018.

47 of 47

Additional References

Calculation of the probability that there is life elsewhere in the universe (using the Poisson distribution): M. Steel, My friend and I catch a bus… �
Example use case of linearity of expectation in population genetics: Bhaskar et al.,Distortion of genealogical properties when the sample is very large, 2014.�
Method of the year 2013: Single-cell RNA- and DNA- seq.�
Method of the Year 2019: Single-cell multimodal omics.�