1 of 47

Single-Cell RNA-seq Technology meets Biology

California Institute of Technology

1

Lecture 2

Caltech Bi/BE/CS183

Spring 2024

These slides are distributed under the CC BY 4.0 license

2 of 47

Single-Cell RNA-seq technology meets Biology

2

These slides are distributed under the CC BY 4.0 license

3 of 47

The ideal (dissociated) single-cell transcriptomics method**

  • Universal in terms of cell size, type and state.�
  • In situ measurements.�
  • No minimum input of number of cells to be assayed.�
  • Every cell is assayed, i.e.100% cell capture rate.�
  • Every transcript in every cell is detected, i.e.100% molecular capture = sensitivity.�
  • Every transcript is identified by its full-length sequence (isoform specificity).�
  • Transcripts are readily associated to single cells, e.g. no cell doublets.�
  • Additional measurements of other cell attributes (including so called “multiomics”).�
  • Cost effective per cell.�
  • Easy to use.�
  • Open source.

3

** The field is not here yet; trade-offs still required

4 of 47

A decade of scRNA-seq technology and adoption

What is it teaching us?

4

5 of 47

5

Consider the full range of single cells

What question do you want to ask?

How is that supported or limited by technology type ?

This entire alga is one cell with many nuclei

6 of 47

Technology development ⇔ methods development

6

Produced on January 8, 2023 with a Google Colab notebook

7 of 47

7

8 of 47

Spatial single-cell RNA-seq

8

9 of 47

Popular single-cell RNA-seq protocols

9

10 of 47

Physical separation of cells in wells

10

11 of 47

Example: library preparation for SMART-Seq2

11

12 of 47

Example: SMART-Seq2 performance

  • Universal in terms of cell size, type and state.
  • In situ measurements.
  • No minimum input of number of cells to be assayed.
  • Every cell is assayed, i.e.100% capture rate.
  • Every transcript in every cell is detected, i.e.100% sensitivity.
  • Every transcript is identified by its full-length sequence.
  • Transcripts are assigned correctly to cells, e.g. in theory no doublets.
  • Additional multimodal measurements.
  • Cost effective per cell.
  • Easy to use.
  • Open source.

12

13 of 47

Cost is complicated to measure

13

14 of 47

Sequencing costs are also complicated

14

15 of 47

Microfluidic methods

15

Drop-seq

inDrops

16 of 47

Foundation: monodispersed emulsions

16

17 of 47

Example: the inDrops approach

17

cell capture

cell lysis

barcoding

18 of 47

Example: the inDrops protocol

18

Library preparation

reverse transcription

amplification creates cDNA library

19 of 47

Unique Molecular Identifiers

  • Barcodes used to distinguish molecules (rather than associate transcripts with cells).
  • UMI collapsing refers to the process of using UMIs to avoid double-counting molecules after sequencing.
    • Naïve UMI collapsing consists of counting all reads that have the same UMI and cell barcode as a single event
    • UMI collapsing can include collision detection by checking whether reads also originate from the same molecule.

19

20 of 47

Sequencing

  • The cDNA library is sequenced. There are three types of current sequencing technology that are commonly used:
    • Single-end reads
    • Paired-end reads
    • “Long” reads
  • Sequencing errors can lead to:
    • incorrectly labeled cells (from cell barcodes).
    • erroneous molecule counts (from UMIs).
  • Error correction can be used to address these problems
    • Cell barcode error correction can sometimes be performed using a list of the known cell barcodes in the experiment (technology dependent).
    • Cell barcode and UMI error correction can be performed by first identifying sequences that likely represent true barcodes (based on frequency).
  • Sequencing errors are (sequencing) technology dependent. More on this in Lecture 8.

20

21 of 47

What it means to be “3’ technology”

21

22 of 47

Beads, Cells and Droplets

22

Split

Doublet

No capture

Goal

Good

Bad

Irrelevant

Collision

23 of 47

Barcode diversity

  • Barcode collisions occur when beads with�identical barcode sequences are present �in droplets with two different cells.�
  • The number of available barcode sequences depends on the �sequence length L. Sequences of length L can yield up to 4L barcodes.�
  • The number of distinct barcodes needed is a function of the number of cells that are to be barcoded.

23

Collision

24 of 47

Estimating the number of cells that will be uniquely barcoded

  • Assuming that each of N cells get one barcode at random from a set of M barcodes, the expected number of cells with a unique barcode is given by�����

24

Expected value is a generalization of weighted average, and is a number associated to a random variable.

25 of 47

Random variables

  • A random variable is neither random nor a variable…
    • A random variable is a function from a sample space of a probability space, to the real numbers (or more generally a measurable space).��
  • Example: the sample space is the set { present, absent } (for instance the presence or absence of a specific barcode in a droplet), and the random variable assigns the real number 1 to “present” and the real number 0 to “absent”. This is an example of an indicator random variable.

25

26 of 47

Estimating the number of cells that will be uniquely barcoded

  • Assuming that each of N cells get one barcode at random from a set of M barcodes, the expected number of cells with a unique barcode is given by�����

Proof: If we denote the probability that any specific barcode associates with some cell by p, then p=1/M. The probability that a given barcode is used for some specific set of k cells is therefore

26

27 of 47

Estimating the number of cells that will be uniquely barcoded

27

28 of 47

Expected value of a random variable

  • The expectation of a discrete random variable with outcomes x1, x2, x3, … xn each occuring with probability p1, p2, p3, … pn respectively is

  • If Xj is an indicator random variable corresponding to barcode j, i.e. Xj is 1 if barcode j is used for only one cell, and 0 otherwise, then

28

29 of 47

The seemingly magic linearity of expectation

  • If X and Y are two random variables, then����This is true regardless of whether X and Y are independent.
  • Thus,

29

30 of 47

Barcode collisions

  • For N assayed cells and M barcodes, the barcode collision rate can be estimated as

  • Barcode collisions lead to synthetic doublets. Avoiding synthetic doublets requires high relative barcode diversity, i.e., a high ratio of M/N.

30

.

.

31 of 47

Droplet tuning concepts

  • The capture rate is 1 - the fraction of cells that that are �in droplets without any beads.��
  • The split rate is the fraction of droplets with exactly one �cell that have more than one bead.��
  • The doublet rate is the fraction of droplets with 1 bead �that have more than one cell.

31

Doublet

Split

No capture

32 of 47

Binomial distributions for beads and cells

  • Consider n droplets, each of which has a probability p of containing a single cell. Then the probability that k cells will be captured is�����
  • This suggests modeling the number of cells captured with a random variable that follows a Binomial distribution. That is, X ~ B(n,p).

32

33 of 47

The law of rare events

  • For large n and small p, the Binomial distribution B(n,p) is approximated well with the Poisson distribution Pois(λ = np), i.e.�����
  • This is convenient for many reasons: the expression on the right is easier to evaluate and the parameter λ is readily interpretable as the expected value of a Poisson random variable.

33

34 of 47

Expected value of a Poisson random variable

If X is a Poisson random variable, i.e. X ~ Pois(λ), then the expected value of X is given by

which is equal to λ.

34

35 of 47

A Poisson approximation for beads and cells

  • Load cells into droplets at Poisson rate .
  • Load beads into droplets at Poisson rate .

35

36 of 47

Cell capture and duplication rates

  • The Poisson approximation yields a simple formula for the capture rate:� �
  • The split rate estimate is �� .�
  • This provides a quantitative�assessment of the tradeoff �between the capture rate and �the split rate.

36

Capture rate

Split rate

37 of 47

Reducing the number of beadless droplets

37

Drop-seq

inDrops

10x genomics

Bead Material

Polystyrene

Hydrogel

Hydrogel

Loading Dynamics

Poisson

Sub-Poisson

Sub-Poisson

Dissolvable

No

No

Yes

Barcode Release

No

UV release

Chemical release

Customizable

Demonstrated

Not shown

Feasible

Licensing

Open Source

Open source

Proprietary

Availability

Beads are sold

Commercial

Commercial

38 of 47

Sub-Poisson (sometimes called super-Poisson) loading

38

39 of 47

Technical doublets

  • Technical doublets arise when two or more cells are captured in a droplet with a single bead. The technical doublet rate is therefore the probability of capturing two or more cells in a droplet given that at least one cell has been captured in a droplet:

  • Note that “overloading” a microfluidics single-cell experiment by loading more cells while keeping flow rates constant will increase the number of technical doublets due to an effective increase in .

39

40 of 47

Doublet detection: the barnyard plot

40

41 of 47

Bloom’s correction

  • Total number of droplets N, barnyard axes measured at N1 and N2, and observed doublets N1,2.

41

42 of 47

Biological doublets

  • Biological doublets arise when two cells form a discrete unit that does not break apart during disruption to form a suspension.
  • Biological doublets will not be detected via barnyard plots.
  • One approach to avoiding biological doublets is to perform nuclear single-cell RNA-seq: Habib et al. 2017.
  • However, biological doublets are not necessarily just a technical nuisance to be avoided. The paper Halpern et al. 2018 utilizes biological doublets of hepatocytes and liver endothelial cells to assign tissue coordinates to liver endothelial cells via imputation from their hepatocyte partners.

42

43 of 47

Summary

43

Split

Doublet

No capture

Goal

Good

Bad

Irrelevant

Collision

Technical

Biological

high λ

low μ

small L

high μ

low λ

44 of 47

Summary of droplet single-cell RNA-seq methods and features

44

45 of 47

What most single-cell RNA-seq is not, circa 2023

  • It is NOT single-cell: While measurements are made from RNA molecules in individual cells, in standard practice (ie 10X genomics, drop-seq) very few RNA molecules are captured (OM 1%). Therefore what is measured samples but does not provide a complete picture of the RNA inside any given cell. Since measurements from cells are incomplete, claims are for the most part restricted to groups of cells, rather than individual cells. �
  • It is NOT RNA: Most RNA-seq largely consists of sampling cDNA molecules from a cDNA library, which serves as a proxy for the (captured) RNA content in cells. (Direct RNA sequencing is in its infancy)

45

46 of 47

Extensions of single-cell RNA-seq

  • Multimodal assays.
  • Multiplexing of samples.
  • Spatial single-cell technologies.
  • Packer and Trapnell 2018.

46

47 of 47

Additional References

47