1 of 73

Одноклеточное секвенирование�(single cell RNA-seq, scRNA-seq)

Благодарности за слайды:

Антонов Иван Валентинович

Зубрицкий Анатолий

Simon Andrews (simon.andrews@babraham.ac.uk)

Åsa Björklund (asa.bjorklund@scilifelab.se)

1

2 of 73

Классический RNA-seq�(bulk RNA-seq)

3 of 73

Зачем нужны адаптеры при секвенировании?

4 of 73

Адаптеры содержат «баркоды» образцов�(sample barcodes)

А что если каждый образец содержит только одну клетку?

Тогда можно сказать, что мы используем клеточные баркоды (cellular barcodes)

5 of 73

Bulk vs. Single Cell RNA-Seq

6 of 73

Баркод образца (Sample barcode)

7 of 73

Ткани - гетерогенны!

Несмотря на то, что клетки морфологически идентичны, они могут иметь отличающиеся уровни экспрессии некоторых генов.

Это делает ткани и популяции клеток гетерогенными на уровне транскриптома и, иногда, генома (иммунные).

8 of 73

Экспрессия РНК в ткани = “средняя температура по больнице”

Уровень РНК из фрагмента кишечника - это уровень РНК из какого источника?

  • Стволовые клетки.
  • Эпителий
  • Бокаловидные клетки
  • Кровеносные сосуды
  • Лимфатические
  • Соединительная ткань
  • Мускулатура
  • Нейроны
  • Симбиотические бактерии

https://medicine.nus.edu.sg/pathweb/normal-histology/colon/

9 of 73

10 of 73

Background

(Svensson et al.)

11 of 73

12 of 73

10x Genomics

13 of 73

Технологии, которые лежат в основе прибора 10X�(гелевые шарики в эмульсии, Gel Bead-in-Emulsion - GEMs)?�

  • Гелевые шарики (gel beads):
    • Уникальные молекулярные индексы (Unique molecular identifiers, UMIs)
    • Клеточные баркоды (cellular barcodes)

  • Микрогидродинамика или Микрофлюидика (Microfluidics):
    • Разделение клеток по отдельным каплям жидкости

14 of 73

How 10X RNA-Seq Works

Cells

Barcoded Beads

Oil

RT Reagents

Gel Beads in Emulsion (GEMs)

15 of 73

Гелевые шарики в эмульсии�(Gel Bead-in-Emulsion, GEMs)

https://theseuslab.by/p100390910-stantsiya-dlya-raboty.html

16 of 73

How 10X RNA-Seq Works

Oligo dT

Cell barcode (same within GEM)

UMI (all different)

Priming site

17 of 73

How 10X RNA-Seq Works

Oligo dT

Cell barcode (same within GEM)

UMI (all different)

Priming site

AAAAAGATTCGTAGTGCTGATGCT...

Reverse Transcription

Mix RNAs

and Cells

Illumina Library Prep

18 of 73

How 10X RNA-Seq Works

Illumina

Adapter

Illumina

Adapter

UMI

Cell Barcode

3’ RNA Insert

Sample Barcode

Read 1

Read 2

Read 3

Sample level barcode – same for all cells and RNAs in a library

Cell level barcode (16bp) – same for all RNAs in a cell

UMI (10bp) – unique for one RNA in one cell

19 of 73

https://youtu.be/9YXRoaQyixQ

20 of 73

21 of 73

3’-end Sequencing w/ UMIs* (10X Genomics)

l

*unique molecular identifiers

22 of 73

23 of 73

24 of 73

25 of 73

10X Produces Barcode Counts

Sample WT

Cell WT A

Cell WT B

Cell WT C

UMI

UMI

UMI

UMI

UMI

UMI

UMI

UMI

UMI

UMI

UMI

UMI

UMI

UMI

UMI

UMI

UMI

UMI

UMI

UMI

Sample KO

Cell KO A

Cell KO B

Cell KO C

UMI

UMI

UMI

UMI

UMI

UMI

UMI

UMI

UMI

UMI

UMI

UMI

UMI

UMI

UMI

UMI

UMI

UMIs are finally related to genes to get per-gene counts

26 of 73

Sample Level Barcodes

  • Only present if multiple libraries mixed in a lane

  • Get standard barcode split report, but with 4 barcodes used per sample

  • Even coverage within and between libraries

27 of 73

28 of 73

29 of 73

30 of 73

https://ars.els-cdn.com/content/image/1-s2.0-S0304416510001169-gr4_lrg.jpg

31 of 73

https://www.nature.com/articles/srep33883/figures/2

32 of 73

33 of 73

https://www.science.org/doi/10.1126/science.aar3131

34 of 73

Одноклеточное секвенирование – эпигенетика

34

35 of 73

Баркод молекулы (unique molecular indexes, UMI)

https://www.pnas.org/doi/full/10.1073/pnas.1118018109

36 of 73

Bulk vs. Single Cell RNA-Seq

37 of 73

Визуализация scRNA-seq -- PCA, t-SNE …

38 of 73

Визуализация scRNA-seq -- PCA, t-SNE …

39 of 73

Проекты по секвенированию всех типов клеток scRNA-seq

GTex

Human Cell Atlas

40 of 73

Перепрограммирование клеток – нам нужны модели

41 of 73

Для чего нужно одноклеточное секвенирование в RNA-seq

42 of 73

Для чего нужно одноклеточное секвенирование в эпигенетике

43 of 73

Одноклеточные методы секвенирования по эпигенетике

44 of 73

Одноклеточные методы секвенирования по эпигенетике

45 of 73

Одноклеточные методы секвенирования по эпигенетике

46 of 73

Одноклеточные методы секвенирования по эпигенетике

47 of 73

Одноклеточные методы секвенирования по эпигенетике

https://en.wikipedia.org/wiki/Single_cell_epigenomics

48 of 73

single-cell multi-omic data integration

Слайды Valentina Lorenzi

Systems Biology Course (EMBL-EBI) - 25th October 2023

49 of 73

why do we want single-cell and multi-omic measurements?

cell intrinsic

cell extrinsic

The Human Cell Atlas White Paper

50 of 73

High throughput (employs massive parallel sequencing)

droplet-based scRNA-seq

Lower sensitivity because only one end of the transcript is sequenced

BUT definition of cell identity is highly dependent on the number of cells sequenced (more than sequencing depth)

from Lorenzi & Vento-Tormo, 2022

51 of 73

scrna-seq data analysis workflow

Quality control Dying cells Ambient RNA Doublets

Normalisation Feature Selection

Dimensionality Reduction Clustering

Visualisation

Cell type annotation

[ Differential gene expression ]

[ Differential cell abundance ]

[ Pseudotime ]

basic workflow

dimensionality reduction

Curse of dimensionality = not all features are important to understand the underlying dynamics of the dataset and there is an inherent redundancy

linear vs non-linear

dimensionality reduction

neighbourhood graph

clustering

Leiden Louvain

visualisation

t-SNE UMAP

PCA

autoencoders

cell x gene matrix

52 of 73

droplet-based snATAC-seq

10x multiomic single nucleus RNA/ATAC data

from 10x Genomics website

Minnoye et al. 2021 Chromatin accessibility profiling methods. Nat Rev Methods Primer

53 of 73

snatac-seq data analysis workflow

Quality control

Fragment size distribution Transcription Start Site enrichment

Count fragments over common set of genomic regions (peak calling, bins, known enhancers..) to obtain tabular data

TF-IDF normalisation

Dimensionality reduction

Visualisation

Cell type annotation

[ TF motif enrichment ] [ Pseudotime ]

[ Integration with scRNA-seq ]

snATAC-seq data is quasi binary (most values are 0 or 1).

basic workflow

lsi dimensionality reduction

Term Frequency -

Inverse Document Frequency normalisation

fragments file

Singular Value Decomposition (PCA)

methods adapted from text processing for topic extraction

Latent Semantic Indexing regions = words

cells = documents

54 of 73

beyond scRNA and ATAC-seq

Zhu, Preissl & Ren Single-cell multimodal omics: the power of many, Nat Methods (2020)

55 of 73

goals of multi-omic data analysis

embedding in a meaningful latent space

identify statistical relationships between features

56 of 73

defining the integration axis

Arguelaguet, Cuomo, Stegle and Marioni (2021) Computational principles and challenges in single-cell data integration, Nat Biotechnology

57 of 73

defining the integration axis

Batch effect correction, mapping to reference atlas

Multi-omics analysis

58 of 73

Vertical integration of matched multi-omics data

1. Construct kNN graph on each modality’s low dimensional embedding

a. scRNA-seq --> PCA

b. scATAC-seq --> LSI

2. For each cell i, identify its k nearest neighbours in each modality (RNA neighbours and ATAC neighbours) and average the low-dimensional profile of each neighbour set, which represents a prediction for the molecular contents for cell i based on local neighbourhood

a. within-modality prediction

b. cross-modality prediction

  1. Compute similarity between predicted values (within and cross-modality) and the actual low- dimensional profile of cell i
  2. Calculate ratio between the two similarities (affinities) to obtain cell-specific modality weights
  3. Compute new similarity metric between any two cells which reflects a weighted combination of RNA and ATAC affinities
  4. Construct kNN graph using this weighted similarity metric (WNN)
  5. Downstream analysis (i.e. visualization, clustering, etc.) of the WNN graph

Is the average really the best view? Jointly analyzing datasets is supposed to increase the resolution, but we might just be smoothing out true differences between modalities

Hao, Hao et al. Cell 2021

weighted nearest neighbours

59 of 73

Vertical integration of matched multi-omics data

1. Construct kNN graph on each modality’s low dimensional embedding

a. scRNA-seq --> PCA

b. scATAC-seq --> LSI

2. For each cell i, identify its k nearest neighbours in each modality (RNA neighbours and ATAC neighbours) and average the low-dimensional profile of each neighbour set, which represents a prediction for the molecular contents for cell i based on local neighbourhood

a. within-modality prediction

b. cross-modality prediction

  1. Compute similarity between predicted values (within and cross-modality) and the actual low- dimensional profile of cell i
  2. Calculate ratio between the two similarities (affinities) to obtain cell-specific modality weights
  3. Compute new similarity metric between any two cells which reflects a weighted combination of RNA and ATAC affinities
  4. Construct kNN graph using this weighted similarity metric (WNN)
  5. Downstream analysis (i.e. visualization, clustering, etc.) of the WNN graph

Is the average really the best view? Jointly analyzing datasets is supposed to increase the resolution, but we might just be smoothing out true differences between modalities

Hao, Hao et al. Cell 2021

weighted nearest neighbours

60 of 73

Vertical integration of matched multi-omics data

Multi-Omics Factor Analysis v2 (MOFA+)

Scalability limits

Affected by imbalances in number of features in each modality

Z = contains low dimensional representation of the cells

W = contains an association score for each feature with each latent factor

Structure of the data is specified in the prior distributions of the Bayesian

model

Uses sparsity priors, which enable

automatic relevance determination of the factors

encourages solutions where factors are associated with a small number of features / active in few groups of

cells

Argelaguet, Velten et al. Mol Sys Biol 2018

Argelaguet, Arnol, Bredikhin et al. Genome Biology 2020

61 of 73

Diagonal integration of unmatched multi-omics data

adapted methods from horizontal integration

  1. Transform data to gene-level features (e.g. count ATAC fragments over gene bodies)
  2. Apply horizontal integration methods used for batch correction (e.g. Seurat’s CCA)

gene activity scores

Horizontal integration!

Assumption that gene accessibility is linearly correlated with gene expression

Stuart, Butler et al., Cell 2019

62 of 73

Diagonal integration of unmatched multi-omics data

autoencoder neural network architecture

matching graph topology

methods working on unpaired features

The embedding of each dataset is performed using an autoencoder, whose architectures can be customized to the specific data modality

Combining the encoder and decoder modules of different autoencoders enables translation between different data modalities at the single-cell level

  1. recovers geodesic distances on a single latent manifold on which all data lie
  2. constructs a neighborhood graph (MultiGraph) on the manifold
  3. projects the data into a single low-dimensional embedding

Assumption that cells lie on the same latent manifold

Yang, Beyalaeva et al., Nature Communications 2021 Jain, Polanski et al., Genome Biology 2021

63 of 73

goals of multi-omic data analysis

embedding in a meaningful latent space

identify statistical relationships between features

64 of 73

identifying statistical relationships between features

Network representations of molecular interactions between transcriptional regulators and target genes

With single-cell multi-omics we measure different molecular features / layers of gene regulation (either from the same cells or from different cells that can be computationally matched)

How do we identify statistical relationships between the different molecular features?

In the case of scRNA/ATAC-seq multi-omic analysis this analysis is often referred to as

Gene Regulatory Network inference

65 of 73

preprocessing for feature-wise analysis

Persad, Choo et al., Nature Biotechnology 2023

potential approaches

problem

What’s the right resolution to consider when matching cells from different modalities in diagonal integration?

= indicates a generic averaged profile (GEX or accessibility) over a group of cells: could be

clusters/KNN graph neighbourhoods,

metacells

Impute expression for scATAC cells or

viceversa (e.g. average of K-nearest neighbors)

66 of 73

preprocessing for feature-wise analysis

potential approaches

feature selection

Which features (genes or peaks) should we choose to identify statistical relationships?

Which genes?

Which accessibility features?

Which feature pairs?

highly variable genes

cell type marker genes dynamic genes in pseudotime

aggregate peaks by genomic locus?

aggregate peaks by TF motif?

67 of 73

identifying statistical relationships between features

correlation-based

machine learning-based

Gradient Boosting Machine Regression

Identify regions enriched in TF motifs

Infer region-to-gene relationships (define search space around the gene)

Infer TF-to-gene relationships

Bravo González-Blas, De Winter et al., Nature Methods 2023

For each TF, generate TF–region–gene triplets by taking all regions that are enriched for a motif annotated to the

TF and all genes linked to these regions

Run Gene Set Enrichment Analysis (GSEA) by ranking all genes based on their TF-to-gene importance score and calculate enrichment of the set of genes within the TF–region–gene triplet

68 of 73

anndata: for unimodal data (e.g. scRNA-seq, scATAC-seq)

data structures in python for single- cell data

mudata: extension of anndata to multimodal data (e.g. multiome scRNA/ATAC-seq)

scverse/AnnData scverse/MuData

anndata

mudata

69 of 73

spatially-resolved transcriptomics

70 of 73

spot-based spatial transcriptomics (10x visium)

transcriptome-wide throughput

from Lorenzi & Vento-Tormo, 2022

55um resolution (diameter of a spot)

71 of 73

imaging-based spatial transcriptomics (cartana - now 10x)

single-cell (even subcellular) resolution

throughput is limited to ~300 genes

from Lorenzi & Vento-Tormo, 2022

72 of 73

single-cell and spatial transcriptomics integration

deconvolution-based methods

collection of cell states and their gene expression signatures (from scRNA-seq)

UMAP1

UMAP2

spot-based spatial transcriptomics (10x Visium)

Idea: deconvolve signal from each spot as contribution of each cell type in the single-cell reference data

Readout: cell type abundance per spot

Kleshchevnikov, Shmatko et al., Nature Biotechnology 2022

73 of 73

single-cell and spatial transcriptomics integration

imaging-based spatial transcriptomics

nearest neighbours based methods

collection of cell states and their gene expression signatures (from scRNA-seq)

UMAP1

UMAP2

Idea: assign cell type identity to segmented cells based on k nearest neighbours in single-cell reference data subset both scRNA-seq and ISS dataset to the genes measured in ISS

create kNN graph of all cells

find scRNA-seq kNNs of each ISS cell and assign cell type label based on majority voting

Readout: segmented cells with cell type label