Одноклеточное секвенирование�(single cell RNA-seq, scRNA-seq)
Благодарности за слайды:
Антонов Иван Валентинович
Зубрицкий Анатолий
Simon Andrews (simon.andrews@babraham.ac.uk)
Åsa Björklund (asa.bjorklund@scilifelab.se)
1
Классический RNA-seq�(bulk RNA-seq)
Зачем нужны адаптеры при секвенировании?
Адаптеры содержат «баркоды» образцов�(sample barcodes)
А что если каждый образец содержит только одну клетку?
Тогда можно сказать, что мы используем клеточные баркоды (cellular barcodes)
Bulk vs. Single Cell RNA-Seq
Баркод образца (Sample barcode)
Ткани - гетерогенны!
Несмотря на то, что клетки морфологически идентичны, они могут иметь отличающиеся уровни экспрессии некоторых генов.
Это делает ткани и популяции клеток гетерогенными на уровне транскриптома и, иногда, генома (иммунные).
Экспрессия РНК в ткани = “средняя температура по больнице”
Уровень РНК из фрагмента кишечника - это уровень РНК из какого источника?
https://medicine.nus.edu.sg/pathweb/normal-histology/colon/
Background
(Svensson et al.)
10x Genomics
Технологии, которые лежат в основе прибора 10X�(гелевые шарики в эмульсии, Gel Bead-in-Emulsion - GEMs)?�
How 10X RNA-Seq Works
Cells
Barcoded Beads
Oil
RT Reagents
Gel Beads in Emulsion (GEMs)
Гелевые шарики в эмульсии�(Gel Bead-in-Emulsion, GEMs)
https://theseuslab.by/p100390910-stantsiya-dlya-raboty.html
How 10X RNA-Seq Works
Oligo dT
Cell barcode (same within GEM)
UMI (all different)
Priming site
How 10X RNA-Seq Works
Oligo dT
Cell barcode (same within GEM)
UMI (all different)
Priming site
AAAAAGATTCGTAGTGCTGATGCT...
Reverse Transcription
Mix RNAs
and Cells
Illumina Library Prep
How 10X RNA-Seq Works
Illumina
Adapter
Illumina
Adapter
UMI
Cell Barcode
3’ RNA Insert
Sample Barcode
Read 1
Read 2
Read 3
Sample level barcode – same for all cells and RNAs in a library
Cell level barcode (16bp) – same for all RNAs in a cell
UMI (10bp) – unique for one RNA in one cell
https://youtu.be/9YXRoaQyixQ
3’-end Sequencing w/ UMIs* (10X Genomics)
l
*unique molecular identifiers
10X Produces Barcode Counts
Sample WT
Cell WT A
Cell WT B
Cell WT C
UMI
UMI
UMI
UMI
UMI
UMI
UMI
UMI
UMI
UMI
UMI
UMI
UMI
UMI
UMI
UMI
UMI
UMI
UMI
UMI
Sample KO
Cell KO A
Cell KO B
Cell KO C
UMI
UMI
UMI
UMI
UMI
UMI
UMI
UMI
UMI
UMI
UMI
UMI
UMI
UMI
UMI
UMI
UMI
UMIs are finally related to genes to get per-gene counts
Sample Level Barcodes
https://ars.els-cdn.com/content/image/1-s2.0-S0304416510001169-gr4_lrg.jpg
https://www.nature.com/articles/srep33883/figures/2
https://www.science.org/doi/10.1126/science.aar3131
Одноклеточное секвенирование – эпигенетика
34
Баркод молекулы (unique molecular indexes, UMI)
https://www.pnas.org/doi/full/10.1073/pnas.1118018109
Bulk vs. Single Cell RNA-Seq
Визуализация scRNA-seq -- PCA, t-SNE …
Визуализация scRNA-seq -- PCA, t-SNE …
Проекты по секвенированию всех типов клеток scRNA-seq
GTex
Human Cell Atlas
Перепрограммирование клеток – нам нужны модели
Для чего нужно одноклеточное секвенирование в RNA-seq
Для чего нужно одноклеточное секвенирование в эпигенетике
Одноклеточные методы секвенирования по эпигенетике
Одноклеточные методы секвенирования по эпигенетике
Одноклеточные методы секвенирования по эпигенетике
Одноклеточные методы секвенирования по эпигенетике
Одноклеточные методы секвенирования по эпигенетике
https://en.wikipedia.org/wiki/Single_cell_epigenomics
single-cell multi-omic data integration
Слайды Valentina Lorenzi
Systems Biology Course (EMBL-EBI) - 25th October 2023
why do we want single-cell and multi-omic measurements?
cell intrinsic
cell extrinsic
The Human Cell Atlas White Paper
High throughput (employs massive parallel sequencing)
droplet-based scRNA-seq
Lower sensitivity because only one end of the transcript is sequenced
BUT definition of cell identity is highly dependent on the number of cells sequenced (more than sequencing depth)
from Lorenzi & Vento-Tormo, 2022
scrna-seq data analysis workflow
Quality control Dying cells Ambient RNA Doublets
Normalisation Feature Selection
Dimensionality Reduction Clustering
Visualisation
Cell type annotation
[ Differential gene expression ]
[ Differential cell abundance ]
[ Pseudotime ]
basic workflow
dimensionality reduction
Curse of dimensionality = not all features are important to understand the underlying dynamics of the dataset and there is an inherent redundancy
linear vs non-linear
dimensionality reduction
neighbourhood graph
clustering
Leiden Louvain
visualisation
t-SNE UMAP
PCA
autoencoders
cell x gene matrix
droplet-based snATAC-seq
10x multiomic single nucleus RNA/ATAC data
from 10x Genomics website
Minnoye et al. 2021 Chromatin accessibility profiling methods. Nat Rev Methods Primer
snatac-seq data analysis workflow
Quality control
Fragment size distribution Transcription Start Site enrichment
Count fragments over common set of genomic regions (peak calling, bins, known enhancers..) to obtain tabular data
TF-IDF normalisation
Dimensionality reduction
Visualisation
Cell type annotation
[ TF motif enrichment ] [ Pseudotime ]
[ Integration with scRNA-seq ]
snATAC-seq data is quasi binary (most values are 0 or 1).
basic workflow
lsi dimensionality reduction
Term Frequency -
Inverse Document Frequency normalisation
fragments file
Singular Value Decomposition (PCA)
methods adapted from text processing for topic extraction
Latent Semantic Indexing regions = words
cells = documents
beyond scRNA and ATAC-seq
Zhu, Preissl & Ren Single-cell multimodal omics: the power of many, Nat Methods (2020)
goals of multi-omic data analysis
embedding in a meaningful latent space
identify statistical relationships between features
defining the integration axis
Arguelaguet, Cuomo, Stegle and Marioni (2021) Computational principles and challenges in single-cell data integration, Nat Biotechnology
defining the integration axis
Batch effect correction, mapping to reference atlas
Multi-omics analysis
Vertical integration of matched multi-omics data
1. Construct kNN graph on each modality’s low dimensional embedding
a. scRNA-seq --> PCA
b. scATAC-seq --> LSI
2. For each cell i, identify its k nearest neighbours in each modality (RNA neighbours and ATAC neighbours) and average the low-dimensional profile of each neighbour set, which represents a prediction for the molecular contents for cell i based on local neighbourhood
a. within-modality prediction
b. cross-modality prediction
Is the average really the best view? Jointly analyzing datasets is supposed to increase the resolution, but we might just be smoothing out true differences between modalities
Hao, Hao et al. Cell 2021
weighted nearest neighbours
Vertical integration of matched multi-omics data
1. Construct kNN graph on each modality’s low dimensional embedding
a. scRNA-seq --> PCA
b. scATAC-seq --> LSI
2. For each cell i, identify its k nearest neighbours in each modality (RNA neighbours and ATAC neighbours) and average the low-dimensional profile of each neighbour set, which represents a prediction for the molecular contents for cell i based on local neighbourhood
a. within-modality prediction
b. cross-modality prediction
Is the average really the best view? Jointly analyzing datasets is supposed to increase the resolution, but we might just be smoothing out true differences between modalities
Hao, Hao et al. Cell 2021
weighted nearest neighbours
Vertical integration of matched multi-omics data
Multi-Omics Factor Analysis v2 (MOFA+)
Scalability limits
Affected by imbalances in number of features in each modality
Z = contains low dimensional representation of the cells
W = contains an association score for each feature with each latent factor
Structure of the data is specified in the prior distributions of the Bayesian
model
Uses sparsity priors, which enable
automatic relevance determination of the factors
encourages solutions where factors are associated with a small number of features / active in few groups of
cells
Argelaguet, Velten et al. Mol Sys Biol 2018
Argelaguet, Arnol, Bredikhin et al. Genome Biology 2020
Diagonal integration of unmatched multi-omics data
adapted methods from horizontal integration
gene activity scores
Horizontal integration!
Assumption that gene accessibility is linearly correlated with gene expression
Stuart, Butler et al., Cell 2019
Diagonal integration of unmatched multi-omics data
autoencoder neural network architecture
matching graph topology
methods working on unpaired features
The embedding of each dataset is performed using an autoencoder, whose architectures can be customized to the specific data modality
Combining the encoder and decoder modules of different autoencoders enables translation between different data modalities at the single-cell level
Assumption that cells lie on the same latent manifold
Yang, Beyalaeva et al., Nature Communications 2021 Jain, Polanski et al., Genome Biology 2021
goals of multi-omic data analysis
embedding in a meaningful latent space
identify statistical relationships between features
identifying statistical relationships between features
Network representations of molecular interactions between transcriptional regulators and target genes
With single-cell multi-omics we measure different molecular features / layers of gene regulation (either from the same cells or from different cells that can be computationally matched)
How do we identify statistical relationships between the different molecular features?
In the case of scRNA/ATAC-seq multi-omic analysis this analysis is often referred to as
Gene Regulatory Network inference
preprocessing for feature-wise analysis
Persad, Choo et al., Nature Biotechnology 2023
potential approaches
problem
What’s the right resolution to consider when matching cells from different modalities in diagonal integration?
= indicates a generic averaged profile (GEX or accessibility) over a group of cells: could be
clusters/KNN graph neighbourhoods,
metacells
Impute expression for scATAC cells or
viceversa (e.g. average of K-nearest neighbors)
preprocessing for feature-wise analysis
potential approaches
feature selection
Which features (genes or peaks) should we choose to identify statistical relationships?
Which genes?
Which accessibility features?
Which feature pairs?
highly variable genes
cell type marker genes dynamic genes in pseudotime
aggregate peaks by genomic locus?
aggregate peaks by TF motif?
identifying statistical relationships between features
correlation-based | machine learning-based |
| Gradient Boosting Machine Regression |
Identify regions enriched in TF motifs
Infer region-to-gene relationships (define search space around the gene)
Infer TF-to-gene relationships
Bravo González-Blas, De Winter et al., Nature Methods 2023
For each TF, generate TF–region–gene triplets by taking all regions that are enriched for a motif annotated to the
TF and all genes linked to these regions
Run Gene Set Enrichment Analysis (GSEA) by ranking all genes based on their TF-to-gene importance score and calculate enrichment of the set of genes within the TF–region–gene triplet
anndata: for unimodal data (e.g. scRNA-seq, scATAC-seq)
data structures in python for single- cell data
mudata: extension of anndata to multimodal data (e.g. multiome scRNA/ATAC-seq)
scverse/AnnData scverse/MuData
anndata
mudata
spatially-resolved transcriptomics
spot-based spatial transcriptomics (10x visium)
transcriptome-wide throughput
from Lorenzi & Vento-Tormo, 2022
55um resolution (diameter of a spot)
imaging-based spatial transcriptomics (cartana - now 10x)
single-cell (even subcellular) resolution
throughput is limited to ~300 genes
from Lorenzi & Vento-Tormo, 2022
single-cell and spatial transcriptomics integration
deconvolution-based methods
collection of cell states and their gene expression signatures (from scRNA-seq)
UMAP1
UMAP2
spot-based spatial transcriptomics (10x Visium)
Idea: deconvolve signal from each spot as contribution of each cell type in the single-cell reference data
Readout: cell type abundance per spot
Kleshchevnikov, Shmatko et al., Nature Biotechnology 2022
single-cell and spatial transcriptomics integration
imaging-based spatial transcriptomics
nearest neighbours based methods
collection of cell states and their gene expression signatures (from scRNA-seq)
UMAP1
UMAP2
Idea: assign cell type identity to segmented cells based on k nearest neighbours in single-cell reference data subset both scRNA-seq and ISS dataset to the genes measured in ISS
create kNN graph of all cells
find scRNA-seq kNNs of each ISS cell and assign cell type label based on majority voting
Readout: segmented cells with cell type label