Bioc2021
PCAworkshop
Aedin Culhane, Lauren Hsu
@AedinCulhane
To run this workshop
For help on workshops, see Kevin Rue-Albrecht’s Introduction to Workshops presentation
BiocManager::install(‘aedin/PCAworkshop’)
devtools::install_github(’aedin/PCAworkshop’)
Choose a workshop
Wait for your own workshop instance to launch
Ignore the Google Chrome warning 🙈
IMPORTANT
Navigating to an URL used by someone else will kick them out of their session.
This workshop includes 4 Vignettes
Matrix Factorization or Dimension reduction methods, including PCA are commonly used and is well suited to finding known & unknown (latent) patterns in large data
Reduce the data matrix to a small number of linear vectors that explain most of the variance in the data
New Axis 1
New Axis 2
Many use
Singular Value Decomposition X= UDVt
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | |
| | | |
| | | |
| | | |
| | | |
n
p
1
X
=
n
1
k
d11 | 0 | 0 | 0 |
0 | d22 | 0 | 0 |
0 | 0 | d33 | 0 |
0 | 0 | 0 | d44 |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
1
k
k
p
1
1
k
U
D
Vt
Examines a simple data
data(bordeaux)
bordeaux
## excellent good mediocre boring
## Cru_Bourgeois 45 126 24 5
## Grand_Cru_classe 87 93 19 1
## Vin_de_table 0 0 52 148
## Bordeaux_d_origine 36 68 74 22
## Vin_de_marque 0 30 111 59
and shows the relationship between Principal Component Analysis (PCA) and Singular Value Decomposition (SVD)
U,V matrices are orthogonal, uncorrelated
=
n
1
k
1 | 0 | 0 | 0 |
0 | 1 | 0 | 0 |
0 | 0 | 1 | 0 |
0 | 0 | 0 | 1 |
p
1
k
U
I
Vt
n
1
k
Ut
=
V
p
1
k
1
k
k
And the inner product (generalization of dot product) between pair of columns in U or V equals 0
The columns and rows are orthogonal unit vectors-> the squared elements of columns U,V sums to 1
Confirm this.. Lets go to Rstudio
Terminology SVD -> PCA
The "Kaiser rule" criteria
is shown in red.
Scree
Vignette 2. Principal Component Analysis in R
Function | loadings | scores | plot |
prcomp(P, center=TRUE, scale=TRUE) | X$rotation | X$x | biplot(res) |
princomp(P, cor=TRUE) | X$loadings | X$scores | biplot(res) |
PCA(P) | X$svd$V | X$ind$coord | plot(res) |
dudi.pca(P, center=TRUE, scale=TRUE) | X$c1 | X$li | scatter(res) |
Compares different functions in R that compute PCA
Dimension reduction is indispensable in �large scale data analysis
Ordination
Dimension Reduction
Matrix Factorization
Latent variable analysis
Factor Analysis
Principal Component Analysis
Wavelet Decomposition
Spectral analysis
Eigen analysis
Two forms of PCA; Covariance, Correlation
Covariance-based PCA
Correlation-based PCA
SVD
Covariance-based PCA used by Sun et al., 2019
https://github.com/xzhoulab/DRComparison/blob/master/algorithms/call_PCA.R
Thank so much for providing reproducible code on github
PCA
Covariance-based PCA
Assessment of the Accuracy, robustness and scalability of dimensionality reduction methods for single-cell RNA-seq analysis
Sun, S., Zhu, J., Ma, Y. et al. . Genome Biol 20, 269 (2019)
Comparison of 18 Methods
Processing steps �impact�results
human lung adenocarcinoma
cell lines
HCC827
H1975
H2228
Hsu & Culhane, 2020
Preprocessing Impacts�on PC1
Arch effect: points on PC1 lie on 1 side of the origin
Centering is important: orthogonal vectors are uncorrelated only when at least one of them has mean 0.
Arches,�Horseshoes,�Guttman effect
curvilinear
relationships
between successive
PCs
Arch
Can arise as a consequence of distance metrics that saturate.
10 nonzero values.
Vignette 3. Correspondence Analysis
Correspondence Analysis of scMix Data
corralm
Unpublished
PCA
corral package
Designed for use on single cell data
BiocManager::install(‘corral’)
devtools::install_github(‘laurenhsu/corral’)
Lauren Hsu
Lets test some scRNAseq data
Compare PCA, corral on Zhengmix (DuoClustering2018)
Zhengmix4eq
Zhengmix4uneq
Zhengmix8eq
Pre-sorted cells, including:
4 cell types, in approx. equal proportions
4 cell types, in unequal proportions
8 cell types, in approx. equal proportions
10X sequencing
mixed
Joint Matrix Factorization of >2 datasets
Meng & Zeleznik et al., (2016) Dimension reduction techniques for the integrative analysis of multi-omics data. Briefings in Bioinformatics, 17(4), 2016, 628–641
Towards Integration of multiple datasets
30
Meng et al., BMC Bioinformatics 2014, 15:162
Meng et al., Brief Bioinform. 2016 Jul; 17(4): 628–641.
Culhane AC et al., BMC Bioinformatics 2003, 4:59-74
Meng et al., BMC Bioinformatics 2014, 15:162
Package:mogsa
Find Correlated Structure Across 5 transcriptomics datasets using MCIA tensor Integration
Multiple Factor Analysis, Multiple Coinertia Analysis, Consensus PCA
BiocManager::install("mogsa")
moa(lapply(se, exprs), proc.row = "center_ssq1", w.data = "inertia", statis = TRUE) #MCIA
MFA statis=FALSE (the default setting)..
Parallel (permutation) based selection of components
library(paran)
Determine Number of Components (by permutation) representing concordant structure between datasets
bootMoa(
moa = ana,
proc.row = "center_ssq1",
w.data = "inertia",
statis = TRUE,
B = 20,
plot=TRUE) �
Select N Components
with > variance than
expected (permuted)
Meng et al., BMC Bioinformatics 2014, 15:162
Package:mogsa
How to annotate of Genesets in tensor Integration
Project GO Terms (vector of gene) onto each to get a gene set “score” in each space
Simple approach to generate a gene set scores
Fagan et al., (2007) A Multivariate Analysis approach to the Integration of Proteomic and Gene Expression Data. Proteomics. 7(13):2162-71.
Jeffery et al., (2007) Integrating transcription factor binding site information with gene expression datasets. Bioinformatics 23(3):298-305.
Matrix decomposition of gene expression and proteomics onto same scale
Multiple ‘Omics Gene Set Analysis
MOGSA Meng C, et al., 2019
MCP DOI: 10.1074/mcp.TIR118.001251
Fast & Performant
Single Sample/Cell Gene Set Scores
Useful Reference Books
Open source (free) at