1 of 38

Bioc2021

PCAworkshop

Aedin Culhane, Lauren Hsu

@AedinCulhane

2 of 38

To run this workshop

Cloud RStudio http://app.orchestra.cancerdatasci.org/

Following Rmarkdown on Website (pkgdown) https://aedin.github.io/PCAworkshop/

For help on workshops, see Kevin Rue-Albrecht’s Introduction to Workshops presentation

Install this workshop package locally

BiocManager::install(‘aedin/PCAworkshop’)

devtools::install_github(’aedin/PCAworkshop’)

3 of 38

Choose a workshop

4 of 38

Wait for your own workshop instance to launch

5 of 38

Ignore the Google Chrome warning 🙈

IMPORTANT

Do not share the URL of your session with others.

Do not copy the URL from someone who is sharing their screen.

Navigating to an URL used by someone else will kick them out of their session.

6 of 38

This workshop includes 4 Vignettes

7 of 38

Matrix Factorization or Dimension reduction methods, including PCA are commonly used and is well suited to finding known & unknown (latent) patterns in large data

Reduce the data matrix to a small number of linear vectors that explain most of the variance in the data

New Axis 1

New Axis 2

8 of 38

Many use

Singular Value Decomposition X= UDV^t

n

p

1

X

=

n

1

k

d₁₁	0	0	0
0	d₂₂	0	0
0	0	d₃₃	0
0	0	0	d₄₄

1

k

p

1

k

U

D

V^t

9 of 38

Short PCA Vignette

Examines a simple data

data(bordeaux)

bordeaux

## excellent good mediocre boring

## Cru_Bourgeois 45 126 24 5

## Grand_Cru_classe 87 93 19 1

## Vin_de_table 0 0 52 148

## Bordeaux_d_origine 36 68 74 22

## Vin_de_marque 0 30 111 59

and shows the relationship between Principal Component Analysis (PCA) and Singular Value Decomposition (SVD)

10 of 38

U,V matrices are orthogonal, uncorrelated

=

n

1

k

1	0	0	0
0	1	0	0
0	0	1	0
0	0	0	1

p

1

k

U

I

V^t

n

1

k

U^t

=

V

p

1

k

1

k

And the inner product (generalization of dot product) between pair of columns in U or V equals 0

The columns and rows are orthogonal unit vectors-> the squared elements of columns U,V sums to 1

11 of 38

Confirm this.. Lets go to Rstudio

12 of 38

Terminology SVD -> PCA

Left (U), Right (V) singular vectors, singular values (D)

PCA computed by linear regression, eigen analysis, SVD, latent factor analysis

Vectors are called principal components, principal axes, latent vectors, eigen vector and capture variance (information) in the data

Number eigen values selected by scree plot or permutations.

The "Kaiser rule" criteria

is shown in red.

Scree

13 of 38

Vignette 2. Principal Component Analysis in R

Function	loadings	scores	plot
prcomp(P, center=TRUE, scale=TRUE)	X$rotation	X$x	biplot(res)
princomp(P, cor=TRUE)	X$loadings	X$scores	biplot(res)
PCA(P)	X$svd$V	X$ind$coord	plot(res)
dudi.pca(P, center=TRUE, scale=TRUE)	X$c1	X$li	scatter(res)

Compares different functions in R that compute PCA

14 of 38

Dimension reduction is indispensable in �large scale data analysis

Ordination

Dimension Reduction

Matrix Factorization

Latent variable analysis

Factor Analysis

Principal Component Analysis

Wavelet Decomposition

Spectral analysis

Eigen analysis

15 of 38

Two forms of PCA; Covariance, Correlation

Covariance-based PCA

Correlation-based PCA

SVD

16 of 38

Covariance-based PCA used by Sun et al., 2019

https://github.com/xzhoulab/DRComparison/blob/master/algorithms/call_PCA.R

Thank so much for providing reproducible code on github

PCA

Covariance-based PCA

17 of 38

Assessment of the Accuracy, robustness and scalability of dimensionality reduction methods for single-cell RNA-seq analysis

Sun, S., Zhu, J., Ma, Y. et al. . Genome Biol 20, 269 (2019)

Comparison of 18 Methods

18 of 38

19 of 38

Processing steps �impact�results

human lung adenocarcinoma

cell lines

HCC827

H1975

H2228

Hsu & Culhane, 2020

20 of 38

Preprocessing Impacts�on PC1

Arch effect: points on PC1 lie on 1 side of the origin

Centering is important: orthogonal vectors are uncorrelated only when at least one of them has mean 0.

21 of 38

Arches,�Horseshoes,�Guttman effect

curvilinear

relationships

between successive

PCs

22 of 38

Arch

Can arise as a consequence of distance metrics that saturate.

10 nonzero values.

23 of 38

Vignette 3. Correspondence Analysis

From our Briefings in Bioinformatics review. Meng et al., 2016 https://academic.oup.com/bib/article/17/4/628/2240645

24 of 38

Correspondence Analysis of scMix Data

corralm

Unpublished

PCA

25 of 38

corral package

Designed for use on single cell data

Uses sparse matrices (Matrix)
Applies fast SVD approximation (irlba)
Interacts directly with Bioconductor objects

BiocManager::install(‘corral’)

devtools::install_github(‘laurenhsu/corral’)

Lauren Hsu

26 of 38

Lets test some scRNAseq data

27 of 38

Compare PCA, corral on Zhengmix (DuoClustering2018)

Zhengmix4eq

Zhengmix4uneq

Zhengmix8eq

Pre-sorted cells, including:

B-cells
CD14 monocytes
CD4 T-helper cells
CD56 NK cells
memory T-cells
naive cytotoxic T-cells
naive T- cells
regulatory T-cells

4 cell types, in approx. equal proportions

4 cell types, in unequal proportions

8 cell types, in approx. equal proportions

10X sequencing

mixed

28 of 38

29 of 38

Joint Matrix Factorization of >2 datasets

Meng & Zeleznik et al., (2016) Dimension reduction techniques for the integrative analysis of multi-omics data. Briefings in Bioinformatics, 17(4), 2016, 628–641

30 of 38

Towards Integration of multiple datasets

30

Meng et al., BMC Bioinformatics 2014, 15:162

Meng et al., Brief Bioinform. 2016 Jul; 17(4): 628–641.

Culhane AC et al., BMC Bioinformatics 2003, 4:59-74

31 of 38

Meng et al., BMC Bioinformatics 2014, 15:162

Package:mogsa

Find Correlated Structure Across 5 transcriptomics datasets using MCIA tensor Integration

32 of 38

Multiple Factor Analysis, Multiple Coinertia Analysis, Consensus PCA

BiocManager::install("mogsa")

moa(lapply(se, exprs), proc.row = "center_ssq1", w.data = "inertia", statis = TRUE) #MCIA

MFA statis=FALSE (the default setting)..

33 of 38

Parallel (permutation) based selection of components

Horn's Parallel Analysis for factor retention

https://www.r-bloggers.com/determining-the-number-of-factors-with-parallel-analysis-in-r/

library(paran)

Edgar Dobriban

https://github.com/dobriban/DPA

34 of 38

Determine Number of Components (by permutation) representing concordant structure between datasets

bootMoa(

moa = ana,

proc.row = "center_ssq1",

w.data = "inertia",

statis = TRUE,

B = 20,

plot=TRUE) �

Select N Components

with > variance than

expected (permuted)

35 of 38

Meng et al., BMC Bioinformatics 2014, 15:162

Package:mogsa

How to annotate of Genesets in tensor Integration

36 of 38

Project GO Terms (vector of gene) onto each to get a gene set “score” in each space

Simple approach to generate a gene set scores

Fagan et al., (2007) A Multivariate Analysis approach to the Integration of Proteomic and Gene Expression Data. Proteomics. 7(13):2162-71.

Jeffery et al., (2007) Integrating transcription factor binding site information with gene expression datasets. Bioinformatics 23(3):298-305.

Matrix decomposition of gene expression and proteomics onto same scale

37 of 38

Multiple ‘Omics Gene Set Analysis

MOGSA Meng C, et al., 2019

MCP DOI: 10.1074/mcp.TIR118.001251

Fast & Performant

Single Sample/Cell Gene Set Scores

38 of 38

Useful Reference Books

Open source (free) at

http://web.stanford.edu/class/bios221/book /