1 of 45

Correlation and Causation

Lior Pachter

California Institute of Technology

1

Lecture 4

Caltech Bi/BE/CS183

Spring 2023

These slides are distributed under the CC BY 4.0 license

2 of 45

Stochastic gene expression in a single cell

  • In E. coli, two reporter genes controlled by identical promoters but with two distinguishable alleles of green fluorescent protein (cyan and yellow) were used to study the extent of intrinsic noise and extrinsic noise in gene expression in single cells.
  • In panel A, an absence of intrinsic noise is evident in “correlated” fluctuation of the two fluorescent proteins. In panel B, absence of extrinsic noise is evident in “uncorrelated” fluctuation of the two fluorescent proteins.

2

3 of 45

Stochastic gene expression in a single cell

  • The fluctuations we see are colloquially thought of as “variance”. The idea of variance has a formal definition in statistics: the variance of a random variable X isVar[X] = E[(X-E[X])2]= E[X2]-E[X]2.

3

mean

4 of 45

How to think formally about intrinsic and extrinsic noise

  • Suppose n cells are assayed, producing gene expression measurements c1,c2,...,cn and y1,y2,...,ynfor two copies of a single gene.
  • Let C1,C2,...,Cn and Y1,Y2,...,Yn be random variables associated to the n cells, i.e. the measurements are realizations of these random variables.
  • The fact that the same gene is labeled twice with different fluorescent reports can be modeled by positing that each cell has a state modeled by a random variable Zi, and that Ci|Zi ~ Yi|Zi.
  • The law of total variance for Ci (equivalent for Yi ) is��

4

intrinsic noise

extrinsic noise

5 of 45

Sample estimates of intrinsic and extrinsic noise

  • The law of total variance for Ci (equivalent for Yi ) is

5

intrinsic noise

extrinsic noise

sample estimate:�(assuming n is large and c̅ = y̅)

normalized sample covariance estimate

6 of 45

Covariance of random variables

  • The covariance of two random variables X and Y is �cov[X,Y] = E[ (X-E[X])(Y-E[Y]) ]� = E[XY]-E[X]E[Y].�
  • cov[X,X] = var[X].�
  • If X and Y are independent random variables then the covariance is zero:�Proof: Independence means that E[XY]=E[X]E[Y].�The converse is not true.

6

7 of 45

Sample covariance

  • The sample covariance formula follows from cov[X,Y] = E[XY]-E[X]E[Y] :����
  • This formula yields a biased estimate of the covariance. The unbiased estimator is�

7

8 of 45

Geometric interpretation of the sample covariance

  • The (unbiased) sample covariance can be rewritten in the following way:����
  • The right-hand side is the average (signed) area of triangles formed by pairs of points.

8

less efficient for computing, more useful for understanding

9 of 45

Geometric interpretation of intrinsic and extrinsic noise

9

10 of 45

Single-cell data from Elowitz et al., 2002

10

11 of 45

On the matter of bias in estimators

  • The sample covariance formula follows from cov[X,Y] = E[XY]-E[X]E[Y] :����
  • Recall: this formula yields a biased estimate of the covariance. The unbiased estimator is�

11

?

12 of 45

Bessel’s correction

  • The unbiased estimator for the variance of n numbers x1,x2,...xn is��
  • The “strange” n-1 can be understood by identifying the source of bias in a naïve estimator. For details, derivation, and a discussion see (Pachter, 2014).
  • The n-1 in the denominator of the sample covariance estimator arises in the same way.
  • The removal of bias from the naïve estimator is known as Bessel’s correction. However, it comes at the cost of mean squared error…

12

13 of 45

The bias - variance tradeoff

13

  • Assume data of the form y = f(x) + ε where ε is noise with mean 0 and variance σ2.
  • The goal is to find a function gD(x) that approximates f(x) well using sampled data D.
  • The bias - variance tradeoff is the decomposition:��ED [ (y-gD(x))2 ] = (BiasD [gD(x)])2 + VarD [gD(x)] + σ2��where ��BiasD [gD(x)] = ED [ gD(x) ] - f(x),��VarD [gD(x)] = ED [ ( ED g(x) ] - gD(x) )2.

14 of 45

On the matter of bias in estimators

  • The sample covariance formula follows from cov[X,Y] = E[XY]-E[X]E[Y] :����
  • Bessel’s correction:�

14

more bias, �minimal mean squared error

no bias, more mean squared error

15 of 45

Correlation

  • The covariance of two random variables X and Y is in units that are a product of those of X and Y. To obtain a dimensionless number, the covariance can be divided by the product of the standard deviation of X and the standard deviation of Y. This is called the correlation coefficient:�����Other names include Pearson’s product-moment correlation coefficient, Pearson’s coefficient, or Pearson’s correlation.

15

16 of 45

Sample correlation coefficient

  • The sample correlation coefficient, denoted by r, is an estimate of the (population) Pearson correlation. There are several equivalent expressions; the analogue of the sample covariance formula we used is:����
  • Note that while Pearson’s correlation coefficient lies between -1 and 1 (inclusive), sampling error will reduce the range of r.

16

17 of 45

The (Anscombe, 1973) quartet

17

18 of 45

An update of Anscombe’s quartet

18

19 of 45

Zero correlation does not mean zero structure

19

20 of 45

Exploratory Data Analysis (EDA)

  • It is not surprising that a summary of 2n numbers with a single number (the correlation) can result in substantial loss of information. This highlights the need for exploratory data analysis, which is the process of investigating data prior to analysis to detect anomalies, identify batch effects, detect patterns, check assumptions, etc. This can be facilitated with visualizations.
  • Example: checking the distributional assumptions of data. For instance, if a gene has Poisson distributed expression then plotting expression vs. variance (estimated from different cells) should reveal a straight line.
  • Always begin analysis with EDA. 👈🏼

20

21 of 45

Recall (from Lecture 3)

21

no noise?

linear regression lines

22 of 45

Spearman rank correlation coefficient

  • Spearman’s rank correlation coefficient is the Pearson correlation between the rank variables derived from two random variables:���
  • The sample estimate is given by���where n is the number of observations and di = R(Xi) - R(Yi) is the difference between the ranks of each observation.

22

23 of 45

Spearman correlation is less sensitive than Pearson correlation to outliers

23

24 of 45

The coefficient of determination ( R2 )

  • In the case of least squares with a multiple regressors, the square of the sample correlation coefficient can be interpreted as a measure of the goodness-of-fit of the linear regression model. Specifically, it measures the proportion of the variation in the dependent variable that is predictable by the independent variable.
  • Formally, if the fitted values to the points (xi,yi ) are denoted fi , and ei = yi - fi then����where

24

residuals

25 of 45

Geometric interpretation of R2

25

26 of 45

Using correlations for network science

  • Correlation is frequently used to identify groups of genes that interact with each other, and to construct gene regulatory networks. �
  • (Eisen et al., 1998) used correlations of gene expression measurements as a measure of interaction and applied this to constructing (hierarchical) clusters of genes in a time-course of serum stimulation of primary human fibroblasts:�
    • A: cholesterol biosynthesis
    • B: the cell cycle
    • C: the immediate-early response
    • D: signaling and angiogenesis
    • E: wound healing and tissue remodeling

26

27 of 45

Single-cell network inference

  • A popular method for inferring networks from gene correlations is Weighted Gene Correlation Network Analysis, or WGCNA (Langfelder and Horvath, 2008).
  • In WGCNA, correlation is used as a measure of coexpression (as in Eisen et al., 1998), a “soft-thresholding” of the correlation matrix is applied by exponentiating the entries (e.g. to the power of 12 ), and the resultant matrix is processed to try to identify biologically relevant structure in the data.

27

28 of 45

Single-cell RNA-seq example (Xue et al., 2013):�Genetic programs in human and mouse early embryos revealed by single-cell RNA sequencing

28

29 of 45

A problem with correlation based networks

  • Consider a thought experiment with three genes: A, B, and C. Suppose that the expression of genes B and C are derived from gene A by adding a small amount of random noise (independently) to B and C respectively. You might say that B and C “interact” with A. Consequently, the expression of A and B, and separately A and C, will be highly correlated.
    • What about B and C? �They will be highly correlated by virtue of �interaction with A, not each other. B and C�are conditionally independent given A: .

29

A

B

C

30 of 45

Partial correlation

  • In order to better assess the “direct” correlation between B and C, instead of computing the correlation between them, the correlation between the residuals after regressing B against A, and C against A, are computed. This is known as partial correlation.

30

A

A

C

B

B residuals

C residuals

calculate correlation

31 of 45

Computing partial correlation

  • The partial correlation between B and C given A, which is denoted by� , can be shown to be simplify to����
  • Partial correlation between two random variables can be extended to the case where residuals that are computed by regressing against not one, but multiple other random variables, are correlated with each other.

31

32 of 45

Partial correlation of pairs of variables controlling for the rest

  • If Xi and Xj are random variables from a set X of n random variables, the partial correlation between them, controlling for the remainder �X \ {Xi ,Xj } can be described succinctly in terms of the covariance matrix Ω = (ρXiXj ) (assuming it is positive definite and invertible): ��

�where P (pij ) = Ω -1 is the precision matrix.

32

33 of 45

Remarks about partial correlation

  • While partial correlations have a two-step definition (first perform regressions to compute residuals, then correlate residuals), the resulting formulas can be described succinctly making computations tractable.
  • While correlations are useful to examine, it’s a “good thing” to invert the covariance matrix, and the procedure removes “indirect effects”.
  • If all the random variables involved are multivariate Gaussian, then the partial correlation between a pair of random variables controlling for the rest is zero if, and only if, the pair of random variables are conditionally independent given the rest.

33

34 of 45

A note about causality

  • The diagram is a form of causal diagram. It is a directed graph����in which a variable is connected to other variables upon which it has causal influence. The arrow indicates the direction of causality.
  • Partial correlations can rule out certain causal relationships by establishing independence conditions (for Gaussian variables). But two causal diagrams may be impossible to distinguish. The following causal diagram yields the same independence conditions as the one above:

34

A

B

C

A

B

C

35 of 45

Correlation does not imply causation

  • An in-depth discussion about causality is beyond the scope of this class. Even a cursory introduction would require significant time to cover the most basic philosophical and statistical issues.
  • Note that correlation does not imply causation!���������

35

36 of 45

Regularized partial correlation

  • In single-cell RNA-seq data (or other genomics datasets), there is a lot of sparsity, and as a result the covariance matrix may not be invertible. However partial correlations can still be computed by a shrinkage covariance estimator.
  • There are several approaches to this, for example (Schäfer and Strimmer, 2005).

36

37 of 45

Network deconvolution

  • There are lots of approaches to constructing networks from gene expression data. Many of them involve heuristics motivated by intuition but not grounded in theory. An example is “network deconvolution” (Feizi et al., 2013).
  • A simulation study (Bray and Pachter, 2013) based on data generated from Gaussian graphical models (Lecture 15), shows that:
    • partial correlation is better at identifying edges in the network than correlation.
    • partial correlation outperforms network deconvolution.

37

38 of 45

Correlation can result from causation

  • However, sometimes (frequently?) correlation does result from causation!

38

autocorrelation

39 of 45

Spatial autocorrelation and Moran’s I

39

slope of line is Moran’s I

40 of 45

Some remarks on measurement of performance

  • The network algorithms benchmarked on Slide 38 each output a ranked list of edges (e.g. by correlation, partial correlation, etc.). In assessing the quality of each list, it is compared to a set of “ground truth” edges (from which the data was simulated). A standard way to assess accuracy is by computing the area under the receiver operator curve (AUROC).
  • This is part of a collection of jargon for performance assessment known as the confusion matrix. It’s confusing!

40

41 of 45

The confusion matrix

  • Consider 11 items labeled blue (positive) or red (negative): A,B,C,D,E,F,G,H,I,J, and K.
  • An algorithm returns as output a ranked list of the items, with the ones most likely to be blue in the front: �C,E,A,B,F,K,G,D,I,H, and J
  • The decision of where to “draw the line” is arbitrary. We can decide that only the top prediction is likely to be correct. Or we can threshold after six items (see table on right).
  • Such a table is called a confusion matrix.

41

42 of 45

The Receiver Operator Curve (ROC)

42

  • Ground truth: A,B,C,D,E,F,G,H,I,J, and K
  • Algorithm output: C,E,A,B,F,K,G,D,I,H, and J

43 of 45

Area Under the Receiver Operator Curve

43

probability that a (uniformly at) randomly selected positive item is ranked higher than a (uniformly at) randomly selected negative item

44 of 45

Summary

Covariance

Pearson’s correlation

Coefficient of determination

Partial correlation

44

45 of 45

Additional References

45