1 of 45

Multiple testing (issues)

Lior Pachter

California Institute of Technology

1

Lecture 14

Caltech Bi/BE/CS183

Spring 2023

These slides are distributed under the CC BY 4.0 license

2 of 45

Finding marker genes (Lecture 13)

  • Consider two pre-defined groups of cells:
    • For each gene, there are two distributions (one in each group).
      • Question: are the means in each of the groups the same?
    • This question of differential expression requires
      • Specification of what exactly is meant by “gene”.
      • Specification of what distributions exactly are being compared?
      • Determination of the magnitude of effect.
      • Determination of statistical significance.
    • Testing of many genes requires consideration of the chance of finding a gene that is significantly different between the groups merely due to the extent of multiple testing.

2

3 of 45

The problem with pre-specific clusters (Lecture 13)

  • Once clusters are pre-specified, there will by definition be genes that differ between them. Such genes are not interesting, they are merely artifact from the clustering procedure that was performed.����

3

4 of 45

Selective inference

4

Lucy Gao

Jacob Bien

Daniela Witten

  • A seemingly valid solution (sample splitting) fails:

5 of 45

QQ plot

  • A visualization that is useful for comparing two distributions.
  • A point (x,y) on the plot shows, for the same quantile, it’s value for one distribution (x coordinate) versus the other (y coordinate).
  • If the two distributions are the same the QQ plot will be a line at 45 degrees.

5

6 of 45

Selective inference

6

Lucy Gao

Jacob Bien

Daniela Witten

  • A selective inference procedure can estimate valid p-values:

7 of 45

Selective inference

7

Lucy Gao

Jacob Bien

Daniela Witten

  • A selective inference procedure can estimate valid p-values.
  • The p-value that needs to be computed is:
    • From among the datasets that result in two clusters, say C1 and C2 , what is the probability, assuming that there is no difference in the means (null hypothesis), of obtaining the observed difference (or more) between the sample means of C1 and C2 ?
  • This p-value is difficult to compute, but Gao et al. find a tractable computation by condition on “extra things”.

8 of 45

“Extra things”

8

Lucy Gao

Jacob Bien

Daniela Witten

9 of 45

Recall: the familywise error rate (Lecture 13)

  • Suppose that multiple hypotheses H1,...Hm are being tested (one for each gene), and the p-values obtained are p1,...pm.
  • The total number of tested hypotheses is m. Suppose that among these there are m0 true null hypothesis (the number m0 is unknown).
  • The familywise error rate (FWER) is the probability of rejecting at least one of the true hypothesis, that is of having at least one false positive.
  • “Control” of the FWER means choosing a threshold for rejecting each null hypothesis so that in aggregate the FWER is less than some threshold α.

9

10 of 45

Recall: The Bonferroni correction (Lecture 13)

  • Control of the FWER can be achieved by rejecting all null hypotheses for which �
  • Proof: ���
  • This is known as the Bonferroni correction.

10

11 of 45

The false discovery rate

  • In a hypothesis testing procedure, the number of false discoveries is the number of type I errors (false positives) among the rejections of the null hypothesis.
  • The false discovery rate (FDR) is the expected number of false discoveries.
    • Formally, ���Here V = number of type I errors, R = number of rejected null hypothesis.

11

12 of 45

The Benjamini-Hochberg multiple testing correction

  • Given the p-values for m hypothesis tests, sort the p-values: �p(1) , p(2) , … p(m).
  • For a given threshold α, find the largest value k such that �
  • Reject all the null hypotheses H(i) for i=1,...,k.

12

13 of 45

q-values

  • Given the p-values for m hypothesis tests, sort the p-values: �p(1) , p(2) , … p(m).
  • For each p-value set the q-value to be
  • In other words, rejecting the null hypothesis with q-values less than α ensures the expected false discovery rate is α.

13

14 of 45

Example

  • Consider the p-values generated for the following ten hypotheses:��1 2 3 4 5 6 7 8 9 10�0.02 0.03 0.035 0.006 0.055 0.047 0.01 0.04 0.015 0.025�
  • Using Benjamini Hochberg, which hypotheses would one reject to control the FDR at a threshold of 0.05?

14

15 of 45

Example

  • Consider the p-values generated for the following ten hypotheses:�� 1 2 3 4 5 6 7 8 9 10��0.02 0.03 0.035 0.006 0.055 0.047 0.01 0.04 0.015 0.025�
  • Using Benjamini Hochberg, which hypotheses would one reject to control the FDR at a threshold of 0.05?�� (1) (2) (3) (4) (5) (6) (7) (8) (9) (10)��0.006 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.047 0.055��0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05 ���

15

16 of 45

The Benjamini-Yekutieli multiple testing correction

  • Given the p-values for m hypothesis tests, sort the p-values: �p(1) , p(2) , … p(m).
  • For a given threshold α, find the largest value k such that �
  • Reject all the null hypotheses H(i) for i=1,...,k.

16

17 of 45

The Benjamini-Yekutieli multiple testing correction

  • Given the p-values for m hypothesis tests, sort the p-values: �p(1) , p(2) , … p(m).
  • For a given threshold α, find the largest value k such that �
    • c(m) = 1: Benjamini-Hochberg procedure (tests are independent or positively correlated.
    • Under arbitrary dependence c(m) is the harmonic number

17

18 of 45

Performing multiple testing correction in R

18

19 of 45

Performing multiple testing correction in Python

19

20 of 45

Power

  • We have discussed Type I error, but Type II error is also important in hypothesis testing.
  • The power of a binary hypothesis test is the probability that the test correctly rejects the null hypothesis (H0) when the alternative hypothesis (H1) is true.
  • Power is commonly denoted by 1- β where β is the probability of making a Type II error by incorrectly failing to reject the null hypothesis. As β increases, the power of a test decreases.

20

21 of 45

Aggregation instead of multiple testing

  • Larger sample sizes increase power.
  • Multiple testing decreases power.
    • Testing individual isoforms for differential expression requires increasing the number of tests substantially, which therefore results in a loss of power.
  • Instead of resolving tests to individual isoforms, it is possible to aggregate p-values from tests of isoforms at the gene level to determine whether some isoforms may have shifted (without identifying which ones).

21

22 of 45

Motivating examples

  • Cancellation

22

23 of 45

Motivating examples

  • Domination

23

24 of 45

Motivating examples

  • Collapsing

24

25 of 45

Aggregation then analysis instead of analysis then aggregation

  • Instead of first quantifying gene abundance and then testing for differences…
  • First test for (transcript) differences and then aggregate p-values (meta-analysis).

25

26 of 45

Šidák aggregation

  • Compute p-values for the transcripts comprising each gene
    • This step performed with a differential analysis method suitable for isoforms, e.g. (Pimentel et al. 2017)
  • Let mg denote the minimum p-value for gene g.
    • Find out if some isoform changed significantly.
  • Adjust mg to ug = 1-(1-mg)|g|
    • These adjusted p-values will be uniformly distributed under the null hypothesis
  • Correct the ug for multiple hypothesis testing, e.g. using the Benjamini-Hochberg multiple testing procedure for controlling FDR.

26

27 of 45

Šidák aggregation can fail in the collapsing scenario

27

28 of 45

Fisher and Lancaster’s method

  • Fisher’s method:�����
  • E.g. k=2, � .
  • The way to think about this is that each p-value is being transformed (by -2log). Under the null hypothesis the transformed p-values are chi-squared distributed.�
  • Lancaster’s method “weights” the p-values by adjusting the inverse transform so that uniform p-values are mapped to a chi-squared distribution whose degree of freedom is chosen according to the weights.

28

29 of 45

Chi-squared distribution

29

30 of 45

Chi-squared goodness of fit test

  • The Pearson chi-squared test is used to test goodness of fit to a distribution. Suppose counts O1,...,On have been obtained in n categories, and the goal is to test whether these observations match a null hypothesis of expected counts E1,...En. Then the test statistic����will be chi-squared distributed.�

30

31 of 45

Lancaster aggregation works well to aggregate isoform results to the gene-level

31

32 of 45

Summary

Selective inference: computation of p-values with respect to a null hypothesis must be taken with care to ensure that the p-values are not conditioned on the data.

Multiple testing: when considering corrections for multiple testing a choice must be made for what is being optimized (e.g. FWER vs. FDR). This choice dictates the correction procedure.

Type I and Type II error: controlling for Type I error is important but the power of a test, i.e. the extent of Type II error, must also be taken into consideration.

Aggregation: Experimental design must consider sample size, the number of tests to be performed, and the resolution of results desired. The aggregation of p-values, as done in meta-analysis, can be a useful way to control resolution.

32

33 of 45

Choosing a model

  • One issue we have not discussed yet is model selection. This refers to the problem of choosing the type of model to use for a problem, and even within a class of models, the number of parameters to use.
  • There are many approaches to model selection. A common one is to compute the Akaike Information Criterion (AIC):��AIC = 2k - 2ln(L),��where k is the number of parameters in the model and L is the value of the likelihood function.

33

34 of 45

Twenty questions XIX

  • No time to get to question XX… Just go with what you have….

34

35 of 45

T̶w̶e̶n̶t̶y̶ ̶q̶u̶e̶s̶t̶i̶o̶n̶s̶ XX

  • You’ve found marker genes for your cell types.��Did you scale your cells for read depth prior to this step?
    • Did you scale with respect to read depth prior to stabilizing variance (e.g. computing log1p)?
    • Did you scale with respect to read depth after stabilizing variance (e.g. computing log1p)?
    • Did you forget to stabilize variance prior to computing PCA
      • Did you do a PCA step?

35

36 of 45

Forty two questions XXI

  • You ran six methods to find marker genes�
    • Do you make a Venn diagram of the genes that were significant and use genes identified by all methods?
    • Do you make a Venn diagram of the genes that were significant and use genes identified by any method?

36

37 of 45

Forty two questions XXII

  • You settled on a list of marker genes after correcting p-values using the Benjamini-Hochberg method.�
    • Can you consider any gene on your list to be a marker gene for sure?
    • Are genes at the top of the list more likely to be real marker genes?

37

38 of 45

Forty two questions XXIII

  • You compare your list of marker genes for the cell types you are annotating to a list from a prior paper.�
    • Do you expect the two lists to be the same?
    • How similar should the lists be for you to believe you have corroborated / repeated their analysis?

38

39 of 45

Forty two questions XXIV

  • You have performed a p-value with the Benjamini-Hochberg correction but a colleague tells you genes are not “independent”.
    • Should you redo your analysis with the Benjamini-Yekutieli correction?
    • Leave your analysis as is?
    • Does it matter?

39

40 of 45

Forty two questions XXV

  • You examine your top marker gene and the p-value is p = 10-213. You check on Google and find that there are at most 1082 atoms in the universe. You realize that the p-value for your gene is much smaller than the probability of pulling a single specific atom in the universe out of a hat.�
    • Should this worry you?
    • No problem.

40

41 of 45

Forty two questions XXVI

  • You have a lot of p-values on your list that are right under the 0.05 threshold.�
    • Should you include all those genes?
    • Worry about your admittedly arbitrary 0.05 threshold?

41

42 of 45

Forty two questions XXVII

  • You are finally “happy” with your marker gene list but a colleague now wants a list of isoform-specific markers. Your have SMART-seq2 data but your first analysis was of the 10x Genomics 3’-end data. What do you do?�
    • Rerun the analysis using the SMART-seq2 data, just replacing gene quantifications with isoform quantifications?
    • Tell your colleague isoforms are irrelevant to biology.

42

43 of 45

Forty two questions XXVIII

  • You decide to perform an isoform analysis with the SMART-seq2 data but you realize you need to annotate the cells with respect to the cell types you found in your 10x Genomics data, or not.�
    • Do you cluster the SMART-seq2 data separately, identify marker genes, and then try to associate SMART-seq2 based clusters with the 10x clusters?
    • Try to assign SMART-seq2 cells cluster labels according to the 10x Genomics profiles?

43

44 of 45

Forty two questions XXIX

  • You cluster the SMART-seq2 data separately, and run a differential expression analysis to find marker genes. When performing the multiple testing correction do you..�
    • Use the number of genes?
    • Use twice the number of genes because you already did an analysis with the 10x Genomics data?
      • And do you go back and double the number of tests performed for the 10x Genomics data as well?
        • You start wondering whether maybe you should adjust by all the tests you have ever performed in your life..
          • Should you?
            • You start thinking you publish just once..
              • Should you?

44

45 of 45

Additional References

  • Shaffer 1995: Multiple hypothesis testing.
  • Leek and Storey, 2008: A general framework for multiple testing dependence.
  • Squair et al., 2021: Confronting false discoveries in single-cell differential expression.�
  • The mathematical foundations for model selection�(and information theory in statistics)

45