1 of 45

Multiple testing (issues)

Lior Pachter

California Institute of Technology

Lecture 14

Caltech Bi/BE/CS183

Spring 2023

These slides are distributed under the CC BY 4.0 license

2 of 45

Finding marker genes (Lecture 13)

Consider two pre-defined groups of cells:

For each gene, there are two distributions (one in each group).

Question: are the means in each of the groups the same?

This question of differential expression requires

Specification of what exactly is meant by “gene”.
Specification of what distributions exactly are being compared?
Determination of the magnitude of effect.
Determination of statistical significance.

Testing of many genes requires consideration of the chance of finding a gene that is significantly different between the groups merely due to the extent of multiple testing.

3 of 45

The problem with pre-specific clusters (Lecture 13)

Once clusters are pre-specified, there will by definition be genes that differ between them. Such genes are not interesting, they are merely artifact from the clustering procedure that was performed.��

Zhang et al., 2019

4 of 45

Selective inference

Gao et al., 2020

Lucy Gao

Jacob Bien

Daniela Witten

A seemingly valid solution (sample splitting) fails:

5 of 45

QQ plot

A visualization that is useful for comparing two distributions.
A point (x,y) on the plot shows, for the same quantile, it’s value for one distribution (x coordinate) versus the other (y coordinate).
If the two distributions are the same the QQ plot will be a line at 45 degrees.

Gao et al., 2020

6 of 45

Selective inference

Gao et al., 2020

Lucy Gao

Jacob Bien

Daniela Witten

A selective inference procedure can estimate valid p-values:

7 of 45

Selective inference

Gao et al., 2020

Lucy Gao

Jacob Bien

Daniela Witten

A selective inference procedure can estimate valid p-values.
The p-value that needs to be computed is:

From among the datasets that result in two clusters, say C₁ and C₂, what is the probability, assuming that there is no difference in the means (null hypothesis), of obtaining the observed difference (or more) between the sample means of C₁ and C₂?

This p-value is difficult to compute, but Gao et al. find a tractable computation by condition on “extra things”.

8 of 45

“Extra things”

Gao et al., 2020

Lucy Gao

Jacob Bien

Daniela Witten

9 of 45

Recall: the familywise error rate (Lecture 13)

Suppose that multiple hypotheses H₁,...H_mare being tested (one for each gene), and the p-values obtained are p₁,...p_m.
The total number of tested hypotheses is m. Suppose that among these there are m₀ true null hypothesis (the number m₀ is unknown).
The familywise error rate (FWER) is the probability of rejecting at least one of the true hypothesis, that is of having at least one false positive.
“Control” of the FWER means choosing a threshold for rejecting each null hypothesis so that in aggregate the FWER is less than some threshold α.

10 of 45

Recall: The Bonferroni correction (Lecture 13)

Control of the FWER can be achieved by rejecting all null hypotheses for which �
Proof: ��
This is known as the Bonferroni correction.

11 of 45

The false discovery rate

In a hypothesis testing procedure, the number of false discoveries is the number of type I errors (false positives) among the rejections of the null hypothesis.
The false discovery rate (FDR) is the expected number of false discoveries.

Formally, ��Here V = number of type I errors, R = number of rejected null hypothesis.

12 of 45

The Benjamini-Hochberg multiple testing correction

Given the p-values for m hypothesis tests, sort the p-values: �p₍₁₎, p₍₂₎, … p_(m). �
For a given threshold α, find the largest value k such that �
Reject all the null hypotheses H_(i) for i=1,...,k.

Benjamini and Hochberg, 1995.

13 of 45

q-values

Given the p-values for m hypothesis tests, sort the p-values: �p₍₁₎, p₍₂₎, … p_(m). �
For each p-value set the q-value to be �
In other words, rejecting the null hypothesis with q-values less than α ensures the expected false discovery rate is α.

Storey, 2002.

14 of 45

Example

Consider the p-values generated for the following ten hypotheses:��1 2 3 4 5 6 7 8 9 10�0.02 0.03 0.035 0.006 0.055 0.047 0.01 0.04 0.015 0.025�
Using Benjamini Hochberg, which hypotheses would one reject to control the FDR at a threshold of 0.05?

15 of 45

Example

Consider the p-values generated for the following ten hypotheses:�� 1 2 3 4 5 6 7 8 9 10��0.02 0.03 0.035 0.006 0.055 0.047 0.01 0.04 0.015 0.025�
Using Benjamini Hochberg, which hypotheses would one reject to control the FDR at a threshold of 0.05?�� (1) (2) (3) (4) (5) (6) (7) (8) (9) (10)��0.006 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.047 0.055��0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 0.05 ��

16 of 45

The Benjamini-Yekutieli multiple testing correction

Given the p-values for m hypothesis tests, sort the p-values: �p₍₁₎, p₍₂₎, … p_(m). �
For a given threshold α, find the largest value k such that �
Reject all the null hypotheses H_(i) for i=1,...,k. �

Yekutieli and Benjamini, 1999

17 of 45

The Benjamini-Yekutieli multiple testing correction

Given the p-values for m hypothesis tests, sort the p-values: �p₍₁₎, p₍₂₎, … p_(m). �
For a given threshold α, find the largest value k such that �

c(m) = 1: Benjamini-Hochberg procedure (tests are independent or positively correlated.
Under arbitrary dependence c(m) is the harmonic number

Yekutieli and Benjamini, 1999

18 of 45

Performing multiple testing correction in R

19 of 45

Performing multiple testing correction in Python

Bedre, 2022

20 of 45

Power

We have discussed Type I error, but Type II error is also important in hypothesis testing.
The power of a binary hypothesis test is the probability that the test correctly rejects the null hypothesis (H₀) when the alternative hypothesis (H₁) is true.
Power is commonly denoted by 1- β where β is the probability of making a Type II error by incorrectly failing to reject the null hypothesis. As β increases, the power of a test decreases.

21 of 45

Aggregation instead of multiple testing

Larger sample sizes increase power.
Multiple testing decreases power.

Testing individual isoforms for differential expression requires increasing the number of tests substantially, which therefore results in a loss of power.

Instead of resolving tests to individual isoforms, it is possible to aggregate p-values from tests of isoforms at the gene level to determine whether some isoforms may have shifted (without identifying which ones).

22 of 45

Motivating examples

Cancellation

Yi et al., 2018

23 of 45

Motivating examples

Domination

Yi et al., 2018

24 of 45

Motivating examples

Collapsing

Yi et al., 2018

25 of 45

Aggregation then analysis instead of analysis then aggregation

Instead of first quantifying gene abundance and then testing for differences…
First test for (transcript) differences and then aggregate p-values (meta-analysis).

26 of 45

Šidák aggregation

Compute p-values for the transcripts comprising each gene

This step performed with a differential analysis method suitable for isoforms, e.g. (Pimentel et al. 2017)

Let m_g denote the minimum p-value for gene g.

Find out if some isoform changed significantly.

Adjust m_g to u_g = 1-(1-m_g)^|g^|

These adjusted p-values will be uniformly distributed under the null hypothesis

Correct the u_g for multiple hypothesis testing, e.g. using the Benjamini-Hochberg multiple testing procedure for controlling FDR.

27 of 45

Šidák aggregation can fail in the collapsing scenario

28 of 45

Fisher and Lancaster’s method

Fisher’s method:��
E.g. k=2, � .
The way to think about this is that each p-value is being transformed (by -2log). Under the null hypothesis the transformed p-values are chi-squared distributed.�
Lancaster’s method “weights” the p-values by adjusting the inverse transform so that uniform p-values are mapped to a chi-squared distribution whose degree of freedom is chosen according to the weights.

29 of 45

Chi-squared distribution

30 of 45

Chi-squared goodness of fit test

The Pearson chi-squared test is used to test goodness of fit to a distribution. Suppose counts O₁,...,O_nhave been obtained in n categories, and the goal is to test whether these observations match a null hypothesis of expected counts E₁,...E_n. Then the test statistic��will be chi-squared distributed.�

31 of 45

Lancaster aggregation works well to aggregate isoform results to the gene-level

Yi et al., 2018

32 of 45

Summary

Selective inference: computation of p-values with respect to a null hypothesis must be taken with care to ensure that the p-values are not conditioned on the data.

Multiple testing: when considering corrections for multiple testing a choice must be made for what is being optimized (e.g. FWER vs. FDR). This choice dictates the correction procedure.

Type I and Type II error: controlling for Type I error is important but the power of a test, i.e. the extent of Type II error, must also be taken into consideration.

Aggregation: Experimental design must consider sample size, the number of tests to be performed, and the resolution of results desired. The aggregation of p-values, as done in meta-analysis, can be a useful way to control resolution.

33 of 45

Choosing a model

One issue we have not discussed yet is model selection. This refers to the problem of choosing the type of model to use for a problem, and even within a class of models, the number of parameters to use.
There are many approaches to model selection. A common one is to compute the Akaike Information Criterion (AIC):��AIC = 2k - 2ln(L),��where k is the number of parameters in the model and L is the value of the likelihood function.

34 of 45

Twenty questions XIX

No time to get to question XX… Just go with what you have….

35 of 45

T̶w̶e̶n̶t̶y̶ ̶q̶u̶e̶s̶t̶i̶o̶n̶s̶ XX

You’ve found marker genes for your cell types.��Did you scale your cells for read depth prior to this step?

Did you scale with respect to read depth prior to stabilizing variance (e.g. computing log1p)?
Did you scale with respect to read depth after stabilizing variance (e.g. computing log1p)?
Did you forget to stabilize variance prior to computing PCA

Did you do a PCA step?

36 of 45

Forty two questions XXI

You ran six methods to find marker genes�

Do you make a Venn diagram of the genes that were significant and use genes identified by all methods?
Do you make a Venn diagram of the genes that were significant and use genes identified by any method?

Nazarieh et al., 2019

37 of 45

Forty two questions XXII

You settled on a list of marker genes after correcting p-values using the Benjamini-Hochberg method.�

Can you consider any gene on your list to be a marker gene for sure?
Are genes at the top of the list more likely to be real marker genes?

38 of 45

Forty two questions XXIII

You compare your list of marker genes for the cell types you are annotating to a list from a prior paper.�

Do you expect the two lists to be the same?
How similar should the lists be for you to believe you have corroborated / repeated their analysis?

39 of 45

Forty two questions XXIV

You have performed a p-value with the Benjamini-Hochberg correction but a colleague tells you genes are not “independent”.�

Should you redo your analysis with the Benjamini-Yekutieli correction?
Leave your analysis as is?
Does it matter?

40 of 45

Forty two questions XXV

You examine your top marker gene and the p-value is p = 10^-213. You check on Google and find that there are at most 10⁸²atoms in the universe. You realize that the p-value for your gene is much smaller than the probability of pulling a single specific atom in the universe out of a hat.�

Should this worry you?
No problem.

41 of 45

Forty two questions XXVI

You have a lot of p-values on your list that are right under the 0.05 threshold.�

Should you include all those genes?
Worry about your admittedly arbitrary 0.05 threshold?

Krawczyk et al., 2015

42 of 45

Forty two questions XXVII

You are finally “happy” with your marker gene list but a colleague now wants a list of isoform-specific markers. Your have SMART-seq2 data but your first analysis was of the 10x Genomics 3’-end data. What do you do?�

Rerun the analysis using the SMART-seq2 data, just replacing gene quantifications with isoform quantifications?
Tell your colleague isoforms are irrelevant to biology.

43 of 45

Forty two questions XXVIII

You decide to perform an isoform analysis with the SMART-seq2 data but you realize you need to annotate the cells with respect to the cell types you found in your 10x Genomics data, or not.�

Do you cluster the SMART-seq2 data separately, identify marker genes, and then try to associate SMART-seq2 based clusters with the 10x clusters?
Try to assign SMART-seq2 cells cluster labels according to the 10x Genomics profiles?

44 of 45

Forty two questions XXIX

You cluster the SMART-seq2 data separately, and run a differential expression analysis to find marker genes. When performing the multiple testing correction do you..�

Use the number of genes?
Use twice the number of genes because you already did an analysis with the 10x Genomics data?

And do you go back and double the number of tests performed for the 10x Genomics data as well?

You start wondering whether maybe you should adjust by all the tests you have ever performed in your life..

Should you?

You start thinking you publish just once..

Should you?

45 of 45

Additional References

Shaffer 1995: Multiple hypothesis testing.
Leek and Storey, 2008: A general framework for multiple testing dependence.
Squair et al., 2021: Confronting false discoveries in single-cell differential expression.�
The mathematical foundations for model selection�(and information theory in statistics)