1 of 88

Navigating the garden of forking paths -- multiple testing and p-hacking and pre-registration (oh my!)

2 of 88

3 of 88

jtleek.com/talks

@jtleek

@simplystats

4 of 88

The optimality paradox

5 of 88

...Rates of convergence of the proposed penalized likelihood estimators are established. Furthermore, with proper choice of regularization parameters, we show that the proposed�estimators perform as well as the oracle procedure in variable selection; namely, they work as well as if the correct submodel were known.

Source: http://orfe.princeton.edu/~jqfan/papers/01/penlike.pdf

6 of 88

...Rates of convergence of the proposed penalized likelihood estimators are established. Furthermore, with proper choice of regularization parameters, we show that the proposed�estimators perform as well as the oracle procedure in variable selection; namely, they work as well as if the correct submodel were known.

Source: http://orfe.princeton.edu/~jqfan/papers/01/penlike.pdf

7 of 88

https://www.theguardian.com/technology/2017/dec/07/alphazero-google-deepmind-ai-beats-champion-program-teaching-itself-to-play-four-hours

8 of 88

https://www.healthcaredive.com/news/google-rolls-out-ai-tool-for-understanding-genomes/

9 of 88

10 of 88

https://www.forbes.com/sites/stevensalzberg/2017/12/11/no-googles-new-ai-cant-build-your-genome-sequence/

11 of 88

12 of 88

13 of 88

14 of 88

Untangling the paradox

15 of 88

16 of 88

“In the first stage, single lag and distributed lag overdispersed Poisson regression models were used for estimating county-specific RRs of hospital admissions associated with ambient levels of PM2.5

Source: Dominici, Peng et al. 2006 JAMA

17 of 88

log(E[Hospital Admissions | X]) =

PM2.5

+ log(# at risk) + factor(day of week) + ns(day of year, 8) + ns(temperature,6)

+ ns(dew point temperature, 3) + ns(lagged temperature, 6)

+ ns(lagged dew point temperature, 3)

18 of 88

19 of 88

Eight outcomes were considered based on the ICD-9 codes for 5 cardiovascular outcomes (heart failure [428], heart rhythm disturbances [426-427], cerebrovascular events [430-438], ischemic heart disease [410-414, 429], peripheral vascular disease [440-448]), 2 respiratory outcomes (chronic obstructive pulmonary disease [COPD; 490-492], respiratory tract infections [464-466, 480-487]), and hospitalizations caused by injuries and other external causes (800-849). The county-wide daily hospitalization rates for each outcome for 1999-2002 appear in Table 1.��The study population includes 11.5 million Medicare enrollees residing an average of 5.9 miles from a PM2.5 monitor. The analysis was restricted to the 204 US counties with populations larger than 200 000. Of these 204 counties, 90 had daily PM2.5 data across the study period and the remaining counties had PM2.5 data collected once every 3 days for at least 1 full year. The locations of the 204 counties appear in Figure 1. The counties were clustered into 7 geographic regions by applying the K-means clustering algorithm to longitude and latitude for the counties.10,11

20 of 88

Eight outcomes were considered based on the ICD-9 codes for 5 cardiovascular outcomes (heart failure [428], heart rhythm disturbances [426-427], cerebrovascular events [430-438], ischemic heart disease [410-414, 429], peripheral vascular disease [440-448]), 2 respiratory outcomes (chronic obstructive pulmonary disease [COPD; 490-492], respiratory tract infections [464-466, 480-487]), and hospitalizations caused by injuries and other external causes (800-849). The county-wide daily hospitalization rates for each outcome for 1999-2002 appear in Table 1.��The study population includes 11.5 million Medicare enrollees residing an average of 5.9 miles from a PM2.5 monitor. The analysis was restricted to the 204 US counties with populations larger than 200 000. Of these 204 counties, 90 had daily PM2.5 data across the study period and the remaining counties had PM2.5 data collected once every 3 days for at least 1 full year. The locations of the 204 counties appear in Figure 1. The counties were clustered into 7 geographic regions by applying the K-means clustering algorithm to longitude and latitude for the counties.10,11

21 of 88

Import

Tidy

Transform

Visualize

Model

Publish

Adapted from Hadley Wickham

Communicate

Design and collect

22 of 88

Import

Tidy

Transform

Visualize

Model

Publish

Adapted from Hadley Wickham

Communicate

Design and collect

googlesheets

dplyr

Base R

gams

google slides

twitter

amazon s3

23 of 88

Import

Tidy

Transform

Visualize

Model

Publish

Adapted from Hadley Wickham

Communicate

Design and collect

spreadsheets :(

david robinson told me

bad life choices

my advisor likes them

gifs

chris volinsky told me

chromebook

24 of 88

https://www.nature.com/articles/d41586-017-07522-z

“We need to appreciate that data analysis is not purely computational and algorithmic — it is a human behaviour.

25 of 88

26 of 88

27 of 88

28 of 88

29 of 88

“...Coming up with nothing is unacceptable…”

30 of 88

Source: http://bmjopen.bmj.com/content/7/11/e018491

31 of 88

32 of 88

33 of 88

34 of 88

35 of 88

36 of 88

37 of 88

38 of 88

39 of 88

Hi Rafa

I read your Jelly bean study and thought it was really interesting! We want to build on your result and finally take down the evil Jelly Bean lobby for good. I had my student collect some more green jelly bean and acne data to really drive the point home.

But we saw something kind of weird, can you take a look?

Jeff

40 of 88

41 of 88

Could be just random

42 of 88

Could be acne makes you like green

43 of 88

44 of 88

45 of 88

46 of 88

  • Measure 20 kinds of jelly beans

  • Calculate 20 p-values

  • Call genes “significant” if p-value < 0.05

  • Expected Number of False Positives:

????

Multiple testing

47 of 88

48 of 88

5% less than 0.05

49 of 88

20% less than 0.2

50 of 88

  • Measure 20 kinds of jelly beans

  • Calculate 20 p-values

  • Call genes “significant” if p-value < 0.05

  • Expected Number of False Positives:

20 x 0.05 =1

Multiple testing

51 of 88

Family wise error rate:

Pr(# False Positives ≥ 1)

False discovery rate:

# False positives

E

# Total Discoveries

  • EFP (e-values)

E[# False Positives]

@leekgroup

52 of 88

20 out of 1000 jelly bean colors are significant at “0.05 level”

P-value < 0.05

Expect 0.05*1000 = 50 false positives

False Discovery Rate < 0.05

Expect 0.05*20 = 1 false positives

Family Wise Error Rate < 0.05

The probability of at least 1 false positive < 0.05

@leekgroup

53 of 88

Bonferroni Correction

P-values less than α/m are significant

Benjamini-Hochberg Correction

Order the p-values: p(1),…,p(m)

If p(i) ≤ α × i/m then it is significant

54 of 88

No Correction

BH (FDR)

Bonferroni (FWER)

55 of 88

56 of 88

https://simplystatistics.org/2015/08/20/if-you-ask-different-quetions-you-get-different-asnwers-one-more-way-science-isnt-broken-it-is-just-really-hard/

57 of 88

Terms

Multiple testing

Considering multiple hypotheses. Often refers to a fix set of hypotheses.

Garden of forking paths

Considering many analysis decisions, often without quantifying how many

P-hacking

Considering many analysis decisions, with a metric in mind (p-value)

P-fraud

Considering many analysis decisions, intentionally minimizing a metric (p-value)

58 of 88

59 of 88

Eight outcomes were considered based on the ICD-9 codes for 5 cardiovascular outcomes (heart failure [428], heart rhythm disturbances [426-427], cerebrovascular events [430-438], ischemic heart disease [410-414, 429], peripheral vascular disease [440-448]), 2 respiratory outcomes (chronic obstructive pulmonary disease [COPD; 490-492], respiratory tract infections [464-466, 480-487]), and hospitalizations caused by injuries and other external causes (800-849). The county-wide daily hospitalization rates for each outcome for 1999-2002 appear in Table 1.��The study population includes 11.5 million Medicare enrollees residing an average of 5.9 miles from a PM2.5 monitor. The analysis was restricted to the 204 US counties with populations larger than 200 000. Of these 204 counties, 90 had daily PM2.5 data across the study period and the remaining counties had PM2.5 data collected once every 3 days for at least 1 full year. The locations of the 204 counties appear in Figure 1. The counties were clustered into 7 geographic regions by applying the K-means clustering algorithm to longitude and latitude for the counties.10,11

60 of 88

  • Measure 20 colors of jelly beans, 30 types of students, 500 types of acne medicine….

  • Calculate ?? p-values

  • Call genes “significant” if p-value < 0.05

  • Expected Number of False Positives:

Garden of forking paths

61 of 88

  • Measure 20 colors of jelly beans, 30 types of students, 500 types of acne medicine….

  • Calculate ?? p-values

  • Call genes “significant” if p-value < 0.05

  • Expected Number of False Positives:

?? x 0.05 = ???!?!

Garden of forking paths

62 of 88

63 of 88

https://explorablemultiverse.github.io/examples/frequentist/

64 of 88

Terms

Multiple testing

Considering multiple hypotheses. Often refers to a fix set of hypotheses.

Garden of forking paths

Considering many analysis decisions, often without quantifying how many

P-hacking

Considering many analysis decisions, with a metric in mind (p-value)

P-fraud

Considering many analysis decisions, intentionally minimizing a metric (p-value)

65 of 88

66 of 88

http://fivethirtyeight.com/features/science-isnt-broken/

67 of 88

68 of 88

When a measure becomes a target, it ceases to be a good measure.

“Goodhart’s Law”

69 of 88

70 of 88

http://phackathon.netlify.com/

D’Agostino McGowan et al. (in prep)

71 of 88

str(college)� library(ggplot2)� library(dplyr)� g1 <- ggplot(college, aes(x = major_category, y = median, fill = major_category))� g1 <- g1 + geom_boxplot()� g1 <- g1 + theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "none")� g1� tapply(college$sample_size, college$major_category, sum)� majormean <- tapply(college$sample_size, college$major_category, mean)� max(majormean)� college$major_category <- as.factor(college$major_category)� college2 <- college� college2$major_category <- relevel(college$major_category, ref = "Business")� fit1 <- lm(median ~ major_category, data = college2)� fit2 <- update(fit1, median ~ major_category + perc_men)� fit3 <- update(fit1, median ~ major_category + perc_men + perc_employed)� fit4 <- update(fit1, median ~ major_category + perc_men + perc_employed + perc_employed_fulltime_yearround)� fit5 <- update(fit1, median ~ major_category + perc_men + perc_employed + perc_employed_fulltime_yearround + perc_college_jobs)� anova(fit1, fit2, fit3, fit4)� summary(fit5)� tail(hatvalues(fit5)[order(hatvalues(fit5, decreasing = T))], 5)� tail(cooks.distance(fit5)[order(cooks.distance(fit5, decreasing = T))], 5)

D’Agostino McGowan et al. (in prep)

72 of 88

str(college)� library(ggplot2)� library(dplyr)� g1 <- ggplot(college, aes(x = major_category, y = median, fill = major_category))� g1 <- g1 + geom_boxplot()� g1 <- g1 + theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "none")� g1� tapply(college$sample_size, college$major_category, sum)� majormean <- tapply(college$sample_size, college$major_category, mean)� max(majormean)� college$major_category <- as.factor(college$major_category)� college2 <- college� college2$major_category <- relevel(college$major_category, ref = "Business")� fit1 <- lm(median ~ major_category, data = college2)� fit2 <- update(fit1, median ~ major_category + perc_men)� fit3 <- update(fit1, median ~ major_category + perc_men + perc_employed)� fit4 <- update(fit1, median ~ major_category + perc_men + perc_employed + perc_employed_fulltime_yearround)� fit5 <- update(fit1, median ~ major_category + perc_men + perc_employed + perc_employed_fulltime_yearround + perc_college_jobs)� anova(fit1, fit2, fit3, fit4)� summary(fit5)� tail(hatvalues(fit5)[order(hatvalues(fit5, decreasing = T))], 5)� tail(cooks.distance(fit5)[order(cooks.distance(fit5, decreasing = T))], 5)

Setup

73 of 88

str(college) library(ggplot2)� library(dplyr) g1 <- ggplot(college, aes(x = major_category, y = median, fill = major_category))� g1 <- g1 + geom_boxplot()� g1 <- g1 + theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "none")� g1� tapply(college$sample_size, college$major_category, sum)� majormean <- tapply(college$sample_size, college$major_category, mean)� max(majormean) college$major_category <- as.factor(college$major_category)� college2 <- college� college2$major_category <- relevel(college$major_category, ref = "Business")� fit1 <- lm(median ~ major_category, data = college2)� fit2 <- update(fit1, median ~ major_category + perc_men)� fit3 <- update(fit1, median ~ major_category + perc_men + perc_employed)� fit4 <- update(fit1, median ~ major_category + perc_men + perc_employed + perc_employed_fulltime_yearround)� fit5 <- update(fit1, median ~ major_category + perc_men + perc_employed + perc_employed_fulltime_yearround + perc_college_jobs)� anova(fit1, fit2, fit3, fit4)� summary(fit5)� tail(hatvalues(fit5)[order(hatvalues(fit5, decreasing = T))], 5)� tail(cooks.distance(fit5)[order(cooks.distance(fit5, decreasing = T))], 5)

EDA

74 of 88

str(college)� library(ggplot2)� library(dplyr) g1 <- ggplot(college, aes(x = major_category, y = median, fill = major_category))� g1 <- g1 + geom_boxplot()� g1 <- g1 + theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "none")� g1� tapply(college$sample_size, college$major_category, sum)� majormean <- tapply(college$sample_size, college$major_category, mean)� max(majormean)� college$major_category <- as.factor(college$major_category)� college2 <- college� college2$major_category <- relevel(college$major_category, ref = "Business")� fit1 <- lm(median ~ major_category, data = college2)� fit2 <- update(fit1, median ~ major_category + perc_men)� fit3 <- update(fit1, median ~ major_category + perc_men + perc_employed)� fit4 <- update(fit1, median ~ major_category + perc_men + perc_employed + perc_employed_fulltime_yearround)� fit5 <- update(fit1, median ~ major_category + perc_men + perc_employed + perc_employed_fulltime_yearround + perc_college_jobs)� anova(fit1, fit2, fit3, fit4)� summary(fit5)� tail(hatvalues(fit5)[order(hatvalues(fit5, decreasing = T))], 5)� tail(cooks.distance(fit5)[order(cooks.distance(fit5, decreasing = T))], 5)

Munging

75 of 88

str(college)� library(ggplot2)� library(dplyr) g1 <- ggplot(college, aes(x = major_category, y = median, fill = major_category))� g1 <- g1 + geom_boxplot()� g1 <- g1 + theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "none")� g1� tapply(college$sample_size, college$major_category, sum)� majormean <- tapply(college$sample_size, college$major_category, mean)� max(majormean)� college$major_category <- as.factor(college$major_category)� college2 <- college� college2$major_category <- relevel(college$major_category, ref = "Business")� fit1 <- lm(median ~ major_category, data = college2)� fit2 <- update(fit1, median ~ major_category + perc_men)� fit3 <- update(fit1, median ~ major_category + perc_men + perc_employed)� fit4 <- update(fit1, median ~ major_category + perc_men + perc_employed + perc_employed_fulltime_yearround)� fit5 <- update(fit1, median ~ major_category + perc_men + perc_employed + perc_employed_fulltime_yearround + perc_college_jobs)� anova(fit1, fit2, fit3, fit4)� summary(fit5)� tail(hatvalues(fit5)[order(hatvalues(fit5, decreasing = T))], 5)� tail(cooks.distance(fit5)[order(cooks.distance(fit5, decreasing = T))], 5)

Modeling

76 of 88

str(college)� library(ggplot2)� library(dplyr) g1 <- ggplot(college, aes(x = major_category, y = median, fill = major_category))� g1 <- g1 + geom_boxplot()� g1 <- g1 + theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "none")� g1� tapply(college$sample_size, college$major_category, sum)� majormean <- tapply(college$sample_size, college$major_category, mean)� max(majormean)� college$major_category <- as.factor(college$major_category)� college2 <- college� college2$major_category <- relevel(college$major_category, ref = "Business")� fit1 <- lm(median ~ major_category, data = college2)� fit2 <- update(fit1, median ~ major_category + perc_men)� fit3 <- update(fit1, median ~ major_category + perc_men + perc_employed)� fit4 <- update(fit1, median ~ major_category + perc_men + perc_employed + perc_employed_fulltime_yearround)� fit5 <- update(fit1, median ~ major_category + perc_men + perc_employed + perc_employed_fulltime_yearround + perc_college_jobs)� anova(fit1, fit2, fit3, fit4) summary(fit5)� tail(hatvalues(fit5)[order(hatvalues(fit5, decreasing = T))], 5)� tail(cooks.distance(fit5)[order(cooks.distance(fit5, decreasing = T))], 5)

Evaluation

77 of 88

str(college) library(ggplot2)� library(dplyr) g1 <- ggplot(college, aes(x = major_category, y = median, fill = major_category))� g1 <- g1 + geom_boxplot()� g1 <- g1 + theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "none")� g1� tapply(college$sample_size, college$major_category, sum)� majormean <- tapply(college$sample_size, college$major_category, mean)� max(majormean) college$major_category <- as.factor(college$major_category)� college2 <- college� college2$major_category <- relevel(college$major_category, ref = "Business")� fit1 <- lm(median ~ major_category, data = college2)� fit2 <- update(fit1, median ~ major_category + perc_men)� fit3 <- update(fit1, median ~ major_category + perc_men + perc_employed)� fit4 <- update(fit1, median ~ major_category + perc_men + perc_employed + perc_employed_fulltime_yearround)� fit5 <- update(fit1, median ~ major_category + perc_men + perc_employed + perc_employed_fulltime_yearround + perc_college_jobs)� anova(fit1, fit2, fit3, fit4)� summary(fit5)� tail(hatvalues(fit5)[order(hatvalues(fit5, decreasing = T))], 5)� tail(cooks.distance(fit5)[order(cooks.distance(fit5, decreasing = T))], 5)

Setup

Exploratory

Munging

Modeling

Evaluation

78 of 88

D’Agostino McGowan et al. (in prep)

79 of 88

Terms

Multiple testing

Considering multiple hypotheses. Often refers to a fix set of hypotheses.

Garden of forking paths

Considering many analysis decisions, often without quantifying how many

P-hacking

Considering many analysis decisions, with a metric in mind (p-value)

P-fraud

Considering many analysis decisions, intentionally minimizing a metric (p-value)

80 of 88

81 of 88

Terms

Multiple testing

Considering multiple hypotheses. Often refers to a fix set of hypotheses.

Garden of forking paths

Considering many analysis decisions, often without quantifying how many

P-hacking

Considering many analysis decisions, with a metric in mind (p-value)

P-mining

Considering many analysis decisions, intentionally minimizing a metric (p-value)

82 of 88

83 of 88

Transparency

https://aspredicted.org/

84 of 88

Pre-registration

https://aspredicted.org/

85 of 88

Always compare to the baseline

https://simplystatistics.org/2017/05/24/toward-tidy-analysis/

86 of 88

Understand there might be nothing

John Tukey

The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data.

87 of 88

Respect your data analytic colleagues

Source: http://bmjopen.bmj.com/content/7/11/e018491

88 of 88

jtleek.com/talks

@jtleek

@simplystats