Navigating the garden of forking paths -- multiple testing and p-hacking and pre-registration (oh my!)
jtleek.com/talks
@jtleek
@simplystats
The optimality paradox
...Rates of convergence of the proposed penalized likelihood estimators are established. Furthermore, with proper choice of regularization parameters, we show that the proposed�estimators perform as well as the oracle procedure in variable selection; namely, they work as well as if the correct submodel were known.
Source: http://orfe.princeton.edu/~jqfan/papers/01/penlike.pdf
...Rates of convergence of the proposed penalized likelihood estimators are established. Furthermore, with proper choice of regularization parameters, we show that the proposed�estimators perform as well as the oracle procedure in variable selection; namely, they work as well as if the correct submodel were known.
Source: http://orfe.princeton.edu/~jqfan/papers/01/penlike.pdf
https://www.theguardian.com/technology/2017/dec/07/alphazero-google-deepmind-ai-beats-champion-program-teaching-itself-to-play-four-hours
https://www.healthcaredive.com/news/google-rolls-out-ai-tool-for-understanding-genomes/
https://www.forbes.com/sites/stevensalzberg/2017/12/11/no-googles-new-ai-cant-build-your-genome-sequence/
Untangling the paradox
“In the first stage, single lag and distributed lag overdispersed Poisson regression models were used for estimating county-specific RRs of hospital admissions associated with ambient levels of PM2.5”
Source: Dominici, Peng et al. 2006 JAMA
log(E[Hospital Admissions | X]) =
PM2.5
+ log(# at risk) + factor(day of week) + ns(day of year, 8) + ns(temperature,6)
+ ns(dew point temperature, 3) + ns(lagged temperature, 6)
+ ns(lagged dew point temperature, 3)
Eight outcomes were considered based on the ICD-9 codes for 5 cardiovascular outcomes (heart failure [428], heart rhythm disturbances [426-427], cerebrovascular events [430-438], ischemic heart disease [410-414, 429], peripheral vascular disease [440-448]), 2 respiratory outcomes (chronic obstructive pulmonary disease [COPD; 490-492], respiratory tract infections [464-466, 480-487]), and hospitalizations caused by injuries and other external causes (800-849). The county-wide daily hospitalization rates for each outcome for 1999-2002 appear in Table 1.��The study population includes 11.5 million Medicare enrollees residing an average of 5.9 miles from a PM2.5 monitor. The analysis was restricted to the 204 US counties with populations larger than 200 000. Of these 204 counties, 90 had daily PM2.5 data across the study period and the remaining counties had PM2.5 data collected once every 3 days for at least 1 full year. The locations of the 204 counties appear in Figure 1. The counties were clustered into 7 geographic regions by applying the K-means clustering algorithm to longitude and latitude for the counties.10,11
Eight outcomes were considered based on the ICD-9 codes for 5 cardiovascular outcomes (heart failure [428], heart rhythm disturbances [426-427], cerebrovascular events [430-438], ischemic heart disease [410-414, 429], peripheral vascular disease [440-448]), 2 respiratory outcomes (chronic obstructive pulmonary disease [COPD; 490-492], respiratory tract infections [464-466, 480-487]), and hospitalizations caused by injuries and other external causes (800-849). The county-wide daily hospitalization rates for each outcome for 1999-2002 appear in Table 1.��The study population includes 11.5 million Medicare enrollees residing an average of 5.9 miles from a PM2.5 monitor. The analysis was restricted to the 204 US counties with populations larger than 200 000. Of these 204 counties, 90 had daily PM2.5 data across the study period and the remaining counties had PM2.5 data collected once every 3 days for at least 1 full year. The locations of the 204 counties appear in Figure 1. The counties were clustered into 7 geographic regions by applying the K-means clustering algorithm to longitude and latitude for the counties.10,11
Import
Tidy
Transform
Visualize
Model
Publish
Adapted from Hadley Wickham
Communicate
Design and collect
Import
Tidy
Transform
Visualize
Model
Publish
Adapted from Hadley Wickham
Communicate
Design and collect
googlesheets
dplyr
Base R
gams
google slides
amazon s3
Import
Tidy
Transform
Visualize
Model
Publish
Adapted from Hadley Wickham
Communicate
Design and collect
spreadsheets :(
david robinson told me
bad life choices
my advisor likes them
gifs
chris volinsky told me
chromebook
https://www.nature.com/articles/d41586-017-07522-z
“We need to appreciate that data analysis is not purely computational and algorithmic — it is a human behaviour.”
“...Coming up with nothing is unacceptable…”
Source: http://bmjopen.bmj.com/content/7/11/e018491
Hi Rafa
I read your Jelly bean study and thought it was really interesting! We want to build on your result and finally take down the evil Jelly Bean lobby for good. I had my student collect some more green jelly bean and acne data to really drive the point home.
But we saw something kind of weird, can you take a look?
Jeff
Could be just random
Could be acne makes you like green
????
Multiple testing
5% less than 0.05
20% less than 0.2
20 x 0.05 =1
Multiple testing
Family wise error rate:
Pr(# False Positives ≥ 1)
False discovery rate:
# False positives
E
# Total Discoveries
E[# False Positives]
@leekgroup
20 out of 1000 jelly bean colors are significant at “0.05 level”
P-value < 0.05
Expect 0.05*1000 = 50 false positives
False Discovery Rate < 0.05
Expect 0.05*20 = 1 false positives
Family Wise Error Rate < 0.05
The probability of at least 1 false positive < 0.05
@leekgroup
Bonferroni Correction
P-values less than α/m are significant
Benjamini-Hochberg Correction
Order the p-values: p(1),…,p(m)
If p(i) ≤ α × i/m then it is significant
No Correction
BH (FDR)
Bonferroni (FWER)
https://simplystatistics.org/2015/08/20/if-you-ask-different-quetions-you-get-different-asnwers-one-more-way-science-isnt-broken-it-is-just-really-hard/
Terms
Multiple testing
Considering multiple hypotheses. Often refers to a fix set of hypotheses.
Garden of forking paths
Considering many analysis decisions, often without quantifying how many
P-hacking
Considering many analysis decisions, with a metric in mind (p-value)
P-fraud
Considering many analysis decisions, intentionally minimizing a metric (p-value)
Eight outcomes were considered based on the ICD-9 codes for 5 cardiovascular outcomes (heart failure [428], heart rhythm disturbances [426-427], cerebrovascular events [430-438], ischemic heart disease [410-414, 429], peripheral vascular disease [440-448]), 2 respiratory outcomes (chronic obstructive pulmonary disease [COPD; 490-492], respiratory tract infections [464-466, 480-487]), and hospitalizations caused by injuries and other external causes (800-849). The county-wide daily hospitalization rates for each outcome for 1999-2002 appear in Table 1.��The study population includes 11.5 million Medicare enrollees residing an average of 5.9 miles from a PM2.5 monitor. The analysis was restricted to the 204 US counties with populations larger than 200 000. Of these 204 counties, 90 had daily PM2.5 data across the study period and the remaining counties had PM2.5 data collected once every 3 days for at least 1 full year. The locations of the 204 counties appear in Figure 1. The counties were clustered into 7 geographic regions by applying the K-means clustering algorithm to longitude and latitude for the counties.10,11
Garden of forking paths
?? x 0.05 = ???!?!
Garden of forking paths
https://explorablemultiverse.github.io/examples/frequentist/
Terms
Multiple testing
Considering multiple hypotheses. Often refers to a fix set of hypotheses.
Garden of forking paths
Considering many analysis decisions, often without quantifying how many
P-hacking
Considering many analysis decisions, with a metric in mind (p-value)
P-fraud
Considering many analysis decisions, intentionally minimizing a metric (p-value)
http://fivethirtyeight.com/features/science-isnt-broken/
When a measure becomes a target, it ceases to be a good measure.
“Goodhart’s Law”
http://phackathon.netlify.com/
D’Agostino McGowan et al. (in prep)
str(college)� library(ggplot2)� library(dplyr)� g1 <- ggplot(college, aes(x = major_category, y = median, fill = major_category))� g1 <- g1 + geom_boxplot()� g1 <- g1 + theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "none")� g1� tapply(college$sample_size, college$major_category, sum)� majormean <- tapply(college$sample_size, college$major_category, mean)� max(majormean)� college$major_category <- as.factor(college$major_category)� college2 <- college� college2$major_category <- relevel(college$major_category, ref = "Business")� fit1 <- lm(median ~ major_category, data = college2)� fit2 <- update(fit1, median ~ major_category + perc_men)� fit3 <- update(fit1, median ~ major_category + perc_men + perc_employed)� fit4 <- update(fit1, median ~ major_category + perc_men + perc_employed + perc_employed_fulltime_yearround)� fit5 <- update(fit1, median ~ major_category + perc_men + perc_employed + perc_employed_fulltime_yearround + perc_college_jobs)� anova(fit1, fit2, fit3, fit4)� summary(fit5)� tail(hatvalues(fit5)[order(hatvalues(fit5, decreasing = T))], 5)� tail(cooks.distance(fit5)[order(cooks.distance(fit5, decreasing = T))], 5)
D’Agostino McGowan et al. (in prep)
str(college)� library(ggplot2)� library(dplyr)� g1 <- ggplot(college, aes(x = major_category, y = median, fill = major_category))� g1 <- g1 + geom_boxplot()� g1 <- g1 + theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "none")� g1� tapply(college$sample_size, college$major_category, sum)� majormean <- tapply(college$sample_size, college$major_category, mean)� max(majormean)� college$major_category <- as.factor(college$major_category)� college2 <- college� college2$major_category <- relevel(college$major_category, ref = "Business")� fit1 <- lm(median ~ major_category, data = college2)� fit2 <- update(fit1, median ~ major_category + perc_men)� fit3 <- update(fit1, median ~ major_category + perc_men + perc_employed)� fit4 <- update(fit1, median ~ major_category + perc_men + perc_employed + perc_employed_fulltime_yearround)� fit5 <- update(fit1, median ~ major_category + perc_men + perc_employed + perc_employed_fulltime_yearround + perc_college_jobs)� anova(fit1, fit2, fit3, fit4)� summary(fit5)� tail(hatvalues(fit5)[order(hatvalues(fit5, decreasing = T))], 5)� tail(cooks.distance(fit5)[order(cooks.distance(fit5, decreasing = T))], 5)
Setup
str(college)� library(ggplot2)� library(dplyr)� g1 <- ggplot(college, aes(x = major_category, y = median, fill = major_category))� g1 <- g1 + geom_boxplot()� g1 <- g1 + theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "none")� g1� tapply(college$sample_size, college$major_category, sum)� majormean <- tapply(college$sample_size, college$major_category, mean)� max(majormean)� college$major_category <- as.factor(college$major_category)� college2 <- college� college2$major_category <- relevel(college$major_category, ref = "Business")� fit1 <- lm(median ~ major_category, data = college2)� fit2 <- update(fit1, median ~ major_category + perc_men)� fit3 <- update(fit1, median ~ major_category + perc_men + perc_employed)� fit4 <- update(fit1, median ~ major_category + perc_men + perc_employed + perc_employed_fulltime_yearround)� fit5 <- update(fit1, median ~ major_category + perc_men + perc_employed + perc_employed_fulltime_yearround + perc_college_jobs)� anova(fit1, fit2, fit3, fit4)� summary(fit5)� tail(hatvalues(fit5)[order(hatvalues(fit5, decreasing = T))], 5)� tail(cooks.distance(fit5)[order(cooks.distance(fit5, decreasing = T))], 5)
EDA
str(college)� library(ggplot2)� library(dplyr)� g1 <- ggplot(college, aes(x = major_category, y = median, fill = major_category))� g1 <- g1 + geom_boxplot()� g1 <- g1 + theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "none")� g1� tapply(college$sample_size, college$major_category, sum)� majormean <- tapply(college$sample_size, college$major_category, mean)� max(majormean)� college$major_category <- as.factor(college$major_category)� college2 <- college� college2$major_category <- relevel(college$major_category, ref = "Business")� fit1 <- lm(median ~ major_category, data = college2)� fit2 <- update(fit1, median ~ major_category + perc_men)� fit3 <- update(fit1, median ~ major_category + perc_men + perc_employed)� fit4 <- update(fit1, median ~ major_category + perc_men + perc_employed + perc_employed_fulltime_yearround)� fit5 <- update(fit1, median ~ major_category + perc_men + perc_employed + perc_employed_fulltime_yearround + perc_college_jobs)� anova(fit1, fit2, fit3, fit4)� summary(fit5)� tail(hatvalues(fit5)[order(hatvalues(fit5, decreasing = T))], 5)� tail(cooks.distance(fit5)[order(cooks.distance(fit5, decreasing = T))], 5)
Munging
str(college)� library(ggplot2)� library(dplyr)� g1 <- ggplot(college, aes(x = major_category, y = median, fill = major_category))� g1 <- g1 + geom_boxplot()� g1 <- g1 + theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "none")� g1� tapply(college$sample_size, college$major_category, sum)� majormean <- tapply(college$sample_size, college$major_category, mean)� max(majormean)� college$major_category <- as.factor(college$major_category)� college2 <- college� college2$major_category <- relevel(college$major_category, ref = "Business")� fit1 <- lm(median ~ major_category, data = college2)� fit2 <- update(fit1, median ~ major_category + perc_men)� fit3 <- update(fit1, median ~ major_category + perc_men + perc_employed)� fit4 <- update(fit1, median ~ major_category + perc_men + perc_employed + perc_employed_fulltime_yearround)� fit5 <- update(fit1, median ~ major_category + perc_men + perc_employed + perc_employed_fulltime_yearround + perc_college_jobs)� anova(fit1, fit2, fit3, fit4)� summary(fit5)� tail(hatvalues(fit5)[order(hatvalues(fit5, decreasing = T))], 5)� tail(cooks.distance(fit5)[order(cooks.distance(fit5, decreasing = T))], 5)
Modeling
str(college)� library(ggplot2)� library(dplyr)� g1 <- ggplot(college, aes(x = major_category, y = median, fill = major_category))� g1 <- g1 + geom_boxplot()� g1 <- g1 + theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "none")� g1� tapply(college$sample_size, college$major_category, sum)� majormean <- tapply(college$sample_size, college$major_category, mean)� max(majormean)� college$major_category <- as.factor(college$major_category)� college2 <- college� college2$major_category <- relevel(college$major_category, ref = "Business")� fit1 <- lm(median ~ major_category, data = college2)� fit2 <- update(fit1, median ~ major_category + perc_men)� fit3 <- update(fit1, median ~ major_category + perc_men + perc_employed)� fit4 <- update(fit1, median ~ major_category + perc_men + perc_employed + perc_employed_fulltime_yearround)� fit5 <- update(fit1, median ~ major_category + perc_men + perc_employed + perc_employed_fulltime_yearround + perc_college_jobs)� anova(fit1, fit2, fit3, fit4)� summary(fit5)� tail(hatvalues(fit5)[order(hatvalues(fit5, decreasing = T))], 5)� tail(cooks.distance(fit5)[order(cooks.distance(fit5, decreasing = T))], 5)
Evaluation
str(college)� library(ggplot2)� library(dplyr)� g1 <- ggplot(college, aes(x = major_category, y = median, fill = major_category))� g1 <- g1 + geom_boxplot()� g1 <- g1 + theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "none")� g1� tapply(college$sample_size, college$major_category, sum)� majormean <- tapply(college$sample_size, college$major_category, mean)� max(majormean)� college$major_category <- as.factor(college$major_category)� college2 <- college� college2$major_category <- relevel(college$major_category, ref = "Business")� fit1 <- lm(median ~ major_category, data = college2)� fit2 <- update(fit1, median ~ major_category + perc_men)� fit3 <- update(fit1, median ~ major_category + perc_men + perc_employed)� fit4 <- update(fit1, median ~ major_category + perc_men + perc_employed + perc_employed_fulltime_yearround)� fit5 <- update(fit1, median ~ major_category + perc_men + perc_employed + perc_employed_fulltime_yearround + perc_college_jobs)� anova(fit1, fit2, fit3, fit4)� summary(fit5)� tail(hatvalues(fit5)[order(hatvalues(fit5, decreasing = T))], 5)� tail(cooks.distance(fit5)[order(cooks.distance(fit5, decreasing = T))], 5)
Setup
Exploratory
Munging
Modeling
Evaluation
D’Agostino McGowan et al. (in prep)
Terms
Multiple testing
Considering multiple hypotheses. Often refers to a fix set of hypotheses.
Garden of forking paths
Considering many analysis decisions, often without quantifying how many
P-hacking
Considering many analysis decisions, with a metric in mind (p-value)
P-fraud
Considering many analysis decisions, intentionally minimizing a metric (p-value)
Terms
Multiple testing
Considering multiple hypotheses. Often refers to a fix set of hypotheses.
Garden of forking paths
Considering many analysis decisions, often without quantifying how many
P-hacking
Considering many analysis decisions, with a metric in mind (p-value)
P-mining
Considering many analysis decisions, intentionally minimizing a metric (p-value)
Transparency
https://aspredicted.org/
Pre-registration
https://aspredicted.org/
Always compare to the baseline
https://simplystatistics.org/2017/05/24/toward-tidy-analysis/
Understand there might be nothing
John Tukey
The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data.
Respect your data analytic colleagues
Source: http://bmjopen.bmj.com/content/7/11/e018491
jtleek.com/talks
@jtleek
@simplystats