Human-data interaction
(things we don’t know)
@jtleek
jtleek.com/talks
@jtleek
The optimality paradox
...Rates of convergence of the proposed penalized likelihood estimators are established. Furthermore, with proper choice of regularization parameters, we show that the proposed�estimators perform as well as the oracle procedure in variable selection; namely, they work as well as if the correct submodel were known.
Source: http://orfe.princeton.edu/~jqfan/papers/01/penlike.pdf
...Rates of convergence of the proposed penalized likelihood estimators are established. Furthermore, with proper choice of regularization parameters, we show that the proposed�estimators perform as well as the oracle procedure in variable selection; namely, they work as well as if the correct submodel were known.
Source: http://orfe.princeton.edu/~jqfan/papers/01/penlike.pdf
https://www.theguardian.com/technology/2017/dec/07/alphazero-google-deepmind-ai-beats-champion-program-teaching-itself-to-play-four-hours
https://www.healthcaredive.com/news/google-rolls-out-ai-tool-for-understanding-genomes/
https://www.forbes.com/sites/stevensalzberg/2017/12/11/no-googles-new-ai-cant-build-your-genome-sequence/
A hypothesis
Let’s make an example of someone at random
“In the first stage, single lag and distributed lag overdispersed Poisson regression models were used for estimating county-specific RRs of hospital admissions associated with ambient levels of PM2.5”
Source: Dominici, Peng et al. 2006 JAMA
log(E[Hospital Admissions | X]) =
PM2.5
+ log(# at risk) + factor(day of week) + ns(day of year, 8) + ns(temperature,6)
+ ns(dew point temperature, 3) + ns(lagged temperature, 6)
+ ns(lagged dew point temperature, 3)
Eight outcomes were considered based on the ICD-9 codes for 5 cardiovascular outcomes (heart failure [428], heart rhythm disturbances [426-427], cerebrovascular events [430-438], ischemic heart disease [410-414, 429], peripheral vascular disease [440-448]), 2 respiratory outcomes (chronic obstructive pulmonary disease [COPD; 490-492], respiratory tract infections [464-466, 480-487]), and hospitalizations caused by injuries and other external causes (800-849). The county-wide daily hospitalization rates for each outcome for 1999-2002 appear in Table 1.��The study population includes 11.5 million Medicare enrollees residing an average of 5.9 miles from a PM2.5 monitor. The analysis was restricted to the 204 US counties with populations larger than 200 000. Of these 204 counties, 90 had daily PM2.5 data across the study period and the remaining counties had PM2.5 data collected once every 3 days for at least 1 full year. The locations of the 204 counties appear in Figure 1. The counties were clustered into 7 geographic regions by applying the K-means clustering algorithm to longitude and latitude for the counties.10,11
Eight outcomes were considered based on the ICD-9 codes for 5 cardiovascular outcomes (heart failure [428], heart rhythm disturbances [426-427], cerebrovascular events [430-438], ischemic heart disease [410-414, 429], peripheral vascular disease [440-448]), 2 respiratory outcomes (chronic obstructive pulmonary disease [COPD; 490-492], respiratory tract infections [464-466, 480-487]), and hospitalizations caused by injuries and other external causes (800-849). The county-wide daily hospitalization rates for each outcome for 1999-2002 appear in Table 1.��The study population includes 11.5 million Medicare enrollees residing an average of 5.9 miles from a PM2.5 monitor. The analysis was restricted to the 204 US counties with populations larger than 200 000. Of these 204 counties, 90 had daily PM2.5 data across the study period and the remaining counties had PM2.5 data collected once every 3 days for at least 1 full year. The locations of the 204 counties appear in Figure 1. The counties were clustered into 7 geographic regions by applying the K-means clustering algorithm to longitude and latitude for the counties.10,11
Import
Tidy
Transform
Visualize
Model
Publish
Adapted from Hadley Wickham
Communicate
Design and collect
Import
Tidy
Transform
Visualize
Model
Publish
Adapted from Hadley Wickham
Communicate
Design and collect
googlesheets
dplyr
Base R
gams
google slides
amazon s3
Import
Tidy
Transform
Visualize
Model
Publish
Adapted from Hadley Wickham
Communicate
Design and collect
spreadsheets :(
david robinson told me
bad life choices
my advisor likes them
gifs
chris volinsky told me
chromebook
Source: Tversky and Kahneman 1984
Suppose we have a sample with 30 lawyers and 70 engineers.
What is the probability Dick is a lawyer?
Source: Tversky and Kahneman 1984
Dick is a 30 year old man. He is married with no children. A man of high ability and high motivation, he promises to be quite successful in his field. He is well liked by his colleagues.
What is the probability Dick is a lawyer?
Source: Cleveland and McGill JASA 1984
Source: Cleveland and McGill JASA 1984
Human data interaction
(observational epidemiology)
Import
Tidy
Transform
Visualize
Model
Publish
Adapted from Hadley Wickham
Communicate
Design and collect
Import
Tidy
Transform
Visualize
Model
Publish
Adapted from Hadley Wickham
Communicate
Design and collect
Source: https://fivethirtyeight.com/features/science-isnt-broken/
Human data interaction
(our lab)
Data Science
10 Courses
4.4M+ Enrollments
Executive Data Science
5 Courses
150K+ Enrollments
Genomic Data Science
9 Courses
230k+ Enrollments
MSD in R
6 Courses
35K+ Enrollments
Adapted from Aboozar Hadavand
Import
Tidy
Transform
Visualize
Model
Publish
Adapted from Hadley Wickham
Communicate
Design and collect
Adapted from Sara Wang
Adapted from Sara Wang
Adapted from Sara Wang
Thanks Hadley!
# read code as text
code_txt = readLines('codefile.r')
# look for calls
str_detect(code_txt, 'library(dplyr)')
Adapted from Sara Wang
Source: https://github.com/jhudsl/matahari
library(matahari)
# Start logging your commands�dance_start(value = TRUE)
dat = data.frame(x=rnorm(100),y=rnorm(100))
library(dplyr)
glimpse(dat)
plot(dat)
# Pause logging�dance_stop()�
# Get data
code_df = dance_tbl()
code_df %>% select(expr,value,dt) %>% View()
# Get data
code_df = dance_tbl()
code_df %>% select(expr,value,dt) %>% View()
# Get data
code_df = dance_tbl()
code_df %>% select(expr,value,dt) %>% View()
# Plain text
dance_report()
"WAoAAAACAAMFAAACAwAAAAMTAAAABgAAABMAAAAIAAAABgAAAAEABAAJAAAAC3Nlc3Npb25J\nbmZvAAAA/gAAAAYAAAABAAQACQAAAAtkYW5jZV9zdGFydAAABAIAAAABAAQACQAAAAV2YWx1\nZQAAAAoAAAABAAAAAQAAAP4AAAAGAAAAAQAEAAkAAAABPQAAAAIAAAABAAQACQAAAANkYXQA\nAAACAAAABgAAAAEABAAJAAAACmRhdGEuZnJhbWUAAAQCAAAAAQAEAAkAAAABeAAAAAYAAAAB\nAAQACQAAAAVybm9ybQAAAAIAAAAOAAAAAUBZAAAAAAAAAAAA/gAABAIAAAABAAQACQAAAAF5\nAAAABgAACP8AAAACAAAADgAAAAFAWQAAAAAAAAAAAP4AAAD+AAAA/gAAAAYAAAABAAQACQAA\nAAdsaWJyYXJ5AAAAAgAAAAEABAAJAAAABWRwbHlyAAAA/gAAAAYAAAABAAQACQAAAAdnbGlt\ncHNlAAAAAgAABf8AAAD+AAAABgAAAAEABAAJAAAABHBsb3QAAAACAAAF/wAAAP4AAAAGAAAB\n/wAAAP4AAAAGAAAB/wAAAP4AAAATAAAACAAAAxMAAAAKAAACEwAAAA4AAAAQAAAAAQAEAAkA\nAAATeDg2XzY0LXBjLWxpbnV4LWdudQAAABAAAAABAAQACQAAAAZ4ODZfNjQAAAAQAAAAAQAE\nAAkAAAAJbGludXgtZ251AAAAEAAAAAEABAAJAAAAEXg4Nl82NCwgbGludXgtZ251AAAAEAAA\nAAEABAAJAAAAAAAAABAAAAABAAQACQAAAAEzAAAAEAAAAAEABAAJAAAAAzUuMAAAABAAAAAB\nAAQACQAAAAQyMDE4AAAAEAAAAAEABAAJAAAAAjA0AAAAEAAAAAEABAAJAAAAAjIzAAAAEAAA\nAAEABAAJAAAABTc0NjI2AAAAEAAAAAEABAAJAAAAAVIAAAAQ....”
# Decode
rprt %>% base64_to_df()
"WAoAAAACAAMFAAACAwAAAAMTAAAABgAAABMAAAAIAAAABgAAAAEABAAJAAAAC3Nlc3Npb25J\nbmZvAAAA/gAAAAYAAAABAAQACQAAAAtkYW5jZV9zdGFydAAABAIAAAABAAQACQAAAAV2YWx1\nZQAAAAoAAAABAAAAAQAAAP4AAAAGAAAAAQAEAAkAAAABPQAAAAIAAAABAAQACQAAAANkYXQA\nAAACAAAABgAAAAEABAAJAAAACmRhdGEuZnJhbWUAAAQCAAAAAQAEAAkAAAABeAAAAAYAAAAB\nAAQACQAAAAVybm9ybQAAAAIAAAAOAAAAAUBZAAAAAAAAAAAA/gAABAIAAAABAAQACQAAAAF5\nAAAABgAACP8AAAACAAAADgAAAAFAWQAAAAAAAAAAAP4AAAD+AAAA/gAAAAYAAAABAAQACQAA\nAAdsaWJyYXJ5AAAAAgAAAAEABAAJAAAABWRwbHlyAAAA/gAAAAYAAAABAAQACQAAAAdnbGlt\ncHNlAAAAAgAABf8AAAD+AAAABgAAAAEABAAJAAAABHBsb3QAAAACAAAF/wAAAP4AAAAGAAAB\n/wAAAP4AAAAGAAAB/wAAAP4AAAATAAAACAAAAxMAAAAKAAACEwAAAA4AAAAQAAAAAQAEAAkA\nAAATeDg2XzY0LXBjLWxpbnV4LWdudQAAABAAAAABAAQACQAAAAZ4ODZfNjQAAAAQAAAAAQAE\nAAkAAAAJbGludXgtZ251AAAAEAAAAAEABAAJAAAAEXg4Nl82NCwgbGludXgtZ251AAAAEAAA\nAAEABAAJAAAAAAAAABAAAAABAAQACQAAAAEzAAAAEAAAAAEABAAJAAAAAzUuMAAAABAAAAAB\nAAQACQAAAAQyMDE4AAAAEAAAAAEABAAJAAAAAjA0AAAAEAAAAAEABAAJAAAAAjIzAAAAEAAA\nAAEABAAJAAAABTc0NjI2AAAAEAAAAAEABAAJAAAAAVIAAAAQ....”
Human data interaction
(randomized trials)
Import
Tidy
Transform
Visualize
Model
Publish
Adapted from Hadley Wickham
Communicate
Design and collect
“...Coming up with nothing is unacceptable…”
Source: http://bmjopen.bmj.com/content/7/11/e018491
[It has been widely noted that there is a relationship between x1 and Y.] We are interested in studying the magnitude of this relationship and reporting results to the president of the company. Using the data available at [data1/data2], use a linear regression model to investigate this relationship.
Would you be confident telling the company president that there is a meaningful relationship between x1 and y?
A. Yes
B. No
data1: x2 confounds the relationship between Y and x1
data2: x2 is not a confounder but is predictive of Y
Source: Myint et al (in prep)
Confounded
Primed
68% found signal
(n = 263)
Not Primed
57% found signal
(n =252 )
Source: Myint et al (in prep)
Not confounded
Primed
78% found signal
(n = 263)
Not Primed
87% found signal
(n = 258)
Source: Myint et al (in prep)
Source: Myint et al (in prep)
Confounded
Not Primed
Confounded
Primed
Not Confounded
Not Primed
Not Confounded
Primed
Frequency of fitting y ~ x1 + x2 + x3
| Confounded | Not Confounded |
Primed | 73% | 70% |
Not Primed | 60% | 62% |
Human data interaction
(interventional randomized trials)
Import
Tidy
Transform
Visualize
Model
Publish
Adapted from Hadley Wickham
Communicate
Design and collect
Source: http://journals.sagepub.com/doi/pdf/10.1177/0956797611417632
Your assignment is to study how income varies across different categories of college majors. You will be using data from a study of recent college graduates. Make sure to use good practices that you have learned so far in this course and previous courses in the specialization. [In particular, it is good practice to specify an analysis plan early in the process to avoid the “p-hacking” behavior of trying many analyses to find one that has desired results. If you want to learn more about “p-hacking”, you can visit https://projects.fivethirtyeight.com/p-hacking/.]
If you will proceed with the analysis, click “Yes”. Otherwise, click “No”.
Source: Myint et al (in prep)
| Significant | Not significant |
Warned | 66 | 90 |
Not Warned | 59 | 88 |
Source: Myint et al (in prep)
head(college)�fit <- lm(rank ~ major_category - 1, college)�summary(fit)�
str(college)� library(ggplot2)� library(dplyr)� g1 <- ggplot(college, aes(x = major_category, y = median, fill = major_category))� g1 <- g1 + geom_boxplot()� g1 <- g1 + theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "none")� g1� tapply(college$sample_size, college$major_category, sum)� majormean <- tapply(college$sample_size, college$major_category, mean)� max(majormean)� college$major_category <- as.factor(college$major_category)� college2 <- college� college2$major_category <- relevel(college$major_category, ref = "Business")� fit1 <- lm(median ~ major_category, data = college2)� fit2 <- update(fit1, median ~ major_category + perc_men)� fit3 <- update(fit1, median ~ major_category + perc_men + perc_employed)� fit4 <- update(fit1, median ~ major_category + perc_men + perc_employed + perc_employed_fulltime_yearround)� fit5 <- update(fit1, median ~ major_category + perc_men + perc_employed + perc_employed_fulltime_yearround + perc_college_jobs)� anova(fit1, fit2, fit3, fit4)� summary(fit5)� tail(hatvalues(fit5)[order(hatvalues(fit5, decreasing = T))], 5)� tail(cooks.distance(fit5)[order(cooks.distance(fit5, decreasing = T))], 5)
str(college)� library(ggplot2)� library(dplyr)� g1 <- ggplot(college, aes(x = major_category, y = median, fill = major_category))� g1 <- g1 + geom_boxplot()� g1 <- g1 + theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "none")� g1� tapply(college$sample_size, college$major_category, sum)� majormean <- tapply(college$sample_size, college$major_category, mean)� max(majormean)� college$major_category <- as.factor(college$major_category)� college2 <- college� college2$major_category <- relevel(college$major_category, ref = "Business")� fit1 <- lm(median ~ major_category, data = college2)� fit2 <- update(fit1, median ~ major_category + perc_men)� fit3 <- update(fit1, median ~ major_category + perc_men + perc_employed)� fit4 <- update(fit1, median ~ major_category + perc_men + perc_employed + perc_employed_fulltime_yearround)� fit5 <- update(fit1, median ~ major_category + perc_men + perc_employed + perc_employed_fulltime_yearround + perc_college_jobs)� anova(fit1, fit2, fit3, fit4)� summary(fit5)� tail(hatvalues(fit5)[order(hatvalues(fit5, decreasing = T))], 5)� tail(cooks.distance(fit5)[order(cooks.distance(fit5, decreasing = T))], 5)
Setup
str(college)� library(ggplot2)� library(dplyr)� g1 <- ggplot(college, aes(x = major_category, y = median, fill = major_category))� g1 <- g1 + geom_boxplot()� g1 <- g1 + theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "none")� g1� tapply(college$sample_size, college$major_category, sum)� majormean <- tapply(college$sample_size, college$major_category, mean)� max(majormean)� college$major_category <- as.factor(college$major_category)� college2 <- college� college2$major_category <- relevel(college$major_category, ref = "Business")� fit1 <- lm(median ~ major_category, data = college2)� fit2 <- update(fit1, median ~ major_category + perc_men)� fit3 <- update(fit1, median ~ major_category + perc_men + perc_employed)� fit4 <- update(fit1, median ~ major_category + perc_men + perc_employed + perc_employed_fulltime_yearround)� fit5 <- update(fit1, median ~ major_category + perc_men + perc_employed + perc_employed_fulltime_yearround + perc_college_jobs)� anova(fit1, fit2, fit3, fit4)� summary(fit5)� tail(hatvalues(fit5)[order(hatvalues(fit5, decreasing = T))], 5)� tail(cooks.distance(fit5)[order(cooks.distance(fit5, decreasing = T))], 5)
EDA
str(college)� library(ggplot2)� library(dplyr)� g1 <- ggplot(college, aes(x = major_category, y = median, fill = major_category))� g1 <- g1 + geom_boxplot()� g1 <- g1 + theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "none")� g1� tapply(college$sample_size, college$major_category, sum)� majormean <- tapply(college$sample_size, college$major_category, mean)� max(majormean)� college$major_category <- as.factor(college$major_category)� college2 <- college� college2$major_category <- relevel(college$major_category, ref = "Business")� fit1 <- lm(median ~ major_category, data = college2)� fit2 <- update(fit1, median ~ major_category + perc_men)� fit3 <- update(fit1, median ~ major_category + perc_men + perc_employed)� fit4 <- update(fit1, median ~ major_category + perc_men + perc_employed + perc_employed_fulltime_yearround)� fit5 <- update(fit1, median ~ major_category + perc_men + perc_employed + perc_employed_fulltime_yearround + perc_college_jobs)� anova(fit1, fit2, fit3, fit4)� summary(fit5)� tail(hatvalues(fit5)[order(hatvalues(fit5, decreasing = T))], 5)� tail(cooks.distance(fit5)[order(cooks.distance(fit5, decreasing = T))], 5)
Munging
str(college)� library(ggplot2)� library(dplyr)� g1 <- ggplot(college, aes(x = major_category, y = median, fill = major_category))� g1 <- g1 + geom_boxplot()� g1 <- g1 + theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "none")� g1� tapply(college$sample_size, college$major_category, sum)� majormean <- tapply(college$sample_size, college$major_category, mean)� max(majormean)� college$major_category <- as.factor(college$major_category)� college2 <- college� college2$major_category <- relevel(college$major_category, ref = "Business")� fit1 <- lm(median ~ major_category, data = college2)� fit2 <- update(fit1, median ~ major_category + perc_men)� fit3 <- update(fit1, median ~ major_category + perc_men + perc_employed)� fit4 <- update(fit1, median ~ major_category + perc_men + perc_employed + perc_employed_fulltime_yearround)� fit5 <- update(fit1, median ~ major_category + perc_men + perc_employed + perc_employed_fulltime_yearround + perc_college_jobs)� anova(fit1, fit2, fit3, fit4)� summary(fit5)� tail(hatvalues(fit5)[order(hatvalues(fit5, decreasing = T))], 5)� tail(cooks.distance(fit5)[order(cooks.distance(fit5, decreasing = T))], 5)
Modeling
str(college)� library(ggplot2)� library(dplyr)� g1 <- ggplot(college, aes(x = major_category, y = median, fill = major_category))� g1 <- g1 + geom_boxplot()� g1 <- g1 + theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "none")� g1� tapply(college$sample_size, college$major_category, sum)� majormean <- tapply(college$sample_size, college$major_category, mean)� max(majormean)� college$major_category <- as.factor(college$major_category)� college2 <- college� college2$major_category <- relevel(college$major_category, ref = "Business")� fit1 <- lm(median ~ major_category, data = college2)� fit2 <- update(fit1, median ~ major_category + perc_men)� fit3 <- update(fit1, median ~ major_category + perc_men + perc_employed)� fit4 <- update(fit1, median ~ major_category + perc_men + perc_employed + perc_employed_fulltime_yearround)� fit5 <- update(fit1, median ~ major_category + perc_men + perc_employed + perc_employed_fulltime_yearround + perc_college_jobs)� anova(fit1, fit2, fit3, fit4)� summary(fit5)� tail(hatvalues(fit5)[order(hatvalues(fit5, decreasing = T))], 5)� tail(cooks.distance(fit5)[order(cooks.distance(fit5, decreasing = T))], 5)
Evaluation
str(college)� library(ggplot2)� library(dplyr)� g1 <- ggplot(college, aes(x = major_category, y = median, fill = major_category))� g1 <- g1 + geom_boxplot()� g1 <- g1 + theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "none")� g1� tapply(college$sample_size, college$major_category, sum)� majormean <- tapply(college$sample_size, college$major_category, mean)� max(majormean)� college$major_category <- as.factor(college$major_category)� college2 <- college� college2$major_category <- relevel(college$major_category, ref = "Business")� fit1 <- lm(median ~ major_category, data = college2)� fit2 <- update(fit1, median ~ major_category + perc_men)� fit3 <- update(fit1, median ~ major_category + perc_men + perc_employed)� fit4 <- update(fit1, median ~ major_category + perc_men + perc_employed + perc_employed_fulltime_yearround)� fit5 <- update(fit1, median ~ major_category + perc_men + perc_employed + perc_employed_fulltime_yearround + perc_college_jobs)� anova(fit1, fit2, fit3, fit4)� summary(fit5)� tail(hatvalues(fit5)[order(hatvalues(fit5, decreasing = T))], 5)� tail(cooks.distance(fit5)[order(cooks.distance(fit5, decreasing = T))], 5)
Setup
Exploratory
Munging
Modeling
Evaluation
Source: Myint et al (in prep)
Source: Myint et al (in prep)
Source: Myint et al (in prep)
Source: Myint et al (in prep)
Y = X + e
Source: https://simplystatistics.org/2013/07/31/the-researcher-degrees-of-freedom-recipe-tradeoff-in-data-analysis/
Y = X + eanalyst + esampling
Data Science
10 Courses
4.4M+ Enrollments
Executive Data Science
5 Courses
150K+ Enrollments
Genomic Data Science
9 Courses
230k+ Enrollments
MSD in R
6 Courses
35K+ Enrollments
Data Science
10 Courses
4.4M+ Enrollments
Executive Data Science
5 Courses
150K+ Enrollments
Genomic Data Science
9 Courses
230k+ Enrollments
MSD in R
6 Courses
35K+ Enrollments
Source: https://simplystatistics.org/2015/04/29/data-analysis-subcultures/
Y = X + epopulation + eanalyst + esampling
2
The optimality paradox is due to unknowns about human data interaction
We can now study human data interaction as its own science
Leek group
Collaborators
Alumni