Let's get meta: analyzing your R code with tidycode
Lucy D’Agostino McGowan
Wake Forest University
lucymcgowan.com/talk
Follow Along!
Joint work with
Jeff Leek
Sean Kross
Thank you
Leek Lab
Johns Hopkins Bloomberg School of Public Health
We want to analyze how analysts are coding
Why?
We want to analyze how analysts are coding
Why?
We want to analyze how analysts are coding
Why?
facilitate data science pedagogy
Why?
facilitate data science pedagogy
help with reproducibility / replicabiltiy
Why?
facilitate data science pedagogy
help with reproducibility / replicabiltiy
explore how current software / tools are being used
How?
facilitate data science pedagogy
help with reproducibility / replicabiltiy
explore how current software / tools are being used
How?
matahari
tidycode
Static R code
Static R code
Dynamic R code
matahari
matahari
tidy data
matahari
tidy data
matahari
tidy data
matahari
tidycode
text::tidytext
text::tidytext
code::tidycode
text::tidytext
code::tidycode
instead of analyzing tokens of text
we are analyzing tokens of code
stopwords::tidytext
stopwords::tidytext
stopfuncs::tidycode
stopwords::tidytext
stopfuncs::tidycode
instead of removing stop words
we remove stop functions
sentiment::tidytext
sentiment::tidytext
analysis tasks::tidycode
sentiment::tidytext
analysis tasks::tidycode
instead of classifying text by sentiment
we are classifying code by analysis tasks
analysis tasks
lexicons
lexicons
lexicons
matahari
matahari
tidycode
df
expr
df
df
expr
df
expr
library(tidyverse)
data %>% filter(mpg < 20)
lm(mpg ~ cyl, data = data)
tidycode
df
expr
tidycode
df
expr
tidycode
df
tbl <- df %>%
unnest_calls(expr)
tidycode
df
classification_tbl <- tbl %>%
anti_join(get_stopfuncs()) %>%
inner_join(get_classifications(
"crowdsource",
include_duplicates = FALSE)
)
tidycode
df
classification_tbl <- tbl %>%
anti_join(get_stopfuncs()) %>%
inner_join(get_classifications(
"crowdsource",
include_duplicates = FALSE)
)
tidycode
df
classification_tbl <- tbl %>%
anti_join(get_stopfuncs()) %>%
inner_join(get_classifications(
"crowdsource",
include_duplicates = FALSE)
)
tidycode
df
classification_tbl <- tbl %>%
anti_join(get_stopfuncs()) %>%
inner_join(get_classifications(
"crowdsource",
include_duplicates = FALSE)
)
tidycode
df
classification_tbl <- tbl %>%
anti_join(get_stopfuncs()) %>%
inner_join(get_classifications(
"crowdsource",
include_duplicates = FALSE)
)
tidycode
df
classification_tbl %>%
group_by(id, analysis_job, classification) %>%
summarise(n = n()) %>%
mutate(pct = n / sum(n)) %>%
group_by(analysis_job, classification) %>%
summarise(n = n()) %>%
mutate(avg_pct = n / sum(n)) %>%
ggplot(aes(x = analysis_job, y = avg_pct, fill = classification)) +
geom_bar(stat = "identity") +
scale_y_continuous("Average percent", labels = scales::percent) +
scale_x_discrete("Participant conducts analyses as part of their job")
tidycode
df
classification_tbl %>%
group_by(id, analysis_job, classification) %>%
summarise(n = n()) %>%
mutate(pct = n / sum(n)) %>%
group_by(analysis_job, classification) %>%
summarise(n = n()) %>%
mutate(avg_pct = n / sum(n)) %>%
ggplot(aes(x = analysis_job, y = avg_pct, fill = classification)) +
geom_bar(stat = "identity") +
scale_y_continuous("Average percent", labels = scales::percent) +
scale_x_discrete("Participant conducts analyses as part of their job")
tidycode
df
classification_tbl %>%
group_by(id, analysis_job, classification) %>%
summarise(n = n()) %>%
mutate(pct = n / sum(n)) %>%
group_by(analysis_job, classification) %>%
summarise(n = n()) %>%
mutate(avg_pct = n / sum(n)) %>%
ggplot(aes(x = analysis_job, y = avg_pct, fill = classification)) +
geom_bar(stat = "identity") +
scale_y_continuous("Average percent", labels = scales::percent) +
scale_x_discrete("Participant conducts analyses as part of their job")
tidycode
df
D’Agostino McGowan, L, Kross, S, & Leek, JT. "Tools for Analyzing R Code the Tidy Way." The R Journal. 12.1 (2020): 226-242. https://doi.org/10.32614/RJ-2020-011
Questions?
Lucy D’Agostino McGowan
Wake Forest University
@LucyStats