1 of 58

Let's get meta: analyzing your R code with tidycode

Lucy D’Agostino McGowan

Wake Forest University

2 of 58

lucymcgowan.com/talk

Follow Along!

3 of 58

Joint work with

Jeff Leek

Sean Kross

4 of 58

Thank you

Leek Lab

Johns Hopkins Bloomberg School of Public Health

5 of 58

We want to analyze how analysts are coding

6 of 58

Why?

We want to analyze how analysts are coding

7 of 58

Why?

We want to analyze how analysts are coding

8 of 58

Why?

facilitate data science pedagogy

9 of 58

Why?

facilitate data science pedagogy

help with reproducibility / replicabiltiy

10 of 58

Why?

facilitate data science pedagogy

help with reproducibility / replicabiltiy

explore how current software / tools are being used

11 of 58

How?

facilitate data science pedagogy

help with reproducibility / replicabiltiy

explore how current software / tools are being used

12 of 58

How?

matahari

tidycode

13 of 58

14 of 58

Static R code

15 of 58

Static R code

Dynamic R code

16 of 58

17 of 58

matahari

18 of 58

matahari

tidy data

19 of 58

matahari

tidy data

20 of 58

matahari

tidy data

  • Each variable has its own column
  • Each observation has its own row
  • Each value has its own cell

21 of 58

matahari

tidycode

22 of 58

text::tidytext

23 of 58

text::tidytext

code::tidycode

24 of 58

text::tidytext

code::tidycode

instead of analyzing tokens of text

we are analyzing tokens of code

25 of 58

stopwords::tidytext

26 of 58

stopwords::tidytext

stopfuncs::tidycode

27 of 58

stopwords::tidytext

stopfuncs::tidycode

instead of removing stop words

we remove stop functions

28 of 58

sentiment::tidytext

29 of 58

sentiment::tidytext

analysis tasks::tidycode

30 of 58

sentiment::tidytext

analysis tasks::tidycode

instead of classifying text by sentiment

we are classifying code by analysis tasks

31 of 58

analysis tasks

  • setup
  • exploratory
  • data cleaning
  • modeling
  • evaluation�

  • visualization
  • communication
  • import
  • export

32 of 58

lexicons

  • leeklab
  • crowdsource

33 of 58

lexicons

  • leeklab
  • crowdsource

34 of 58

lexicons

  • leeklab
  • crowdsource

35 of 58

36 of 58

37 of 58

matahari

38 of 58

matahari

39 of 58

tidycode

df

expr

40 of 58

41 of 58

df

42 of 58

df

expr

43 of 58

df

expr

library(tidyverse)

data %>% filter(mpg < 20)

lm(mpg ~ cyl, data = data)

44 of 58

tidycode

df

expr

45 of 58

tidycode

df

expr

46 of 58

tidycode

df

47 of 58

tbl <- df %>%

unnest_calls(expr)

tidycode

df

48 of 58

classification_tbl <- tbl %>%

anti_join(get_stopfuncs()) %>%

inner_join(get_classifications(

"crowdsource",

include_duplicates = FALSE)

)

tidycode

df

49 of 58

classification_tbl <- tbl %>%

anti_join(get_stopfuncs()) %>%

inner_join(get_classifications(

"crowdsource",

include_duplicates = FALSE)

)

tidycode

df

50 of 58

classification_tbl <- tbl %>%

anti_join(get_stopfuncs()) %>%

inner_join(get_classifications(

"crowdsource",

include_duplicates = FALSE)

)

tidycode

df

51 of 58

classification_tbl <- tbl %>%

anti_join(get_stopfuncs()) %>%

inner_join(get_classifications(

"crowdsource",

include_duplicates = FALSE)

)

tidycode

df

52 of 58

classification_tbl <- tbl %>%

anti_join(get_stopfuncs()) %>%

inner_join(get_classifications(

"crowdsource",

include_duplicates = FALSE)

)

tidycode

df

53 of 58

classification_tbl %>%

group_by(id, analysis_job, classification) %>%

summarise(n = n()) %>%

mutate(pct = n / sum(n)) %>%

group_by(analysis_job, classification) %>%

summarise(n = n()) %>%

mutate(avg_pct = n / sum(n)) %>%

ggplot(aes(x = analysis_job, y = avg_pct, fill = classification)) +

geom_bar(stat = "identity") +

scale_y_continuous("Average percent", labels = scales::percent) +

scale_x_discrete("Participant conducts analyses as part of their job")

tidycode

df

54 of 58

classification_tbl %>%

group_by(id, analysis_job, classification) %>%

summarise(n = n()) %>%

mutate(pct = n / sum(n)) %>%

group_by(analysis_job, classification) %>%

summarise(n = n()) %>%

mutate(avg_pct = n / sum(n)) %>%

ggplot(aes(x = analysis_job, y = avg_pct, fill = classification)) +

geom_bar(stat = "identity") +

scale_y_continuous("Average percent", labels = scales::percent) +

scale_x_discrete("Participant conducts analyses as part of their job")

tidycode

df

55 of 58

classification_tbl %>%

group_by(id, analysis_job, classification) %>%

summarise(n = n()) %>%

mutate(pct = n / sum(n)) %>%

group_by(analysis_job, classification) %>%

summarise(n = n()) %>%

mutate(avg_pct = n / sum(n)) %>%

ggplot(aes(x = analysis_job, y = avg_pct, fill = classification)) +

geom_bar(stat = "identity") +

scale_y_continuous("Average percent", labels = scales::percent) +

scale_x_discrete("Participant conducts analyses as part of their job")

tidycode

df

56 of 58

57 of 58

D’Agostino McGowan, L, Kross, S, & Leek, JT. "Tools for Analyzing R Code the Tidy Way." The R Journal. 12.1 (2020): 226-242. https://doi.org/10.32614/RJ-2020-011

58 of 58

Questions?

Lucy D’Agostino McGowan

Wake Forest University

@LucyStats