Creating reproducible research reports using RMarkdown
Michael C. Frank
with
Anjie Cao and Alvin Tan (TAs)
These slides
Questions!
Independent verification: �A core principle of the scientific endeavour
Components of verifiability
Reproducibility is not possible when there analytic errors, when data are not available, or when analyses are not fully specified
How do we ensure the integrity of this chain?
Raw data
Data in a completely unaltered form (e.g., handwritten survey responses).
Quantitative values
Values appearing in text, tables, or figures.
Primary data
Data in its most basic digital form.
Processed data
Data after initial processing but prior to substantive analysis.
Reports
Including interim analyses, slides, posters, preprints, & papers.
Digitization
(extraction, conversion)
Processing
(filtering, formatting,
munging, anonymization)
Analysis
(descriptive or inferential)
Dissemination
Analysis pipeline
Inter-related reproducibility issues
Digitization
(extraction, conversion)
Processing
(filtering, formatting,
munging, anonymization)
Analysis
(descriptive or inferential)
Dissemination
Analysis pipeline
One solution – reproducible writing!
Data
RMarkdown Writeup
+
=
APA Paper
Share these!
Additional (selfish) motivation
A reproducible workflow eliminates many of these tedious tasks
Learning goals
Learning goals
Errors in data analysis are more common than we would like… and they are hard to find!
Workflow
Wrong trial numbers!
Not just me…
gen recall1=.
replace recall1=0 if Q21==1 replace recall1=1 if Q21==3 | Q21==5 | Q21==6
replace recall1=2 if Q21==2 | Q21==4 | Q21==7 | Q21==8
replace recall1=0 if Q69==1
replace recall1=1 if Q69==3 | Q69==5 | Q69==6
replace recall1=2 if Q69==2 | Q69==4 | Q69==7 | Q69==8 ta recall1
These examples are the best case – we have access to code and data and can detect and correct issues
Major barrier:
Most data is not available
Low data and code availability renders most papers non-reproducible
Data & code availability (prevalence estimates)
Data not available on request
| Data | Code |
Psychology (2014-2017)1 | 2% [1-4%]* | 1% [0-1%] |
Social Sciences (2014-2017)2 | 7% [2-13%] | 1% [0-3%] |
Biomedicine (2015-2017)3 | 18.3% [11.6-27.3%] | 0% |
| Data shared |
141 articles published in four major APA journals (2004)4 | 27% |
516 ecology articles published (1991-2011)5 | 20% |
111 most highly-cited articles published in psychology & psychiatry (2006-2016)6 | 14% |
*[95% confidence intervals]
A clever workaround for error detection
Bakker & Wicherts (2011)
Redundant information in APA statistical reporting provides an opportunity…
Automated error detection using statcheck technique on parsed PDFs
Lower bound on total errors
Opportunity: Mandatory open data policy at the journal Cognition
First systematic case study of analytic reproducibility in psychology
Subset of set of target outcomes from 35 articles with open, reusable data.
11/35 fully reproducible
11/35 fully reproducible with assistance
13/35 not fully reproducible
1324 values checked. 64 “major numerical errors” (5%)
Empirical assessment of reproducibility
Important caveat:
No cases where reproducibility issues appeared to seriously undermine substantive conclusions (3 unclear cases)
Person hours for each:
2 - 4 (no assistance) or 5 - 25 (assistance)
Like putting together flat pack furniture with a vague instruction manual - very high failure rate!
Replicating the reproducibility study (a meta-analysis)
Two components of reproducibility
Transparency policies must require code AND data
YOU should post your code and data!
Revisions to workflow – adopt reproducible reports!
Questions?
Learning goals
R and RStudio
This webinar: using cloud-based services
Markdown
RMarkdown = R + Markdown
An RMarkdown document
YAML header
code chunk
markdown text
Headers
Markdown Text
Code Chunks
Output formats
Let’s see it in action!
Follow along with instructions on the slides:
[starting on slide 41]
Rendering a simple markdown
1. Create new workspace (RStudio)
2. File > New > RMarkdown
3. Install relevant packages
4. Fill out template
5. Knit (requires saving the file)
Questions?
Adding display elements
1. Clear template
2. Install and load tidyverse package
3. Add a table showing the mtcars built-in dataset using knitr::kable(mtcars)
4. Add a plot of mtcars using ggplot
Adding reproducible statistical tests
Returning to Markdown and report styling
final Rmd: https://gist.github.com/mcfrank/6066d428dba57109706cc23011687cf5
Part 3: Writing reproducible papers
The workflow
Papaja
Let’s see it in action!
Follow along with instructions on the slides:
https://tinyurl.com/mb-repro-writing
(instructions for this example on slide 49)
Rendering an APA format markdown to PDF
1. install.packages("papaja")
2. File > New > Rmarkdown
3. Choose “From template” from left tab panel
4. Choose “APA style manuscript”
5. Knit (requires saving the file)
Bibliographic management
Bibliographic management
Let’s see it in action!
Follow along with instructions on the slides:
Adding references
Bibliographic management (advanced)
Additional tools for bibliographies:
citr:
Collaboration – many models
Why share your reproducible manuscript?
Sharing, practicalities
Summary
Thank you!
Data
RMarkdown Writeup
+
=
APA Paper
Conclusions &
These slides:
https://tinyurl.com/mb-repro-writing
Webinar feedback: https://tinyurl.com/mb-repro-feedback
Much more reproducibility content: http://experimentology.io
Questions