1 of 61

Creating reproducible research reports using RMarkdown

Michael C. Frank

with

Anjie Cao and Alvin Tan (TAs)

2 of 61

These slides

3 of 61

Questions!

4 of 61

5 of 61

Independent verification: �A core principle of the scientific endeavour

6 of 61

Components of verifiability

Reproducibility is not possible when there analytic errors, when data are not available, or when analyses are not fully specified

7 of 61

How do we ensure the integrity of this chain?

Raw data

Data in a completely unaltered form (e.g., handwritten survey responses).

Quantitative values

Values appearing in text, tables, or figures.

Primary data

Data in its most basic digital form.

Processed data

Data after initial processing but prior to substantive analysis.

Reports

Including interim analyses, slides, posters, preprints, & papers.

Digitization

(extraction, conversion)

Processing

(filtering, formatting,

munging, anonymization)

Analysis

(descriptive or inferential)

Dissemination

Analysis pipeline

8 of 61

Inter-related reproducibility issues

  1. Internal reproducibility: maintain the integrity of analytic chain to ensure provenance of results
  2. External reproducibility: need to be able to access details of analytic chain and verify individual steps

Digitization

(extraction, conversion)

Processing

(filtering, formatting,

munging, anonymization)

Analysis

(descriptive or inferential)

Dissemination

Analysis pipeline

9 of 61

One solution – reproducible writing!

Data

RMarkdown Writeup

+

=

APA Paper

Share these!

10 of 61

Additional (selfish) motivation

  • Have you ever hand-tweaked a figure?
  • Have you ever copied and pasted stats?
    • And rounded by hand?
    • And done it again when your analysis changed?
  • Have you ever reformatted your references to change journals?

A reproducible workflow eliminates many of these tedious tasks

11 of 61

Learning goals

  1. Describe the challenges of analytic reproducibility (15 min + 5 min questions)
  2. Use RMarkdown to create reproducible documents to address these challenges (30 min practicum)
  3. Explore RMarkdown for APA formatting and bibliographic management (20 min practicum)
  4. Open Q&A (15 min)

12 of 61

Learning goals

  • Describe the challenges of analytic reproducibility (15 min + 5 min questions)
  • Use RMarkdown to create reproducible documents to address these challenges (30 min practicum)
  • Explore RMarkdown for APA formatting and bibliographic management (20 min practicum)
  • Open Q&A (15 min)

13 of 61

Errors in data analysis are more common than we would like… and they are hard to find!

14 of 61

15 of 61

Workflow

  1. Get Excel data
  2. Write Matlab code
  3. Export figure as .EPS
  4. Tweak in Illustrator
  5. Drag into Word

Wrong trial numbers!

16 of 61

Not just me…

gen recall1=.

replace recall1=0 if Q21==1 replace recall1=1 if Q21==3 | Q21==5 | Q21==6

replace recall1=2 if Q21==2 | Q21==4 | Q21==7 | Q21==8

replace recall1=0 if Q69==1

replace recall1=1 if Q69==3 | Q69==5 | Q69==6

replace recall1=2 if Q69==2 | Q69==4 | Q69==7 | Q69==8 ta recall1

17 of 61

18 of 61

19 of 61

These examples are the best case – we have access to code and data and can detect and correct issues

20 of 61

Major barrier:

Most data is not available

21 of 61

Low data and code availability renders most papers non-reproducible

Data & code availability (prevalence estimates)

Data not available on request

Data

Code

Psychology (2014-2017)1

2% [1-4%]*

1% [0-1%]

Social Sciences

(2014-2017)2

7% [2-13%]

1% [0-3%]

Biomedicine (2015-2017)3

18.3% [11.6-27.3%]

0%

Data shared

141 articles published in four major APA journals (2004)4

27%

516 ecology articles published (1991-2011)5

20%

111 most highly-cited articles published in psychology & psychiatry (2006-2016)6

14%

*[95% confidence intervals]

22 of 61

A clever workaround for error detection

Bakker & Wicherts (2011)

Redundant information in APA statistical reporting provides an opportunity…

23 of 61

Automated error detection using statcheck technique on parsed PDFs

Lower bound on total errors

24 of 61

Opportunity: Mandatory open data policy at the journal Cognition

25 of 61

First systematic case study of analytic reproducibility in psychology

26 of 61

Subset of set of target outcomes from 35 articles with open, reusable data.

11/35 fully reproducible

11/35 fully reproducible with assistance

13/35 not fully reproducible

1324 values checked. 64 “major numerical errors” (5%)

27 of 61

Empirical assessment of reproducibility

Important caveat:

No cases where reproducibility issues appeared to seriously undermine substantive conclusions (3 unclear cases)

Person hours for each:

2 - 4 (no assistance) or 5 - 25 (assistance)

Like putting together flat pack furniture with a vague instruction manual - very high failure rate!

28 of 61

Replicating the reproducibility study (a meta-analysis)

29 of 61

Two components of reproducibility

  1. Need code and data for others to verify (and build on) your work

  • Need a reproducible workflow so you can check

Transparency policies must require code AND data

YOU should post your code and data!

Revisions to workflow – adopt reproducible reports!

Questions?

30 of 61

Learning goals

  • Describe the challenges of analytic reproducibility (15 min + 5 min questions)
  • Use RMarkdown to create reproducible documents to address these challenges (30 min practicum)
  • Explore RMarkdown for APA formatting and bibliographic management (20 min practicum)
  • Open Q&A (15 min)

31 of 61

R and RStudio

  • R is an open source statistical programming language
  • RStudio is an integrated development environment (IDE)
    • Software for writing R code and seeing outputs
  • Both are free and can be downloaded to your computer

32 of 61

This webinar: using cloud-based services

  • R relies on an ecosystem of packages that allow you to do different things
    • statistical models
    • “knitting” reproducible reports
  • Downloading and installing packages is complex and computer-dependent
  • Posit Cloud is a “freemium” cloud service
    • Removes challenges of installing software on individual machines

33 of 61

Markdown

  • Markdown is text with lightweight annotations
    • Italics: *this*
    • Bold: **this**
  • Simple headings:
    • # first level
    • ## second level
    • ### third level

34 of 61

RMarkdown = R + Markdown

  • Method for blending R code with text
  • Allows you to create documents with data-generated elements
  • Part of a broader trend towards “literate programming”

35 of 61

An RMarkdown document

YAML header

code chunk

markdown text

36 of 61

Headers

  • YAML text
  • Tells “knitr” the general properties of the document
  • Generally don’t fool with it!

37 of 61

Markdown Text

  • This is where your paper goes!
  • Can write in markdown syntax
    • Headings
    • Formatting, bullets, and lists
    • Hyperlinks
  • Can include R code chunks inline
    • `r 2+2` will return 4

38 of 61

Code Chunks

  • Delimited by:
    • Start: ```{r …}
    • End: ```
  • R executes the code inside the chunk
  • Effects get printed to the document
    • Figures, numbers, tables, etc.

39 of 61

Output formats

  • Default output format is HTML
    • Easy to share with collaborators and post on the web
  • Word output is natively supported
  • PDF rendering also possible
    • for local installs, requires installing LaTeX

40 of 61

Let’s see it in action!

Follow along with instructions on the slides:

https://tinyurl.com/mb-repro-writing

[starting on slide 41]

41 of 61

Rendering a simple markdown

1. Create new workspace (RStudio)

2. File > New > RMarkdown

3. Install relevant packages

4. Fill out template

5. Knit (requires saving the file)

Questions?

42 of 61

Adding display elements

1. Clear template

2. Install and load tidyverse package

3. Add a table showing the mtcars built-in dataset using knitr::kable(mtcars)

4. Add a plot of mtcars using ggplot

43 of 61

Adding reproducible statistical tests

  1. Perform a statistical test and save the outputs
  2. Write the numbers out into text using `r p.value`-type syntax

44 of 61

Returning to Markdown and report styling

  1. Add chunk options
  2. Add headings
  3. Add styling
  4. Add table of contents

final Rmd: https://gist.github.com/mcfrank/6066d428dba57109706cc23011687cf5

45 of 61

Part 3: Writing reproducible papers

46 of 61

The workflow

  • You write your paper in Markdown
    • Adding references as you go
  • You write your analyses in R so that they generate the figures, tables, and stats you want
  • You “knit” into the final manuscript
    • Formatting handled by the template
    • Bibliography automatically generated
  • To make changes, change the code and reknit

47 of 61

Papaja

  • Managing APA format is a pain – let the software do the work!
  • Enter the papaja package
    • R-package including a R Markdown template
    • Knits to Word or PDF documents
  • Includes functions for formatting statistics and tables properly

48 of 61

Let’s see it in action!

Follow along with instructions on the slides:

https://tinyurl.com/mb-repro-writing

(instructions for this example on slide 49)

49 of 61

Rendering an APA format markdown to PDF

1. install.packages("papaja")

2. File > New > Rmarkdown

3. Choose “From template” from left tab panel

4. Choose “APA style manuscript”

5. Knit (requires saving the file)

50 of 61

Bibliographic management

  • Do you manage your bibliography for each paper by hand?
  • Alternative: use a software library and insert citations into your text with bibliography generated automatically at the end

51 of 61

Bibliographic management

  • Build your personal bibliography using
    • Zotero
    • Mendeley
    • Bibdesk
  • Create/export a bibtex file that has reference info
  • As you write, type citation codes like @hardwicke2018
  • These are converted into APA references + appropriate bibliographic entries when you knit

52 of 61

Let’s see it in action!

Follow along with instructions on the slides:

https://tinyurl.com/mb-repro-writing

53 of 61

Adding references

  1. Upload .bib file to posit.cloud
  2. Add .bib file to YAML in papaja document
  3. Add @citation to document
  4. Knit and see reference in rendered doc

54 of 61

Bibliographic management (advanced)

Additional tools for bibliographies:

Better BibTeX:

  • auto-generated citation keys
  • better formatting for special characters

citr:

  • RStudio addin for adding citations
  • with Better BibTeX, you can access your Zotero library from within RStudio + auto-update your bib file

55 of 61

Collaboration – many models

  1. First author owns Rmd, other authors comment on PDF
  2. First author drafts, renders to word/gdocs, all collaborate, first author ports text back to Rmd
  3. All collaborate on Rmd using git+github
  4. redoc – bring Word track changes back to Rmd automatically

56 of 61

Why share your reproducible manuscript?

  • Data sharing is a great first step but analytic reproducibility can still be complicated
  • Sharing an RMarkdown script alongside data is a strong step towards letting others reproduce your work
    • For secondary analyses
    • Power analysis for new studies
    • Meta-analysis

57 of 61

Sharing, practicalities

  • Where to share?
    • Open Science Framework (http://osf.io)
    • git+github also allow version control
  • Tell us more!
    • Sometimes software changes…
    • Document the versions for key packages using devtools::session_info()
  • Consider going fully reproducible
    • Use renv package to capture package versions

58 of 61

59 of 61

Summary

  • We’re only human, so errors are inevitable
  • Doing lots of things by hand is a great way to waste time and make more errors
  • Writing reproducible scientific papers using RMarkdown can:
    • Help you avoid errors
    • Save time and energy
    • Make spiffy looking documents easily

60 of 61

Thank you!

Data

RMarkdown Writeup

+

=

APA Paper

61 of 61

Conclusions &

These slides:

https://tinyurl.com/mb-repro-writing

Webinar feedback: https://tinyurl.com/mb-repro-feedback

Much more reproducibility content: http://experimentology.io

Questions