1 of 12

Before we begin

  • Make sure you have installed:
    • R AND RStudio
    • The “tidyverse” suite of R packages
    • install.packages(“tidyverse”)

  • Have RStudio OPEN

  • We will set up an R project in your data directory

2 of 12

Data Carpentry: Day 2

UW Madison - Data Science Hub

June 12, 2019

3 of 12

Same as yesterday

Logistics

  • Similar schedule -- 3 breaks
  • Out to right, then left
  • If you need a kitchen, lactation room, lockers, let us know.

Workshop

  • Hands-on!
  • Work with your neighbors
  • Use red/green tents to indicate if you need help
  • Helpers are nearby
  • Abide by the Code of Conduct

4 of 12

Goal: Productivity

  • Perform accurate data analysis
  • Leverage appropriate tools
  • Prepare for the future

Your closest collaborator is you six months ago,�but you don't reply to email.

– (paraphrasing) Mark Holder

5 of 12

Recap from yesterday

  • Early stages of data analysis
    • Format
    • Clean
    • Basic selection, combination, summarization
  • Focus on reproducibility
    • Track steps of data change
    • Organize using a folder structure
    • Use tools that you can repeat

6 of 12

Organization

  • Create a directory for each project
  • Separate things (data, scripts, reports)
  • File names: meaningful, sortable, consistent

File organization and naming �are powerful weapons against chaos.

– Jenny Bryan

7 of 12

Dates

https://xkcd.com/1179/

8 of 12

Format (spreadsheets)

  • Goal: rectangle of information
    • rows = observations, columns = variables
    • one thing per cell
    • headers for the columns
    • don't use font color or highlighting as data
  • Watch out for dates (3 separate columns?)
  • Never edit raw data
  • Plain text formats

9 of 12

Clean (spreadsheets/Open Refine)

  • cleaning & exploration
  • faceting / filtering
  • splitting a column
  • remove trailing/ending text
  • cluster categories (to find typos)
  • identify outliers
  • all actions are reproducible

10 of 12

Subset and Summarize �(Open Refine + SQL)

  • Access rows (usually by a filter) and columns of data
  • Summarize data in total or in groups
  • Combine related data

11 of 12

TODAY

  • Pick up where we left off:
    • Select and summarize data
    • Visualize data
    • Reports, final presentations
  • Using R
  • Still with a focus on self-documentation, organization, reproducibility

12 of 12

R Lessons Online