1 of 32

Social Science R Onboarding

For Data Carpentry Instructors

October-November 2018

2 of 32

Who is this onboarding for?

  • Certified Carpentries Instructors
  • Trainees working toward certification
  • Applicants waiting for training
  • Who are interested in teaching the R version of the Social Sciences curriculum

Please note: This onboarding is not a substitute for Instructor Training or any stage of the Instructor checkout process. If you are not a certified Instructor, you will need to complete Instructor Training and checkout before teaching at a centrally-organized Carpentry workshop.

3 of 32

What if I’m not an Instructor?

  • Stay and learn! You’re welcome here.

  • Non-certified Instructors are welcome to use and adapt these lessons for other teaching contexts. See our license for usage conditions.

  • Non-certified Instructors can teach self-organized Carpentry-branded workshops alongside a certified Instructor. See the requirements for running a Carpentry-branded workshop.

4 of 32

Why should I complete onboarding?

  • Help prepare you to teach a Data Carpentry Social Science workshop
    • Gain familiarity with the curriculum and data
    • Learn where to find information about common problems

  • Get priority placement for Social Science teaching opportunities
    • Onboarding isn’t a requirement, but onboarded Instructors will have priority
    • This applies to certified Instructors only

Please note: This onboarding is not a substitute for Instructor Training or any stage of the Instructor checkout process. If you are not a certified Instructor, you will need to complete Instructor Training and checkout before teaching at a centrally-organized Carpentry workshop.

5 of 32

This onboarding will not cover:

These materials are in alpha development and are not currently part of Data Carpentry core curricular offerings.

6 of 32

Questions / Discussion

7 of 32

Origins of Curriculum

  • Direct copy of the Data Carpentry Ecology curriculum
  • Dataset and exercises changed for new audience

8 of 32

Introduction to the dataset

Data description: https://datacarpentry.org/socialsci-workshop/data/

Tabular dataset from SAFI (Studying African Farmer-Led Irrigation) project. Surveys conducted by smartphone application with questions about households and agricultural practices.

Subset of larger study - 131 responses included.

9 of 32

Workshop structure

Lesson

Length

Data Organization in Spreadsheets*

2 to 3 hours

Data Cleaning with OpenRefine*

2 to 3 hours

Data Analysis and visualisation with R

8+ hours

*Spreadsheets + OpenRefine lessons take longer than in Ecology versions

10 of 32

Dataset versions

Different versions of data file for the different applications:

  • Listed on setup page for spreadsheet and OpenRefine lessons.
  • Downloaded using download.file() in R lesson.

Illustrate different problems with data organization and cleaning. Essential to use the correct version for each exercise.

11 of 32

Audience

  • Intended for researchers working with tabular data.
  • Focus on survey / interview data.
  • About half have prior experience with GUI statistical analysis software (e.g. SPSS, SAS).
  • Some have experience with R.
  • Many work with Qualtrics data.
  • Detail-oriented - plan for questions.
  • Diversity of operating systems and software versions.

12 of 32

Workshop overview

  • Data lifecycle from collection to analysis.

  • By end of workshop, learners should be able to:
    • organize tabular data in spreadsheets
    • handle date formatting
    • carry out quality control and quality assurance
    • export data to use with downstream applications
    • explore, summarize, and clean tabular data reproducibly.
    • Import data into R
    • calculate summary statistics and
    • create publication-quality graphics

13 of 32

Questions / Discussion

14 of 32

Data Organization in Spreadsheets

  • Covers:
    • Good data entry practices - formatting data tables in spreadsheets
    • How to avoid common formatting mistakes
    • Approaches for handling dates in spreadsheets
    • Basic quality control in spreadsheets
    • Exporting data from spreadsheets

15 of 32

Data Organization in Spreadsheets

  • Motivate! Things spreadsheets are good at vs. not good at.
  • Important to engage right away
  • Start with “what are the problems?” exercise
  • The Formatting Problems episode is the solution to first exercise - use as discussion guide, not script.

16 of 32

Dataset formatting

Several columns include lists within a cell - separated by semicolon (;).

Dealt with explicitly in OpenRefine lesson.

Can cause problems on data import to LibreOffice.

17 of 32

Things to watch out for

Detailed in the Instructor Notes.

Don’t rush! Take the time to troubleshoot. Use helpers for one-on-one support.

18 of 32

Questions / Discussion

19 of 32

Data Cleaning with OpenRefine

  • Covers:
    • Faceting to get an overview of data within a column.
    • Clustering to identify data entry errors.
    • Data transformations using GREL statements.
    • Filtering and sorting data.
    • Exporting scripts and using them for reproducible data cleaning.

20 of 32

Data Cleaning with OpenRefine

  • If OpenRefine doesn’t open, point browser at http://127.0.0.1:3333/ (more troubleshooting in setup guide.
  • Exercise solutions are dependent on learners following all steps.
  • Transforming data with GREL statements quite technical.
    • Be explicit about syntax
    • Take time to troubleshoot
    • Demonstrate using history
  • Can skip examining numbers and other resources if short on time.

21 of 32

Questions / Discussion

22 of 32

Data Analysis and visualisation with R

Uses the following R packages:

  • dplyr
  • tidyr
  • readr
  • ggplot2

All installed through tidyverse.

23 of 32

Data Analysis and visualisation with R

  • Covers:
    • Installing packages.
    • Seeking help.
    • Basic data structures (focusing on vectors and dataframes).
    • Data import and export.
    • Subsetting - using both base R syntax and dplyr.
    • Data type conversions.
    • Handling missing data.
    • Date formatting.
    • Data reshaping with gather() and spread().
    • Data visualisation.

24 of 32

Data Analysis and visualisation with R

  • Uses project setup.
  • Starts with variable assignment and builds up to larger data structures.
  • Slow getting to real data - 2 to 3 hours into lesson.
  • Can’t skip gather and spread section - essential for creating data structure for plotting.

25 of 32

Data Analysis and Visualisation with R

  • Visualisation lesson includes:
    • geom_point(), geom_jitter(), geom_boxplot(), geom_bar()
    • facet_wrap()
    • themes and customization

26 of 32

Data Analysis and Visualisation with R

27 of 32

Things to watch out for

  • Dependencies between episodes - need reshaping for visualisation episode
  • Demotivation and burnout in first two episodes - before real data
  • Rushing!!!!

28 of 32

Questions / Discussion

29 of 32

Preparing to Teach

  • Read the Instructor Notes for the lessons you’re teaching (1, 2, 3).

  • Read through the materials and test out each line of code.

  • Join an Instructor Discussion session for general questions about teaching Carpentry workshops.

  • Other strategies?

30 of 32

What next?

  • You now have priority when signing up to teach Social Science workshops! Make sure you’re on the Instructor mailing list to hear about teaching opportunities.

  • Please leave feedback for the lesson developers and other Instructors via issues and PRs to the lesson repositories. If you’re not sure how, check out these instructions.

31 of 32

How to get help?

  • Check the Instructor Notes for each lesson to see if others have had the same problem (1, 2, 3).

  • Contact the dc-socialsci Slack channel with questions about the lessons.

  • For broader discussion of the lessons, you can email the social science curriculum list.

  • For questions about workshops and teaching, contact team@carpentries.org.

32 of 32

Questions?