1 of 47

Data Wrangling in R

Instructor: Heather Shimon

Content: Tobin Magle, PhD

Based on

http://www.datacarpentry.org/R-ecology-lesson/03-dplyr.html

2 of 47

Outline

  • Introduce the tidyverse package dplyr
  • Importing and exporting data with read_csv() and write_csv()
  • 7 dplyr verbs for data manipulation
    • select(), filter(), mutate(), group_by(), summarize(), arrange(), count()
  • Combining verbs using pipes %>%

3 of 47

Prerequisites

  • Install R and R studio

http://www.datacarpentry.org/R-ecology-lesson/index.html#setup_instructions

  • See the R basics lesson if you’re unfamiliar with R or R studio

https://researchguides.library.wisc.edu/R/basics

4 of 47

Open R Project from R Basics workshop

  • Open RStudio
  • File > Open Project
  • Browse for uw-r-workshops folder > Open
  • Select uw-r-workshops R Project > Open

5 of 47

Or create R Project and folders

  • Open RStudio
  • File > New Project > New Directory > New Project
  • For the “Directory name” enter the project name: uw-r-workshops
  • Leave the subdirectory as your desktop for today.
    • If your desktop is not the default use “Browse” to find your desktop (you might need to find your desktop in OneDrive) > Select “Open”
  • Select “Create Project”

6 of 47

Create folders in your R Project

  • In the Files pane, click the Plus Folder icon to create a new folder
  • Make 5 new folders:
    • data_raw
    • data_processed
    • scripts
    • plots
    • reports

7 of 47

File structure in Files pane

  • uw-r-workshops.Rproj
    • data_raw
    • data_processed
    • scripts
    • plots
    • reports

8 of 47

Create a new R Script

  • File > New File > R Script
  • Use the Save icon to save it in your scripts folder as: 02-wrangling

9 of 47

Installing packages

  • Function: install.packages()
  • Arguments: package name
  • Downloads packages from CRAN
  • Install once per machine
  • Example: install.packages(“tidyverse”)

10 of 47

Installing packages

Code: install.packages("tidyverse")

Output in Console:

> install.packages("tidyverse")

trying URL 'https://cran.rstudio.com/bin/windows/contrib/4.3/tidyverse_2.0.0.zip'

Content type 'application/zip' length 430874 bytes (420 KB)

downloaded 420 KB

package ‘tidyverse’ successfully unpacked and MD5 sums checked

The downloaded binary packages are in

C:\Users\shimon\AppData\Local\Temp\RtmpgZXpZ5\downloaded_packages

11 of 47

Loading packages

  • Function: library()
  • Arguments: package name
  • Gives R access to functions in the package
  • Load packages every time you restart R
  • Example: library(tidyverse)

12 of 47

Loading packages

Code: library("tidyverse")

Output in Console:

13 of 47

What is the tidyverse?

  • “umbrella-package” that includes packages tidyr, dplyr, ggplot2, tibble, etc.

  • powerful tools for manipulating and visualizing data

  • Built for tabular data

  • Combine verbs using pipes

14 of 47

Download the dataset

  • Same dataset as used in R Basics, DO NOT download it again if you already have it
  • Go to your desktop and open the uw-r-workshops folder
  • Put the dataset named raw_surveys in your data_raw folder

15 of 47

Import dataset in tidyverse

  • Function: read_csv() – loads contents of a .csv file
  • Arguments: a file path
  • Output: a “tibble”
  • Format: read_csv(“filePath/fileName.csv”)
  • Code: read_csv("data_raw/raw_surveys.csv")

16 of 47

Import dataset and save it to an object

surveys <- read_csv("data_raw/raw_surveys.csv")

Output:

17 of 47

Data set: survey of small animals

  • Stored in a .csv file
  • Rows: observations of individual animals
  • Columns: variables that describe the animals
    • Species, date, sex, etc.

18 of 47

dplyr functions/verbs

  • First argument is always the dataset
    • Tibble or data frame
  • Output is a new table
    • Doesn’t change the original data
  • Must save the output using <-
    • Example: new_df <- verb(old_df, value)

19 of 47

select()

  • Selects columns from a data frame
  • Arguments: data frame and columns to be kept
  • Output: data with only the specified columns
  • Example: select() the columns weight, sex, record_id from the surveys data frame
  • Code: select(surveys, weight, sex, record_id)

20 of 47

Assign the new table to an object

Code:

surveys_select <- select(surveys, weight, sex, record_id)

To see output:

view(surveys_select)

21 of 47

filter()

  • Chooses rows based on values in a column
  • Arguments: data frame and column with logical expression >, <, >=, <=, ==, !=
  • Output: table with rows that match the expression
  • Example: filter() the surveys data frame to observations from the year 1995
  • Code: filter(surveys, year == 1995)

22 of 47

Assign the new table to an object

Code:

surveys_1995 <- filter(surveys, year == 1995)

To see output:

view(surveys_1995)

23 of 47

filter() on multiple variables

  • Each condition is separated by a comma
  • Example: filter() the surveys data frame to observations from the year 1995 and the female sex
  • Code: filter(surveys, year == 1995, sex == 'F')
  • Use != for not equals to
  • Code: (surveys, year != 1995)

24 of 47

Pipe operator %>%

  • Shortcut: Ctrl + Shift + m (Windows) or Cmd + Shift + m (Mac)
  • Takes the output of one function and sends it directly to the next function (do this %>% do this)
  • Format: data frame %>% function1() %>% function2()
  • These two lines function the same:

filter(surveys, weight < 5) # Same as

surveys %>% filter(weight < 5)

25 of 47

Combine functions with %>%

  • Subset the surveys data frame to include individuals with a weight less than 5 grams and retain the columns species_id, sex, weight

  • Code: surveys %>%

filter(weight < 5) %>%

select(species_id, sex, weight)

26 of 47

Exercise: practice pipes

  • Using pipes, subset the surveys data frame to include:
    • individuals collected before the year 1995 and
    • retain only the columns yearsex, and weight.

27 of 47

mutate()

  • Creates a new column
  • Arguments: data frame, definition of a new column including NewColumName = <value>
  • Output: table with a new column
  • Example: With the surveys data frame, use mutate() to create a new column named weight_kg with the value of weight in kilograms instead weight in grams (g/1000)
  • Code: mutate(surveys, weight_kg=weight/1000)

28 of 47

mutate() with pipe

  • Code: surveys %>%

mutate(weight_kg = weight / 1000)

  • Output is a new column named weight_kg

29 of 47

Exercise: data frame challenge

Create a new data frame from the surveys data frame that meets the following criteria:

    • Only include records from the year 1990 and after
    • Contains only the species_id column and a new column called hindfoot_half
    • hindfood_half contains values that are half the hindfoot_length values
    • Save it to an object named surveys_hf_half

Hint: think about how the commands should be ordered to produce this data frame!

30 of 47

group_by() and summarize()

  • Used together to create a summary table of a group
  • Arguments for group_by(): data frame, column names that contain the values to be broken up into categories
  • Arguments for summarize(): data frame, NewColumnName = SummaryFunction(variable) - e.g., mean(), min(), max(), median(), sum()
  • Output: a table with the calculated summary statistic by categorical variable (remember to remove NAs)

31 of 47

Creating a grouped summary table

  • Example: Compute the mean weight of each sex in the surveys data frame
  • group_by() variable is sex; summarize() variable is weight
  • Call new summary column of mean weight: mean_wt
  • Remove NAs with argument , na.rm = TRUE
  • Code: surveys %>%

group_by(sex) %>%

summarize(mean_wt = mean(weight, na.rm = TRUE))

32 of 47

Grouped summary table - output

  • Code: surveys %>%

group_by(sex) %>%

summarize(mean_wt = mean(weight, na.rm = TRUE))

  • Output in Console: # A tibble: 3 × 2

sex mean_wt

<chr> <dbl>

1 F 42.2

2 M 43.0

3 NA 64.7

33 of 47

Grouped summary table, multiple variables

  • Example: Compute the mean weight of each sex AND species_id in the surveys data frame

  • Code:

surveys %>%

group_by(sex, species_id) %>%

summarize(mean_wt = mean(weight, na.rm = TRUE))

34 of 47

Grouped summary table, multiple variables

  • Code: surveys %>%

group_by(sex, species_id) %>%

summarize(mean_wt = mean(weight, na.rm = TRUE))

  • Output in Console: # A tibble: 92 × 3

sex species_id mean_wt

<chr> <chr> <dbl>

1 F BA 9.16

2 F DM 41.6

3 F DO 48.5

35 of 47

Remove missing values: filter(!is.na())

  • Removes the missing values before summarize()
  • Arguments for is.na(): data frame and a column name
  • Output of is.na(): logical vector (missing=TRUE, not missing=FALSE)
  • Add ! to is.na() to remove the NAs
  • Example:

surveys %>%

filter(!is.na(weight))

36 of 47

Remove missing values: filter(!is.na())

  • Example with a grouped summary table
  • Code:

surveys %>%

filter(!is.na(weight)) %>%

group_by(sex, species_id) %>%

summarize(mean_wt = mean(weight))

37 of 47

Add other summary statistics

  • Add min_wt to the summary table with , min_wt = min(weight) and save it to an object called surveys_summary
  • Code:

surveys_summary <- surveys %>%

filter(!is.na(weight)) %>%

group_by(sex, species_id) %>%

summarize(mean_wt = mean(weight),

min_wt = min(weight))

38 of 47

arrange() and count()

  • Sort the values in the summary table with arrange()
    • Arguments: a data frame and a column name
    • Default ascending order, use the desc() function to arrange in descending order

  • Count the # observations per variable using count()
    • Arguments: data frame and categorical variable
    • Output: table with a row for each categorical variable and a column called n with counts

39 of 47

arrange()

  • Example: use arrange() to sort the summary table by the min_wt column in ascending order
  • Code:

surveys %>%

filter(!is.na(weight), !is.na(sex)) %>%

group_by(sex, species_id) %>%

summarize(mean_wt = mean(weight), min_wt = min(weight)) %>%

arrange(min_wt)

40 of 47

arrange() - output

  • Example: sort the summary table by the min_wt column in ascending order

sex species_id mean_wt min_wt

<chr> <chr> <dbl> <dbl>

1 F PF 7.97 4

2 F RM 11.1 4

3 M PF 7.89 4

4 M PP 17.2 4

5 M RM 10.1 4

6 F OT 24.8 5

41 of 47

count()

  • Example: count() the number of observations for each sex in the surveys data frame
  • Code: surveys %>% count(sex)
  • Output:

sex n

<chr> <int>

1 F 15690

2 M 17348

3 NA 1748

42 of 47

count() and arrange() together

  • Example: use count() with arrange() and desc() to sort the table in descending order by species

  • Code:

surveys %>%

count(sex, species) %>%

arrange(species, desc(n))

43 of 47

count() and arrange() together - output

  • Code: surveys %>% count(sex, species) %>%

arrange(species, desc(n))

  • Output in Console: # A tibble: 81 × 3

sex species n

<chr> <chr> <int>

1 F albigula 675

2 M albigula 502

3 NA albigula 75

4 NA audubonii 75

5 F baileyi 1646

44 of 47

write_csv()

  • Arguments: a tibble, a file path
  • Output: a file at the specified file path
  • Format: write_csv(dataFrame, “filepath/fileName.csv”)
  • Example: export the surveys_summary data frame as a .csv file named summary.csv to the data_processed folder
  • Code:

write_csv(surveys_summary, "data_processed/summary.csv")

45 of 47

Need help?

  • Data Science Hub
    • Coding Meetup – Tuesdays & Thursdays, 2:30–4:30 pm
    • More information at: https://datascience.wisc.edu/hub/

  • UW Research Data Services: http://researchdata.wisc.edu/

46 of 47

47 of 47

Take the survey