1 of 47

Data Wrangling in R

Instructor: Heather Shimon

Content: Tobin Magle, PhD

Based on

http://www.datacarpentry.org/R-ecology-lesson/03-dplyr.html

2 of 47

Outline

Introduce the tidyverse package dplyr
Importing and exporting data with read_csv() and write_csv()
7 dplyr verbs for data manipulation

select(), filter(), mutate(), group_by(), summarize(), arrange(), count()

Combining verbs using pipes %>%

3 of 47

Prerequisites

Install R and R studio

http://www.datacarpentry.org/R-ecology-lesson/index.html#setup_instructions

See the R basics lesson if you’re unfamiliar with R or R studio

https://researchguides.library.wisc.edu/R/basics

4 of 47

Open R Project from R Basics workshop

Open RStudio
File > Open Project
Browse for uw-r-workshops folder > Open
Select uw-r-workshops R Project > Open

5 of 47

Or create R Project and folders

Open RStudio
File > New Project > New Directory > New Project
For the “Directory name” enter the project name: uw-r-workshops
Leave the subdirectory as your desktop for today.

If your desktop is not the default use “Browse” to find your desktop (you might need to find your desktop in OneDrive) > Select “Open”

Select “Create Project”

6 of 47

Create folders in your R Project

In the Files pane, click the Plus Folder icon to create a new folder
Make 5 new folders:

data_raw
data_processed
scripts
plots
reports

7 of 47

File structure in Files pane

uw-r-workshops.Rproj

data_raw
data_processed
scripts
plots
reports

8 of 47

Create a new R Script

File > New File > R Script
Use the Save icon to save it in your scripts folder as: 02-wrangling

9 of 47

Installing packages

Function: install.packages()
Arguments: package name
Downloads packages from CRAN
Install once per machine
Example: install.packages(“tidyverse”)

10 of 47

Installing packages

Code: install.packages("tidyverse")

Output in Console:

> install.packages("tidyverse")

trying URL 'https://cran.rstudio.com/bin/windows/contrib/4.3/tidyverse_2.0.0.zip'

Content type 'application/zip' length 430874 bytes (420 KB)

downloaded 420 KB

package ‘tidyverse’ successfully unpacked and MD5 sums checked

The downloaded binary packages are in

C:\Users\shimon\AppData\Local\Temp\RtmpgZXpZ5\downloaded_packages

11 of 47

Loading packages

Function: library()
Arguments: package name
Gives R access to functions in the package
Load packages every time you restart R
Example: library(tidyverse)

12 of 47

Loading packages

Code: library("tidyverse")

Output in Console:

13 of 47

What is the tidyverse?

“umbrella-package” that includes packages tidyr, dplyr, ggplot2, tibble, etc.

powerful tools for manipulating and visualizing data

Built for tabular data

Combine verbs using pipes

In this lesson, we are going to use tidyverse, which is actually a collection of packages, including tidyr, dplyr, ggplot2, tibble, etc

And each package includes many useful functions. These are very powerful but easy tools for manipulating and visualizing data.

You can do all things that tidyverse does in base R but these new functions can make your life easier.

Today we will focus on data manipulation. If you sign up for the workshop next week, we’ll introduce ggplot2 package for visualization.

All the tidyverse packages are built around tidy data tables (tabular data refers to data that is organized in a table with rows and columns. Within the table, the rows represent observations, and the columns represent attributes for those observations).

These functions are designed to work with the pipe operator from the magrittr package, which allows you to make the output of one verb the input of another verb. This is a special operation only in tidyverse.

14 of 47

Download the dataset

Same dataset as used in R Basics, DO NOT download it again if you already have it
Go to your desktop and open the uw-r-workshops folder
Put the dataset named raw_surveys in your data_raw folder

15 of 47

Import dataset in tidyverse

Function: read_csv() – loads contents of a .csv file
Arguments: a file path
Output: a “tibble”
Format: read_csv(“filePath/fileName.csv”)
Code: read_csv("data_raw/raw_surveys.csv")

16 of 47

Import dataset and save it to an object

surveys <- read_csv("data_raw/raw_surveys.csv")

Output:

17 of 47

Data set: survey of small animals

Stored in a .csv file
Rows: observations of individual animals
Columns: variables that describe the animals

Species, date, sex, etc.

18 of 47

dplyr functions/verbs

First argument is always the dataset

Tibble or data frame

Output is a new table

Doesn’t change the original data

Must save the output using <-

Example: new_df <- verb(old_df, value)

19 of 47

select()

Selects columns from a data frame
Arguments: data frame and columns to be kept
Output: data with only the specified columns
Example: select() the columns weight, sex, record_id from the surveys data frame
Code: select(surveys, weight, sex, record_id)

20 of 47

Assign the new table to an object

Code:

surveys_select <- select(surveys, weight, sex, record_id)

To see output:

view(surveys_select)

21 of 47

filter()

Chooses rows based on values in a column
Arguments: data frame and column with logical expression >, <, >=, <=, ==, !=
Output: table with rows that match the expression
Example: filter() the surveys data frame to observations from the year 1995
Code: filter(surveys, year == 1995)

We’ve seen how to choose particular columns using select, now let’s look at how to pick rows using filter.

You use filter when you want to look at a subset of your observations, based on certain condition.

· Filter chooses rows based on specified criteria

· It takes a tibble and a logical expression as arguments (== is a logical equal, it compares two values year and 1995, if the answer is yes, it would return TRUE, then this observation would be included in our subset. If the answer is no, this observation would be excluded)

· And it outputs a tibble with only the rows that meet the relational expression

· For example, you can specify that you only want rows where the year is equal to 1995

· Let’s see select in action

Demo: filter

# Let’s say we only want to look at records taken in 1995

filter(surveys, year == 1995)

#same as

surveys[surveys$year == 1995, ]

22 of 47

Assign the new table to an object

Code:

surveys_1995 <- filter(surveys, year == 1995)

To see output:

view(surveys_1995)

23 of 47

filter() on multiple variables

Each condition is separated by a comma
Example: filter() the surveys data frame to observations from the year 1995 and the female sex
Code: filter(surveys, year == 1995, sex == 'F')
Use != for not equals to
Code: (surveys, year != 1995)

24 of 47

Pipe operator %>%

Shortcut: Ctrl + Shift + m (Windows) or Cmd + Shift + m (Mac)
Takes the output of one function and sends it directly to the next function (do this %>% do this)
Format: data frame %>% function1() %>% function2()
These two lines function the same:

filter(surveys, weight < 5) # Same as

surveys %>% filter(weight < 5)

25 of 47

Combine functions with %>%

Subset the surveys data frame to include individuals with a weight less than 5 grams and retain the columns species_id, sex, weight

Code: surveys %>%

filter(weight < 5) %>%

select(species_id, sex, weight)

26 of 47

Exercise: practice pipes

Using pipes, subset the surveys data frame to include:

individuals collected before the year 1995 and
retain only the columns year, sex, and weight.�

27 of 47

mutate()

Creates a new column
Arguments: data frame, definition of a new column including NewColumName = <value>
Output: table with a new column
Example: With the surveys data frame, use mutate() to create a new column named weight_kg with the value of weight in kilograms instead weight in grams (g/1000)
Code: mutate(surveys, weight_kg=weight/1000)

28 of 47

mutate() with pipe

Code: surveys %>%

mutate(weight_kg = weight / 1000)

Output is a new column named weight_kg

29 of 47

Exercise: data frame challenge

Create a new data frame from the surveys data frame that meets the following criteria:

Only include records from the year 1990 and after
Contains only the species_id column and a new column called hindfoot_half
hindfood_half contains values that are half the hindfoot_length values
Save it to an object named surveys_hf_half

Hint: think about how the commands should be ordered to produce this data frame!�

Let’s do another exercise combining the 3 verbs with pipes

Exercise:

Create a new data frame from the survey data that meets the following criteria:

Only include records from the year 1990 and after
Contains only the species_id column and a new column called hindfoot_half
hindfood_half contains values that are half the hindfoot_length values
Save it to an object named surveys_hf_half

Hint: think about how the commands should be ordered to produce this data frame!

Solution:

· Start with the surveys data frame

· First, let’s filter the rows: filter(year >= 1990)

· Then create hindfoot half: mutate(hindfoot_half = hindfoot_length/2)

· Finally, select the columns:

· Why did I do this in this order?

o Have to do select last because the other 2 depend on year and hindfoot length, which aren’t included in the output of the select statement

o Mutate and filter could be switched, but reducing the number of rows first makes the computation a bit more efficient.

30 of 47

group_by() and summarize()

Used together to create a summary table of a group
Arguments for group_by(): data frame, column names that contain the values to be broken up into categories
Arguments for summarize(): data frame, NewColumnName = SummaryFunction(variable) - e.g., mean(), min(), max(), median(), sum()
Output: a table with the calculated summary statistic by categorical variable (remember to remove NAs)

31 of 47

Creating a grouped summary table

Example: Compute the mean weight of each sex in the surveys data frame
group_by() variable is sex; summarize() variable is weight
Call new summary column of mean weight: mean_wt
Remove NAs with argument , na.rm = TRUE
Code: surveys %>%

group_by(sex) %>%

summarize(mean_wt = mean(weight, na.rm = TRUE))

32 of 47

Grouped summary table - output

Code: surveys %>%

group_by(sex) %>%

summarize(mean_wt = mean(weight, na.rm = TRUE))

Output in Console: # A tibble: 3 × 2

sex mean_wt

1 F 42.2

2 M 43.0

3 NA 64.7

33 of 47

Grouped summary table, multiple variables

Example: Compute the mean weight of each sex AND species_id in the surveys data frame

Code:

surveys %>%

group_by(sex, species_id) %>%

summarize(mean_wt = mean(weight, na.rm = TRUE))

34 of 47

Grouped summary table, multiple variables

Code: surveys %>%

group_by(sex, species_id) %>%

summarize(mean_wt = mean(weight, na.rm = TRUE))

Output in Console: # A tibble: 92 × 3

sex species_id mean_wt

1 F BA 9.16

2 F DM 41.6

3 F DO 48.5

35 of 47

Remove missing values: filter(!is.na())

Removes the missing values before summarize()
Arguments for is.na(): data frame and a column name
Output of is.na(): logical vector (missing=TRUE, not missing=FALSE)
Add ! to is.na() to remove the NAs
Example:

surveys %>%

filter(!is.na(weight))

Another useful function for data cleaning in combination with filter is the is.na() function

This function takes a column as input
And returns a T/F vector of the same length
This vector value true where there is a missing value and false where the input vector has a value
This T/F vector combined with the not operator (!) can be used to remove records with missing values from a dataset.

Exclamation mark will make FALSE become TRUE, TRUE become FALSE

Demo:

Show the T/F vector

!is.na(weight)

Use it as input to filter so that all weight values are NA

surveys %>%

filter(is.na(weight))

Add the not operator so that no weight values are NA

surveys %>%

filter(!is.na(weight))

Then you can add other dplyr verbs, and because the NAs have been eliminated,

surveys %>%

filter(!is.na(weight)) %>%

group_by(sex, species_id) %>%

summarize(mean_wt = mean(weight))

36 of 47

Remove missing values: filter(!is.na())

Example with a grouped summary table
Code:

surveys %>%

filter(!is.na(weight)) %>%

group_by(sex, species_id) %>%

summarize(mean_wt = mean(weight))

Another useful function for data cleaning in combination with filter is the is.na() function

This function takes a column as input
And returns a T/F vector of the same length
This vector value true where there is a missing value and false where the input vector has a value
This T/F vector combined with the not operator (!) can be used to remove records with missing values from a dataset.

Exclamation mark will make FALSE become TRUE, TRUE become FALSE

Demo:

Show the T/F vector

!is.na(weight)

Use it as input to filter so that all weight values are NA

surveys %>%

filter(is.na(weight))

Add the not operator so that no weight values are NA

surveys %>%

filter(!is.na(weight))

Then you can add other dplyr verbs, and because the NAs have been eliminated,

surveys %>%

filter(!is.na(weight)) %>%

group_by(sex, species_id) %>%

summarize(mean_wt = mean(weight))

37 of 47

Add other summary statistics

Add min_wt to the summary table with , min_wt = min(weight) and save it to an object called surveys_summary
Code:

surveys_summary <- surveys %>%

filter(!is.na(weight)) %>%

group_by(sex, species_id) %>%

summarize(mean_wt = mean(weight),

min_wt = min(weight))

38 of 47

arrange() and count()

Sort the values in the summary table with arrange()

Arguments: a data frame and a column name
Default ascending order, use the desc() function to arrange in descending order

Count the # observations per variable using count()

Arguments: data frame and categorical variable
Output: table with a row for each categorical variable and a column called n with counts

When making these summary tables, the data don’t always appear in the order that we would like.

The arrange() function orders the results in ascending order
It takes a variable as input
If you want to arrange them in descending order, wrap the variable name in the desc() function
The output is a table with the same data, but ordered in the way you specified.

And a common task in data wrangling is counting the number of observations in a category. In this case, use the count() function

Count takes the data and a categorical variable as input
And outputs a table with rows for each unique value of a categorical variable, and two columns: one labeled with the variable name containing the unique values, and another with the counts for each category
You can think of this as a special case of the group_by/summarize combination because it’s so frequently used

39 of 47

arrange()

Example: use arrange() to sort the summary table by the min_wt column in ascending order
Code:

surveys %>%

filter(!is.na(weight), !is.na(sex)) %>%

group_by(sex, species_id) %>%

summarize(mean_wt = mean(weight), min_wt = min(weight)) %>%

arrange(min_wt)

40 of 47

arrange() - output

Example: sort the summary table by the min_wt column in ascending order

sex species_id mean_wt min_wt

1 F PF 7.97 4

2 F RM 11.1 4

3 M PF 7.89 4

4 M PP 17.2 4

5 M RM 10.1 4

6 F OT 24.8 5

41 of 47

count()

Example: count() the number of observations for each sex in the surveys data frame
Code: surveys %>% count(sex)
Output:

sex n

1 F 15690

2 M 17348

3 NA 1748

42 of 47

count() and arrange() together

Example: use count() with arrange() and desc() to sort the table in descending order by species

Code:

surveys %>%

count(sex, species) %>%

arrange(species, desc(n))

43 of 47

count() and arrange() together - output

Code: surveys %>% count(sex, species) %>%

arrange(species, desc(n))

Output in Console: # A tibble: 81 × 3

sex species n

1 F albigula 675

2 M albigula 502

3 NA albigula 75

4 NA audubonii 75

5 F baileyi 1646

44 of 47

write_csv()

Arguments: a tibble, a file path
Output: a file at the specified file path
Format: write_csv(dataFrame, “filepath/fileName.csv”)
Example: export the surveys_summary data frame as a .csv file named summary.csv to the data_processed folder
Code:

write_csv(surveys_summary, "data_processed/summary.csv")

45 of 47

Need help?

Data Science Hub

Coding Meetup – Tuesdays & Thursdays, 2:30–4:30 pm
More information at: https://datascience.wisc.edu/hub/

UW Research Data Services: http://researchdata.wisc.edu/

Data Carpentry lesson: http://www.datacarpentry.org/R-ecology-lesson/03-dplyr.html

Data wrangling cheat sheet: http://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf

46 of 47

47 of 47

Take the survey

https://forms.gle/ZmW1n1SvktGgHHxPA