Data Wrangling in R
Instructor: Heather Shimon
Content: Tobin Magle, PhD
http://www.datacarpentry.org/R-ecology-lesson/03-dplyr.html
Outline
Prerequisites
http://www.datacarpentry.org/R-ecology-lesson/index.html#setup_instructions
https://researchguides.library.wisc.edu/R/basics
Open R Project from R Basics workshop
Or create R Project and folders
Create folders in your R Project
File structure in Files pane
Create a new R Script
Installing packages
Installing packages
Code: install.packages("tidyverse")
Output in Console:
> install.packages("tidyverse")
trying URL 'https://cran.rstudio.com/bin/windows/contrib/4.3/tidyverse_2.0.0.zip'
Content type 'application/zip' length 430874 bytes (420 KB)
downloaded 420 KB
package ‘tidyverse’ successfully unpacked and MD5 sums checked
The downloaded binary packages are in
C:\Users\shimon\AppData\Local\Temp\RtmpgZXpZ5\downloaded_packages
Loading packages
Loading packages
Code: library("tidyverse")
Output in Console:
What is the tidyverse?
Download the dataset
Import dataset in tidyverse
Import dataset and save it to an object
surveys <- read_csv("data_raw/raw_surveys.csv")
Output:
Data set: survey of small animals
dplyr functions/verbs
select()
Assign the new table to an object
Code:
surveys_select <- select(surveys, weight, sex, record_id)
To see output:
view(surveys_select)
filter()
Assign the new table to an object
Code:
surveys_1995 <- filter(surveys, year == 1995)
To see output:
view(surveys_1995)
filter() on multiple variables
Pipe operator %>%
filter(surveys, weight < 5) # Same as
surveys %>% filter(weight < 5)
Combine functions with %>%
filter(weight < 5) %>%
select(species_id, sex, weight)
Exercise: practice pipes
mutate()
mutate() with pipe
mutate(weight_kg = weight / 1000)
Exercise: data frame challenge
Create a new data frame from the surveys data frame that meets the following criteria:
Hint: think about how the commands should be ordered to produce this data frame!�
group_by() and summarize()
Creating a grouped summary table
group_by(sex) %>%
summarize(mean_wt = mean(weight, na.rm = TRUE))
Grouped summary table - output
group_by(sex) %>%
summarize(mean_wt = mean(weight, na.rm = TRUE))
sex mean_wt
<chr> <dbl>
1 F 42.2
2 M 43.0
3 NA 64.7
Grouped summary table, multiple variables
surveys %>%
group_by(sex, species_id) %>%
summarize(mean_wt = mean(weight, na.rm = TRUE))
Grouped summary table, multiple variables
group_by(sex, species_id) %>%
summarize(mean_wt = mean(weight, na.rm = TRUE))
sex species_id mean_wt
<chr> <chr> <dbl>
1 F BA 9.16
2 F DM 41.6
3 F DO 48.5
Remove missing values: filter(!is.na())
surveys %>%
filter(!is.na(weight))
Remove missing values: filter(!is.na())
surveys %>%
filter(!is.na(weight)) %>%
group_by(sex, species_id) %>%
summarize(mean_wt = mean(weight))
Add other summary statistics
surveys_summary <- surveys %>%
filter(!is.na(weight)) %>%
group_by(sex, species_id) %>%
summarize(mean_wt = mean(weight),
min_wt = min(weight))
arrange() and count()
arrange()
surveys %>%
filter(!is.na(weight), !is.na(sex)) %>%
group_by(sex, species_id) %>%
summarize(mean_wt = mean(weight), min_wt = min(weight)) %>%
arrange(min_wt)
arrange() - output
sex species_id mean_wt min_wt
<chr> <chr> <dbl> <dbl>
1 F PF 7.97 4
2 F RM 11.1 4
3 M PF 7.89 4
4 M PP 17.2 4
5 M RM 10.1 4
6 F OT 24.8 5
count()
sex n
<chr> <int>
1 F 15690
2 M 17348
3 NA 1748
count() and arrange() together
surveys %>%
count(sex, species) %>%
arrange(species, desc(n))
count() and arrange() together - output
arrange(species, desc(n))
sex species n
<chr> <chr> <int>
1 F albigula 675
2 M albigula 502
3 NA albigula 75
4 NA audubonii 75
5 F baileyi 1646
write_csv()
write_csv(surveys_summary, "data_processed/summary.csv")
Need help?
Take the survey