1 of 30

DATA TRANSFORMATION�R & dplyr

2 of 30

CONTENT

  • Steps in Exploratory Data Analysis
  • First Steps with R and RStudio
  • Our Tool Set
  • Data loading with {readr} / data management with {tibble}
  • Data transformation with {dplyr}
    • Select columns
    • Filter rows
    • Sort rows
    • Add or change columns
    • Aggregate rows
  • Exercise

Prof. Dr. Nicolas Meseth | Twitter | Instagram | YouTube

3 of 30

RECOMMENDED LITERATURE

Wickham, Hadley, and Garrett Grolemund. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. 2nd edition, O’Reilly, 2023. Online verfügbar: https://r4ds.hadley.nz/

Chapter 4 in the online version

Sauer, Sebastian. Moderne Datenanalyse mit R. Springer Gabler, 2019.

�→ Chapter 7

Prof. Dr. Nicolas Meseth | Twitter | Instagram | YouTube | LinkedIn

4 of 30

STEPS IN EXPLORATORY DATA ANALYSIS

5 of 30

STEPS IN EXPLORATORY DATA ANALYSIS

Source: Wickham, Hadley, and Garrett Grolemund. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. First edition, O’Reilly, 2016. URL: https://r4ds.hadley.nz/diagrams/data-science/base.png

Prof. Dr. Nicolas Meseth | Twitter | Instagram | YouTube | LinkedIn

6 of 30

FIRST STEPS WITH �R & RStudio

7 of 30

FIRST STEPS WITH R AND RSTUDIO

DESKTOP OR CLOUD

Download, Installation R and RStudio

Walkthrough RStudio

  • Console and script editor
  • Installing packages
  • Projects
  • Environment
  • Previews
  • Getting Help

alternatively

Registration and Login RStudio Cloud

Prof. Dr. Nicolas Meseth | Twitter | Instagram | YouTube | LinkedIn

8 of 30

FIRST STEPS WITH R AND RSTUDIO

CREATE A NEW PROJECT

All code examples for this course are hosted publicly on GitHub

  • File → New Project → New Directory
  • Choose a location on your computer and enter the name for the new directory
  • Check “Use renv with this project”

Prof. Dr. Nicolas Meseth | Twitter | Instagram | YouTube | LinkedIn

9 of 30

FIRST STEPS WITH R AND RSTUDIO

CHECKOUT GITHUB REPO

All code examples for this course are hosted publicly on GitHub

  • File → New Project → Version Control → Git
  • Paste the repository’s URL and choose a location on your computer:�https://github.com/winf-hsos/<name_of_repo>.git

Prof. Dr. Nicolas Meseth | Twitter | Instagram | YouTube | LinkedIn

10 of 30

OUR TOOLSET

11 of 30

OUR TOOLSET

  • Data loading, e.g., with {readr} or {readxl}
  • Data management with {tibble}
  • Data transformation with {dplyr}
    • select()
    • filter()
    • arrange()
    • mutate() / transmute()
    • summarise() / group_by()�
  • Data visualization with {ggplot2}
  • Working Environment(s)
    • R & Python
    • RStudio
    • Databricks (for Big Data)

Prof. Dr. Nicolas Meseth | Twitter | Instagram | YouTube | LinkedIn

12 of 30

mutate

transmute

group_by

summarize

filter

select

arrange

Prof. Dr. Nicolas Meseth | Twitter | Instagram | YouTube | LinkedIn

13 of 30

SELECT

REDUCING COLUMNS

Prof. Dr. Nicolas Meseth | Twitter | Instagram | YouTube | LinkedIn

14 of 30

FILTER

REDUCING ROWS

Prof. Dr. Nicolas Meseth | Twitter | Instagram | YouTube | LinkedIn

15 of 30

ARRANGE

SORTING ROWS

Prof. Dr. Nicolas Meseth | Twitter | Instagram | YouTube | LinkedIn

16 of 30

MUTATE

ADD NEW COLUMNS

Prof. Dr. Nicolas Meseth | Twitter | Instagram | YouTube | LinkedIn

17 of 30

SUMMARIZE

SUMMARIZE ROWS

Prof. Dr. Nicolas Meseth | Twitter | Instagram | YouTube | LinkedIn

18 of 30

GROUP & SUMMARIZE

GROUP BY VARIABLE AND SUMMARIZE

Prof. Dr. Nicolas Meseth | Twitter | Instagram | YouTube | LinkedIn

19 of 30

DATA LOADING

{readr}

20 of 30

DATA LOADING

EXERCISE DATA

  • Data loading with {readr} and {readxl} (Excel, CSV), {jsonlite} (JSON), or readRDS (R-format)
  • {janitor} and clean_names for better column names
  • Introductory data sets:

Politician’s Tweets (JSON / RDS)

Campusbier Sales Orders (CSV)

REWE Online Products (CSV)

Prof. Dr. Nicolas Meseth | Twitter | Instagram | YouTube | LinkedIn

21 of 30

DATA MANAGEMENT

{tibble}

22 of 30

DATA MANAGEMENT

HANDLE THE DATA

  • Manage data with data frames and {tibble}
  • Tibbles as modern data frames
    • Better printing
    • No string conversion into factors
    • No rownames
    • Original column names are kept when loading a tibble
    • Lazy processing

Tibbles or data frames? Both are like tables in a spreadsheet… just in R

Prof. Dr. Nicolas Meseth | Twitter | Instagram | YouTube | LinkedIn

23 of 30

DATA TRANSFORMATION

{dplyr}

24 of 30

DATA TRANSFORMATION

SELECT COLUMNS

  • Select specific columns with {dplyr}
    • select
      • By name
      • By name pattern (starts_with, ends_with, contains)
      • By position or index (last_col)
      • By set (all_of, any_of)
      • By data type (where(is.numeric))
      • White vs. blacklist (!)

Prof. Dr. Nicolas Meseth | Twitter | Instagram | YouTube | LinkedIn

25 of 30

DATA TRANSFORMATION

FILTER ROWS

  • Reduce rows with {dplyr}
    • filter
      • Simple filter conditions (==, !=, <, >)
      • Multiple conditions (&, |, !, xor)
      • Set operators (%in%)
      • Missing values (NA, is.na)
      • Simple text searches (str_detect)

Prof. Dr. Nicolas Meseth | Twitter | Instagram | YouTube | LinkedIn

26 of 30

DATA TRANSFORMATION

ZEILEN SORTIEREN

  • Sort results with {dplyr}
    • arrange
      • Ascending order by one or more columns
      • Descending order (desc or -)

Prof. Dr. Nicolas Meseth | Twitter | Instagram | YouTube | LinkedIn

27 of 30

DATA TRANSFORMATION

ADD OR CHANGE COLUMNS

  • Add new calculated columns with {dplyr}
    • mutate
      • Add new calculated columns (+, -, /, *, %%, ^, paste0)
      • Vectorized calculations (mean, sum, max, min, lag, lead)
      • Keep only used columns (.keep = "used")
      • Determine position of new columns with .before and .after
    • transmute
      • Add new columns and remove all others (sometimes what we want)

Prof. Dr. Nicolas Meseth | Twitter | Instagram | YouTube | LinkedIn

28 of 30

DATA TRANSFORMATION

SUMMARIZE ROWS

  • Summarize data with {dplyr}
    • count, tally, distinct for quick aggregations
    • summarize
      • Aggregate data using functions (mean, median, quantile, sd, IQR, mad, sum, max, min, n, n_distinct, first, nth, last)
    • group_by
      • Create groups by which to aggregate
    • The janitor package with tabyl for quick percentages and cross-tables

Prof. Dr. Nicolas Meseth | Twitter | Instagram | YouTube | LinkedIn

29 of 30

EXERCISE

30 of 30

CAMPUSBIER SALES ORDERS

AD-HOC EXERCISE

You are new as a managing director in the Campusbier project and are supposed to get a first impression of the business. All you have are two datasets: orders.csv and line_items.csv.

  • How do you approach this unknown dataset?
  • With a partner, come up with at least 3 questions you want to ask the data! Look at the available columns for this!
  • Create R commands to answer the questions (without visualization yet)!

Prof. Dr. Nicolas Meseth | Twitter | Instagram | YouTube | LinkedIn