1 of 55

Data Manipulation

Lucy D’Agostino McGowan

Wake Forest University

Dr. Lucy D’Agostino McGowan ● STA 112

2 of 55

Dr. Lucy D’Agostino McGowan ● STA 112

3 of 55

Dataframe

Dr. Lucy D’Agostino McGowan ● STA 112

4 of 55

How many rows?

Dr. Lucy D’Agostino McGowan ● STA 112

5 of 55

How many columns?

Dr. Lucy D’Agostino McGowan ● STA 112

6 of 55

Here is one observation

Dr. Lucy D’Agostino McGowan ● STA 112

7 of 55

Looking at the column green, how many sides does the observation have?

Dr. Lucy D’Agostino McGowan ● STA 112

8 of 55

Logical Operators

Test

Meaning

x < y

x is less than y

x > y

x is greater than y

x <= y

x is less than or equal to y

x >= y

x is greater than or equal to y

Test

Meaning

x == y

x is exactly equal to y

x != y

x is not equal to y

x %in% y

x is contained in y

is.na(x)

x is missing

!is.na(x)

x is not missing

Dr. Lucy D’Agostino McGowan ● STA 112

9 of 55

Boolean Operators

Operator

Meaning

a & b

and

a | b

or

!a

not

Dr. Lucy D’Agostino McGowan ● STA 112

10 of 55

filter

Dr. Lucy D’Agostino McGowan ● STA 112

11 of 55

filter

include rows based on one or more logical statements

Dr. Lucy D’Agostino McGowan ● STA 112

12 of 55

take your data frame and then filter it (only include rows where) the red column only includes observations with three sides (triangles) OR the green column only includes observations with more than 4 sides (pentagons, hexagons)

Dr. Lucy D’Agostino McGowan ● STA 112

13 of 55

Logical Operators

Test

Meaning

x < y

x is less than y

x > y

x is greater than y

x <= y

x is less than or equal to y

x >= y

x is greater than or equal to y

Test

Meaning

x == y

x is exactly equal to y

x != y

x is not equal to y

x %in% y

x is contained in y

is.na(x)

x is missing

!is.na(x)

x is not missing

Dr. Lucy D’Agostino McGowan ● STA 112

14 of 55

Boolean Operators

Operator

Meaning

a & b

and

a | b

or

!a

not

Dr. Lucy D’Agostino McGowan ● STA 112

15 of 55

data |>

filter(red == 3 | � green > 4)

Dr. Lucy D’Agostino McGowan ● STA 112

16 of 55

data |>

filter(red == 3 | � green > 4)

take your data frame

Dr. Lucy D’Agostino McGowan ● STA 112

17 of 55

data |>

filter(red == 3 | � green > 4)

and then

Dr. Lucy D’Agostino McGowan ● STA 112

18 of 55

data |>

filter(red == 3 | � green > 4)

filter it (only include rows where)

Dr. Lucy D’Agostino McGowan ● STA 112

19 of 55

data |>

filter(red == 3 | � green > 4)

the red column only includes observations with three sides (triangles)

Dr. Lucy D’Agostino McGowan ● STA 112

20 of 55

data |>

filter(red == 3 | � green > 4)

OR

Dr. Lucy D’Agostino McGowan ● STA 112

21 of 55

data |>

filter(red == 3 | � green > 4)

the green column only includes observations with more than 4 sides (pentagons, hexagons)

Dr. Lucy D’Agostino McGowan ● STA 112

22 of 55

data |>

filter(yellow == 4 | � blue > 3)

Dr. Lucy D’Agostino McGowan ● STA 112

23 of 55

data |>

filter(yellow > 3 & � blue > 3)

Dr. Lucy D’Agostino McGowan ● STA 112

24 of 55

data |>

filter(yellow > 3, � blue > 3)

the default in R is the and condition

Dr. Lucy D’Agostino McGowan ● STA 112

25 of 55

select

Dr. Lucy D’Agostino McGowan ● STA 112

26 of 55

select

include columns based on one or more logical statements

Dr. Lucy D’Agostino McGowan ● STA 112

27 of 55

data |>

select(red, yellow,

green)

Dr. Lucy D’Agostino McGowan ● STA 112

28 of 55

data |>

select(-green)

Dr. Lucy D’Agostino McGowan ● STA 112

29 of 55

mutate

Dr. Lucy D’Agostino McGowan ● STA 112

30 of 55

mutate

create new columns

Dr. Lucy D’Agostino McGowan ● STA 112

31 of 55

data |>

mutate(purple = c(4, 4, 5))

Dr. Lucy D’Agostino McGowan ● STA 112

32 of 55

data |>

mutate(purple = c(3, 5, 4))

Dr. Lucy D’Agostino McGowan ● STA 112

33 of 55

ifelse

Dr. Lucy D’Agostino McGowan ● STA 112

34 of 55

ifelse

ifelse(logical_test,

value_if_true,

value_if_false)

Dr. Lucy D’Agostino McGowan ● STA 112

35 of 55

data |>

mutate(blue =

ifelse(red > 3, 4, 5))

Dr. Lucy D’Agostino McGowan ● STA 112

36 of 55

data |>

mutate(blue =

ifelse(purple == 4, 3, 6))

Dr. Lucy D’Agostino McGowan ● STA 112

37 of 55

data |>

mutate(orange =

ifelse(blue == 6, 4, 3))

Dr. Lucy D’Agostino McGowan ● STA 112

38 of 55

data |>

mutate(orange =

ifelse(blue == 6, 4, 3),

green = orange + 1)

Dr. Lucy D’Agostino McGowan ● STA 112

39 of 55

data |>

mutate(orange =

ifelse(blue == 6, 4, 3),

green = orange + 1)

Dr. Lucy D’Agostino McGowan ● STA 112

40 of 55

summarize

Dr. Lucy D’Agostino McGowan ● STA 112

41 of 55

summarize

compute a table of summaries

Dr. Lucy D’Agostino McGowan ● STA 112

42 of 55

Summary statistics

function

Meaning

min

minimum

max

maximum

mean

average

sd

standard deviation

sum

addition

quantile(x, probs = 0.25)

quantile, set probs [0 - 1]

Dr. Lucy D’Agostino McGowan ● STA 112

43 of 55

data |>

summarize(max(purple))

Dr. Lucy D’Agostino McGowan ● STA 112

44 of 55

data |>

summarize(min(red),

min(green),

min(blue),

min(orange))

Dr. Lucy D’Agostino McGowan ● STA 112

45 of 55

data |>

summarize(max(red),

max(blue),

min(orange))

Dr. Lucy D’Agostino McGowan ● STA 112

46 of 55

group_by

Dr. Lucy D’Agostino McGowan ● STA 112

47 of 55

group_by

put rows into groups based on values in column(s)

Dr. Lucy D’Agostino McGowan ● STA 112

48 of 55

data |>

group_by(blue) |>

summarize(max(red))

Dr. Lucy D’Agostino McGowan ● STA 112

49 of 55

data |>

group_by(blue) |>

summarize(min(red))

Dr. Lucy D’Agostino McGowan ● STA 112

50 of 55

data |>

group_by(blue) |>

summarize(min(red),

max(green))

Dr. Lucy D’Agostino McGowan ● STA 112

51 of 55

data |>

group_by(orange) |>

summarize(mean(blue))

Dr. Lucy D’Agostino McGowan ● STA 112

52 of 55

data |>

group_by(orange, purple) |>

summarize(min(blue))

Dr. Lucy D’Agostino McGowan ● STA 112

53 of 55

combine steps

Dr. Lucy D’Agostino McGowan ● STA 112

54 of 55

data |>

filter(blue > 3) |>

select(red, yellow, blue)|>

mutate(green = blue - 1)

Dr. Lucy D’Agostino McGowan ● STA 112

55 of 55

data |>

filter(blue > 4) |>

summarize(max(blue))

Dr. Lucy D’Agostino McGowan ● STA 112