Data Manipulation
Lucy D’Agostino McGowan
Wake Forest University
Dr. Lucy D’Agostino McGowan ● STA 112
Dr. Lucy D’Agostino McGowan ● STA 112
Dataframe
Dr. Lucy D’Agostino McGowan ● STA 112
How many rows?
Dr. Lucy D’Agostino McGowan ● STA 112
How many columns?
Dr. Lucy D’Agostino McGowan ● STA 112
Here is one observation
Dr. Lucy D’Agostino McGowan ● STA 112
Looking at the column green, how many sides does the observation have?
Dr. Lucy D’Agostino McGowan ● STA 112
Logical Operators
Test | Meaning |
x < y | x is less than y |
x > y | x is greater than y |
x <= y | x is less than or equal to y |
x >= y | x is greater than or equal to y |
Test | Meaning |
x == y | x is exactly equal to y |
x != y | x is not equal to y |
x %in% y | x is contained in y |
is.na(x) | x is missing |
!is.na(x) | x is not missing |
Dr. Lucy D’Agostino McGowan ● STA 112
Boolean Operators
Operator | Meaning |
a & b | and |
a | b | or |
!a | not |
Dr. Lucy D’Agostino McGowan ● STA 112
filter
Dr. Lucy D’Agostino McGowan ● STA 112
filter
include rows based on one or more logical statements
Dr. Lucy D’Agostino McGowan ● STA 112
take your data frame and then filter it (only include rows where) the red column only includes observations with three sides (triangles) OR the green column only includes observations with more than 4 sides (pentagons, hexagons)
Dr. Lucy D’Agostino McGowan ● STA 112
Logical Operators
Test | Meaning |
x < y | x is less than y |
x > y | x is greater than y |
x <= y | x is less than or equal to y |
x >= y | x is greater than or equal to y |
Test | Meaning |
x == y | x is exactly equal to y |
x != y | x is not equal to y |
x %in% y | x is contained in y |
is.na(x) | x is missing |
!is.na(x) | x is not missing |
Dr. Lucy D’Agostino McGowan ● STA 112
Boolean Operators
Operator | Meaning |
a & b | and |
a | b | or |
!a | not |
Dr. Lucy D’Agostino McGowan ● STA 112
data |>
filter(red == 3 | � green > 4)
Dr. Lucy D’Agostino McGowan ● STA 112
data |>
filter(red == 3 | � green > 4)
take your data frame
Dr. Lucy D’Agostino McGowan ● STA 112
data |>
filter(red == 3 | � green > 4)
and then
Dr. Lucy D’Agostino McGowan ● STA 112
data |>
filter(red == 3 | � green > 4)
filter it (only include rows where)
Dr. Lucy D’Agostino McGowan ● STA 112
data |>
filter(red == 3 | � green > 4)
the red column only includes observations with three sides (triangles)
Dr. Lucy D’Agostino McGowan ● STA 112
data |>
filter(red == 3 | � green > 4)
OR
Dr. Lucy D’Agostino McGowan ● STA 112
data |>
filter(red == 3 | � green > 4)
the green column only includes observations with more than 4 sides (pentagons, hexagons)
Dr. Lucy D’Agostino McGowan ● STA 112
data |>
filter(yellow == 4 | � blue > 3)
Dr. Lucy D’Agostino McGowan ● STA 112
data |>
filter(yellow > 3 & � blue > 3)
Dr. Lucy D’Agostino McGowan ● STA 112
data |>
filter(yellow > 3, � blue > 3)
the default in R is the and condition
Dr. Lucy D’Agostino McGowan ● STA 112
select
Dr. Lucy D’Agostino McGowan ● STA 112
select
include columns based on one or more logical statements
Dr. Lucy D’Agostino McGowan ● STA 112
data |>
select(red, yellow,
green)
Dr. Lucy D’Agostino McGowan ● STA 112
data |>
select(-green)
Dr. Lucy D’Agostino McGowan ● STA 112
mutate
Dr. Lucy D’Agostino McGowan ● STA 112
mutate
create new columns
Dr. Lucy D’Agostino McGowan ● STA 112
data |>
mutate(purple = c(4, 4, 5))
Dr. Lucy D’Agostino McGowan ● STA 112
data |>
mutate(purple = c(3, 5, 4))
Dr. Lucy D’Agostino McGowan ● STA 112
ifelse
Dr. Lucy D’Agostino McGowan ● STA 112
ifelse
ifelse(logical_test,
value_if_true,
value_if_false)
Dr. Lucy D’Agostino McGowan ● STA 112
data |>
mutate(blue =
ifelse(red > 3, 4, 5))
Dr. Lucy D’Agostino McGowan ● STA 112
data |>
mutate(blue =
ifelse(purple == 4, 3, 6))
Dr. Lucy D’Agostino McGowan ● STA 112
data |>
mutate(orange =
ifelse(blue == 6, 4, 3))
Dr. Lucy D’Agostino McGowan ● STA 112
data |>
mutate(orange =
ifelse(blue == 6, 4, 3),
green = orange + 1)
Dr. Lucy D’Agostino McGowan ● STA 112
data |>
mutate(orange =
ifelse(blue == 6, 4, 3),
green = orange + 1)
Dr. Lucy D’Agostino McGowan ● STA 112
summarize
Dr. Lucy D’Agostino McGowan ● STA 112
summarize
compute a table of summaries
Dr. Lucy D’Agostino McGowan ● STA 112
Summary statistics
function | Meaning |
min | minimum |
max | maximum |
mean | average |
sd | standard deviation |
sum | addition |
quantile(x, probs = 0.25) | quantile, set probs [0 - 1] |
Dr. Lucy D’Agostino McGowan ● STA 112
data |>
summarize(max(purple))
Dr. Lucy D’Agostino McGowan ● STA 112
data |>
summarize(min(red),
min(green),
min(blue),
min(orange))
Dr. Lucy D’Agostino McGowan ● STA 112
data |>
summarize(max(red),
max(blue),
min(orange))
Dr. Lucy D’Agostino McGowan ● STA 112
group_by
Dr. Lucy D’Agostino McGowan ● STA 112
group_by
put rows into groups based on values in column(s)
Dr. Lucy D’Agostino McGowan ● STA 112
data |>
group_by(blue) |>
summarize(max(red))
Dr. Lucy D’Agostino McGowan ● STA 112
data |>
group_by(blue) |>
summarize(min(red))
Dr. Lucy D’Agostino McGowan ● STA 112
data |>
group_by(blue) |>
summarize(min(red),
max(green))
Dr. Lucy D’Agostino McGowan ● STA 112
data |>
group_by(orange) |>
summarize(mean(blue))
Dr. Lucy D’Agostino McGowan ● STA 112
data |>
group_by(orange, purple) |>
summarize(min(blue))
Dr. Lucy D’Agostino McGowan ● STA 112
combine steps
Dr. Lucy D’Agostino McGowan ● STA 112
data |>
filter(blue > 3) |>
select(red, yellow, blue)|>
mutate(green = blue - 1)
Dr. Lucy D’Agostino McGowan ● STA 112
data |>
filter(blue > 4) |>
summarize(max(blue))
Dr. Lucy D’Agostino McGowan ● STA 112