Missing Data
Data Science Thing of the Day
https://x.com/Corey_Yanofsky/status/1841583741300289645
“The best thing to do with missing values is not to have any”
The best way to prevent missing data is to help design a study, review the data as the study is ongoing, and create automated checks to report and describe missing data.
Why does missing data happen?
Why is this an issue?
Ways of Handling missing data
Why does R give NA by default for most methods, such as mean(x)?
Easy ways to see missing data
rowSums(is.na(df))
colSums(is.na(df))
rowSums(!is.finite(df)) - what does this do?
colSums(!is.finite(df))
2 great packages to help: naniar/visdat
Both by Nick Tierney: https://github.com/njtierney
"naniar provides principled, tidy ways to summarise, visualise, and manipulate missing data with minimal deviations from the workflows in ggplot2 and tidy data"
"vis_dat helps you visualise a dataframe and “get a look at the data” by displaying the variable classes in a dataframe as a plot with vis_dat, and getting a brief look into missing data patterns using vis_miss"
Naniar: why that name?
Let's explore airquality dataset (datasets)
visdat::vis_dat(airquality)
miss_var_summary(airquality)
Breakdown by each variable
miss_case_summary(airquality)
Case is record/row number
gg_miss_var(airquality)
Working with NA vars: replace_na
Tidyr::replace_na - replace NA with a value
airquality %>% tidyr::replace_na(list(Ozone = 0, Solar.R = 5))
Replace known missing with NA
For example 999 or 99 for variables (toy example)
naniar::replace_with_na - replace value with NA
airquality %>% naniar::replace_with_na(list(Ozone = 41, Solar.R = 190))
This is can be useful with survey data depending how it's coded
Normal plotting in ggplot2 with missing
library(ggplot2)
ggplot(data = airquality,
aes(x = Ozone,
y = Solar.R)) +
geom_point()
But what happened to missing?!
(aka READ THE WARNINGS)
Removed 42 rows containing missing values or values outside the scale range (`geom_point()`).
Warning message:
Naniar: geom_miss_point() can be helpful
Values set away from original data
library(naniar)
ggplot(data = airquality,
aes(x = Ozone,
y = Solar.R)) +
geom_miss_point()
See any pattern?
(No Ozone > 100 for those missing Solar.R)
airquality %>% add_any_miss()
airquality %>% add_label_missings()
Functions to impute data
Use Tidymodels - learn this framework if you're working in R
Recipes package
Imputation: Filling in Your Data
Imputation: Things to Remember
simputation: Simple Imputation
This package offers a number of commonly used single imputation methods, each with a similar and hopefully simple interface. At the moment the following imputation methodology is supported.
Function: impute_<model>(data, formula, [model-specific options])
Formula: IMPUTED ~ MODEL_SPECIFICATION [ | GROUPING ]
data <- iris |>
impute_lm(Sepal.Length ~ Sepal.Width + Species) |>
impute_median(Sepal.Length ~ Species) |>
impute_cart(Species ~ .)
mice package: multiple imputations
https://cran.r-project.org/web/packages/mice/index.html
Do more than 5 imputations (default)!
Use the pool command to pool estimates together for final result (and variance workup)
Quick example of mice
library(mice)
aq = mice(airquality, m = 10, method = "pmm")
res = with(aq, lm(Ozone ~ Solar.R + Wind))
pool(res)
What mice method?
General recommendation is to try a number of them and see how sensitive the result are to the
Create a "Table 1" but the columns are inclusion/missingness - is this data generalizable?
library(dplyr)
library(gt)
library(gtsummary)
df = airquality %>%
mutate(is_included = !is.na(Ozone))
df %>%
select(-Ozone, -Day) %>%
tbl_summary(
by = is_included,
statistic = list(
all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n} / {N} ({p}%)"
),
digits = all_continuous() ~ 2,
label = Temp ~ "Temperature",
missing_text = "(Missing)"
) %>%
add_p()
Short Tutorials
Some Math on Imputation
Jump to slides
https://www.lshtm.ac.uk/media/38306
Introduction to Multiple Imputation
James Carpenter & Mike Kenward
Department of Medical Statistics
London School of Hygiene & Tropical Medicine
Rubin estimate of variance
Accessed from JHU Library Online:
APA: Rubin, D. B. (1987). Multiple Imputation for Nonresponse in Surveys. John Wiley & Sons.
Chapter 3: https://drive.google.com/file/d/1Ade09cuS19HUAAheRqqggAK83VdkUCzz/view?usp=drive_link