5 of 34

Why does missing data happen?

It wasn't collected for a number of reasons

Someone forgot
Patient was tired/ornery and other measure had precedence
Someone forgot to put the info in
It wasn't applicable

People refuse to answer questions (always ask why this would be)

Sensitive information
Response could get them in "trouble" (e.g. with their employer/benefits)
They don't like that information out there

It is systematic/not applicable

If you answer X to question 1, skip to question 5 - know your instruments!

It was read in incorrectly - this is YOUR responsibility
It was coded incorrectly/coerced to NA - this is YOUR responsibility

6 of 34

Why is this an issue?

Some functions will not run with missing data (e.g. randomForest)
Some functions will implicitly/almost silently drop records (e.g. lm)
Bias in results (everyone > some cutoff didn't report and is NA)
Generalizability of results (missing are systematically different, but present in the population)

Think statistically, E[Y | X] vs E[Y | X, X is not missing]

7 of 34

Ways of Handling missing data

Delete any rows with missing data (complete.cases function)

What are the downsides?
Always look at the % of records dropped (<5% OK rule of thumb)

Impute the data

Use a model
Use the mean (this is a really model with just the intercept)
Use the median (this is really an quantile regression with just the intercept)

Impute the data multiple times (multiple imputation)

How do you aggregate the results?

Put indicators for missing data

Different models can handle this

Separate models with indicators and then another model for actual data

8 of 34

Why does R give NA by default for most methods, such as mean(x)?

9 of 34

Easy ways to see missing data

rowSums(is.na(df))

colSums(is.na(df))

rowSums(!is.finite(df)) - what does this do?

colSums(!is.finite(df))

10 of 34

2 great packages to help: naniar/visdat

Both by Nick Tierney: https://github.com/njtierney

"naniar provides principled, tidy ways to summarise, visualise, and manipulate missing data with minimal deviations from the workflows in ggplot2 and tidy data"

"vis_dat helps you visualise a dataframe and “get a look at the data” by displaying the variable classes in a dataframe as a plot with vis_dat, and getting a brief look into missing data patterns using vis_miss"

11 of 34

Naniar: why that name?

12 of 34

Let's explore airquality dataset (datasets)

visdat::vis_dat(airquality)

13 of 34

miss_var_summary(airquality)

Breakdown by each variable

14 of 34

miss_case_summary(airquality)

Case is record/row number

15 of 34

gg_miss_var(airquality)

16 of 34

Working with NA vars: replace_na

Tidyr::replace_na - replace NA with a value

airquality %>% tidyr::replace_na(list(Ozone = 0, Solar.R = 5))

17 of 34

Replace known missing with NA

For example 999 or 99 for variables (toy example)

naniar::replace_with_na - replace value with NA

airquality %>% naniar::replace_with_na(list(Ozone = 41, Solar.R = 190))

This is can be useful with survey data depending how it's coded

18 of 34

Normal plotting in ggplot2 with missing

library(ggplot2)

ggplot(data = airquality,

aes(x = Ozone,

y = Solar.R)) +

geom_point()

19 of 34

But what happened to missing?!

(aka READ THE WARNINGS)

Removed 42 rows containing missing values or values outside the scale range (`geom_point()`).

Warning message:

20 of 34

Naniar: geom_miss_point() can be helpful

Values set away from original data

library(naniar)

ggplot(data = airquality,

aes(x = Ozone,

y = Solar.R)) +

geom_miss_point()

See any pattern?

(No Ozone > 100 for those missing Solar.R)

21 of 34

airquality %>% add_any_miss()

22 of 34

airquality %>% add_label_missings()

23 of 34

Functions to impute data

Use Tidymodels - learn this framework if you're working in R

Recipes package

24 of 34

Imputation: Filling in Your Data

25 of 34

Imputation: Things to Remember

Do NOT impute your OUTCOME in most cases
Can impute predictors

You can use the outcome to impute predictors, but can be contentious in some scenarios

The imputed data isn't real!

Thus you should do sensitivity analyses on the whole data

Keep track of what values were imputed!
Cross-validation in imputation is likely less biased, but not common
Make sure you save the model - especially for prediction/cross-validation

Tidymodels helps with this

26 of 34

simputation: Simple Imputation

This package offers a number of commonly used single imputation methods, each with a similar and hopefully simple interface. At the moment the following imputation methodology is supported.

Function: impute_<model>(data, formula, [model-specific options])

Formula: IMPUTED ~ MODEL_SPECIFICATION [ | GROUPING ]

data <- iris |>

impute_lm(Sepal.Length ~ Sepal.Width + Species) |>

impute_median(Sepal.Length ~ Species) |>

impute_cart(Species ~ .)

27 of 34

mice package: multiple imputations

https://cran.r-project.org/web/packages/mice/index.html

Do more than 5 imputations (default)!

Use the pool command to pool estimates together for final result (and variance workup)

28 of 34

Quick example of mice

library(mice)

aq = mice(airquality, m = 10, method = "pmm")

res = with(aq, lm(Ozone ~ Solar.R + Wind))

pool(res)

29 of 34

What mice method?

General recommendation is to try a number of them and see how sensitive the result are to the

30 of 34

Create a "Table 1" but the columns are inclusion/missingness - is this data generalizable?

library(dplyr)

library(gt)

library(gtsummary)

df = airquality %>%

mutate(is_included = !is.na(Ozone))

df %>%

select(-Ozone, -Day) %>%

tbl_summary(

by = is_included,

statistic = list(

all_continuous() ~ "{mean} ({sd})",

all_categorical() ~ "{n} / {N} ({p}%)"

digits = all_continuous() ~ 2,

label = Temp ~ "Temperature",

missing_text = "(Missing)"

) %>%

add_p()

31 of 34

Short Tutorials

https://www.appsilon.com/post/imputation-in-r

https://libguides.princeton.edu/R-Missingdata

https://www.analyticsvidhya.com/blog/2016/03/tutorial-powerful-packages-imputing-missing-values/

Longer (course): https://www.datacamp.com/courses/handling-missing-data-with-imputations-in-r

32 of 34

Some Math on Imputation

33 of 34

Jump to slides

https://www.lshtm.ac.uk/media/38306

Introduction to Multiple Imputation

James Carpenter & Mike Kenward

Department of Medical Statistics

London School of Hygiene & Tropical Medicine

34 of 34

Rubin estimate of variance

Accessed from JHU Library Online:

APA: Rubin, D. B. (1987). Multiple Imputation for Nonresponse in Surveys. John Wiley & Sons.

Chapter 3: https://drive.google.com/file/d/1Ade09cuS19HUAAheRqqggAK83VdkUCzz/view?usp=drive_link