1 of 34

Missing Data

2 of 34

Data Science Thing of the Day

https://x.com/Corey_Yanofsky/status/1841583741300289645

3 of 34

“The best thing to do with missing values is not to have any”

  • Gertrude Mary Cox

4 of 34

The best way to prevent missing data is to help design a study, review the data as the study is ongoing, and create automated checks to report and describe missing data.

  • John Muschelli

5 of 34

Why does missing data happen?

  1. It wasn't collected for a number of reasons
    1. Someone forgot
    2. Patient was tired/ornery and other measure had precedence
    3. Someone forgot to put the info in
    4. It wasn't applicable
  2. People refuse to answer questions (always ask why this would be)
    • Sensitive information
    • Response could get them in "trouble" (e.g. with their employer/benefits)
    • They don't like that information out there
  3. It is systematic/not applicable
    • If you answer X to question 1, skip to question 5 - know your instruments!
  4. It was read in incorrectly - this is YOUR responsibility
  5. It was coded incorrectly/coerced to NA - this is YOUR responsibility

6 of 34

Why is this an issue?

  1. Some functions will not run with missing data (e.g. randomForest)
  2. Some functions will implicitly/almost silently drop records (e.g. lm)
  3. Bias in results (everyone > some cutoff didn't report and is NA)
  4. Generalizability of results (missing are systematically different, but present in the population)
    1. Think statistically, E[Y | X] vs E[Y | X, X is not missing]

7 of 34

Ways of Handling missing data

  1. Delete any rows with missing data (complete.cases function)
    1. What are the downsides?
    2. Always look at the % of records dropped (<5% OK rule of thumb)
  2. Impute the data
    • Use a model
    • Use the mean (this is a really model with just the intercept)
    • Use the median (this is really an quantile regression with just the intercept)
  3. Impute the data multiple times (multiple imputation)
    • How do you aggregate the results?
  4. Put indicators for missing data
    • Different models can handle this
  5. Separate models with indicators and then another model for actual data

8 of 34

Why does R give NA by default for most methods, such as mean(x)?

9 of 34

Easy ways to see missing data

rowSums(is.na(df))

colSums(is.na(df))

rowSums(!is.finite(df)) - what does this do?

colSums(!is.finite(df))

10 of 34

2 great packages to help: naniar/visdat

Both by Nick Tierney: https://github.com/njtierney

"naniar provides principled, tidy ways to summarise, visualise, and manipulate missing data with minimal deviations from the workflows in ggplot2 and tidy data"

"vis_dat helps you visualise a dataframe and “get a look at the data” by displaying the variable classes in a dataframe as a plot with vis_dat, and getting a brief look into missing data patterns using vis_miss"

11 of 34

Naniar: why that name?

12 of 34

Let's explore airquality dataset (datasets)

visdat::vis_dat(airquality)

13 of 34

miss_var_summary(airquality)

Breakdown by each variable

14 of 34

miss_case_summary(airquality)

Case is record/row number

15 of 34

gg_miss_var(airquality)

16 of 34

Working with NA vars: replace_na

Tidyr::replace_na - replace NA with a value

airquality %>% tidyr::replace_na(list(Ozone = 0, Solar.R = 5))

17 of 34

Replace known missing with NA

For example 999 or 99 for variables (toy example)

naniar::replace_with_na - replace value with NA

airquality %>% naniar::replace_with_na(list(Ozone = 41, Solar.R = 190))

This is can be useful with survey data depending how it's coded

18 of 34

Normal plotting in ggplot2 with missing

library(ggplot2)

ggplot(data = airquality,

aes(x = Ozone,

y = Solar.R)) +

geom_point()

19 of 34

But what happened to missing?!

(aka READ THE WARNINGS)

Removed 42 rows containing missing values or values outside the scale range (`geom_point()`).

Warning message:

20 of 34

Naniar: geom_miss_point() can be helpful

Values set away from original data

library(naniar)

ggplot(data = airquality,

aes(x = Ozone,

y = Solar.R)) +

geom_miss_point()

See any pattern?

(No Ozone > 100 for those missing Solar.R)

21 of 34

airquality %>% add_any_miss()

22 of 34

airquality %>% add_label_missings()

23 of 34

Functions to impute data

Use Tidymodels - learn this framework if you're working in R

Recipes package

24 of 34

Imputation: Filling in Your Data

25 of 34

Imputation: Things to Remember

  1. Do NOT impute your OUTCOME in most cases
  2. Can impute predictors
    1. You can use the outcome to impute predictors, but can be contentious in some scenarios
  3. The imputed data isn't real!
    • Thus you should do sensitivity analyses on the whole data
  4. Keep track of what values were imputed!
  5. Cross-validation in imputation is likely less biased, but not common
  6. Make sure you save the model - especially for prediction/cross-validation
    • Tidymodels helps with this

26 of 34

simputation: Simple Imputation

This package offers a number of commonly used single imputation methods, each with a similar and hopefully simple interface. At the moment the following imputation methodology is supported.

Function: impute_<model>(data, formula, [model-specific options])

Formula: IMPUTED ~ MODEL_SPECIFICATION [ | GROUPING ]

data <- iris |>

impute_lm(Sepal.Length ~ Sepal.Width + Species) |>

impute_median(Sepal.Length ~ Species) |>

impute_cart(Species ~ .)

27 of 34

mice package: multiple imputations

https://cran.r-project.org/web/packages/mice/index.html

Do more than 5 imputations (default)!

Use the pool command to pool estimates together for final result (and variance workup)

28 of 34

Quick example of mice

library(mice)

aq = mice(airquality, m = 10, method = "pmm")

res = with(aq, lm(Ozone ~ Solar.R + Wind))

pool(res)

29 of 34

What mice method?

General recommendation is to try a number of them and see how sensitive the result are to the

30 of 34

Create a "Table 1" but the columns are inclusion/missingness - is this data generalizable?

library(dplyr)

library(gt)

library(gtsummary)

df = airquality %>%

mutate(is_included = !is.na(Ozone))

df %>%

select(-Ozone, -Day) %>%

tbl_summary(

by = is_included,

statistic = list(

all_continuous() ~ "{mean} ({sd})",

all_categorical() ~ "{n} / {N} ({p}%)"

),

digits = all_continuous() ~ 2,

label = Temp ~ "Temperature",

missing_text = "(Missing)"

) %>%

add_p()

31 of 34

Short Tutorials

32 of 34

Some Math on Imputation

33 of 34

Jump to slides

https://www.lshtm.ac.uk/media/38306

Introduction to Multiple Imputation

James Carpenter & Mike Kenward

Department of Medical Statistics

London School of Hygiene & Tropical Medicine

34 of 34

Rubin estimate of variance

Accessed from JHU Library Online:

APA: Rubin, D. B. (1987). Multiple Imputation for Nonresponse in Surveys. John Wiley & Sons.

Chapter 3: https://drive.google.com/file/d/1Ade09cuS19HUAAheRqqggAK83VdkUCzz/view?usp=drive_link