Allison Horst
How to become a missing explorer:
Including missingness in exploratory data analysis
PART 1 (15 min): Why we care about missingness
PART 2 (10 min): Missing data mechanisms
PART 3 (15 min): Exploring missing data in naniar!
Part 1: Why we care about missingness
Imagine: You are a marine biologist
Photo: www.nps.gov
CURRENT RESEARCH PROJECT:
Collect data on dead bottlenose dolphins
to model growth
Photo: www.nps.gov
This work is critical for
marine mammal conservation:
Over or underestimating growth will impact management decisions for the species and marine areas more broadly.
Variables you need for dolphin growth model:
Variables you need for dolphin growth model:
Variables you need for dolphin growth model:
Variables you need for dolphin growth model:
To measure body mass, you need to transport the dolphins to the lab, which is harder & sometimes impossible with larger dolphins.
We expect: a greater tendency for mass values to be missing for larger dolphins
Data might look like this:
length_m | age_yr | mass_kg |
1.21 | 1 | 48.2 |
3.37 | 15 | 162.3 |
NA | 2 | 53.7 |
2.64 | 3 | 107.4 |
3.25 | NA | NA |
2.01 | 11 | 82.4 |
1.84 | 3 | 71.2 |
3.15 | 18 | NA |
3.71 | 14 | NA |
NA = “not available” in R
NaN = “not a number” in R & Python
length_m | age_yr | mass_kg |
1.21 | 1 | 48.2 |
3.37 | 15 | 162.3 |
NA | 2 | 53.7 |
2.64 | 3 | 107.4 |
3.25 | NA | NA |
2.01 | 11 | 82.4 |
1.84 | 3 | 71.2 |
3.15 | 18 | NA |
3.71 | 14 | NA |
Data might look like this:
Human error
length_m | age_yr | mass_kg |
1.21 | 1 | 48.2 |
3.37 | 15 | 162.3 |
NA | 2 | 53.7 |
2.64 | 3 | 107.4 |
3.25 | NA | NA |
2.01 | 11 | 82.4 |
1.84 | 3 | 71.2 |
3.15 | 18 | NA |
3.71 | 14 | NA |
Data might look like this:
Forgot pliers, couldn’t get tooth
length_m | age_yr | mass_kg |
1.21 | 1 | 48.2 |
3.37 | 15 | 162.3 |
NA | 2 | 53.7 |
2.64 | 3 | 107.4 |
3.25 | NA | NA |
2.01 | 11 | 82.4 |
1.84 | 3 | 71.2 |
3.15 | 18 | NA |
3.71 | 14 | NA |
Data might look like this:
…but mass values are more often missing for larger dolphins.
Occasionally a large dolphin gets transported and massed...
Missing values: Everybody’s got ‘em.
“Even the best networks of environmental
monitors do not operate flawlessly…”
Lawrence CL and M Shah (2016) J Air Waste Manag Assoc 66 (1), 38-52.
“Gear and vessels fail, storms curtail sampling…”
Rago, PJ (2004) Fishery independent sampling: survey techniques and data analyses. National Marine Fisheries Service.
“Missing data plagues almost all surveys, and quite a number of experiments...”
Lo Presti, R et al (2010) Environ Monit Assess 160 (1-4), 1-22. doi:10.1007/s10661-008-0653-3
“...in real-world data sets, missing data are the norm rather than the exception.”
Nakagawa, S and RP Freckleton (2008) Trends in Ecology and Evolution 23 (11), 592-596.
Missing values: everybody’s got ‘em...
...AND everyone should deliberately deal with missings in their data analyses.
length_m | age_yr | mass_kg |
1.21 | 1 | 48.2 |
3.37 | 15 | 162.3 |
NA | 2 | 53.7 |
2.64 | 3 | 107.4 |
3.25 | NA | NA |
2.01 | 11 | 82.4 |
1.84 | 3 | 71.2 |
3.15 | 18 | NA |
3.71 | 14 | NA |
How are missing values most commonly “dealt with”?
LISTWISE DELETION:
Any observation (row) containing NA for a variable included in analysis is dropped
KEEP CALM
AND
DISAPPEAR NAs
Default mindset:
Listwise deletion is the default in most software
na.omit = TRUE
na.rm = TRUE
Default code:
ggplot():
ggplot():
lm() & glm():
Listwise deletion…
length_m | age_yr | mass_kg |
1.21 | 1 | 48.2 |
3.37 | 15 | 162.3 |
NA | 2 | 53.7 |
2.64 | 3 | 107.4 |
3.25 | NA | NA |
2.01 | 11 | 82.4 |
1.84 | 3 | 71.2 |
3.15 | 18 | NA |
3.71 | 14 | NA |
WAIT WHAT?
My estimates for dolphin growth rates might be biased if I used the default of listwise deletion in most software!?
Listwise deletion…
Yes...but depends on the missingness mechanism.
Part 2: Mechanisms of missingness
Missing completely at random (MCAR):
Missingness doesn’t depend on observed or unobserved variables
Missing completely at random (MCAR):
Missingness doesn’t depend on observed or unobserved variables
Examples: Missingness in dolphin length depends on...
Missing completely at random (MCAR):
Missingness doesn’t depend on observed or unobserved variables
Missing at random (MAR):
Missingness depends on observed variables
Examples: Missingness in dolphin length depends on...
Missing completely at random (MCAR):
Missingness doesn’t depend on observed or unobserved variables
Missing at random (MAR):
Missingness depends on observed variables
Examples: Missingness in dolphin length depends on...
Example: Missingness in dolphin mass depends on dolphin length.
Missing completely at random (MCAR):
Missingness doesn’t depend on observed or unobserved variables
Missing at random (MAR):
Missingness depends on observed variables
Missing not at random (MNAR):
Missingness depends on the value of the missing data
Examples: Missingness in dolphin length depends on...
Example: Missingness in dolphin mass depends on dolphin length.
Missing completely at random (MCAR):
Missingness doesn’t depend on observed or unobserved variables
Missing at random (MAR):
Missingness depends on observed variables
Missing not at random (MNAR):
Missingness depends on the value of the missing data
Example: Missingness in dolphin mass depends on dolphin mass.
Examples: Missingness in dolphin length depends on...
Example: Missingness in dolphin mass depends on dolphin length.
So we might not know if missing dolphin mass is missing at random (depends on length) or missing not at random (depends on mass), but it’s not missing completely at random.
Recall:
Listwise deletion can introduce bias in parameter estimates
Yes...but depends on the missingness mechanism.
Newman, DA (2014). Missing data: five practical guidelines. Organizational research methods 17 (4): 372 - 411. DOI: 10.1177/1094428114548590
Missing completely
at random
Missing at random
Missing not at random
Newman, DA (2014). Missing data: five practical guidelines. Organizational research methods 17 (4): 372 - 411. DOI: 10.1177/1094428114548590
Missing completely
at random
Missing at random
Missing not at random
Turns out...
Mary E. Shotwell, Wayne E. McFee, and Elizabeth H. Slate (2016). A Bayesian mixture model for missing data in marine mammal growth analysis. Environmental and Ecological Statistics 23 (4): 585-603.
“One goal of the Marine Mammal Health and Stranding Response Program is to model bottlenose dolphin growth for the southeastern U.S. coastal population.”
“...it is more feasible to transport a smaller animal than a larger adult animal for laboratory measurements.”
“...causing an underestimate of animal mass when a growth curve is fit to only measured animals.”
“...it is more feasible to transport a smaller animal than a larger adult animal for laboratory measurements.”
“...causing an underestimate of animal mass when a growth curve is fit to only measured animals.”
data missing at random /
missing not at random
“...it is more feasible to transport a smaller animal than a larger adult animal for laboratory measurements.”
“...causing an underestimate of animal mass when a growth curve is fit to only measured animals.”
data missing at random /
missing not at random
listwise deletion
“...it is more feasible to transport a smaller animal than a larger adult animal for laboratory measurements.”
“...causing an underestimate of animal mass when a growth curve is fit to only measured animals.”
data missing at random /
missing not at random
listwise deletion
biased estimate
Shotwell et al. “Develop(ed) a statistical model for growth in bottlenose dolphins...to compensate for the omission of larger animals in the complete case analysis.”
Shotwell et al. “Develop(ed) a statistical model for growth in bottlenose dolphins...to compensate for the omission of larger animals in the complete case analysis.”
They handled missing data by:
A visit from Dr. Mary Shotwell!
First question: “What is missing?”
Part 3: Start exploring missingness
in naniar
naniar, an R package by Dr. Nick Tierney, “provides principled, tidy ways to summarise, visualise, and manipulate missing data”
Konza Prairie LTER
Konza Prairie Biological Station
Traditional lands of Kaw (Kansa) People
Data: Konza Prairie rodents
Citation: Hope A. 2019. CSM08 Small mammal host-parasite sampling data for 16 linear trapping transects located in 8 LTER burn treatment watersheds at Konza Prairie. Environmental Data Initiative. https://doi.org/10.6073/pasta/69109c56fcf21a30a8d37369cb47f8de.
Data structure: (stored as object kp_rodents)
Total observations (rows): 971
Let’s go to naniar
to start exploring missing data:
Today we’ll explore:
https://allisonhorst.shinyapps.io/missingexplorer/
(link in chat)
Let’s go coding.
Tutorial made with the learnr package in R.
Learn how to make your own:
https://rstudio.github.io/learnr/
Brand new RStudio Education post:
URL HERE (copy to chat window)
Today’s takeaways:
Before lab this week:
Coming up:
QUESTIONS?
Become a missing explorer!