1 of 56

Allison Horst

How to become a missing explorer:

Including missingness in exploratory data analysis

2 of 56

PART 1 (15 min): Why we care about missingness

PART 2 (10 min): Missing data mechanisms

PART 3 (15 min): Exploring missing data in naniar!

3 of 56

Part 1: Why we care about missingness

4 of 56

Imagine: You are a marine biologist

Photo: www.nps.gov

5 of 56

CURRENT RESEARCH PROJECT:

Collect data on dead bottlenose dolphins

to model growth

6 of 56

This work is critical for

marine mammal conservation:

  • Model species for marine mammal health
  • Study long-term human & environmental impacts
  • Inform conservation management & policy

Over or underestimating growth will impact management decisions for the species and marine areas more broadly.

7 of 56

Variables you need for dolphin growth model:

  • Length
  • Age
  • Body mass

8 of 56

Variables you need for dolphin growth model:

  • Length (measuring tape)
  • Age
  • Body mass

9 of 56

Variables you need for dolphin growth model:

  • Length (measuring tape)
  • Age (tooth sections)
  • Body mass

10 of 56

Variables you need for dolphin growth model:

  • Length (measuring tape)
  • Age (tooth sections)
  • Body mass

To measure body mass, you need to transport the dolphins to the lab, which is harder & sometimes impossible with larger dolphins.

11 of 56

We expect: a greater tendency for mass values to be missing for larger dolphins

12 of 56

Data might look like this:

length_m

age_yr

mass_kg

1.21

1

48.2

3.37

15

162.3

NA

2

53.7

2.64

3

107.4

3.25

NA

NA

2.01

11

82.4

1.84

3

71.2

3.15

18

NA

3.71

14

NA

NA = “not available” in R

NaN = “not a number” in R & Python

13 of 56

length_m

age_yr

mass_kg

1.21

1

48.2

3.37

15

162.3

NA

2

53.7

2.64

3

107.4

3.25

NA

NA

2.01

11

82.4

1.84

3

71.2

3.15

18

NA

3.71

14

NA

Data might look like this:

Human error

14 of 56

length_m

age_yr

mass_kg

1.21

1

48.2

3.37

15

162.3

NA

2

53.7

2.64

3

107.4

3.25

NA

NA

2.01

11

82.4

1.84

3

71.2

3.15

18

NA

3.71

14

NA

Data might look like this:

Forgot pliers, couldn’t get tooth

15 of 56

length_m

age_yr

mass_kg

1.21

1

48.2

3.37

15

162.3

NA

2

53.7

2.64

3

107.4

3.25

NA

NA

2.01

11

82.4

1.84

3

71.2

3.15

18

NA

3.71

14

NA

Data might look like this:

but mass values are more often missing for larger dolphins.

Occasionally a large dolphin gets transported and massed...

16 of 56

Missing values: Everybody’s got ‘em.

“Even the best networks of environmental

monitors do not operate flawlessly…”

Lawrence CL and M Shah (2016) J Air Waste Manag Assoc 66 (1), 38-52.

“Gear and vessels fail, storms curtail sampling…”

Rago, PJ (2004) Fishery independent sampling: survey techniques and data analyses. National Marine Fisheries Service.

“Missing data plagues almost all surveys, and quite a number of experiments...”

Lo Presti, R et al (2010) Environ Monit Assess 160 (1-4), 1-22. doi:10.1007/s10661-008-0653-3

“...in real-world data sets, missing data are the norm rather than the exception.”

Nakagawa, S and RP Freckleton (2008) Trends in Ecology and Evolution 23 (11), 592-596.

17 of 56

Missing values: everybody’s got ‘em...

...AND everyone should deliberately deal with missings in their data analyses.

18 of 56

length_m

age_yr

mass_kg

1.21

1

48.2

3.37

15

162.3

NA

2

53.7

2.64

3

107.4

3.25

NA

NA

2.01

11

82.4

1.84

3

71.2

3.15

18

NA

3.71

14

NA

How are missing values most commonly “dealt with”?

LISTWISE DELETION:

Any observation (row) containing NA for a variable included in analysis is dropped

19 of 56

KEEP CALM

AND

DISAPPEAR NAs

Default mindset:

20 of 56

Listwise deletion is the default in most software

na.omit = TRUE

na.rm = TRUE

Default code:

21 of 56

ggplot():

22 of 56

ggplot():

lm() & glm():

23 of 56

Listwise deletion…

  • Omits valuable existing data
  • Reduces statistical power (lower n)
  • Can increase bias in parameter estimates

24 of 56

length_m

age_yr

mass_kg

1.21

1

48.2

3.37

15

162.3

NA

2

53.7

2.64

3

107.4

3.25

NA

NA

2.01

11

82.4

1.84

3

71.2

3.15

18

NA

3.71

14

NA

WAIT WHAT?

My estimates for dolphin growth rates might be biased if I used the default of listwise deletion in most software!?

25 of 56

Listwise deletion…

  • Omits valuable existing data
  • Reduces statistical power (lower n)
  • Can introduce bias in parameter estimates

Yes...but depends on the missingness mechanism.

26 of 56

Part 2: Mechanisms of missingness

27 of 56

  • Missing completely at random (MCAR)
  • Missing at random (MAR)
  • Missing not at random (MNAR)

28 of 56

Missing completely at random (MCAR):

Missingness doesn’t depend on observed or unobserved variables

29 of 56

Missing completely at random (MCAR):

Missingness doesn’t depend on observed or unobserved variables

Examples: Missingness in dolphin length depends on...

30 of 56

Missing completely at random (MCAR):

Missingness doesn’t depend on observed or unobserved variables

Missing at random (MAR):

Missingness depends on observed variables

Examples: Missingness in dolphin length depends on...

31 of 56

Missing completely at random (MCAR):

Missingness doesn’t depend on observed or unobserved variables

Missing at random (MAR):

Missingness depends on observed variables

Examples: Missingness in dolphin length depends on...

Example: Missingness in dolphin mass depends on dolphin length.

32 of 56

Missing completely at random (MCAR):

Missingness doesn’t depend on observed or unobserved variables

Missing at random (MAR):

Missingness depends on observed variables

Missing not at random (MNAR):

Missingness depends on the value of the missing data

Examples: Missingness in dolphin length depends on...

Example: Missingness in dolphin mass depends on dolphin length.

33 of 56

Missing completely at random (MCAR):

Missingness doesn’t depend on observed or unobserved variables

Missing at random (MAR):

Missingness depends on observed variables

Missing not at random (MNAR):

Missingness depends on the value of the missing data

Example: Missingness in dolphin mass depends on dolphin mass.

Examples: Missingness in dolphin length depends on...

Example: Missingness in dolphin mass depends on dolphin length.

34 of 56

So we might not know if missing dolphin mass is missing at random (depends on length) or missing not at random (depends on mass), but it’s not missing completely at random.

35 of 56

Recall:

Listwise deletion can introduce bias in parameter estimates

Yes...but depends on the missingness mechanism.

36 of 56

Newman, DA (2014). Missing data: five practical guidelines. Organizational research methods 17 (4): 372 - 411. DOI: 10.1177/1094428114548590

Missing completely

at random

Missing at random

Missing not at random

37 of 56

Newman, DA (2014). Missing data: five practical guidelines. Organizational research methods 17 (4): 372 - 411. DOI: 10.1177/1094428114548590

Missing completely

at random

Missing at random

Missing not at random

38 of 56

Turns out...

Mary E. Shotwell, Wayne E. McFee, and Elizabeth H. Slate (2016). A Bayesian mixture model for missing data in marine mammal growth analysis. Environmental and Ecological Statistics 23 (4): 585-603.

“One goal of the Marine Mammal Health and Stranding Response Program is to model bottlenose dolphin growth for the southeastern U.S. coastal population.”

39 of 56

“...it is more feasible to transport a smaller animal than a larger adult animal for laboratory measurements.”

“...causing an underestimate of animal mass when a growth curve is fit to only measured animals.”

40 of 56

“...it is more feasible to transport a smaller animal than a larger adult animal for laboratory measurements.”

“...causing an underestimate of animal mass when a growth curve is fit to only measured animals.”

data missing at random /

missing not at random

41 of 56

“...it is more feasible to transport a smaller animal than a larger adult animal for laboratory measurements.”

“...causing an underestimate of animal mass when a growth curve is fit to only measured animals.”

data missing at random /

missing not at random

listwise deletion

42 of 56

“...it is more feasible to transport a smaller animal than a larger adult animal for laboratory measurements.”

“...causing an underestimate of animal mass when a growth curve is fit to only measured animals.”

data missing at random /

missing not at random

listwise deletion

biased estimate

43 of 56

Shotwell et al. “Develop(ed) a statistical model for growth in bottlenose dolphins...to compensate for the omission of larger animals in the complete case analysis.”

44 of 56

Shotwell et al. Develop(ed) a statistical model for growth in bottlenose dolphins...to compensate for the omission of larger animals in the complete case analysis.”

They handled missing data by:

  • Thought carefully & explored missingness
  • Determined listwise deletion could bias mass estimates
  • Developed appropriate model to deal with missingness
  • Compared to other methods

45 of 56

A visit from Dr. Mary Shotwell!

46 of 56

First question: “What is missing?

47 of 56

Part 3: Start exploring missingness

in naniar

naniar, an R package by Dr. Nick Tierney, “provides principled, tidy ways to summarise, visualise, and manipulate missing data”

48 of 56

Konza Prairie LTER

Konza Prairie Biological Station

Traditional lands of Kaw (Kansa) People

49 of 56

Data: Konza Prairie rodents

Citation: Hope A. 2019. CSM08 Small mammal host-parasite sampling data for 16 linear trapping transects located in 8 LTER burn treatment watersheds at Konza Prairie. Environmental Data Initiative. https://doi.org/10.6073/pasta/69109c56fcf21a30a8d37369cb47f8de.

50 of 56

Data structure: (stored as object kp_rodents)

Total observations (rows): 971

51 of 56

Let’s go to naniar

to start exploring missing data:

Today we’ll explore:

  • Heatmap of missingness
  • Missing co-occurrences
  • Missing relationships

52 of 56

Let’s go coding.

53 of 56

Tutorial made with the learnr package in R.

Learn how to make your own:

https://rstudio.github.io/learnr/

Brand new RStudio Education post:

URL HERE (copy to chat window)

54 of 56

Today’s takeaways:

  1. Default mindset:

  • Include missingness in exploratory data analysis
  • There are tools to help you be a missing explorer

55 of 56

Before lab this week:

  • Complete naniar tutorial
  • Read Ch. 1.1 - 1.4 in Stef van Buuren’s Flexible imputation of missing data

Coming up:

  • Diagnosing mechanisms of missingness
  • Different ways to handle missingness
    • Listwise deletion
    • Pairwise deletion
    • Imputation

56 of 56

QUESTIONS?

Become a missing explorer!