1 of 45

Better Simulations for Validating Causal Discovery with the DAG-Adaptation of the Onion Method

Erich Kummerfeld

Research Assistant Professor

Institute for Health Informatics, University of Minnesota

2 of 45

Hi!

Some background about me:

PhD at CMU

Mixture of causal discovery and philosophy of science
Trained under Peter Spirtes, Clark Glymour, David Danks, etc.

Postdoc at UPitt, Center for Causal Discovery

Transition to health informatics, under Greg Cooper

Faculty at UMN, Institute for Health Informatics

Institute led by Constantin Aliferis

3 of 45

Causal discovery experience

Developing new causal methods
Data analysis best practices
Applying data analysis, including causal discovery, to specific domain problems. Mostly:

Addiction and alcohol
Aging
Neuroimaging
Psychiatry
Other areas

4 of 45

Talk structure

Describe some experiences with applied work

Summarize my perspective on how causal discovery fits into domain science right now

Describe a methods project that targets specific weaknesses of causal discovery in modern science�
Then the open discussion phase

5 of 45

Some examples of applications

How do nursing homes benefit from an on-site APRN?

6 of 45

What are the causes and effects of PTSD symptoms in populations with PTSD diagnosis?

7 of 45

What is the causal explanation for comorbid INTD and AUD?

8 of 45

What are the mechanisms that relate brain and behavior variables, and ultimately AUD?

9 of 45

How do individuals differ in terms of what causes them to drink?

10 of 45

How are brain networks causally connected during rest?

11 of 45

How is brain connectivity different during psychosis?

12 of 45

How do conspiracy theory beliefs relate to vaccine intentions and attitudes?

13 of 45

How can we improve treatment for psychosis?

14 of 45

What brain connectivity changes does neuromodulation cause, and what brain connectivities cause relapse?

15 of 45

Some position papers led by domain scientists promoting Causal Discovery

16 of 45

A podcast?!

17 of 45

Summary of work: everything is ad hoc

Many different data types, sizes, shapes
Many different project goals
Many different algorithms used
Many different roles for CD in the approach

Important lesson for anyone wanting to do applications: the graph is usually not the Primary Research Product

18 of 45

In most projects

The graph is merely one

Of multiple stepping stones

To the primary finding

19 of 45

What do domain scientists think of CD?

My impressions of applied scientists and clinicians

They want to answer a human understandable question about their topic
They don’t really care what methods are used
They are more worried about whether things are being measured correctly
To them, causal discovery is new and interesting, but it’s unclear what to do with it or if they should trust it

20 of 45

Trust in AI?

Scientist attitudes towards AI, statistics, etc. can vary wildly.
Most scientists have rudimentary understanding of statistics. For many, it’s just an annoying hurdle to get over for publications and grants.

Scientists can spin stories rapidly and support them with literature. The story is most important. For some of them AI appears to be a muse.

21 of 45

Passing the Statistics Gate

While many research scientists are fast and loose, they are still beholden to their stats experts. (Otherwise they may not do stats at all)

22 of 45

Validating Causal Discovery

“Our journal only accepts observational studies using new methods if they are externally validated.”
Major problem for CD: validation is lacking!
Supervised learning

holdout samples, cross validation, etc.
model tested in a separate population.

Experiments and regressions

confidence intervals, p-values, etc.

what does causal discovery have??

23 of 45

Decision makers need confidence

Clinical doctor: how can I make a medical decision based on this model if I don’t know how accurate it is? This patient’s life is on the line, and I’m liable for any mistakes.

24 of 45

The current reality of CD validation

How are causal discovery methods validated?
Proof of correctness. But these proofs are pointwise limit theorems. Existing proofs about finite samples make completely unrealistic assumptions.
Simulations. But the simulations are completely ad hoc, unrealistic, inconsistent, and cherry-picked. There is no reason to expect current simulations to indicate how real world applications will perform.

25 of 45

Quick shoutout to other methods

There are some other approaches gaining traction

resampling stability
model fit statistics (what SEM-based fields use)
performance on graph-informed predictive modeling
etc.

�

But none of these are very mature, and their limitations are unknown

26 of 45

One direction: improve simulations

Our contribution: make simulations better.
Simulations should

Generate data from a well characterized and reasonable distribution of data distributions.
Include possible real-world scenarios
NOT include simulation artifacts that don’t exist in real world data
NOT permit cherry-picking to make algorithms look better or worse than they are

27 of 45

How do existing simulation methods do on these criteria?

Generate data from a well characterized and reasonable distribution of data distributions.
Include all possible real-world scenarios
NOT include simulation artifacts that don’t exist in real world data
NOT permit cherry-picking to make algorithms look better or worse than they are

28 of 45

How do existing simulation methods do on these criteria?

Existing simulations are insufficient for even comparing relative performance of methods.

They are nowhere near providing evidence for expecting good performance on real data!

29 of 45

Our idea: sample uniformly

All of those points (and more!) can be achieved by sampling uniformly from the space of covariance matrices

DAG used as input
randomly sample from correlation matrices that are consistent with that DAG. This simultaneously and uniquely assigns values to all free parameters of the DAG, including edge weights and variance terms.

30 of 45

Shoutout to Bryan

Bryan Andrews basically did everything.

31 of 45

What does sampling uniformly look like?

[show examples from 3D plot]
[see right for reference graphs]��
Proof is in the paper (preprint is on arxiv, manuscript currently under peer review)

https://arxiv.org/abs/2405.13100

emitter

chain

collider

32 of 45

Simulated model parameters

ZARX: NOTEARS papers. Tetrad: BOSS paper.

33 of 45

Parameter distributions�(edges and errors)

ZARX: NOTEARS papers. Tetrad: BOSS paper.

34 of 45

R² sortability?

ZARX: NOTEARS papers. Tetrad: BOSS paper.

35 of 45

Definition of evaluation statistics

36 of 45

CD methods on DaO data

dLiNGAM used the same models but with non-Gaussian exponential error��With standard evaluations, DaO is a difficult test for most algorithms

37 of 45

And non-DaO simulations…

ZARX: NOTEARS papers. Tetrad: BOSS paper.

Some algorithms that do very poorly on DaO suddenly do extremely well on non-DaO simulations

38 of 45

Going forward (1)

DaO should be a standard that any global causal discovery learning algorithm based on covariance matrix must be evaluated on�
This serves as a foundation to start empirically evaluating finite-sample performance in a way that extends to real world data

39 of 45

Going forward (2)

Other methods of evaluation are likely very important

Because different types of science questions require different types of method evaluation
e.g. does the model estimate total effects well?

There are many opportunities to extend DaO to better reflect more types of distributions and real world scenarios, such as latent confounding, time-series data, cyclic models, etc.

40 of 45

Some limitations

DaO currently makes no attempt to simulate specific real world situations

Growing list of simulation methods for specific domains, such as fMRI data, gene expression data, survey data…

Current evaluations on DaO depend heavily on performance on small effect sizes

Most real world effect sizes that scientists care about are moderate or large in size.�

More limitations in paper. For time, let’s move on!

41 of 45

Discussion Questions

How should we a priori validate causal discovery algorithms to ensure that they are ready for use in real world applications where real lives are at stake?
How should we post hoc quantify the uncertainty of the results of causal discovery after it has been applied to data?
What types of scientific questions are causal discovery methods best suited to answer compared to other methods, and can we quantify the performance of causal discovery in answering those questions, either a priori or post hoc?
When causal discovery is used as part of a larger analysis pipeline, how can we quantify the uncertainty or variability of results for the entire pipeline?

42 of 45

Discussion Questions

How should we a priori validate causal discovery algorithms to ensure that they are ready for use in real world applications where real lives are at stake?
How should we post hoc quantify the uncertainty of the results of causal discovery after it has been applied to data?
What types of scientific questions are causal discovery methods best suited to answer compared to other methods, and can we quantify the performance of causal discovery in answering those questions, either a priori or post hoc?
When causal discovery is used as part of a larger analysis pipeline, how can we quantify the uncertainty or variability of results for the entire pipeline?

43 of 45

Discussion Questions

How should we a priori validate causal discovery algorithms to ensure that they are ready for use in real world applications where real lives are at stake?
How should we post hoc quantify the uncertainty of the results of causal discovery after it has been applied to data?
What types of scientific questions are causal discovery methods best suited to answer compared to other methods, and can we quantify the performance of causal discovery in answering those questions, either a priori or post hoc?
When causal discovery is used as part of a larger analysis pipeline, how can we quantify the uncertainty or variability of results for the entire pipeline?

44 of 45

Discussion Questions

How should we a priori validate causal discovery algorithms to ensure that they are ready for use in real world applications where real lives are at stake?
How should we post hoc quantify the uncertainty of the results of causal discovery after it has been applied to data?
What types of scientific questions are causal discovery methods best suited to answer compared to other methods, and can we quantify the performance of causal discovery in answering those questions, either a priori or post hoc?
When causal discovery is used as part of a larger analysis pipeline, how can we quantify the uncertainty or variability of results for the entire pipeline?

1 of 45

2 of 45

3 of 45

4 of 45

5 of 45

6 of 45

7 of 45

8 of 45

9 of 45

10 of 45

11 of 45

12 of 45

13 of 45

14 of 45

15 of 45

16 of 45

17 of 45

18 of 45

19 of 45

20 of 45

21 of 45

22 of 45

23 of 45

24 of 45

25 of 45

26 of 45

27 of 45

28 of 45

29 of 45

30 of 45

31 of 45

32 of 45

33 of 45

34 of 45

35 of 45

36 of 45

37 of 45

38 of 45

39 of 45

40 of 45

41 of 45

42 of 45

43 of 45

44 of 45

45 of 45