1 of 49

Causal graph structure learning

Sep25^th 2025

These slides, excluding third-party material, are licensed under CC BY-NC 4.0 by Sushmita Roy and Anthony Gitter

BMI/CS 775 Computational Network Biology�Fall 2025

Sushmita Roy

https://bmi775.sites.wisc.edu

2 of 49

Plan for this section

Background in molecular biology and graph theory (Sep 9th)
Representing molecular networks as probabilistic graphical models (Sep 9^th)
Learning Bayesian networks (Sep 9^th,11^th)
Learning Dependency networks (Sep 16^th, Sep 18th)
Dynamics in graph learning (Sep 18^th, 23^rd)
Causality in graph learning (Sep 25^th)

3 of 49

Readings

Glymour, C., Zhang, K., Spirtes, P., 2019. Review of Causal Discovery Methods Based on Graphical Models. Front. Genet. 10, 524. https://doi.org/10.3389/fgene.2019.00524

4 of 49

Plan for today

Representing causal structures
Some definitions

d-separation and conditional independence
Causal Markov assumption
Faithfulness

Classes of algorithms for causal discovery

Constraint-based algorithms
Combinatorial search
Continuous optimization

Benchmarking

5 of 49

What do we mean by causality?

Uhler and Squires 2022; Luo et al., 2020

6 of 49

Different terminologies for causal graphs

Directed Graphical Causal Model (DGCM)

Glymour, Zhang and Spirtes, 2019

Structural Causal Models (SCM)

Squires & Uhler, 2022

They capture similar notions of causality

A directed edge corresponds to cause-effect relationships
conditional independence

7 of 49

Interventional vs Observational data

Interventional data

We explicitly perturb the level of a node and measure the impact on other nodes

Observational data

No perturbations
Most data we have seen are “observational”

Interventional data are typically needed for learning causal relationships
However, there are a number of methods to infer causal relationships using only observational data.

Observational

Interventional

Chevalley et al., 2023

8 of 49

Directed Graphical Causal Models

Glymour, C., Zhang, K., Spirtes, P., 2019.

9 of 49

Interpretation of a directed edge

Glymour, C., Zhang, K., Spirtes, P., 2019.

10 of 49

Functional Causal Models

Glymour, C., Zhang, K., Spirtes, P., 2019.

11 of 49

d-separation or d-connecting

This concept is important to understand what conditional independencies the graph structure is asserting
d-separation:

A condition that implies conditional independence between pairs of random variables given a third variable set.
For any two variables, A and B, given observations/evidence on a third set of variables E, A and B are said to be conditionally independent given E if they are d-separated by E

d-connecting:

Similar to d-separation
A and B are dependent given E if there is a d-connecting path between them given E

Glymour, C., Zhang, K., Spirtes, P., 2019.; Charniak 1999

12 of 49

d-separation notation

13 of 49

d-separation constructs

linear path/directed

converging/collider

diverging

14 of 49

d-separation

15 of 49

d-separation example

Consider the following graph

Consider W and P,

W and P are d-separated given H
W is independent of P given H

Suppose we observe H

N: Night sleep

W: Weather

H: Happiness

O: Outfit

P: Productivity

16 of 49

d-separation example

Consider the following graph

Consider N and W,

H is a collider between N and W
Given H, N and W are not independent
If we observe the effect of two variables, the causes are no longer independent

Suppose we observe H

N: Night sleep

W: Weather

H: Happiness

O: Outfit

P: Productivity

17 of 49

Causal Markov Assumption

18 of 49

Markov Equivalence Class

Graphs that satisfy the same set of d-separation properties specify the same set of conditional independencies
Such graphs are called Markov Equivalent
A collection of all DAGs that are Markov Equivalent form a Markov Equivalence Class

19 of 49

Local Markov condition example

Friedman 2000

I(E;A)

I(B;D|E,A)

I(C;E,A, D|B)

I(D;B,C,E|A)

Here E and A are marginally independent

Other variables are conditionally independent of a set of other variables

independence

20 of 49

Faithfullness

When no additional independence relations hold, the distribution is said to be faithful with respect to the graph

d-separation is a necessary but not sufficient condition for conditional independence assertions in a joint probability distribution

21 of 49

22 of 49

Plan for today

Some definitions

Representing causal structure
d-separation
Causal Markov assumption

Classes of algorithms for causal discovery

Constraint-based algorithms
Combinatorial search
Continuous optimization

Benchmarking

23 of 49

Classes of methods for causal graph structure discovery

Constraint-based algorithms

PC algorithm

Score-based combinatorial search
Hybrid methods

Greedy sparsest permutation

Score-based continuous optimization

24 of 49

The PC algorithm

One of the first algorithms for causal structure discovery
Assumes there are no hidden variables (confounders) and faithfulness
Relies on a set of independence and conditional independence tests
Has three main steps

Create a completely connected graph
Find the skeleton
Orient edges

At the end, we may still end up with a partially directed graph

25 of 49

Steps of the PC algorithm

26 of 49

Steps of the PC algorithm contd

Orient edges

Introduce v-structures

Propagate orientation

If A and B are adjacent and C and B are adjacent, but not A and C, make a v-structure

Provided B was not in the “conditioning” set when A and C became independent.

27 of 49

Remarks for the PC algorithm

PC algorithm is feasible and scalable for thousands of variables if the graph is sparse
Output of the PC algorithm is more useful than a Markov network (completely undirected) because:

Markov networks don’t capture direction of causality
Are more dense, because of d-separation with colliders

28 of 49

Limitations of constrained based approaches

Need to have sufficient samples to do the independence tests
Need to know the form of dependence

29 of 49

Classes of methods for causal graph structure discovery

Constraint-based algorithms

PC algorithm

Score-based methods/Combinatorial search
Hybrid methods

Greedy sparsest permutation

Continuous optimization

30 of 49

Combinatorial algorithms for causal discovery

These algorithms search in the space of graphs by defining a score and optimizing the score to find a graph
We have seen examples of such algorithms for DAGs e.g., sparse candidate

Another is Greedy Equivalence Search (GES) Chickering et al., 2003

31 of 49

Classes of methods for causal graph structure discovery

Constraint-based algorithms

PC algorithm

Score-based methods/Combinatorial search
Hybrid methods

Greedy sparsest permutation

Continuous optimization

32 of 49

Continuous optimization methods

Uhler and Squires 2022

33 of 49

Continuous optimization: DAG NOTEARS

Zheng X, Aragam B, Ravikumar P, Xing EP. DAGs with NO TEARS: Continuous Optimization for Structure Learning. arXiv:180301422 [cs, stat] [Internet]. 2018 Nov 2 [cited 2021 Sep 25]; Available from: http://arxiv.org/abs/1803.01422
NO TEARS: Non-combinatoric Optimization via Trace Exponential Augmented Lagrangian Structure learning
The first continuous optimization algorithm for learning DAG structure
Initial formulation was based on linear “Structural Equation Model”

Vowels et al., ACM Comput Surv

35 of 49

Structural Equation Model (SEM)

Each variable is a linear combination of the state of its parent variables

The set of regressions can be compactly written as:

n: # of samples

36 of 49

Continuous vs discrete optimization framework

Discrete optimization

Continuous optimization

The continuous optimization helps to solve the DAG learning problem efficiently.

W is the weighted adjacency matrix

F(W) is the score/loss function

G(W) is graph induced from W

h(W) check for “DAGness”

37 of 49

NOTEARS objective

NOTEARS makes use of an additional sparsity constraint

Overall objective:

38 of 49

h(W)=0, if and only if W is acyclic
The values of h quantify DAGness
h is smooth
h derivatives are easy to compute

Hadamard product

Matrix exponent

Trace

40 of 49

Incorporating non-linearities in the continuous optimization framework

DAG-GNN

Yu et al., 2019

GraN-DAG

Lachapelle et al., 2020

Non-parametric SEM

Zheng et al., 2020

41 of 49

Deep learning and causal networks

Luo et al., 2019

42 of 49

Plan for today

Some definitions

Representing causal structure
d-separation
Causal Markov assumption

Classes of algorithms for causal discovery

Constraint-based algorithms
Combinatorial search
Continuous optimization

Benchmarking

43 of 49

Benchmarking studies of causal structure discovery

Hill, Steven M., Laura M. Heiser, Thomas Cokelaer, Michael Unger, Nicole K. Nesser, Daniel E. Carlin, Yang Zhang, et al. “Inferring Causal Molecular Networks: Empirical Assessment through a Community-Based Effort.” Nature Methods 13, no. 4 (April 2016): 310–18. https://doi.org/10.1038/nmeth.3773.

Focus on cancer signaling data

Lachapelle S, Brouillard P, Deleu T, Lacoste-Julien S. Gradient-Based Neural DAG Learning. arXiv:190602226 [cs, stat] [Internet]. 2020 Feb 18 [cited 2021 Sep 25]; Available from: http://arxiv.org/abs/1906.02226

Largely based on simulations and small networks

Chevalley, M., Roohani, Y., Mehrjou, A., Leskovec, J., Schwab, P., 2023. CausalBench: A Large-scale Benchmark for Network Inference from Single-cell Perturbation Data.

Focus on nearly genome-scale networks
Leveraged real single cell RNA-seq datasets

44 of 49

Evaluation of causal graph structure

Precision/Recall
Structural Hamming Distance (SHD)

Count the number of falsely detected, missing and reversed edges

Structural Interventional Distance (SID)

Number of variable pairs (X, Y) where X’s distribution is miscalculated when intervening on Y

Change in distribution (Wasserstein Distance) between observational and interventional distribution of X, P(X|Y) when Y is intervened
False Omission Rate (FOR): Proportion of false negative edges in the predicted graph

Ratio of False Negatives to False+True Negatives

45 of 49

Empirical comparisons of different algorithms

ER, SF: Different types of network topologies. ER: Erdos Renyi; SF: Scale-free

SHD: Structural Hamming Distance

SID: Structural Intervention Distance

Lachapelle et al 2020, GraN-DAG suite

46 of 49

CausalBench

Chevalley, M., Roohani, Y., Mehrjou, A., Leskovec, J., Schwab, P., 2023. CausalBench: A Large-scale Benchmark for Network Inference from Single-cell Perturbation Data.

47 of 49

How do methods perform?

Green: methods use observational data; Pink: methods use interventional data

Methods explicitly modeling interventional data don’t outperform others

48 of 49

CausalBench Ranking

49 of 49

Take away points

A causal graph: a graph that captures cause-effect relations between random variables
Learning these graphs from data requires us to make some assumptions about conditional independence:

causal Markov assumption/d-separation
faithfulness

Algorithms for learning a causal graph
Benchmarking algorithms for causal structure discovery?

1 of 49

2 of 49

3 of 49

4 of 49

5 of 49

6 of 49

7 of 49

8 of 49

9 of 49

10 of 49

11 of 49

12 of 49

13 of 49

14 of 49

15 of 49

16 of 49

17 of 49

18 of 49

19 of 49

20 of 49

21 of 49

22 of 49

23 of 49

24 of 49

25 of 49

26 of 49

27 of 49

28 of 49

29 of 49

30 of 49

31 of 49

32 of 49

33 of 49

34 of 49

35 of 49

36 of 49

37 of 49

38 of 49

39 of 49

40 of 49

41 of 49

42 of 49

43 of 49

44 of 49

45 of 49

46 of 49

47 of 49

48 of 49

49 of 49