1 of 49

Causal graph structure learning

Sep 25th 2025

These slides, excluding third-party material, are licensed under CC BY-NC 4.0 by Sushmita Roy and Anthony Gitter

BMI/CS 775 Computational Network Biology�Fall 2025

Sushmita Roy

https://bmi775.sites.wisc.edu

2 of 49

Plan for this section

  • Background in molecular biology and graph theory (Sep 9th)
  • Representing molecular networks as probabilistic graphical models (Sep 9th)
  • Learning Bayesian networks (Sep 9th,11th)
  • Learning Dependency networks (Sep 16th, Sep 18th)
  • Dynamics in graph learning (Sep 18th, 23rd)
  • Causality in graph learning (Sep 25th)

2

3 of 49

Readings

  • Glymour, C., Zhang, K., Spirtes, P., 2019. Review of Causal Discovery Methods Based on Graphical Models. Front. Genet. 10, 524. https://doi.org/10.3389/fgene.2019.00524

4 of 49

Plan for today

  • Representing causal structures
  • Some definitions
    • d-separation and conditional independence
    • Causal Markov assumption
    • Faithfulness
  • Classes of algorithms for causal discovery
    • Constraint-based algorithms
    • Combinatorial search
    • Continuous optimization
  • Benchmarking

5 of 49

What do we mean by causality?

  •  

Uhler and Squires 2022; Luo et al., 2020

6 of 49

Different terminologies for causal graphs

  • Directed Graphical Causal Model (DGCM)
    • Glymour, Zhang and Spirtes, 2019
  • Structural Causal Models (SCM)
    • Squires & Uhler, 2022
  • They capture similar notions of causality
    • A directed edge corresponds to cause-effect relationships
    • conditional independence

7 of 49

Interventional vs Observational data

  • Interventional data
    • We explicitly perturb the level of a node and measure the impact on other nodes
  • Observational data
    • No perturbations
    • Most data we have seen are “observational”
  • Interventional data are typically needed for learning causal relationships
  • However, there are a number of methods to infer causal relationships using only observational data.

Observational

Interventional

Chevalley et al., 2023

8 of 49

Directed Graphical Causal Models

  •  

Glymour, C., Zhang, K., Spirtes, P., 2019.

X

Y

9 of 49

Interpretation of a directed edge

  •  

Glymour, C., Zhang, K., Spirtes, P., 2019.

10 of 49

Functional Causal Models

  •  

Glymour, C., Zhang, K., Spirtes, P., 2019.

11 of 49

d-separation or d-connecting

  • This concept is important to understand what conditional independencies the graph structure is asserting
  • d-separation:
    • A condition that implies conditional independence between pairs of random variables given a third variable set.
    • For any two variables, A and B, given observations/evidence on a third set of variables E, A and B are said to be conditionally independent given E if they are d-separated by E
  • d-connecting:
    • Similar to d-separation
    • A and B are dependent given E if there is a d-connecting path between them given E

Glymour, C., Zhang, K., Spirtes, P., 2019.; Charniak 1999

12 of 49

d-separation notation

  •  

13 of 49

d-separation constructs

A

B

C

linear path/directed

A

B

C

converging/collider

A

B

C

diverging

14 of 49

d-separation

  •  

15 of 49

d-separation example

N

W

H

O

P

Consider the following graph

N

W

H

O

P

Consider W and P,

    • W and P are d-separated given H
    • W is independent of P given H

Suppose we observe H

N: Night sleep

W: Weather

H: Happiness

O: Outfit

P: Productivity

16 of 49

d-separation example

N

W

H

O

P

Consider the following graph

N

W

H

O

P

Consider N and W,

    • H is a collider between N and W
    • Given H, N and W are not independent
    • If we observe the effect of two variables, the causes are no longer independent

Suppose we observe H

N: Night sleep

W: Weather

H: Happiness

O: Outfit

P: Productivity

17 of 49

Causal Markov Assumption

  •  

18 of 49

Markov Equivalence Class

  • Graphs that satisfy the same set of d-separation properties specify the same set of conditional independencies
  • Such graphs are called Markov Equivalent
  • A collection of all DAGs that are Markov Equivalent form a Markov Equivalence Class

19 of 49

Local Markov condition example

E

B

A

Friedman 2000

D

C

  • I(E;A)

  • I(B;D|E,A)

  • I(C;E,A, D|B)

  • I(D;B,C,E|A)

Here E and A are marginally independent

Other variables are conditionally independent of a set of other variables

independence

20 of 49

Faithfullness

  • When no additional independence relations hold, the distribution is said to be faithful with respect to the graph

  • d-separation is a necessary but not sufficient condition for conditional independence assertions in a joint probability distribution

21 of 49

Top Hat Question

22 of 49

Plan for today

  • Some definitions
    • Representing causal structure
    • d-separation
    • Causal Markov assumption
  • Classes of algorithms for causal discovery
    • Constraint-based algorithms
    • Combinatorial search
    • Continuous optimization
  • Benchmarking

23 of 49

Classes of methods for causal graph structure discovery

  • Constraint-based algorithms
    • PC algorithm
  • Score-based combinatorial search
  • Hybrid methods
    • Greedy sparsest permutation
  • Score-based continuous optimization

24 of 49

The PC algorithm

  • One of the first algorithms for causal structure discovery
  • Assumes there are no hidden variables (confounders) and faithfulness
  • Relies on a set of independence and conditional independence tests
  • Has three main steps
    • Create a completely connected graph
    • Find the skeleton
    • Orient edges
  • At the end, we may still end up with a partially directed graph

25 of 49

Steps of the PC algorithm

  •  

26 of 49

Steps of the PC algorithm contd

  1. Orient edges
    • Introduce v-structures

    • Propagate orientation

A

B

C

A

B

C

If A and B are adjacent and C and B are adjacent, but not A and C, make a v-structure

A

B

C

A

B

C

Provided B was not in the “conditioning” set when A and C became independent.

27 of 49

Remarks for the PC algorithm

  • PC algorithm is feasible and scalable for thousands of variables if the graph is sparse
  • Output of the PC algorithm is more useful than a Markov network (completely undirected) because:
    • Markov networks don’t capture direction of causality
    • Are more dense, because of d-separation with colliders

28 of 49

Limitations of constrained based approaches

  • Need to have sufficient samples to do the independence tests
  • Need to know the form of dependence

29 of 49

Classes of methods for causal graph structure discovery

  • Constraint-based algorithms
    • PC algorithm
  • Score-based methods/Combinatorial search
  • Hybrid methods
    • Greedy sparsest permutation
  • Continuous optimization

30 of 49

Combinatorial algorithms for causal discovery

  • These algorithms search in the space of graphs by defining a score and optimizing the score to find a graph
  • We have seen examples of such algorithms for DAGs e.g., sparse candidate
    • Another is Greedy Equivalence Search (GES) Chickering et al., 2003

31 of 49

Classes of methods for causal graph structure discovery

  • Constraint-based algorithms
    • PC algorithm
  • Score-based methods/Combinatorial search
  • Hybrid methods
    • Greedy sparsest permutation
  • Continuous optimization

32 of 49

Continuous optimization methods

  •  

Uhler and Squires 2022

33 of 49

Continuous optimization: DAG NOTEARS

  • Zheng X, Aragam B, Ravikumar P, Xing EP. DAGs with NO TEARS: Continuous Optimization for Structure Learning. arXiv:180301422 [cs, stat] [Internet]. 2018 Nov 2 [cited 2021 Sep 25]; Available from: http://arxiv.org/abs/1803.01422
  • NO TEARS: Non-combinatoric Optimization via Trace Exponential Augmented Lagrangian Structure learning
  • The first continuous optimization algorithm for learning DAG structure
  • Initial formulation was based on linear “Structural Equation Model”

Vowels et al., ACM Comput Surv

34 of 49

Notation

  •  

35 of 49

Structural Equation Model (SEM)

  • Each variable is a linear combination of the state of its parent variables

  • The set of regressions can be compactly written as:

n: # of samples

36 of 49

Continuous vs discrete optimization framework

Discrete optimization

Continuous optimization

The continuous optimization helps to solve the DAG learning problem efficiently.

W is the weighted adjacency matrix

F(W) is the score/loss function

G(W) is graph induced from W

h(W) check for “DAGness”

37 of 49

NOTEARS objective

  • NOTEARS makes use of an additional sparsity constraint

  • Overall objective:

38 of 49

 

  • h(W)=0, if and only if W is acyclic
  • The values of h quantify DAGness
  • h is smooth
  • h derivatives are easy to compute

Hadamard product

Matrix exponent

Trace

39 of 49

 

  •  

40 of 49

Incorporating non-linearities in the continuous optimization framework

  • DAG-GNN
    • Yu et al., 2019
  • GraN-DAG
    • Lachapelle et al., 2020
  • Non-parametric SEM
    • Zheng et al., 2020

41 of 49

Deep learning and causal networks

Luo et al., 2019

42 of 49

Plan for today

  • Some definitions
    • Representing causal structure
    • d-separation
    • Causal Markov assumption
  • Classes of algorithms for causal discovery
    • Constraint-based algorithms
    • Combinatorial search
    • Continuous optimization
  • Benchmarking

43 of 49

Benchmarking studies of causal structure discovery

  • Hill, Steven M., Laura M. Heiser, Thomas Cokelaer, Michael Unger, Nicole K. Nesser, Daniel E. Carlin, Yang Zhang, et al. “Inferring Causal Molecular Networks: Empirical Assessment through a Community-Based Effort.” Nature Methods 13, no. 4 (April 2016): 310–18. https://doi.org/10.1038/nmeth.3773.
    • Focus on cancer signaling data
  • Lachapelle S, Brouillard P, Deleu T, Lacoste-Julien S. Gradient-Based Neural DAG Learning. arXiv:190602226 [cs, stat] [Internet]. 2020 Feb 18 [cited 2021 Sep 25]; Available from: http://arxiv.org/abs/1906.02226
    • Largely based on simulations and small networks
  • Chevalley, M., Roohani, Y., Mehrjou, A., Leskovec, J., Schwab, P., 2023. CausalBench: A Large-scale Benchmark for Network Inference from Single-cell Perturbation Data.
    • Focus on nearly genome-scale networks
    • Leveraged real single cell RNA-seq datasets

44 of 49

Evaluation of causal graph structure

  • Precision/Recall
  • Structural Hamming Distance (SHD)
    • Count the number of falsely detected, missing and reversed edges
  • Structural Interventional Distance (SID)
    • Number of variable pairs (X, Y) where X’s distribution is miscalculated when intervening on Y
  • Change in distribution (Wasserstein Distance) between observational and interventional distribution of X, P(X|Y) when Y is intervened
  • False Omission Rate (FOR): Proportion of false negative edges in the predicted graph
    • Ratio of False Negatives to False+True Negatives

45 of 49

Empirical comparisons of different algorithms

ER, SF: Different types of network topologies. ER: Erdos Renyi; SF: Scale-free

SHD: Structural Hamming Distance

SID: Structural Intervention Distance

Lachapelle et al 2020, GraN-DAG suite

46 of 49

CausalBench

Chevalley, M., Roohani, Y., Mehrjou, A., Leskovec, J., Schwab, P., 2023. CausalBench: A Large-scale Benchmark for Network Inference from Single-cell Perturbation Data.

47 of 49

How do methods perform?

Green: methods use observational data; Pink: methods use interventional data

Methods explicitly modeling interventional data don’t outperform others

48 of 49

CausalBench Ranking

49 of 49

Take away points

  • A causal graph: a graph that captures cause-effect relations between random variables
  • Learning these graphs from data requires us to make some assumptions about conditional independence:
    • causal Markov assumption/d-separation
    • faithfulness
  • Algorithms for learning a causal graph
  • Benchmarking algorithms for causal structure discovery?