1 of 23

Generating Synthetic Molecular Mediator Time Series Data for ML and AI applications: considerations

Gary An, MD, FACS

Department of Surgery

University of Vermont

August 24, 2023

Viral Pandemics Working Group Webinar Seminar Series

2 of 23

What this Talk is about

  • Generating synthetic data to augment multiplexed molecular-level time series data for AI-aided forecasting or evaluative tasks that presume a mechanistic, hierarchical causal relationship between the lower scale features (cellular/molecular) and the higher order, system-level phenotype (clinical disease manifestation)
  • Generating synthetic data that can allow for the identification of individual trajectories (e.g., personalization/Digital Twins).

3 of 23

Synthetic Data: The key to modern ML/AI

4 of 23

”Classical” Approaches Part 1: Statistical Approaches

Statistical synthetic data can be generated if there is enough existing real-world data such that either:

  1. Statistical distribution of system features can be reliably approximated OR
  2. an ANN can be trained to a sufficiently robust generative function (generative adversarial neural networks or GANs)

However, for GAN to work => needs to be sufficient training data such that a GAN can distinguish between applied noise and the “true” data/invariant component (Ground Truth).

Consequently, this approach has found its greatest success in image analysis, where vast libraries of annotated images have been able to be used to both train initial ANNs and serve as reference points for GAN-driven synthetic image generation.

5 of 23

”Classical” Approaches Part 2: �Physics-based Approaches

Simulation generated synthetic data from mechanism-based simulation models

  • Considered “real enough” if the simulations are firmly grounded in natural laws (“physics-based”) or within the context of known rules (e.g., games).

In these cases, there is a high degree of confidence in the rules and mechanisms of the simulations, and thus high trust in the fidelity of the synthetically generated data and the “real-world” in which the trained systems must operate (acknowledging that in the case of a game, the game itself represent the “real world” for the player).

6 of 23

Synthetic Data in Healthcare/Biomedical Modeling: What works

Statistical/GAN

  • Biomedical image processing (for either radiology or pathology) readily falls into the category of general image processing
  • Population-level data suitable to represent the control population in a potential clinical trial or based on data from electronic health records

Physics-Based

  • Biomedical systems that can be represented as physical systems, such as fluid dynamics for anatomic representation of blood flow, electrical circuits for cardiac conduction, or the mechanical properties of joints

7 of 23

Synthetic Data Molecular Time Series Data: why it is wanted

  • Relates to experimental cellular and molecular biology
  • Central premise => more granular mechanistic knowledge can lead to improved human health, i.e. through the development and use of various -omics-based and multiplexed molecular assays.
  • Drug development => premise that more detailed molecular knowledge of biological processes is the means to identifying more effective and precise new therapeutic agents.
  • However, this experimental paradigm has two consequences that challenge the application of traditional methods for generated synthetic data

8 of 23

Why Statistical Methods won’t work

  • Perpetual Data Sparsity => the Curse of Dimensionality
  • The inapplicability of the Central Limit Theorem
    • Often assumption of a normal distribution => Central Limit Theorem
      • BUT CLT does not hold for multiplexed molecular time series
      • NOT Independent variables => these molecules/mediators/genes are almost always connected by shared pathways, so the value of one entity will affect the value of another
      • source of the samples are not random => sampling occurs in a population that is preselected based on their manifestation of a disease process
  • Given that the initial requirements for application of the Central Limit Theorem, which is the justification for assuming a normal/Gaussian distribution of values, are not met, this means that one cannot assume a normal distribution for this type of data

9 of 23

Why Physics-based Methods won’t work

  • Multiscale cellular/molecular biology => no corresponding fundamental laws that constrain the dynamics and output of cellular/molecular biology.
  • Perpetual epistemic uncertainty and incompleteness
  • Establishing generative hierarchically causal/multi-scale mechanisms for such simulation models involves
    • identifying a level of abstraction that is “sufficiently complex”
    • Minimize bias, and therefore have greatest explanatory expansiveness => Maximum Entropy Principle

10 of 23

What features must SMT have?�Understanding limits of AI

  • Mitigating the failure modes for ANN AI systems:
    • ANNs fail to generalize => data drift
      • Therefore, SMT should be as expansive (e.g., least biased) as possible, representing as much of the breadth of possible configurations of the target/real-world system
      • Inability to discriminate.
    • Classification task? => supposition that there is a detectable difference present between groups via a distinguishable phenotype/outcome.
      • Major issue with much existing time series data is that the range of values within each group is nearly always larger that the difference between some statistically determined characteristic of each group (be it mean or median)

11 of 23

12 of 23

What must be dealt with when generating SMT?

  • Choosing a “sufficiently complex” abstraction level for the multiscale mechanism-based simulation model.
    • Rethinking “parameters” => variation of functional responsiveness
  • Maximizing the expansiveness of the generated SMT to minimize data drift for the ANN. => Maximal Entropy Principle
  • Translating expansiveness of representation (as per the Maximal Entropy Principle) into an alternative view of “calibration.”

13 of 23

A proposed method for generating SMT

  1. The use of a mechanism-based multiscale simulation model grounded in existing biological knowledge
  2. The use of a simulation model provides a means of partially addressing the Curse of Dimensionality, where the constraints on behavior enforced by the incorporated mechanistic rules constrain the multi-dimensional configurations possible.
  3. The multiscale nature of the simulation model overcomes the limits imposed by the Causal Hierarchy Theorem in terms of representing hierarchical generative causal relationships in a testable fashion.
  4. The incorporation of epistemic uncertainty into the simulation model while following the Maximal Entropy Principle through the use of a mathematical object, the Model Rule Matrix, described in detail below.
  5. Utilizing the concept of non-falsifiability in the generation of an unbiased, expansive synthetic data set, also as per the Maximal Entropy Principle, to overcome the inherent limitations of ML/ANNs in terms of their brittleness and failure to generalize.

14 of 23

Capturing Clinical Heterogeneity = Parameter Landscape => Model Rule Matrix (MRM)

15 of 23

Model Rule Matrix (MRM)

  • Mathematical object that relates represented entities (variables) versus implemented rules that describe interactions between the entities
  • Values for each matrix element = existence/strength/direction of element in rule

List of Entities (Molecules) in Model:

  • Guided by purpose of model (e.g Control)

List of Rules:

  • ”Assume” list of rules capture all “essential functions” in the biology (safer for ABMs)

Entity 1

Entity 2

Entity 3

Entity 4

Entity 5

Rule 1

Rule 2

Rule 3

Rule 4

Entity n

Rule m

  • Assertion: MRM can ”nearly” represent every configuration of a given model rule set and represented entities

16 of 23

Digital Twin to generate Synthetic Multiplexed Molecular Time Series

  • Data:
    • Collected at t=0,1,3,7,14 days post-injury
    • Blood-serum cytokine profiles time series: IL-1b, IL-1ra, IL-6, IL-4, IL-8, IL-10, G-CSF, IFNg, and TNFa
    • Organ Failure Scores
  • Simulation Model => Innate Immune Response ABM (CCM 2004)

  • Data Source: USU/WRNMMC via SC2i TDAP protocol
    • 199 trauma patients: 92 developed ARDS; 107 controls wo ARDS

17 of 23

Results: Sample MRMs from IIRABM

“Base” IIRABM MRM

“An” Evolved IIRABM MRM

IIRABM-MRM = 17 Mediators (Columns) x 25 Rules (Rows

 

Cockrell C and An G: Frontiers in Physiology: Computational Physiology and Medicine. 2021

18 of 23

Results: Range of Ensemble MRMs

2d Heatmap of Value Ranges

3d Depiction of Value Ranges

 

Cockrell C and An G: Frontiers in Physiology: Computational Physiology and Medicine. 2021

19 of 23

Results: Mechanistically Generated Synthetic Trajectory Spaces

T

TNFa: Real and Synthetic Data

20 of 23

Results: Mechanistically Generated Synthetic Trajectory Spaces => Link to Organ Scale

T

GCSF, IL-1 and Lung SOFA Scores

21 of 23

Benefits of Using this approach MSM to generate SMT

  1. Simulation generated SMT at scale/density impossible in the real world => Dense trajectory spaces allow for n-order derivatives to be found
  2. MRM/MEP/Stochasticity in MSM “obscures” the underlying generative model
  3. MRM/MEP encompass biological/population/clinical heterogeneity
  4. Mechanistic basis grounds generative model => Provides a means of “useful failure” (as opposed to pure GAN SD)
  5. Mechanistic model allows for novel interventions => hypothesis testing (new drugs/combinations/repurposing)
  6. Can overcome biases/imbalances in existing data sets

22 of 23

Next Steps

  • Low-hanging fruit: Does SMT-augmented ML/AI perform better (primary emphasis on generalizability)?
    • DoD CDMRP Project: MB220047 - “Improving the Robustness and Generalizability of Post-Burn Sepsis Prediction with the Post-Burn Sepsis Digital Twin” (PIs: Schobel-McHugh and Cockrell)
  • Additional Use Cases => Outreach to Bridge-To-AI group?
  • Other groups interested in adapting MRM to their models?
  • Adjunctive but integral component of Medical Digital Twin program (case for mechanism-based multiscale models)

23 of 23

���Finis