1 of 34

Modeling Complex Systems of Chemical Reactions�

William H. Green

MIT Dept. of Chem. Eng.

“Modeling Talks” Series

Jan. 23, 2024

MIT SuperCloud

2 of 34

Motivation

We want to Understand and Design chemical systems
Quantitative Understanding and System Design require Predictive Models

If our model predictions are inconsistent with experiment, we don’t Understand
To Design one has to predict the performance of proposed designs.

Often we want to design a new molecule that (in the right mixture) would be an effective drug, or the active element in a photovoltaic or color display, or a low-greenhouse jet fuel
Need to be able to Predict the key performance property of the new molecule…
…also predict many other properties (many performance requirements!)
Also predict how to make (“synthesize”) this new molecule.
Also predict fate of this molecule as it degrades in use or in storage
Rely in part on predictions of reaction products and reaction rates

3 of 34

If you can predict Reactions…

… If you reliably predict Products, you can predict sequences of reactions that cumulatively will convert molecules you can buy into the new molecule

“Computer-Aided Synthesis Planning”, e.g. the ASKCOS software package. See Coley, Green, and Jensen, Accounts of Chemical Research

…If you can reliably predict both Products and Rates, you can build Kinetic Models, and then quantitatively simulate the time-evolution of a chemical system.

Software: the Reaction Mechanism Generator (RMG). M. Liu et al. JCIM
Several successful examples in literature, where simulations correctly predicted results of experiments performed independently or after the simulations.

Can be extremely helpful for designing new chemical processes!

Challenging to ensure simulation includes all of the important reactions
Sometimes numerical problems solving the simulation equations.
Often simulation accuracy limited primarily by errors in the rate predictions.

4 of 34

Historically, most chemical engineers have used a purely empirical approach to reaction kinetics

Do experiments, scale up, do more experiments, fit to a simple kinetic model

Experiments in large pilot plants are slow and expensive:

Barrier to innovation!

The Engineering Motivation of this Work

World needs to decarbonize fast – requires replacing most chemical processes

with some decarbonized alternative… need much faster chemical innovation!

5 of 34

Can we avoid Barrier to innovation by Predicting the Kinetics on the computer?

If possible, has many advantages, not just lower cost.

But how can we accurately Predict Kinetics?

The Engineering Motivation of this Work

6 of 34

Although sometimes Predictive Kinetics works, there are Challenges

Challenging to ensure simulation includes all of the important reactions
Can be numerical problems solving the simulation equations

e.g. if too many state variables in simulation

Often simulation accuracy limited primarily by errors in the rate predictions.

7 of 34

Chemistry is a Big Problem – so many molecules & reactions, each a bit different from the others…

The number of possible molecules and reactions is VAST.

Combinatorics: >10²⁰ stable simple light organic molecules, with >10⁴⁰ reactions.

For bigger molecules the numbers are much much larger…

Many important processes involve >10⁵ significant reactions
Only a tiny fraction of all the molecules and reactions that exist have ever been considered at all

though it is predicted >10⁹ could be made commercially: https://doi.org/10.1021/acs.jcim.2c01253

>10⁸ molecules listed in the PubChem database

but most of these have never actually been made (just predicted or thought about)

>10⁷ reactions recorded in Pistachio database (extracted from patent literature)

Many Properties of Interest for each Molecule and Reaction:
binding constants to enzymes, spectra, reaction products & byproducts, reaction rates, reaction equilibria, solubilities of solids in different solvents, vapor-liquid equilibria, solvent effects on rates & reaction equilibria, interactions with catalysts and sorbents, ….
Best if we can predict the properties before investing the effort/time/money to create, purify, and characterize a proposed molecule, and then try to measure its reactions and properties.

8 of 34

Data on Molecules & Reactions is Sparse

Experimental data exists on some molecules (though may be tedious or impossible to access the data).

Only have data on molecules that someone already created, purified, and characterized.

Only a few properties of those special molecules were ever measured.
Molecules which are hard to make or have short lifetimes are unlikely to be measured.

Only ~10⁴ molecules or reactions with quantitative experimental numbers in public datasets of particular properties (NIST k’s, IUPAC pKa’s, MIT solvation energies)

Datasets for many other properties are quite small (<10³ entries)

Sparse and mal-distributed “clumpy” dataset – sufficient to predict closely related molecules…

…But unreliable for predicting the interesting truly new molecules.

Data on Reactions much sparser than data on molecules!

Each molecule can react with many other molecules, often in multiple ways, so many more reactions than molecules. And reactions are harder to measure…

9 of 34

Historical Approach to Estimating Molecular Properties:

find correlation with substructures in molecule

Historically Data was Scarce: Use Simple Models
Human experts define functional groups or reaction classifiers.
Look for simple correlations to molecular or reaction properties.

Works pretty well for many properties

But correlations usually have narrow scope, only molecules or reactions of specific “type”.

Molecule’s “Fingerprint”

(e.g. list of number of each functional group present in molecule)

Molecule

Molecular Property

(or Log(Property) )

Count functional groups

Linear Correlation

or Tree Classifier

10 of 34

“Learned Fingerprint” + Neural Net approach

If you have lots of data, and are willing to deal with non-linear models…
Instead of pre-defining the Functional Groups & Descriptors (and so Fingerprint), one can have the computer Learn a Fingerprint that works well to predict the specific property.

Many different ways to do this that often work, though some better than others…

Can allow non-linear relationship between Fingerprint & Property by using a Feed-Forward Neural Net (FFN) in second step

Molecule

Learned

Fingerprint

Molecular Property

11 of 34

Original: Kevin Yang et al. J. Chem. Inf. Model. (2019) 59, 3370.

Latest: Esther Heid et al., J. Chem. Inf. Model. (2024) and her earlier JCIM papers

Software at github.com/chemprop/chemprop

How Chemprop works:

Directed Message Passing Neural Network (D-MPNN) followed by Neural Network

Atom features: atomic number, molar mass, aromaticity, formal charge, hybridization, …

Bond features: bond type, conjugation, stereo, …

ML model based on Chemprop: Yang, K., et al. J. Chem. Inf. Model. 2019, 59, 3370–3388.

“Learned Fingerprint”

12 of 34

Easy to Extend to Pairs of Molecules as Input

Reactions are similar but a little trickier, since Pair of Reactants X+Y can make multiple products, and there could be more than one mechanism for unimolecular X🡪Y.
For Reactions, “atom mapping” between reactants-TS-products with CGR helps. See E. Heid & W.H. Green JCIM https://doi.org/10.1021/acs.jcim.1c00975

Combined

Fingerprint

of the Pair

13 of 34

How to get enough data to train a Chemprop model?

High throughput experimentation (i.e. use robots to automate both synthesis of new molecules and the property or reaction measurements)?

It is possible, but it takes very good experimentalists taking care of a lot of equipment:

For a nice example from our group, see Koscher et al. Science (Dec. 2023) 382, eadi1407

To build and shake-down the apparatus in that paper, and then collect the data (several properties on each of several hundred new molecules) took an MIT team several years of full-time hard work. Is there an easier way?

14 of 34

But how can we make accurate predictions on a broad scope of molecules/reactions if we don’t have enough experimental data?

We know fundamental physical laws, e.g. Laws of Thermodynamics. Gives relations between different quantities.
We also know how to compute some properties of molecules and reactions directly from first-principles,

e.g. by solving the Schroedinger Equation and using Statistical Mechanics & Rate Theory.
We can compute molecules that do not exist yet, no need to purchase or synthesize the molecule. No problem if the molecule has short lifetime.
Cannot solve equations exactly, need to make approximations: degrades accuracy
Very Bad Scaling: Number of possible isomers/conformers and reactions, and CPU time & RAM required for each one, all grow extremely quickly with number of atoms in molecule.

First-principles computation only practical for subset of properties/molecules/reactions

How to combine physical laws, computations, and experiment to make better Predictions?

15 of 34

Combining Quantum & Expt by Transfer Learning

Assumption: quantum calculations are cheaper but less accurate than experiments

1) Do huge number of quantum calculations: big training data set

2) build a model to match quantum calculations

3) Freeze DMPNN fingerprint (i.e. assume quantum model defined the relevant functional groups)

4) Tweak parameters in FFN to calibrate quantum model to experiment.

Solvation: Vermeire & Green, ChemEngJ (2021) 418, 129307

Thermochemistry: Grambow et al. JPhysChemA (2019) 123, 5826

16 of 34

Transfer Learning greatly reduces number of high-accuracy data needed

Transfer Learning model achieves similar accuracy with 200 experimental data as purely-experimental model achieves using measurements on 3,000 solute-solvent pairs for training.
In this case, the initial (quantum) model is from COSMO-RS calculations on a million solute-solvent pairs.
From Vermeire & Green, Chem Eng J (2021) 418, 129307

Same quality of fit

Number of Experimental Data

used for Training + Validation

Transfer Learning Model (QM then expt)

Model built solely from

Experimental Data

17 of 34

One of several possible limiting cases:�The Good Model, Noisy Data limit…

Shaded regions indicate range of error in predictions from different initial guesses at model parameters. This range is sometimes used to estimate epistemic error (but is usually an underestimate).

Why doesn’t Model vs. Test Data RMS deviation improve as we add experiments on 2,000 more solute-solvent pairs? In this case, repeating the calculation with less noisy experimental test data gives much smaller RMSE. RMSE set by the Noise in the Test Data! Model predictions are significantly more accurate than this RMSE indicates.

(with quantum)

18 of 34

Combining ΔG_solv,298 from ML model with fundamental thermo equations predicts many condensed phase properties

Left: Quantum (COSMO-RS) vs. experimental solvation energies at 298 K (kcal/mole) for many solutes & solvents. ML model even more accurate. Vermeire & Green, Chem Eng J (2021)
Middle: Predicted & Exptl solvation energy (kJ/mole) of 2-propanol in water versus T. Chung, Gillis, & Green, AIChE J. (2021).
Right: Predicted vs. Experimental log₁₀(Solubility) of solid benzoin in 40 solvents over 50 K T range, & how to combine ML, Quantum, Thermo: Vermeire et al. JACS (2022) 144, 10785

19 of 34

My group uses high-throughput Quantum Chemistry to augment the limited experimental data, to train models to estimate many different quantities. (Other research groups are also using similar approaches).

Thermochemistry of gas-phase neutral molecules

Grambow et al. J.Phys.Chem.A (2019) 123, 5826

Solvation energies of neutral molecules (and so solvent effects on K_eq, solubility, vapor pressure, etc.)

Chung et al. J. Chem. Inform. Model. (2022) 62, 433
Vermeire et al. Chem. Eng. J. (2021) 418, 129307; J. Am. Chem. Soc. (2022) 144, 10785
Pattanaik et al. J.Phys.Chem.B (2023) 127, 10151 [relative solvation of conformers in different solvents]

Solvation energies of ions and pKa

Jonathan Zheng and WHG, J.Phys.Chem.A (2023) 127, 10268

Solvent effects on chemical reactions

Chung and Green, Chemical Science (2024, in press); more manuscripts coming.

Reaction barriers of gas-phase neutrals

Grambow et al. Sci. Data (2020) found DFT transition states for ~12,000 distinct reactions
Spiekermann et al. Sci. Data (2022) computed ~12,000 reaction barriers at CCSD(T)-F12
Reaction barrier datasets for free radical reactions coming soon from my group.

20 of 34

Computer can initiate its own Quantum Chemistry calculations, to automatically improve/update Estimators (e.g. by Machine Learning)

Li Yi-Pei and Han Kehang demonstrated this concept for automatic continuous improvement of enthalpy estimates, by calculating a bigger and bigger set of molecules with Quantum Chem, and continuously updating the machine learning model.

Yi-Pei Li et al., J. Phys. Chem. A 123, 2142-2152 (2019).

21 of 34

Estimators accelerate several steps in computation of k’s, making it easier to construct big TS datasets

Bootstrapping:

Use the TS geometries we have to improve computer’s ability to guessTS geometries for new reactions. Rinse and Repeat.

For ML geometry predictors, see Pattanaik’s papers.

22 of 34

Computing Barriers to ~10⁵ different reactions in hundreds of solvents using quantum mechanics is tricky…

Have to find special atom arrangements where all forces vanish, then compute energies at those geometries very accurately. Each step in the many calculations has different speed-up and RAM requirements.

Bootstrapping Helps: Re-fit ML models to provide improved initial guess geometries, using completed calculations to grow training set. Significantly speeds convergence!

Each job using 16 cores, 64 GB, 10 sec~2 min
Failed jobs take ~2 min
Provides fast screening before expensive DFT jobs

AM1/PM7/GFN2-XTB

Opt recipe

~500K initial TS guess

From ML model

wB97XD

Opt & Freq

DLPNO-CCSD(T)-F12D

Energy at Stationary Point

COSMO-RS Solvation Energy at Stationary Point

Memory intensive
Wide range of job time
Each job using 24 cores, 192 GB, 1 min~2 hr

Wide range of job time
Each job using 16 cores, 64 GB, 10 min~2 hr
Failed jobs take ~2 hr

Large number of output files (~200K TS + ~360K reactants/products) x 300 solvents
Tar output files on-the-fly
Each job using 1 core, 4 GB, ~1 sec

~250K converged at semiempirical

We optimize job scheduling and resource allocation for each type of calculations
Different job type has different technical challenges

DFT jobs need large scale parallelization with dynamic queuing
DLPNO jobs need adaptive memory allocation with dynamic queuing
COSMO-RS jobs need intensive file IO and file management on-the-fly

We leverage dynamic queuing system to maximize node utilization

Automatic restart features
Automatic job pausing and requeuing depending on overall load of entire cluster (aka. Spot Queue)

~200K converged at DFT

23 of 34

Optimize Each Step in Calculation! Example: Searching for Stationary Arrangements of the Atoms

Use cheaper approximate

quantum methods initially,

then switch to CPU-intensive

DFT method when near

Convergence. Reduces total

CPU-time by orders

of magnitude

Unless one is careful, almost all CPU time is consumed by calculations that never converge, since the ‘well-behaved’ calcs from good initial guesses don’t take many iterations. So cut off calculations when probability of success (as judged by earlier calcs) is becoming small.

Number of Atom Arrangements tried before

before Converging on Stationary Point geometry.

Some calculations never find stationary point.

Need initial guess arrangement of atoms,

which we get from machine-learned model

trained from other reactions.

Good initial guess is key to rapid convergence.

24 of 34

Spiekermann et al, J. Phys. Chem. A (2022) 126, 3976

Grambow/Spiekermann Reaction Barrier Dataset (12,000 reactions)

First big accurate TS dataset. [Grambow et al. Sci. Data (2020); Spiekermann et al. Sci. Data (2022)]

Small molecule unimolecular reactions on singlet PES

Used to train ML model to guess TS geometries [Pattanaik et al. PCCP (2022) 22, 23618]
Used to train ML models to predict barrier heights [Grambow JPCL (2020), Heid & Green JCIM (2022) , Spiekermann et al. JPCA (2022)]
ML estimates more accurate than DFT, but not as accurate as CCSD(T)-F12

ML estimator vs. CCSD(T)-F12 barriers for ~2000 rxns

25 of 34

Y. Chung & W.H. Green, Chem. Sci. (2024) in press https://doi.org/10.1039/D3SC05353A

Can use ML trained on Quantum to Predict solvent effects on reaction rates

Yunsie Chung computed many TS’s with COSMO-RS method

Assumed TS geometry in solvent same as gas-phase.

Used her data to train ML model; model used to predict 165 reaction-solvent pairs
Errors in ML- estimated kinetic solvent effects a bit worse than direct COSMO-RS calcs, comparable to error in computed gas-phase k’s…

ML Estimator’s Prediction

26 of 34

We have algorithms for proposing new reactions, using reaction templates, and so building Reaction Networks tailored to Reaction Conditions
We have ways to Estimate or Accurately Calculate all the Numbers…
We have reactor simulators that take the chemistry sub-model as input.

e.g. the Reaction Mechanism Simulator (RMS) written in Julia

So let’s make the computer build the Kinetic Model for us, and Predict Kinetics!
The software package that does this is called Reaction Mechanism Generator (rmg); it includes many databases and Estimators.

Mengjie Liu et al., J. Chem. Inf. Model 61, 2686-2696 (2021).
M.S. Johnson et al. ., J. Chem. Inf. Model 62, 4906-4915 (2022).
See https://rmg.mit.edu

Pulling it all together…

27 of 34

Compare with

Experiments…

or use to design

new experiments

If the model is

satisfactory, use

the model for design of new process, policy, etc.

Open source, rmg.mit.edu

Free Training available

Steps inside dashed line have been available since ~1980

Green, AIChE J. (2020)

28 of 34

Test #1: Application to a real Pilot Plant (HTP process)

High-Temperature Pyrolysis (HTP) process to convert natural gas plus waste gases into ethylene + acetylene (C2’s).
rmg built a kinetic model to predict 71 pilot plant experiments on 12 different feed compositions.

664 chemical species
8,121 reactions
Pure predictions, except we used experimental cooling water T rise to infer the heat loss to the reactor walls.

see Gudiyella et al. IECR (2018) 57, 7404; Green AIChE J (2020) 66, e17059

29 of 34

Test #1: How Accurate? �Comparing Predictions of C2 yield with HTP process Pilot Plant Data

RMG Predicted Yield

Pilot Plant Measured Yield

30 of 34

Symbols: experimental measurements of products of pyrolysis of a mixture of 1-phenyldodecane, undecane, and toluene at 400 C and 300 bar.
Curves: RMG-built model predictions using RMG’s estimates of functional group k(T)’s, derived from quantum chemistry calculations.

Pushing towards heavier feeds, big models: C18 feed mixture

Above: alkylbenzenes

Upper Right: C₁₈H₃₀ isomers

Lower Right: alkanes

A priori predictions, no

fitting to data. Model

contains 3,029 molecules.

A.M. Payne et al. Energy & Fuels (2022)

31 of 34

Now working on predictions for multi-phase reaction systems including macromolecules and vapor-liquid equilibria: growth of polymeric fouling films on the trays of distillation columns

H.-W. Pang et al., I&ECR (2023) 62, 14266; and I&ECR (2024) accepted

32 of 34

Estimation:

Quantum Chem has poor scaling.
Fewer experimental data on bigger molecules.
We don’t always catch 3-d effects in big molecules (e.g. folding)

Huge number of reactions involving many reaction intermediates.

Number of isomers grows exponentially with size of molecule. K. Han I&ECR (2018) 57: 14022

Even if we can build the kinetic model, can we solve it?

Today’s stiff ODE solvers don’t work well if >10,000 species

If we can solve a huge model, can we compare its predictions with experiments?

Difficult to access data from all the relevant experiments: need Community Cooperation on data
Many experiments give hundreds of ‘major’ species, not trivial to identify and quantify each one.
Need automated methods to compare experimental measurements with complicated models…

The approach described above works well for well-mixed small gas phase neutral molecules, but as molecules increase in size there are challenges….

33 of 34

Predictive Chemical Kinetics often feasible, Accurate for some systems.

Kinetic models can be very complex; automation can be very helpful.

Current Model Construction Algorithms (e.g. in rmg) work OK for some cases

But some cases fail – better algorithms needed!
Sometimes too slow, need more efficient software implementations
Much more work is needed on building chemical kinetic models that fit well into computational fluid dynamics or other inhomogeneous reactor models

Some systems are just too complicated, need simplifications

Might be able to solve ODEs with >10,000 species, but is this wise direction?
Fragment Method & Structure-Oriented Lumping concepts need improvement

Biggest problems today: reliable affordable Quantum Chemistry & Estimates (ideally with accuracy guarantees). Better Data Sharing would help…

Conclusions: Predictive Chemical Kinetics off to a good start

34 of 34

Acknowledgements

Funding: DARPA, Machine Learning for Pharmaceutical Discovery & Synthesis Consortium, BASF, DOE, ENI
This work would not have been possible without access to the MIT SuperCloud.
We thank the SuperCloud administrators for very helpful interactions!
Some parts of this project were done on NERSC. We thank the NERSC staff for helping us to get the software installed.
Green group students and postdocs who contributed to this line of research:

Haoyang Wu Kevin Greenman Yunsie Chung Shih-Cheng Li Yi-Pei Li Hao-Wei Pang

Michael Forsuelo Chas McGill

Florence Vermeire Angiras Menon

Esther Heid Kevin Spiekermann

Colin Grambow Lagnajit Pattanaik

Jackson Burns