Modeling Complex Systems of Chemical Reactions�
William H. Green
MIT Dept. of Chem. Eng.
“Modeling Talks” Series
Jan. 23, 2024
Motivation
1
If you can predict Reactions…
2
3
3
3
3
Historically, most chemical engineers have used a purely empirical approach to reaction kinetics
Do experiments, scale up, do more experiments, fit to a simple kinetic model
Experiments in large pilot plants are slow and expensive:
Barrier to innovation!
The Engineering Motivation of this Work
World needs to decarbonize fast – requires replacing most chemical processes
with some decarbonized alternative… need much faster chemical innovation!
4
4
4
4
Can we avoid Barrier to innovation by Predicting the Kinetics on the computer?
If possible, has many advantages, not just lower cost.
But how can we accurately Predict Kinetics?
The Engineering Motivation of this Work
Although sometimes Predictive Kinetics works, there are Challenges
5
Chemistry is a Big Problem – so many molecules & reactions, each a bit different from the others…
6
Data on Molecules & Reactions is Sparse
7
8
8
8
Historical Approach to Estimating Molecular Properties:
find correlation with substructures in molecule
8
Molecule’s “Fingerprint”
(e.g. list of number of each functional group present in molecule)
Molecule
Molecular Property
(or Log(Property) )
Count functional groups
Linear Correlation
or Tree Classifier
9
9
9
“Learned Fingerprint” + Neural Net approach
9
Molecule
Learned
Fingerprint
Molecular Property
Original: Kevin Yang et al. J. Chem. Inf. Model. (2019) 59, 3370.
Latest: Esther Heid et al., J. Chem. Inf. Model. (2024) and her earlier JCIM papers
Software at github.com/chemprop/chemprop
10
10
How Chemprop works:
Directed Message Passing Neural Network (D-MPNN) followed by Neural Network
Atom features: atomic number, molar mass, aromaticity, formal charge, hybridization, …
Bond features: bond type, conjugation, stereo, …
ML model based on Chemprop: Yang, K., et al. J. Chem. Inf. Model. 2019, 59, 3370–3388.
“Learned Fingerprint”
Easy to Extend to Pairs of Molecules as Input
11
11
Combined
Fingerprint
of the Pair
How to get enough data to train a Chemprop model?
High throughput experimentation (i.e. use robots to automate both synthesis of new molecules and the property or reaction measurements)?
It is possible, but it takes very good experimentalists taking care of a lot of equipment:
For a nice example from our group, see Koscher et al. Science (Dec. 2023) 382, eadi1407
To build and shake-down the apparatus in that paper, and then collect the data (several properties on each of several hundred new molecules) took an MIT team several years of full-time hard work. Is there an easier way?
12
But how can we make accurate predictions on a broad scope of molecules/reactions if we don’t have enough experimental data?
13
Combining Quantum & Expt by Transfer Learning
Assumption: quantum calculations are cheaper but less accurate than experiments
1) Do huge number of quantum calculations: big training data set
2) build a model to match quantum calculations
3) Freeze DMPNN fingerprint (i.e. assume quantum model defined the relevant functional groups)
4) Tweak parameters in FFN to calibrate quantum model to experiment.
Solvation: Vermeire & Green, ChemEngJ (2021) 418, 129307
Thermochemistry: Grambow et al. JPhysChemA (2019) 123, 5826
14
15
Transfer Learning greatly reduces number of high-accuracy data needed
15
Same quality of fit
Number of Experimental Data
used for Training + Validation
Transfer Learning Model (QM then expt)
Model built solely from
Experimental Data
One of several possible limiting cases:�The Good Model, Noisy Data limit…
Shaded regions indicate range of error in predictions from different initial guesses at model parameters. This range is sometimes used to estimate epistemic error (but is usually an underestimate).
16
Why doesn’t Model vs. Test Data RMS deviation improve as we add experiments on 2,000 more solute-solvent pairs? In this case, repeating the calculation with less noisy experimental test data gives much smaller RMSE. RMSE set by the Noise in the Test Data! Model predictions are significantly more accurate than this RMSE indicates.
(with quantum)
(with quantum)
Combining ΔGsolv,298 from ML model with fundamental thermo equations predicts many condensed phase properties
17
My group uses high-throughput Quantum Chemistry to augment the limited experimental data, to train models to estimate many different quantities. (Other research groups are also using similar approaches).
18
18
Computer can initiate its own Quantum Chemistry calculations, to automatically improve/update Estimators (e.g. by Machine Learning)
Li Yi-Pei and Han Kehang demonstrated this concept for automatic continuous improvement of enthalpy estimates, by calculating a bigger and bigger set of molecules with Quantum Chem, and continuously updating the machine learning model.
19
19
Yi-Pei Li et al., J. Phys. Chem. A 123, 2142-2152 (2019).
Estimators accelerate several steps in computation of k’s, making it easier to construct big TS datasets
Bootstrapping:
Use the TS geometries we have to improve computer’s ability to guessTS geometries for new reactions. Rinse and Repeat.
For ML geometry predictors, see Pattanaik’s papers.
20
20
20
21
Computing Barriers to ~105 different reactions in hundreds of solvents using quantum mechanics is tricky…
Have to find special atom arrangements where all forces vanish, then compute energies at those geometries very accurately. Each step in the many calculations has different speed-up and RAM requirements.
Bootstrapping Helps: Re-fit ML models to provide improved initial guess geometries, using completed calculations to grow training set. Significantly speeds convergence!
AM1/PM7/GFN2-XTB
Opt recipe
~500K initial TS guess
From ML model
wB97XD
Opt & Freq
DLPNO-CCSD(T)-F12D
Energy at Stationary Point
COSMO-RS Solvation Energy at Stationary Point
~250K converged at semiempirical
~200K converged at DFT
22
Optimize Each Step in Calculation! Example: Searching for Stationary Arrangements of the Atoms
Use cheaper approximate
quantum methods initially,
then switch to CPU-intensive
DFT method when near
Convergence. Reduces total
CPU-time by orders
of magnitude
Unless one is careful, almost all CPU time is consumed by calculations that never converge, since the ‘well-behaved’ calcs from good initial guesses don’t take many iterations. So cut off calculations when probability of success (as judged by earlier calcs) is becoming small.
Number of Atom Arrangements tried before
before Converging on Stationary Point geometry.
Some calculations never find stationary point.
Need initial guess arrangement of atoms,
which we get from machine-learned model
trained from other reactions.
Good initial guess is key to rapid convergence.
23
23
Grambow/Spiekermann Reaction Barrier Dataset (12,000 reactions)
ML estimator vs. CCSD(T)-F12 barriers for ~2000 rxns
24
24
Can use ML trained on Quantum to Predict solvent effects on reaction rates
24
ML Estimator’s Prediction
25
25
Pulling it all together…
26
26
Compare with
Experiments…
or use to design
new experiments
If the model is
satisfactory, use
the model for design of new process, policy, etc.
Open source, rmg.mit.edu
Free Training available
Steps inside dashed line have been available since ~1980
Green, AIChE J. (2020)
27
Test #1: Application to a real Pilot Plant (HTP process)
27
see Gudiyella et al. IECR (2018) 57, 7404; Green AIChE J (2020) 66, e17059
28
Test #1: How Accurate? �Comparing Predictions of C2 yield with HTP process Pilot Plant Data
28
RMG Predicted Yield
Pilot Plant Measured Yield
29
29
Pushing towards heavier feeds, big models: C18 feed mixture
Above: alkylbenzenes
Upper Right: C18H30 isomers
Lower Right: alkanes
A priori predictions, no
fitting to data. Model
contains 3,029 molecules.
30
Now working on predictions for multi-phase reaction systems including macromolecules and vapor-liquid equilibria: growth of polymeric fouling films on the trays of distillation columns
30
H.-W. Pang et al., I&ECR (2023) 62, 14266; and I&ECR (2024) accepted
31
31
The approach described above works well for well-mixed small gas phase neutral molecules, but as molecules increase in size there are challenges….
32
32
Conclusions: Predictive Chemical Kinetics off to a good start
Acknowledgements
Haoyang Wu Kevin Greenman Yunsie Chung Shih-Cheng Li Yi-Pei Li Hao-Wei Pang
Michael Forsuelo Chas McGill
Florence Vermeire Angiras Menon
Esther Heid Kevin Spiekermann
Colin Grambow Lagnajit Pattanaik
Jackson Burns
33