1 of 39

The Open Molecules 2025 (OMol25) Dataset, Evaluations, and Models

Samuel M. Blau

Lawrence Berkeley National Lab

Berkeley Lab

2 of 39

Atomistic simulations & the promise of ML interatomic potentials (MLIPs) for chemistry and materials

Predict energy + atomic forces

How do we calculate energy and forces?

            • Classical force field (FF): fast, but cannot describe chemical reactivity
  • Density functional theory (DFT): accurate, but very slow
  • MLIPs, trained on DFT data: near DFT accuracy at near FF speed

MLIP

3D atomic positions

Update positions, repeat iteratively

Molecular dynamics

Elucidate complex reactivity

Geometry

optimization + free energy

            • MLIPs can provide accurate atomistic insight at otherwise intractable length and timescales

            • Chemical domains in which MLIPs are reliable depends on available DFT training data

Berkeley Lab

3 of 39

Lots of general MLIP dev has focused on materials

https://matbench-discovery.materialsproject.org/

2022: MPtrj dataset, 2023: Matbench Discovery leaderboard

*

Berkeley Lab

4 of 39

Lots of general MLIP dev has focused on materials

https://matbench-discovery.materialsproject.org/

2022: MPtrj dataset, 2023: Matbench Discovery leaderboard

*

>300 citations in <3 years

Berkeley Lab

5 of 39

Lots of general MLIP dev has focused on materials

https://matbench-discovery.materialsproject.org/

2022: MPtrj dataset, 2023: Matbench Discovery

Architectural improvements can only get you so far…

*

Berkeley Lab

6 of 39

Barroso-Luque, Zitnick, Ulissi et al. “Open Materials 2024 (OMat24) Inorganic Materials Dataset and Models” Arxiv 2024

>60x larger than MPtrj

Much higher forces

More diverse sampling

Berkeley Lab

7 of 39

Models trained on more/better data are better!

Now, atomistic simulations of well-behaved crystalline materials should always start with a pre-trained MLIP

Berkeley Lab

8 of 39

DFT datasets

Catalysis 230M

2021

2022

2023

2024

2025

OC20

OC22

ODAC23

OMat24

MOFs 29M

Materials 100M

~400M

core hrs

~400M

core hrs

~400M

core hrs

Industrial resources enable massive data generation

All fully open-source!

“Hey Sam…”

What’s missing? Molecular chemistry!

Berkeley Lab

9 of 39

DFT datasets

Catalysis 230M

2021

2022

2023

2024

2025

OC20

OC22

ODAC23

OMat24

MOFs 30M

Materials 100M

Molecules 100M

OMol25

“Hey Sam…”

Berkeley Lab

10 of 39

Open Molecules 2025: Unlocking MLIPs for chemistry

Industry

  • Meta
  • Genentech

Government

  • LBNL
  • LANL

Academia

  • UC Berkeley
  • CMU
  • Stanford
  • NYU
  • Princeton
  • Cambridge

ωB97M-V / def2-TZVPD / 99-590 grid

>6 billion core hours

Berkeley Lab

11 of 39

OMol coverage & complexity

AIMNet2:

14 elements

SPICE2:

17 elements

QCML:

79 elements

83 elements

Berkeley Lab

12 of 39

OMol coverage & complexity

SPICE2

AIMNet2

QCML

Berkeley Lab

13 of 39

OMol coverage & complexity

QCML: <26 atoms

only one molecule

AIMNet2: <80 atoms

SPICE2: <110 atoms

“The richness (and challenge) of chemistry is mostly in the intermolecular interactions”

Berkeley Lab

14 of 39

OMol coverage & complexity

OMol25: 5.9B atoms

AIMNet2: ~600M atoms

QCML: ~600M atoms

Better functional

Better basis set

Tighter grid

Berkeley Lab

15 of 39

OMol construction / structural sampling

Boltzmann rattling + optimization

Geodesic interpolation R -> TS -> P

10% also run as triplets

10% add electron, 10% remove electron

Up to 3 optimization steps

  • Generate 3D endpoint structures
  • Applied force induced reactivity (AFIR) with MACE-MP-0
  • Sample dissimilar/high energy/high force structures along overall AFIR trajectory

Berkeley Lab

16 of 39

OMol construction / structural sampling

Snapshots pulled from 300K, 400K MD

Multiple protonation/tautomer states

Berkeley Lab

17 of 39

OMol construction / structural sampling

Berkeley Lab

18 of 39

OMol construction / structural sampling

Template reactions from MOBH35, MOR41, ROST61

Berkeley Lab

19 of 39

OMol construction / structural sampling

Berkeley Lab

20 of 39

Berkeley Lab

21 of 39

Quick aside – FAIR Chemistry’s UMA model(s)

Universal Model for Atoms (UMA)

Structure

Router

Total Charge & Spin Multiplicity

Expert

Task Embedding

Expert

Merged Mixture of Linear Experts UMA Model

OMol

OMC

OMat

ODAC

OC20

Input Structure

Null Charge & Spin Multiplicity

OMol Task

Not OMol Task

Energy

Forces

Stress

Input Task

Composition

Wood et al. “UMA: A Family of Universal Models for Atoms” Arxiv 2025

Berkeley Lab

22 of 39

Dataset splitting and test sets

Berkeley Lab

23 of 39

Baseline results: test set energy and forces

Berkeley Lab

24 of 39

Baseline results: test set energy and forces

Berkeley Lab

25 of 39

Novel model evaluation tasks + metrics

Berkeley Lab

26 of 39

Baseline results

Berkeley Lab

27 of 39

Baseline results

Lots of room for MLIP architecture development to improve treatment of charge, spin, long-range interactions

Berkeley Lab

28 of 39

Baseline results

https://benchmarks.rowansci.com/

Martinez group Slack:

GMTKN55

w/ metals,

charged + open-shell

Berkeley Lab

29 of 39

Models capture chemistry changing with charge, spin

Cu1+

Cu2+

Tetrahedral

Square planar

Berkeley Lab

30 of 39

Models capture chemistry changing with charge, spin

Neutral ethylene carbonate

Radical anion EC

Stable

Ring bond breaks

Berkeley Lab

31 of 39

Training on OMol continues to improve model performance

Berkeley Lab

32 of 39

Enthusiastic community reception

“OMol25-trained models give much better energies than the DFT level of theory I can afford allow for computations on huge systems that I previously never even attempted to compute."

Another Rowan user called this "an AlphaFold moment for computational chemistry”.

Rowan: Models correctly predicted relative barrier heights for C–F reductive elimination, C–O reductive elimination, and different aryl groups.

Grimme: “Benchmark results for OS reaction energies and barrier heights as well as for TM geometries confirm UMA’s strong transferability across broad regions of chemical space.”

g-xTB: WTMAD-2 of 9.3 kcal mol−1

UMA-s-1: WTMAD-2 of 6.1 kcal mol−1

30 citations in 10 weeks!

Berkeley Lab

33 of 39

First paper using OMol/UMA: 25 days after release!

Avg RMSD = 0.24 Å

MAE = 1.65 kcal/mol

Berkeley Lab

34 of 39

More OMol/UMA transition state opt from 3 weeks ago

ColabReaction: Accelerating Transition State Searches with Machine Learning Potentials on Google Colaboratory

Masayuki Karasawa,

Chee Siang Leow,

Hideaki Yajima,

Shuta Arai,

Hiromitsu Nishizaki,

Tohru Terada,

Hajime Sato

Berkeley Lab

35 of 39

Additional OMol results: protein-ligand binding (Rowan)

https://rowansci.com/blog/benchmarking-protein-ligand-interaction-energy

Berkeley Lab

36 of 39

Additional OMol results: bond dissociation energies (Rowan)

https://rowansci.com/publications/expbde54

Note: MLIPs much faster on GPU

Berkeley Lab

37 of 39

Last week:

Berkeley Lab

38 of 39

OMol isn’t quite done yet…

  • Extending training set:�- d-block intermediate spins�- More diverse heavy main group, noble gasses�- Chunks of molecular crystals�- Polymers!
  • Lots of additional information:�- Dipoles / quadrupoles�- Fock matrices, NBO bonding/orbital info, reduced orbital populations�- Planned GBW postprocessing: better partial charges/spins, QTAIM, etc.

  • Six petabytes of wavefunctions – working on dissemination with Argonne

              • Additional test set: solvated lanthanides

              • Additional evaluation tasks: transition state optimization, non-minima Hessians

              • Public leaderboard to encourage community engagement in MLIP model dev

Berkeley Lab

39 of 39

Acknowledgements

Muhammed Shuaibi

Daniel Levine

Brandon Wood

Santiago Vargas (LBL)

Andrew Rosen (Princeton)

Evan Spotte-Smith (CMU)

Michael Taylor (LANL)

Muhammad Haysim (NYU)

Ilyes Batatia (Cambridge)

Gabor Csanyi (Cambridge)

Peter Eastman (Stanford)

Nathan Frey (Prescient / Genentech)

Aditi Krishnapriyan (Berkeley)

Joshua Rackers (Prescient / Genentech)

Sanjeev Raja (Berkeley)

Larry Zitnick

Zack Ulissi

Kyle

Michel

Misko Dzamba

Vahe Gharakhanyan

Ammar Rizvi

Xiang Fu

Berkeley Lab