2 of 86

Contents

Section 1: Introduction to the concept of machine-learning potentials (MLPs)

Molecular dynamics & DFT
Basic concept of MLIP
Models
Training set generation
Universal pretrained MLIPs

Section 2: Practical use of MLIPs

How to choose model and code for your problem?
How to sample training set?
How to set hyperparameters?
How to know whether the simulation is going wrong or not?

Section 3: Practice with codes (Colab)

14:00

15:25

15:35

16:25

16:30

10 min break

5 min break

3 of 86

2024 Nobel prize in physics

Machine-learning potential paper

18-year old field!

4 of 86

Success stories

Hydrogen phase transition

Amorphous Si

NH₃ decomposition catalysis

Phase change memory

Cheng et al. Nature 585, 217 (2020)

Deringer et al. Nature 589, 59 (2021)

Yang, Parrinello et al. Nat. Catal. 6 829 (2023)

Zhou, Zhang, Deringer et al. Nat. Electron. 6 746 (2023)

5 of 86

Introduction to the concept of machine-learning potentials

Molecular dynamics & DFT
Basic concept of MLP
Model 1: Descriptor-based models
Model 2: E(3)-equivariant graph models
Training set generation
Universal pretrained MLPs

6 of 86

Molecular dynamics

What is molecular dynamics (MD)?

Change in atomic positions over time:

Velocity, v:

Acceleration, a:

Density functional theory (DFT)

Ionic bonding

Covalent bonding

Noble gases

HΨ(r₁,r₂,..r_N) = EΨ (r₁,r₂,..r_N)

Accurate & general
Low speed (<500 atoms)

Classical interatomic potentials

Quantum mechanical calculations

High speed (millions of atoms)
Limited to specific systems

J. Manuf. Sci. Eng. Apr 2014, 136(2): 021015

V: Potential energy

7 of 86

Density-functional theory (DFT)

Structure

Input:

Energy, wavefunction

Output:

Periodic boundary condition (PBC)

Quantum mechanics

HΨ(r₁,r₂,..r_N) = EΨ (r₁,r₂,..r_N)

0.02 L = 6.02×10²³ atoms

How can we simulate such a large number of atoms?

Typically, 100–200 atoms are used to simulate liquid structures in DFT calculations.

8 of 86

Scales of materials simulations

Commun. Mater. 4, 66 (2023)

All chemical reactions

Specific chemical reactions

Limited chemical reactions

9 of 86

Machine-learning potentials (MLPs)

MLP

Machine learning

model

Machine-learning potentials

Training set from DFT calculations

Target simulation

Small structures

Big structures

10²

10³

10⁴

10^-6

10^-2

10⁰

10²

10⁶

10⁴

10^-4

# of atoms

Time (s)

Computation time for Si

DFT~ O(N³)

Classical MD

MLP ~ O(N)

~ O(N)

Energy = f(structure)

10 of 86

Example: modeling HF etching process with MLP

Diverse sampling techniques / Time scale: ps scale

Time scale: ns scale

a-Si₃N₄

a-Si₃N₄ + HF

Training set generation (DFT)

Simulation (MLP)

Simulation target

HF etching of amorphous Si₃N₄ for semiconductor process

C. Hong et al. ACS Appl. Mater. Interfaces 16, 48457 (2024)

11 of 86

Introduction to the concept of machine-learning potentials

Molecular dynamics & DFT
Basic concept of MLP
Model 1: Descriptor-based models
Model 2: E(3)-equivariant graph models
Training set generation
Universal pretrained MLPs

12 of 86

Challenges in representing material structures

A straightforward try:

Input: (x₁,y₁,z₁,x₂, … x₆,y₆,z₆)

x₁

y₁

z₆

…

E_DFT

Challenges you may encounter:

(1) Unable to account for translational, rotational, and permutational invariance.

Input: (x₁, y₁, z₁, x₂, … x₆, y₆, z₆)

Input: (x₁+Δ, y₁, z₁, x₂+Δ, … x₆+Δ, y₆, z₆)

Constant

shift by Δ

Get different outputs

(2) Not transferable to systems with larger or smaller cells.

→The input length varies, making it incompatible with the trained model.

Energy = f(structure)

13 of 86

Symmetries that MLP should satisfy

(x₁+Δ, y₁, z₁, x₂+Δ, … x₆+Δ, y₆, z₆)

(1) Translational symmetry (+periodic boundary condition)

(2) Rotational symmetry

Constant

shift by Δ

(x₁,y₁,z₁,x₂, … x₆,y₆,z₆)

(y₁,x₁,z₁,y₂, … y₆,x₆,z₆)

(3) Permutational symmetry

(x₁,y₁,z₁,x₂, … x₆,y₆,z₆)

(x₄,y₄,z₄,x₅, … x₃,y₃,z₃)

→ The output (energy) should remain invariant (does not change) under these transformations.

14 of 86

Atomic energy mapping

Atom 1

Relative coordinates

…

E_atom,1

E_atom,2

E_atom,3

E_atom,N

E_tot

Relative coordinates

…

Atom 2

Atomic energies

Total energy

(DFT)

DFT total energies are assumed to be decomposable into transferable atomic energies.
However, training is performed to predict total energy values.
See PRM 3, 093802 (2019) for details.

E_total = ∑ E_atom

Atomic energy mapping

Not given

(Estimated during training)

Data given for training

Relative coordinates: translational invariance
Atomic energies: permutational invariance

15 of 86

Force calculation

Atom 1

Relative coordinates

…

E_atom,1

E_atom,2

E_atom,3

E_atom,N

E_tot

Relative coordinates

…

Atom 2

Atomic energies

Total energy

(DFT)

_tot

Force

Atomic

index

Directional index: x, y, or z

Loss function

Total energy error

Atomic force error

Stress error

16 of 86

Types of MLP models

(1) Descriptor-based models

(2) Graph models

Descriptor function

E_atom,1

E_atom,2

E_tot

Descriptor function

Atom 1

Atom 2

Total energy

(DFT)

NN of element 1

NN of element 2

…

Figure: PRL 120, 145301 (2018)

but this paper is not about MLP

Graph construction by connectivity

Graph convolution neural network

Atomic

energies

Atomic

energies

E₁

E₂

E₃

17 of 86

Types of MLP models

(1) Descriptor-based models

Descriptor function

E_atom,1

E_atom,2

E_tot

Descriptor function

Atom 1

Atom 2

Total energy

(DFT)

NN of element 1

NN of element 2

…

Atomic

energies

18 of 86

Descriptor model 1: Behler-Parrinello neural network (BPNN) potential

Atom 1

Relative coordinates

…

E_atom,1

E_atom,2

E_atom,N

E_tot

Relative coordinates

…

Atom 2

Atomic energies

Total energy

(DFT)

Behler and Parrinello, PRL, 98, 146401 (2007)

Descriptor

Descriptor: symmetry function

G_i = [G_i^radial,^η¹, G_i^radial,^η2, G_i^radial,^η3, … G_i^angular,^ζ¹, G_i^angular,^ζ2, G_i^angular,^ζ3, …]

R_ij

R_ik

θ_ijk

R_c

R(Å)

θ(rad)

f_c: cutoff function

→ Used as input vectors for neural networks predicting atomic energies.

Rotationally

invariant

2-body

3-body

19 of 86

Descriptor model 2: DeePMD-kit

Zhang, Wang, E et al. PRL 120, 143001 (2018)

Descriptor

D_i = {D_ij | j ∈ neighbors of i}

How can rotational and permutational invariance be ensured?

Relative coordinates are directly used as an input of NN

(1) Rotational invariance: Adjust relative axes based on first- and second-nearest neighbors.

(2) Permutational invariance: sort D_ij by R_ij

Rotational matrix:

R_ia: first nearest neighbor

R_ib: second nearest neighbor

* Disadvantage: discontinuity

20 of 86

DeePMD-kit ver. 2: DeepPot-SE

Continuously differentiable version of DeePMD-kit

Zhang, E, et al. NeurIPS (2018); arxiv:1805.09003

21 of 86

Descriptor model 3: Gaussian approximation potential (GAP)

Descriptor: Smooth Overlap of Atomic Positions (SOAP)

…

Training point 1:

See Gabor Csányi, https://www.youtube.com/watch?v=wpJbSjq6QDw

Training point 2:

Training point 3:

Training point N:

New point (NP)

k(1,NP)

k(2,NP)

k(3,NP):

k(N,NP):

k(i,j): kernel. similarity between i and j
Mathematically same as 1-layer neural network
Easy estimation of uncertainty

Gaussian process

Spherical harmonics

Bartók, Csányi, et al. PRL 104, 136403 (2020)

22 of 86

Descriptor model 3: Gaussian approximation potential (GAP)

…

Training point 1:

Training point 2:

Training point 3:

Training point N:

New point (NP)

k(1,NP)

k(2,NP)

k(3,NP):

k(N,NP):

k(i,j): kernel. similarity between i and j
Mathematically same as 1-layer neural network
Easy estimation of uncertainty

Gaussian process

Bartók, Csányi, et al. PRL 104, 136403 (2020)

Uncertainty estimation of gaussian process

High uncertainty region (lack of training data)

Uncertainty is directly derived from the machine-learning model without using an ensemble.

23 of 86

Descriptor models summary: BP-NNP vs DeePMD-kit vs GAP

Neural network

Gaussian process

Behler-Parrinello NNP

DeePMD-kit

GAP

…

Training point 1:

Training point 2:

Training point 3:

Training point N:

k(1,NP)

k(2,NP)

k(3,NP):

k(N,NP):

R_ij

R_ik

θ_ijk

R_c

E_atom,1

E_atom,2

E_atom,N

E_tot

Total energy

(DFT)

Complicated descriptor
Shallow NN model

Simple descriptor
Deep NN model

General reliability
Low gpu acceleration

Careful attentions required
High gpu acceleration

Pros: large training data point (shorter inference time)
Cons: long training time

Pros: Short training time + uncertainty
Cons: inference time ~ training set size

24 of 86

Limitations of descriptor models

Atom 1

Relative coordinates

…

E_atom,1

E_atom,2

E_atom,N

E_tot

Relative coordinates

…

Atomic energies

Total energy

(DFT)

Descriptor

Element A

Element B

NN of element A

2-body: A-B

2-body: A-A

3-body: A-A-B

3-body: A-A-A

3-body: B-A-B

…

2-body: A-A

3-body: B-A-B

…

2-body: A-A

3-body: B-A-B

…

Hyperparameter set 1

Hyperparameter set 2

Hyperparameter set N

…

Atomic

energy

Number of NN = number of element
Size of each input vector ~ (number of elements)²
→ Toal parameters ~ (number of elements)³

Limitaton1:

Limitation 2: Knowledge from one element is not transferred to others, as a distinct network is used for each element.

Input

Hidden

25 of 86

Types of MLP models

(1) Descriptor-based models

(2) Graph models

Descriptor function

E_atom,1

E_atom,2

E_tot

Descriptor function

Atom 1

Atom 2

Total energy

(DFT)

NN of element 1

NN of element 2

…

Figure: PRL 120, 145301 (2018)

but this paper is not about MLP

Graph construction by connectivity

Graph convolution neural network

Atomic

energies

Atomic

energies

E₁

E₂

E₃

26 of 86

Types of MLP models

(2) Graph models

Figure: PRL 120, 145301 (2018)

but this paper is not about MLP

Graph construction by connectivity

Graph convolution neural network

Atomic

energies

E₁

E₂

E₃

27 of 86

E(3)-equivariant graph machine-learning potentials

1^st convolution

2^nd convolution

3^rd convolution

Message passing

E(3)-equivariant graph model

Node

Features

(scalars)

Scalar

(l=0)

Vector

(l=1)

Rank 2 tensor

(l=2)

Feature vectors consist of tensors, in addition to scalars.

Rotational transformation

Equivariant networks are more data-efficient and accurate compared to descriptor-based models.
Errors can be reduced by a factor of two to three with equivariant networks, though they are computationally heavier than descriptor-based models.

28 of 86

Equivariant graph neural network

x₂₂ = σ(w₁₁₂x₁₁ + w₁₂₂x₁₂+ b₁)

…

x₂₁

x₁₁

w₁₁₁

w₁₁₄

w₁₁₂

w₁₁₃

x₂₁ = σ(w₁₁₂x₁₁ + w₁₁₂x₁₂+ …)

Neural network

Graph NN (massage passing NN)

Equivariant GNN

Message from 1 to 2 = w₁₁₂ ⨂ x₁₂

Edge tensor, w_112,_lm = R(r₁₂)Y_lm(r₁₂)

w₁₂₂

w₁₁₂

x₁₁

x₁₂

x₂₂

Edge

Node

x₁₁

x₁₂

x₁₃

x₁₄

w₁₁₁

w₁₁₂

w₁₁₃

w₁₁₄

x₂₁

x₁₁

x₁₂

w₁₁₂

r₁₂

Radial term

(include trainable weights)

Spherical harmonics

Graph

Input

Hidden

Output

Input

Hidden

Output

Tensor

For instance, when l = 1

Y_1-1(θ, φ) = C sinθ sinφ → y

Y₁₀(θ, φ) = C cosθ → z

Y₁₁(θ, φ) = C sinθ cosφ → x

29 of 86

What is tensor?

E(3) group = 3D Euclidean group, which comprises translations, rotations, and reflections (parity).

l = 0

Even parity (p = 1)

Odd parity (p = -1)

l = 1

l = 2

Pseudo scalar (0o)

Vector (1o)

Scalar (0e)

Pseudo vector (1e)

Parity

(from mirror symmetry)

Order (l)

= 각운동량 양자수

Projection index (m)

= 자기 양자수

m∈[−l, −(l −1)..., (l−1), l]

m = 0

m = −1

−1

m = −2

d_xy

d_yz

d_z2

d_xz

d_x2−y2

p_y

p_z

p_x

30 of 86

Structure of equivariant network (NequIP structure for example)

One-hot embedding

First-layer node

Second-layer node

Scalars (0/1)

Scalars

Conventional NN

E(3) NN

Scalar

(l=0)

Vector

(l=1)

Tensor

(l=2)

Edge (=filter, f)

Convolution

Eigenfunction

of rotational

operator

CG coefficient:

Radial neural network

Bessel function

Nat. Commun.

13, 2453 (2022)

Clebsch-Gordon

coeff.

Radial

part

Spherical

harmonics

Node

feature

[Edge tensor ⊗ Node tensor]_lf,pf

Energy

(scalar)

Message from node b to a

31 of 86

Atomic cluster expansion

Many-body messages

2-body

3-body

4-body

5-body

Atomic cluster expansion (ACE)

Multi-ACE

Cf: Graph ACE (GRACE)

PRB 99,014104 (2019)

=L in Previous slides

32 of 86

Invariance vs equivariance

Descriptor (input)

Descriptor model = Invariant model

f(x)

f(Rx)

Hidden layers

g(f(x))

g(f(Rx))

Output

Energy

Equivariant model

f(x)

f(Rx) = Rf(x)

Convolution

layers

≠

Output

Energy

33 of 86

Role of equivariance

Descriptor (input)

f(x)

Hidden layers

g(f(x))

Output

Energy

f(x)

Convolution

layers

Output

Energy

Structural

representation

Energy regression

Structural representation + energy regression

at the same time

→ MLIP learns effective structural representation way as well

Descriptor model = Invariant model

Equivariant model

34 of 86

Why E(3)-equivariant graph NNs are powerful?

(1) Increase of the cutoff through message passing

(2) All elements share the same network, differing only in their initial embedding vectors.

→ Computational cost does not increase with the number of elements.

One-hot embedding

Scalars (0/1)

(3) The network consists of high-rank tensors, enhancing representability in geometric spaces.

Nat. Mach. Intell. 7, 56 (2025)

Scalar

(l=0)

Vector

(l=1)

Rank 2 tensor

(l=2)

35 of 86

Parallelization issue

Improved parallelization algorithm of SevenNet

Parallelization performance

Park, Han et al. J. Chem. Theory Ccomput. 20, 4857 (2024)

Problem: Graph neural network potentials exhibit poor parallelization performance due to constant communication between nodes.

SevenNet addresses the parallelization issue by integrating a communication block within the convolution layers.

36 of 86

Descriptor model vs graph model

R_ij

R_ik

θ_ijk

R_c

E_atom,1

E_atom,2

E_atom,N

E_tot

Total energy

(DFT)

Descriptor models

Graph models

BP-NNP, DeePMD-kit, GAP, …
Fast, and available with CPUs
Higher errors (0.1 ~ 0.5 eV/Å)
Less than quinary composition
Poor data efficiency

Scalars

Scalar

(l=0)

Vector

(l=1)

Tensor

(l=2)

NequIP, MACE, …
Slow, and may not be available with CPUs
Lower errors
Up to 100-element compositions
Good data efficiency

37 of 86

Tensor-based, but not message-passing models

Moment tensor potential (MTP)

Allegro

Linear function
Descriptors: tensors in cartesian coordinates

Cf) NequIP & MACE: tensors in spherical coordinates

Shapeev, arxiv:1512.06054 (2015)

Review: Mach. Learn. Sci. Technol. 2 025002 (2021)

Two-body messages as a descriptor for MLP
Not message-passing model; local model

Musaelian, Kozinsky et al. Nat. Commun. 14, 579 (2023)

38 of 86

Summary of MLP models

Neural network

Gaussian process

Behler-Parrinello NNP

DeePMD-kit

GAP

R_ij

R_ik

θ_ijk

R_c

Descriptor-based models

E(3)-equivariant graph models

Scalars (0/1)

Scalars

Scalar

(l=0)

Vector

(l=1)

Tensor

(l=2)

NequIP, MACE, …

39 of 86

Long-range interaction

Cutoff

Long range: Mostly Coulomb interaction

→ Cannot fully described by conventional MLPs

Charge equilibration (Qeq) scheme + MLP

Predicted by ML

Qeq scheme

Electrostatic energy calculation with ewald summation

Cons: high computational cost

Ko, Behler et al. Nat. Commun. 12, 398 (2021)

40 of 86

Issues arising from neglecting electrostatic interactions

Bulk diffusion barrier without defects

Defect formation energy

GAP and electrostatics-considered GAP (ES-GAP) yield similar results.
Error cancellation occurs due to isotropy.

Defect formation energy deviates by 0.3 eV when ES is not considered.
The error arises due to anisotropy around defects.

41 of 86

Introduction to the concept of machine-learning potentials

Molecular dynamics & DFT
Basic concept of MLP
Model 1: Descriptor-based models
Model 2: E(3)-equivariant graph models
Training set generation
Universal pretrained MLPs

42 of 86

Example: HF etching (1)

Target simulation:

a-Si₃N₄

a-Si₃N₄ + HF

Training set generation:

Non-reactive data:

Crystal, amorphous, molecules, …

molecular dynamics

Target events:

guided MD

Unexpected events:

4,500 – 10,000 K

To increase accuracy in unexpected structures.

Hong, Oh, Han et al. ACS Appl. Mater. Interfaces 16, 48457 (2024)

43 of 86

Example: HF etching (2)

Guided MD accelerates rare reactions by gradually applying constraints on a chosen reaction coordinate

Guided MD

Let R_N-H + R_Si-F decrease with at a constant rate (0.02 Å/fs)

Let R_N-H + R_Si-F remains constant

Results

Hong, Oh, Han et al. ACS Appl. Mater. Interfaces 16, 48457 (2024)

44 of 86

Atomic energy mapping

Atom 1

Relative coordinates

…

E_atom,1

E_atom,2

E_atom,3

E_atom,N

E_tot

Relative coordinates

…

Atom 2

Atomic energies

Total energy

(DFT)

DFT total energies are assumed to be decomposable into transferable atomic energies.
However, training is performed to predict total energy values.
See PRM 3, 093802 (2019) for details.

E_total = ∑ E_atom

Atomic energy mapping

Not given

(Estimated during training)

Data given for training

Relative coordinates: translational invariance
Atomic energies: permutational invariance

45 of 86

Sampling training set 1 – using intuition

InP core

ZnSe shell

Bulk

Surface

Interface

Edge and vertex

Simulation target

Kang et al. ACS Mater. Au (2022)

46 of 86

Sampling training set 2 – active learning / iterative learning

Active learning framework

Simulation

with MLP

Configuration not included

in the training set

DFT calculations

MLP update

Simulation with the updated MLP

Uncertainty estimation with ensemble

Uncertainty

= standard variation

Untrained structure

Trained structure

High variation

W. Jung and S. Han et al. J. Phys. Chem. Lett. (2020)

The first method is active learning on the fly.

This method reinforces machine-learned potential while running the simulation.

Specifically, when running the simulation with MLP, it can be possible to encounter a configuration not included in the training set, which might has a large error.

In active learning scheme, it performs DFT calculations on the configuration and use it as new training set to update MLP.

Therefore, in active learning, it is important to discriminate whether the given configuration is sampled or not.

This quantity is called uncertainty in machine learning field.

We used replica ensemble method to define uncertainty of the system.

Replica ensemble is the set of machine-learned potentials that is trained on the atomic energy.

And the standard variation of these ensemble is defined as the uncertainty.

For example, for the unstrained structure, with large error, would have high variation, while the trained structure will show uniform outputs of machine learning model.

Active learning is the very powerful approach, because in principle, this can be applied to any problems.

However, in our experience, starting active learning from the scratch is highly inefficient, because it have to sample too many structures.

Therefore, even if we use active learning scheme, we need reasonably good MLP to start the simulation.

47 of 86

Uncertainty estimation based on energy deviations within an ensemble

Atomic energy mapping is not unique!

How can uncertainty be estimated using energy values?

Atomic energy training procedure

Train one MLP
Calculate atomic energies
Train an ensemble of 4–6 MLPs on atomic energies rather than total energies.

Deviations can arise from both uncertainty and variations in atomic mapping across models.

W. Jung and S. Han et al. J. Phys. Chem. Lett. (2020)

→ Implemented in the SIMPLE-NN code

48 of 86

Uncertainty estimation based on force deviations within an ensemble

Nat. Catal. 6, 829 (2023)

When using force-based uncertainty, zero uncertainty may occur even in configurations with high errors.
In untrained configurations, forces can be zero if the structure is 'symmetric' due to the symmetry of the MLP architecture.

49 of 86

Other uncertainty prediction methods

50 of 86

Open-source active learning codes

https://github.com/mir-group/flare

GAP+ACE

DeePMD-kit

https://github.com/deepmodeling/dpgen

51 of 86

Sampling training set 3 – advanced sampling methods

Cannot sample

Only sample near equilibrium

Apply bias potential to avoid

already sampled configurations

Metadynamics

Molecular dynamics

How to define “sampled” configurations

Bias potential, U_b:

Bias force:

G: collective variable

Example: G=N-N distance, for N₂ dissociation

Bias

T. Ludwig and J. K. Nørskovet al. J. Phys. Chem. C (2020)

52 of 86

General collective variables for sampling training set

Descriptor function

E_atom,1

E_atom,2

E_tot

Descriptor function

Atom 1

Atom 2

Total energy

(DFT)

NN of element 1

NN of element 2

Using descriptor function itself as a collective variable would allow general sampling!

D. Yoo, S. Han et al. npj Comput. Mater. (2021); https://github.com/MDIL-SNU/G-metaD

Results

Metadynamics trajectory

Amorphous

Clusters

53 of 86

Atomic energy mapping

Every known MLP follows this structure.

Q: Can we establish a sectioning method for atomic energies applicable across

universal chemical environment?

A: Yes. Mathematical proof is done in this paper:

Q: Then, is the method for segmenting atomic energies unique?

A: No. It means that there can be multiples ways to define atomic energies for the

same training set.

In typical error range (~10 meV/atom), the atomic energies may differ by a few

eV/atom across models, even when the training is successful for each model.

Example: E(SiC) = -10 eV

Pontetial 1) E(Si) = -6 eV, E(C) = -4 eV

Potential 2) E(Si) = -3 eV, E(C) = -7 eV

54 of 86

Ad hoc mapping

Model 1: 100 K MD trajectory

Model 2: 1000 K MD trajectory

Model 1 → ad hoc mapping

Model 2

While the total energies remain consistent, the atomic energies differ between the two models.

55 of 86

Other examples for ad hoc mapping

Case 1: lack of training epoch

Case 2: lack of composition sampling

RMSEs for total energies and forces remain consistent after 100 epochs, but the RMSE for atomic energies converges only after 600 epochs.

Trained on 1:1 composition

Trained on diverse composition

While the total energies in a 1:1 composition are identical, errors exist in the atomic energies.

→ Fails in other compositions

Unphysical

MD trajectory

56 of 86

Introduction to the concept of machine-learning potentials

Molecular dynamics & DFT
Basic concept of MLP
Model 1: Descriptor-based models
Model 2: E(3)-equivariant graph models
Training set generation
Universal pretrained MLPs

57 of 86

Universal interatomic potential

Conventional approach: MLPs for individual systems

Recent approach: universal MLP

Training set

Simulation

Kang et al. ACS Mater. Au (2022)

Kang* et al. ACS Catal. (2023)

Kang* et al. Nano Lett. (2024)

Kang et al. PRB (2020)

Kang et al. npj Comput. Mater. (2022)

Kang* et al. JACS (2023)

Training set

(big data):

Simulation

(universal):

Batatia, Benner, Chiang, Elena, Kovács, Riebesell, Csányi* et al. arXiv:2401.00096 (2023)

Universal

model

58 of 86

Extrapolation behavior of universal MLIP

Training set

Materials Project DB

SevenNet-0 & MACE-MP-0 results

Water & ice

Disordered structure

Organic liquid

Etching simulation

BCC

FCC

Crystalline material: an ordered solid composed of atoms arranged in a periodic lattice.

Example:

Materials Project is a computational database containing 200,000 inorganic crystal structures.

Not inorganic

Not crystal

arXiv:2401.00096 (2023), JCTC (2024), arXiv:2501.05211 (2025)

59 of 86

Benchmark test of foundation models

Matbench Discovery benchmark test

Energy error: non-listed compositions in Materials Project through substitution
Thermal conductivity error

META (Facebook)

SNU (Prof. Seungwu Han)

Cambridge

Microsoft

DP technology

(China)

Google DeepMind

Orbital Material (start-up)

Ruhr-Universität Bochum

60 of 86

Multi-fidelity learning

Purpose: we want to learn inconsistent datasets at once (for instance, PBE and SCAN data)

Add a fidelity-dependent term to the input of the ML model.

For instance, PBE = (1,0), SCAN = (0,1)

J. Am. Chem. Soc. 2025, 147, 1042

Multi-fidelity learning is the key factor behind SevenNet's high performance in Matbench Discovery.

61 of 86

Cf) Problems of direct force predictions

Non-conservative force model

Conservative force model

arxiv:2405.04967

_tot

Accurate

Fast (3~4 times)

Example problem of non-conservative force models

NVE simulation

Non-conservative force models can be unstable during MD simulations.

62 of 86

Practical use of MLPs

How to choose model and code for your problem?
How to sample training set?
How to set hyperparameters?
How to know whether the simulation is going wrong or not?

63 of 86

Descriptor model vs graph model

R_ij

R_ik

θ_ijk

R_c

E_atom,1

E_atom,2

E_atom,N

E_tot

Total energy

(DFT)

Descriptor models

Graph models

BP-NNP, DeePMD-kit, GAP, …
Fast, and available with CPUs
Higher errors (0.1 ~ 0.5 eV/Å)
Less than quinary composition
Poor data efficiency

Scalars

Scalar

(l=0)

Vector

(l=1)

Tensor

(l=2)

NequIP, MACE, …
Slow, and may not be available with CPUs
Lower errors
Up to 100-element compositions
Good data efficiency

64 of 86

Descriptor models

Neural network

Gaussian process

Behler-Parrinello NNP

DeePMD-kit

GAP

…

Training point 1:

Training point 2:

Training point 3:

Training point N:

k(1,NP)

k(2,NP)

k(3,NP):

k(N,NP):

R_ij

R_ik

θ_ijk

R_c

E_atom,1

E_atom,2

E_atom,N

E_tot

Total energy

(DFT)

Complicated descriptor
Shallow NN model

Simple descriptor
Deep NN model

General reliability
Low gpu acceleration
Code: SIMPLE-NN, …

Careful attentions required
High gpu acceleration
Code: DeePMD-kit

Pros: large training data point (shorter inference time)
Cons: long training time

Pros: Short training time + uncertainty
Cons: inference time ~ training set size

Code: QUIP, VASP

65 of 86

Speed and accuracy: MTP vs GAP vs BP-NNP

J. Phys. Chem. A 2020, 124, 731−745

Computational cost: MTP ≳ (BP) NNP > GAP
Note that 10 meV/atom of energy error is enough.
Using SIMPLE-NN, NNP ~ MTP (much more accelerated by other NNP codes)

66 of 86

SIMPLE-NN code

Optimized GPU usage

Optimized CPU usage

Kyuhyun Lee, Thesis (2019)

https://simple-nn-v2.readthedocs.io/

Various features

PCA whitening, and scaling scheme for efficient training
Uncertainty
GDF weight

67 of 86

Accuracy of E(3)-equivariant models

G. Kim, B. Na, Y. Kim, et al. NeurIPS (2023)

In the above paper, a comprehensive benchmark of MLPs is conducted, including tests on their extrapolation capability.
The Python framework for these models, along with the data, is available on GitHub.

Blue: in distribution (ID)

Red: out-of-distribution (OOD)

OOD: melt-quench trajectory & random structures

68 of 86

Speed vs accuracy of equivariant models

arXiv:2505.02503

Equivariant, but not message-passing

Equivariant graph (much longer cutoff)

69 of 86

NequIP vs MACE

Model

Parallelization performance

SevenNet-0 (NequIP base) vs MACE-MP-0

Kang, J. Chem. Phys. 161, 244102 (2024)

Park, Han et al. J. Chem. Theory Ccomput. 20, 4857 (2024)

Larger number of layers: might be advantageous when modeling electrostatic interactions
At least 3 layers
Larger inference & training time

Smaller number of layers
Typically two layers
Smaller inference & training time

SevenNet demonstrates better parallelization performance compared to MACE.
The D3 functional was recently added to the SevenNet code.

70 of 86

Practical use of MLPs

How to choose model and code for your problem?
How to sample training set?
How to set hyperparameters?
How to know whether the simulation is going wrong or not?

71 of 86

Constructing training set

Even with active learning, constructing a well-structured primary dataset is essential for efficiency.
Baseline: Pristine structures, typically obtained from MD simulations.
Reaction-specific: Structures of interest, requiring MD, NEB simulations, or advanced sampling methods.
General-purpose: Prevents simulations from diverging into untrained regions. High-temperature MD and advanced sampling methods (e.g., metadynamics) are used. However, excessive data inclusion may increase training errors.

72 of 86

Issue in gaussian process model

…

Training point 1:

Training point 2:

Training point 3:

Training point N:

k(1,NP)

k(2,NP)

k(3,NP):

k(N,NP):

Inference time ~ number of training set
Dataset sparsification is crucial for Gaussian process-based models.
Example: CUR decomposition

n datapoints → k datapoints

73 of 86

Considering atomic energy mapping

Trained on 1:1 composition

Trained on diverse composition

Composition

Volume

Temperature

Unphysical

MD trajectory

Training set simulations should also be conducted under non-target conditions to enhance robustness.

74 of 86

Practical use of MLPs

How to choose model and code for your problem?
How to sample training set?
How to set hyperparameters?
How to know whether the simulation is going wrong or not?

75 of 86

Machine learning 101

Andrew Ng

76 of 86

The most important hyperparameter in machine learning

"If you can adjust only one hyperparameter, tuning the learning rate."

Carefully observe the difference between validation and training errors.
Monitor energy, force, and stress errors separately—do not rely solely on the total loss.

Bible:

Small learning rate (lr)

Big lr

Right lr

Epoch

Loss

Learning rate adjustment!

Do it manually, or use scheduler

77 of 86

Important hyperparameters

Read prior studies. Refer to commonly used hyperparameters.

# Network

nodes: '30-30'

acti_func: 'sigmoid'

double_precision: True

weight_initializer:

type: 'xavier normal'

dropout: 0.0

use_scale: True

use_pca: True

use_atomic_weights: False

weight_modifier:

type: null

# Optimization

optimizer:

method: 'Adam'

batch_size: 8

full_batch: False

total_epoch: 1000

learning_rate: 0.0001

decay_rate: null

l2_regularization: 1.0e-6

# Loss function

energy_coeff: 1.

force_coeff: 0.1

stress_coeff: 1.0e-6

Example: SIMPLE-NN input file

Usually, 30-30 ~ 60-60

Usually, stress coefficient is the smallest,

Energy loss / force loss = 10 ~ 0.1

10^-3 ~ 10^-5

78 of 86

Sampling bias in materials science problem

In-configuration bias

Out-of-configuration bias

Defect atom : Bulk atom = 1:100

[ structure_type_1 : 1.0]

/location/of/calculation/data/oneshot_output_file :

/location/of/calculation/data/MDtrajectory_output_file 100:2000:20

[ structure_type_2 : 3.0 ]

/location/of/calculation/data/same_folder_format{1..10}/oneshot_output_file :

Training weight

Relaxation and NEB trajectories consist of small simulation steps, whereas MD simulations involve larger steps, potentially biasing the training.
Proper weight assignment is crucial.

Example: SIMPLE-NN structure_list file

79 of 86

Addressing in-configuration bias: gaussian density function (GDF) weight

Descriptor values of training set

Density of data points

Jeong, Han et al. J. Phys. Chem. C 122, 22790 (2018)

80 of 86

Practical use of MLPs

How to choose model and code for your problem?
How to sample training set?
How to set hyperparameters?
How to know whether the simulation is going wrong or not?

81 of 86

Rule no. 1

In machine learning, data is the most important factor.

Training: If training does not converge properly, first verify the quality of the training set. Ensure there are no errors in the DFT calculations

Simulation: If a simulation collapses, check whether the collapsed structures were included in the training set.

82 of 86

The most important thing: do it and test!

Carefully construct a test set that accurately represents the properties of your target simulation.

Example: nanoparticle

Training set

Test set

Simulation

Test

Refine

What configurations should be included in the test set?

Targeted events: NEB trajectory, defect formation energy, etc.
Fundamental PES properties: Equation of state (EOS), phonon calculations, etc.
Extrapolatability in scale: Larger system sizes not included in the training set, but still feasible for DFT calculations (avoiding excessively large structures).

Kang et al. ACS Mater. Au, 2, 103 (2022)

83 of 86

How to know whether the given structure is included or not in the training set

PCA or t-SNE analysis

PRB 102, 224104 (2020)

Uncertainty

ACS Catal. 13, 16078 (2023)

84 of 86

Example: J. Phys. Chem. Lett. 2020, 11, 6090 (1)

Target simulation

Training set

Crystal

Liquid

Simulation

New phase is found at the interface!

85 of 86

Example: J. Phys. Chem. Lett. 2020, 11, 6090 (2)

Uncertainty

The newly discovered structure is absent from the training set.
Its phase is energetically unstable.
The energy is inaccurately described due to its omission from the training set.

Simulation with the re-trained MLP

No high-uncertainty configuration

86 of 86

Practice with codes: SIMPLE-NN tutorial

1. 파이썬 환경 설치

2. SIMPLE-NN 설치

3. SIMPLE-NN 튜토리얼

3.1. Preprocess

3.2. Training

3.3. Preprocess + training at once (실행은 하지 않을 예정)

3.4. Continue training with different hyperparameters

3.5. Continual learning (training set을 추가한 다음 기존의 포텐셜도 계속 training하는 경우)

4. PCA analysis