Machine Learning for QCD Studies
1
Vinicius M. Mikuni
vmikuni@lbl.gov
vinicius-mikuni
What I’m not talking about
2
Flavour Tagging: see Reinhard’s talk
What I’m not talking about: ML@QCD@LHC25
3
CERN
4
https://www.calcmaps.com/map-radius/
The Large Hadron Collider (LHC) is a 3 mile radius accelerator facility, accelerating particles near the speed of light
Dinner Location
The Challenge
5
More Likely to Happen
The Challenge
6
1 in 10 billion collisions at the LHC produce a Higgs Boson
Comparison:
Source: https://stacker.com/art-culture/odds-50-random-events-happening-you
The Challenge
7
Bunches of protons crossing every 25 ns, resulting in hundreds of millions of collisions per second
The Challenge
8
Source: CMS-NOTE-2022-008
Future
Present
Future upgrades of the LHC experiment will aim to increase the likelihood of collisions happening, exceeding the current computing budget
Generative models
Generative models are a class of algorithms trained to transform easy-to-sample noise into data
9
Diffusion Generative Models
10
See also:
1: E. Dreyer, E. Gross,, D. Kobylianskii, V. Mikuni, B.Nachman: e-Print: 2503.19981
2: M. Omana Kuttan,, K. Zhou, J. Steinheimer, H. Stöcker: e-Print: 2502.16330
3: Erik B., C. Ewen, D. A. Faroughy, et al.
e-Print: 2310.00049
4: A. Butter, N. Huetsch, S. P. Schweitzer, T. Plehn, P. Sorrenson et al. SciPost Phys.Core 8 (2025), 026, SciPost Phys.Core 8 (2025), 026
5: V. Mikuni, B. Nachman, M. Pettee: Phys.Rev.D 108 (2023) 3, 036025
6: M. Leigh, D. Sengupta, G. Quétant, J. A. Raine, K. Zoch et al. SciPost Phys. 16 (2024) 1, 018, SciPost Phys. 16 (2024), 018
Diffusion Generative Models
11
“Scientists at QCD@LHC working on Science and Machine Learning”
EIC Events
12
We generate SM events for the EIC using Pythia8
EIC Events
13
We generate SM events for the EIC using Pythia8
Encode the electron separately from hadrons:
p(e,h) = p(h|e)p(e)
*V. Mikuni, B. Nachman, Phys. Rev. D 111, L051504
EIC Events
14
2-step generation
Generate multiple particle species from Pythia
Araz, J. Y., Mikuni, V., Ringer, F., Sato, N., Acosta, F. T., Whitehill, R., Phys.Lett.B 868 (2025) 139694
EIC Events
15
Generate multiple particle species from Pythia
Araz, J. Y., Mikuni, V., Ringer, F., Sato, N., Acosta, F. T., Whitehill, R., Phys.Lett.B 868 (2025) 139694
2-step generation
EIC Events
16
Generate multiple particle species from Pythia
Ratio between Pythia and Diffusion model
Araz, J. Y., Mikuni, V., Ringer, F., Sato, N., Acosta, F. T., Whitehill, R., Phys.Lett.B 868 (2025) 139694
2-step generation
Future
17
Theory parameters
𝛳
Physics Prediction zp
zp~p(zp|𝛳)
Generative Models are also differentiable by design:
Given observed data zd maximize L(zd|𝛳) wrt. 𝛳
Event Unfolding
18
Unfolding
19
What we measure
What we want
Unfolding
20
A. Badea, A. Baty, H. Bossi, et al. arXiv:2507.14349
Unfolding
21
How to define the optimal binning?
Unfolding
22
How to include multiple distributions?
How to define the optimal binning?
Unfolding
23
How to unfold distributions that are not defined for each event?
How to include multiple distributions?
How to define the optimal binning?
ML Based Unfolding
24
2-step iterative process
Source: Andreassen et al. PRL 124, 182001 (2020)
ML Based Unfolding
25
See also
1: M. Backes, A. Butter, M. Dunford, B. Malaescu: SciPost Phys.Core 7 (2024) 1, 007
2: A. Shmakov, K. T. Greif, M. J. Fenton, A. Ghosh, P. Baldi et al. SciPost Phys. 18 (2025) 4, 117, SciPost Phys. 18 (2025), 117
3: N. Huetsch, J. M. Villadamigo, A. Shmakov, S. Diefenbacher, V. Mikuni et al. SciPost Phys. 18 (2025) 2, 070, SciPost Phys. 18 (2025), 070
4: A. Butter, S. Diefenbacher, N. Huetsch, V. Mikuni, B. Nachman et al.: SciPost Phys. 18 (2025) 6, 200, SciPost Phys. 18 (2025), 200
5: M. Bellagente, A. Butter, G. Kasieczka, T. Plehn, A. Rousselot et al.
SciPost Phys. 9 (2020), 074
6: C. Pazos, S. Aeron, P.-H. Beauchemin, V. Croft, Z. Huan et al.
e-Print: 2406.01507
7: S. Diefenbacher, G.-H. Liu, V. Mikuni, B. Nachman, W. Nie: Phys.Rev.D 109 (2024) 7, 076011
Source: Andreassen et al. PRL 124, 182001 (2020)
The H1 Detector
26
One of the two multipurpose detectors at the HERA accelerator facility
ML Based Unfolding
27
3 papers on ML-based unfolding using H1 data
Azimuthal Asymmetries
28
Study of correlations between the scattered lepton and jet
Phys.Rev.Lett. 128 (2022) 13, 132002
Azimuthal Asymmetries
Final state lepton and jet are mostly back-to-back
29
Measure: cos(ɸ), cos(2ɸ), cos(3ɸ)
Require q⟂ / P⟂> 0.3
Azimuthal Asymmetries
Reuse previous results at PRL. 128, 132002
30
Results
31
Dedicated DIS generators do a good job everywhere, especially Rapgap
Pythia predictions not tuned to this data
GBW Includes gluon saturation effects while CT18A uses NLO TMD calculations with collinear PDFs, both currently available only for low q⟂
arXiv:2412.14092, submitted to PLB
What if we unfold everything?
32
Experimental setup
Using 228 pb-1 of data collected by the H1 Experiment during 2006 and 2007 at 318 GeV center-of-mass energy
33
Reconstructed hadrons using combined detector information: energy flow algorithm
27.5 GeV e+- (k)
920 GeV p (P)
Q2 = - q2
y = Pq / pk
P: incoming proton 4-vector
k: incoming electron 4-vector
q=k-k’ : 4-momentum transfer
Goal: Include the information of all reconstructed particles + scattered lepton in the collision
OmniLearn
34
We use the OmniLearn model to train the classifiers for the unfolding task:
More details at: V. Mikuni, B. Nachman, Phys. Rev. D 111, L051504
Results
Cluster unfolded jets using kT algorithm with radius of 1.0
We are able to re-derive past results
35
Phys. Rev. Lett.(128) 132002
Results
Cluster unfolded jets using kT algorithm with radius of 1.0
We are able to re-derive past results
36
Phys.Lett.B 844 (2023) 138101
Results
Breit Frame provides a natural frame to study ep collisions, where the struck quark forms a jet opposite from the proton beam: useful for jet and TMD studies
37
Results
Cluster jets using kT algorithm with radius of 1.0
We can study observables in different frames!
38
Lab Frame
Breit Frame
Results
Unfold observables that are hard to unfold without machine learning: Energy Energy Correlators
39
Sensitive to transverse momentum dependent parton distribution functions and fragmentation functions
Eq. from Phys.Rev.D 103 (2021) 9, 094005
OmniLearn
Combine tasks: Train one model to classify and generate particles
40
More details at: V. Mikuni, B. Nachman, Phys. Rev. D 111, L051504
Strategy
41
Model starts with random weights
Ask the model to classify and generate particle collisions
Fine-tune the model on new datasets and tasks
Jet Tagging
Unfolding
Anomaly Detection
Jet Tagging
42
Source: ATL-PHYS-SLIDE-2023-048
Pushing classification performance requires lots of simulated data!
30M Jets
192M Jets
FastSim to FullSim
43
OmniLearn is trained on fast simulations. Fine-tune to ATLAS Top Tagging Open Data Set
Full simulation + Reconstruction
Improving Unfolding
44
Improved precision for unfolding
Conclusions
45
THANKS!
Any questions?
46
Backup
47
EIC Events
48
We also compare with previous diffusion model based on images
53: P. Devlin, J.-W. Qiu, F. Ringer, and N. Sato, Phys. Rev. D 110 no. 1, (2024) 016030,
Araz, J. Y., Mikuni, V., Ringer, F., Sato, N., Acosta, F. T., Whitehill, R., Phys.Lett.B 868 (2025) 139694
Systematic uncertainties
Systematic uncertainties
49
Unfolding uncertainties
MC Generators
Lund string hadronization model and CTEQ6L PDF set
Pythia 8.3: default NNPDF3.1 PDF
Herwig 7.2: Cluster hadronization and CT14 PDF set
50
Phi dependence
51
Experimental setup
Experimental setup
Fiducial Phase space definition:
Particle selection:
Reco Phase space definition:
Particle selection:
52
Q2 > 100 GeV2
Closure test
53
Pretraining
54
We would like to unfold up to 130*3 = 390 features simultaneously: requires lots of data
Generative models
55
| Training Stability | Scalability | Fast inference | Fidelity | Expressivity |
Diffusion | Yes | Yes | No | Yes | Yes |
GANS | No | Yes | Yes | Maybe | Yes |
VAE | Maybe | Yes | Yes | Maybe | Kinda |
NF | Yes | Maybe | Maybe | Yes | Kinda |
Score matching/denoising/diffusion
Denoise diffusion models are the newest state-of-the-art generative models for image generation.
Pros:
Cons:
56
Score-matching
57
Likelihood estimation?
58
SDE
ODE
Introduction
59
Anomaly Detection
60
Bump-hunting using ML:
Anomaly Detection
61
Bump-hunting using ML:
Anomaly Detection
62
OmniLearn requires 4 times less data to identify anomalies!