1 of 43

OmniLearn: Facilitating All Jet Physics Tasks

Vinicius M. Mikuni

vmikuni@lbl.gov

vinicius-mikuni

2 of 43

Foundational Models

Option 1: Human language is the communication medium between the user and the machine

Possibility of exchanging ideas and interpolating human knowledge

3 of 43

Foundational Models

Option 1: Human language is the communication medium between the user and the machine

Possibility of exchanging ideas and interpolating human knowledge
Requires strong reasoning capabilities to understand the connection between disciplines

4 of 43

Foundational Models

Option 1: Human language is the communication medium between the user and the machine

Possibility of exchanging ideas and interpolating human knowledge
Requires strong reasoning capabilities to understand the connection between disciplines
Large models required for generality

5 of 43

Foundational Models

Option 2: Data is the communication medium between the user and the machine

Data is ingested as is, allowing the model to learn complicated correlations

6 of 43

Foundational Models

Option 2: Data is the communication medium between the user and the machine

Data is ingested as is, allowing the model to learn complicated correlations
Requires domain knowledge: what to represent? How?

7 of 43

Foundational Models

Option 2: Data is the communication medium between the user and the machine

Data is ingested as is, allowing the model to learn complicated correlations
Requires domain knowledge: what to represent? How?
Smaller (O(1M)) models can be used instead given domain knowledge

8 of 43

Jets

Jets are the most common signatures at the LHC

Complicated signature: O(10-100) objects are clustered in each jet
Choice of data: Particle Flow objects associated to jets

Choice of data representation: Point Clouds

9 of 43

Jets

Measurements

Searches

Tagging

10 of 43

Jets

How to teach AI about jets?

11 of 43

Encoding jet information

Create a neural network model that aims to accomplish 2 tasks:

Classify jets: learns the difference in radiation between jet types
Generate jets: implicitly learn the likelihood of jets for different partons

12 of 43

Diffusion 101

Diffusion models are the go to for data generation

Simple training: take data x, perturb with a Gaussian of mean 𝜇 and std 𝜎
x’ = 𝜇*x + 𝜎*𝜺, 𝜺~N(0,1)
Ask the network to predict the noise injected
L = ||D(x’) - 𝜺||²

13 of 43

Encoding jet information

Create a neural network model that aims to accomplish 2 tasks:

Classify jets: learns the difference in radiation between jet types
Generate jets: implicitly learn the likelihood of jets for different partons
Loss = CE(x) + ||D(x’) - 𝜺||² + 𝜇²CE(x’)
Data augmentation for free!

14 of 43

Encoding jet information

Point-Edge Transformer (PET)

Combine local information with graphs
Learn global information with Transformers: 3M parameters

15 of 43

Input Dropout

Not all datasets contain the same information:

Let the model learn with and without some features
Feature Dropout: With fixed probability, set some of the input features to 0

More details at: https://arxiv.org/abs/2404.16091

f₁, f₂, f₃, f₄, f₅, f₆, f₇, f₈

f₅, f₆, f₇, f₈

0,0,0,0

p = 0.9

p = 0.1

16 of 43

Comparison Between Models

Language inspired models

Data are tokenized
Unsupervised and general pre-training
Big models often required

OmniLearn

Data are continuous
HEP has one of the best simulators across all sciences: supervised pre-training
Medium models that can fit on standard GPUs are still useful

17 of 43

Training

JetClass dataset used for training

100M jets
10 different jet categories, AK8 jets simulated in pp collisions with Madgraph + Pythia8 with CMS Delphes detector simulation

Use the pre-trained model as the starting point and fine-tune using different datasets

Huilin Qu, Congqiao Li, Sitian Qian, arXiv:2202.03772

18 of 43

Evaluation

2 different jet categories, AK8 jets simulated in pp collisions with Madgraph + Pythia8 with ATLAS Delphes detector simulation

Better than all non-fine-tuned models and similar to PartT performance

Evaluation datasets: 1

19 of 43

Evaluation

2 different jet categories, AK4 jets simulated in pp collisions with Madgraph + Pythia8 with CMS Delphes detector simulation

Better than all non-fine-tuned models and similar to PartT performance

Evaluation datasets: 2

20 of 43

Evaluation

Evaluation datasets: 2

Faster training and better convergence

21 of 43

Evaluation

2 different jet categories, AK5 jets simulated in pp collisions with Pythia6 with Geant4 Simulation + CMS Particle flow reconstruction

Evaluation datasets: 3

22 of 43

Evaluation

2 different jet categories, AK10 jets simulated in ep collisions with Rapgap with Geant3 Simulation + H1 Particle flow reconstruction

Evaluation datasets: 4

23 of 43

Jet Generation

Evaluation datasets: 6

Great generation quality across multiple metrics

24 of 43

Application Highlight

“

25 of 43

FastSim to FullSim

Evaluation datasets: 7

OmniLearn is trained on cheap Delphes simulations. Can we fine-tune to Run 2 ATLAS Full simulation + Reconstruction?

Matches SOTA with 10% of the data
Improves on SOTA if all events are used

26 of 43

Unfolding

What we measure

What we want

27 of 43

OmniFold

Source: Andreassen et al. PRL 124, 182001 (2020)

2-step iterative process

Step 1: Reweight simulations to look like data
Step 2: Convert learned weights into functions of particle level objects

28 of 43

ATLAS OmniFold analysis

OmniFold dataset consisting of Z(𝜈𝜈) + Jets events. Unfold the particles directly and then build the jet observables

Evaluation datasets: 8

29 of 43

Unfolding

Evaluation datasets: 8

Unbinned Unfolding using the OmniFold workflow. More precise than traditional unfolding and more efficient than previous ML models

30 of 43

Anomaly Detection

Evaluation datasets: 9

Bump-hunting using ML:

Use the background in the sideband to estimate the background in the signal region
Compare the estimated background with the data

31 of 43

Anomaly Detection

Evaluation datasets: 9

Bump-hunting using ML:

Generative Model
Classifier

32 of 43

LHCO dataset

LHCO R&D dataset

Resonant dijet final state: A->B(qq)C(qq) with m_A, m_B , m_C = 3.5, 0.5, 0.1 TeV

Evaluation datasets: 9

33 of 43

Anomaly Detection

Evaluation datasets: 9

Generate the full dijet system: 2*279*3 = 1674 numbers to generate
Classify data from background

SIC = Significance Improvement Curve (TPR/sqrt(FPR) vs TPR) “By how much can I improve the significance of a particular signal given an initial significance.”

34 of 43

Anomaly Detection

Evaluation datasets: 9

Generate the full dijet system: 2*279*3 = 1674 numbers to generate
Classify data from background

Previous results were limited by the amount of data in the SR: Only sensitive to NP when S/B > 3% ~ 4𝜎

OmniLearn founds the NP with S/B = 0.7% ~ 2𝜎

35 of 43

Conclusion

OmniLearn: learn a general representation of jets
Evaluate OmniLearn across 9 different downstream datasets
Evaluate the performance on jet tagging, jet generation, unfolding, and anomaly detection
OmniLearn improves upon SOTA or/and converges quicker than models trained from scratch
Magnify the statistical power of the data: Not only Big Data benefits from AI
Try it out yourself: https://github.com/ViniciusMikuni/OmniLearn/ and check out the paper: arXiv:2404.16091

36 of 43

THANKS!

Any questions?

37 of 43

Backup

38 of 43

ATLAS Loss Curves

39 of 43

OmniLearn for reweighting

40 of 43

OmniLearn for Unfolding

41 of 43

PET

Train one model that learns to classify and generate jets

Combine both local and global information using local edges and a transformer: Point-Edge Transformer

More details at: https://arxiv.org/abs/2404.16091

42 of 43

Diffusion Generative Models

Source: https://yang-song.net/blog/2021/score/

43 of 43

Loss function

Straightforward loss function:

Cross entropy for each class
Perturbed data prediction from the diffusion loss
Classification over perturbed inputs: data augmentation!

More details at: https://arxiv.org/abs/2404.16091