Symmetries in ML
Use case: Deep Sets for flavor tagging
Nicole Hartman
SLAC ATLAS Group
US ATLAS Machine Learning Training Event
Day 2: Applications of ML�July 28th, 2022
Symmetries in particle physics
Standard Model
SU(3) x SU(2) x U(1) group
Group theory is the language of particle physics
-> 19 free parameters
2 / Nicole Hartman
LHC physics: Search for where these 19 free parameters fail to describe the data
…And how do we do do this???
What’s exciting about data analysis now ?
Increasingly large datasets!!
3 / Nicole Hartman
AI for science: How to gain the most out of our physics datasets
Rob Schreiber (Cerebras) SLAC Colloquium
Peter Wennik (ASML) CSMBC interview
LCLS : Tb /s
LSST : 20 TB / night
SKA : TB / night
LHC : 90 PB / yr
“We’re changing into a sensing world”
What is a NN?
A function approximator
What’s new are a lot more ways to do this function approximation and input features… more cleverly.
Ө: Trainable parameters O(billions)
NN
Inputs
Outputs
4 / Nicole Hartman
x
y
fӨ(x)
30 m jets for FTAG training!
Symmetries
in our ML models
Today:
About me
5 / Nicole Hartman
b-tagging
Sets and sequence models for algorithm training
HH -> 4b
Interesting because
Outline
Symmetries in architectures
Symmetries in inputs
Other examples
Not a replete overview, using examples to give a sense of the types of applications ….
Coffee break
Part 1 talk:
Part 2 tutorial:
Deep Sets for flavor tagging
6 / Nicole Hartman
1
2
3
✋
Would love feedback + discussions!!
Outline
Symmetries in architectures
Symmetries in inputs
Other examples
7 / Nicole Hartman
1
2
3
Deep
Impact
Parameter
Sets
CNNs
Translational invariance
8 / Nicole Hartman
Strong inductive bias: or domain knowledge we inject to efficiently converge to a solution.
Graphic adapted from Stanford
CS 231n Lecture 5: CNNs
Example: Deep image prior
The only regularization is the CNN architecture.
Use SGD to find Ө by minimizing ||f(x0) - x||2 on a single image.
Input
Output
Task: super resolution
Corrupted image
CNN
High res image
x* = argmin || x - x0 ||2 = argmin || f(x0) - x0 ||2
9 / Nicole Hartman
1711.10925
fӨ(x0)
x0
x
Shows the power of the imposed symmetry even in the absence of multiple examples for learning.
x
x = f(x0)
References
10 / Nicole Hartman
1806.01261
Encoding symmetries into your network can help solve the problem faster
References
11 / Nicole Hartman
1806.01261; 1703.06114
NN designed to operate on sets
Set: Collection of objects without any specified order
Example: # of colored balls in a bag
Where does order matter?
Natural language
12 / Nicole Hartman
(1) Mary likes John
Same words… the order changes the meaning.
(2) John likes Mary
Images ref
References
13 / Nicole Hartman
1806.01261; 1703.06114; 1810.05165
Permutation invariance over the tracks
Input representation
14 / Nicole Hartman
The European Physical Journal C Vol 73 3 (2013) 2304
Collection of tracks:
Xi : i = { 1 , … , n }
Each track has features:
Xi ∈ ℝm
Jet has labels Y
Want: p(Y | X1, … , Xn)
What is a NN?
A function approximator
Inputs
Outputs
NN
15 / Nicole Hartman
x
y
fӨ(x)
1810.05165
What is a Deep Set?
A special type of function approximator
Inputs
NN
Outputs
16 / Nicole Hartman
1810.05165
X1
X2
Xn
fӨ(x)
fӨ(x)
y
…
What is a Deep Set?
A special type of function approximator
Inputs
NN
Outputs
17 / Nicole Hartman
1810.05165
X1
X2
Xn
fӨ(x)
fӨ(x)
y
…
What is a Deep Set?
A special type of function approximator
Inputs
NN: track feature extractor
Outputs
18 / Nicole Hartman
1810.05165
X1
X2
Xn
Φ(X1)
y
Φ(X2)
Φ(Xn)
…
…
?
What is a Deep Set?
A special type of function approximator
Inputs
NN track feature extractor
Outputs
Sum: Permutation invariant operation
19 / Nicole Hartman
1810.05165
X1
X2
Xn
Φ(X1)
y
Φ(X2)
Φ(Xn)
…
…
∑
?
What is a Deep Set?
A special type of function approximator
Inputs
NN track feature extractor
Outputs
Sum: Permutation invariant operation
NN jet feature extractor
Models correlations between the tracks
20 / Nicole Hartman
1810.05165
X1
X2
Xn
Φ(X1)
y
Φ(X2)
Φ(Xn)
…
…
∑
F
Deep Sets: for b-tagging
Inputs
NN track feature extractor
Outputs
Sum: Permutation invariant operation
NN jet feature extractor
Models correlations between the tracks
21 / Nicole Hartman
X1
X2
Xn
Φ(X1)
y
Φ(X2)
Φ(Xn)
…
…
∑
F
∈ {b, c, l}
Input features
Tracks
High dimensional
22 / Nicole Hartman
X1
X2
Xn
…
ATL-PHYS-PUB-2020-014
Xi =
∈ ℝ15
Impact parameters
Track kinematics
Track quality
b-tagging
Key variable: Impact Parameter
23 / Nicole Hartman
DIPS
In the energy flow formalism
25 / Nicole Hartman
ATL-PHYS-PUB-2020-014
What does deep sets replace?
Recurrent neural network
Account for correlations between tracks
Challenge: ordering
26 / Nicole Hartman
😃
Why is Deep Sets a nice representation?
Learning faster
27 / Nicole Hartman
How we evaluate the performance
28 / Nicole Hartman
b-jet efficiency =
Background rejection =
Background efficiency
1
# b-jets > threshold
# b-jets
Performance comparison
29 / Nicole Hartman
Let’s us have faster turnaround time for PHYSICS
30 / Nicole Hartman
Outline
Symmetries in architectures
Symmetries in inputs
Other examples
31 / Nicole Hartman
1
2
3
Input processing: ΔR
Ex: b-tagging input features
32 / Nicole Hartman
failing IP cuts
Xi =
Training with ΔR(trk,jet) encodes a symmetry in the input representation - and makes learning features easier.
Input processing: logging variables
Ex: b-tagging input features
33 / Nicole Hartman
log ( pTfrac )
scaled log ( pTfrac )
pTfrac ≡ pTjet/ pTjet
ΔR(trk, jet)
log ( ΔR )
scaled log ( ΔR )
Power law distributions
With the log, become bell-shape
Normalizing encourages inputs to be close to the activation functions
Input processing: logging variables
Ex: b-tagging input features
34 / Nicole Hartman
Train pTfrac and ΔR
Train log pTfrac and log ΔR
How does this help? 20% speed up in training time!
Input processing: HH -> 4b background modeling
Problem setup
35 / Nicole Hartman
p(mhh, mh1, mh2) = p(mhh | mh1, mh2) p(mh1, mh2)
Model the Higgs candidate kinematics to reconstruct mhh
Input processing: HH -> 4b background modeling
36 / Nicole Hartman
Azimuthal symmetry baked into the model
ΔΦHH
log (π - ΔΦHH)
Searching for solutions with constant ΔΦHH - will give the same mHH
Concerned about modeling b/c no way to know that -π → π.
Conditioning on mh1, mh2 → 6 variables left to model
5
Input processing: HH -> 4b background modeling
37 / Nicole Hartman
The right transformations can help learn complicated distributions.
Trick: batch norm
What about the features in the hidden layers?
38 / Nicole Hartman
“You want zero mean and unit variance operations? Just make them so!”
1502.03167
Stanford's CS 231n Lecture 6
At training time, normalize over the activations of the minibatch
Usually inserted after Fully Connected layers, and before nonlinearity.
Nice regularization technique to have in your toolkit!
Additional learnable parameters for the scale and shift: γ and β.
Outline
Symmetries in architectures
Symmetries in inputs
Other examples
39 / Nicole Hartman
1
2
3
DeXTer: Deep Set Xbb Tagger
Low pT X->bb tagging
ATL-COM-PHYS-2022-550
Currently in ATLAS circulation
40 / Nicole Hartman
Signal
Backgrounds
Two deep sets!!
Tracks Deep Set
Vertices Deep Set
Low ma leads to merged b-jets
Pseudo-scalar a pT is too low for boosted Higgs calibration
Deep Sets: ttH -> bb event classification
41 / Nicole Hartman
Slide from Higgs report from
ML in analysis meeting
better
Transformers
42 / Nicole Hartman
“Any time you have a deep set you can substitute in a transformer as it’s always more expressive.”
~ Peter Battaglia (or what I remembered of the quote)
Sequence: [Mary, likes, John]
Set: [ {0, Mary} , {1, likes} , {2, John} ]
Transformer architecture
Replace the Deep Sets Sum operation with a weighted sum (can still be permutation invariant)
1706.03762
Computes a weight
Weighted sum over inputs
Standard architecture for Natural Language Processing (replacing RNNs): faster training -> better optimization.
And then… stack-them-up!
Transformers: HH->4b combinatorial optimization
Goal: Reconstruct jets into Higgs Candidates
43 / Nicole Hartman
Interpret the transformer attention weights between the jets the probability of coming from the same Higgs.
Anther permutation invariant operation - different application.
baselines
In ∑-mary
44 / Nicole Hartman
BACKUP
Table of Contents:
Questions and answers
RNNs vs Deep Sets
Example: Deep image prior
Task: super resolution
Given a corrupted image x0 reconstruct a higher resolution truth image
Just the inductive bias of the CNN architecture is sufficient to obtain good super resolution performance by choose the SGD to optimize the parameters on a single image.
46 / Nicole Hartman
1711.10925
Inputs
ATL-PHYS-PUB-2020-014
47 / Nicole Hartman
IP3D limitations
48 / Nicole Hartman
ATL-PHYS-PUB-2017-003
Recurrent neural networks
49 / Nicole Hartman
Graphic adapted from Stanford
CS 231n Lecture 10: RNNs
RNNIP
50 / Nicole Hartman
ATL-PHYS-PUB-2017-003
Where are the improvements coming from?
51 / Nicole Hartman
ATL-PHYS-PUB-2017-003
RNNIP
52 / Nicole Hartman
ATL-PHYS-PUB-2017-003
Where are the improvements coming from?
53 / Nicole Hartman
ATL-PHYS-PUB-2017-003
Saliency maps
What has DIPS learned about b-jets?
54 / Nicole Hartman
1312.6034
Saliency definition
What has DIPS learned about b-jets?
55 / Nicole Hartman
ATL-PHYS-PUB-2020-014
b-jets with 8 tracks failing the 77% b-tagging working point
Largest sd0
Smallest sd0
ATLAS Simulation Preliminary
√s = 13 TeV, tt
56 / Nicole Hartman
Saliency maps in mathematics applications
Way to analyze the importance of the inputs
57 / Nicole Hartman
Key idea: Use saliency maps to postulate new conjectures which could then become new math theorems!
Impact of more data
58 / Nicole Hartman
Alex Foch’s QT summary
Recomm. RNNIP: Run 2 recommendation
Reference DIPS: DIPS model from PUB w/ the optimized track sel
DIPS Default: Alex F’s retraining DIPS w/ many more jets
DIPS Loose: Alex F’s retraining DIPS w/ many more jets + the optimized track sel
Retraining the high level tagger
59 / Nicole Hartman
Alex Foch’s QT summary
Transformers: HH->4b combinatorial optimization
60 / Nicole Hartman
Q: What extra information is the transformer using?
A: Momentum conservation to help with jet selection!
Transformers: HH->4b combinatorial optimization
61 / Nicole Hartman
Training
4b: extra figures
62 / Nicole Hartman
Input processing: HH -> 4b background modeling
Variable transformations
63 / Nicole Hartman
Invariance vs Equivariance
Equivariance
Invariance
64 / Nicole Hartman
A function f is equivariant under group G if for every g ∈ G there exist operators of the group element g, Tg and Sg, such that:
Tg(f(x)) = f (Sg(x))
2105.09016
A function f is invariant under a symmetry group G means:
Example: Convolutions
f(Tg(x)) = f(x)
CNNs in HEP / ATLAS
65 / Nicole Hartman
Jet images (1511.05190)
Top tagging
ATL-COM-PHYS-2022-275