1 of 65

Symmetries in ML

Use case: Deep Sets for flavor tagging

Nicole Hartman

SLAC ATLAS Group

US ATLAS Machine Learning Training Event

Day 2: Applications of ML�July 28^th, 2022

2 of 65

Symmetries in particle physics

Standard Model

SU(3) x SU(2) x U(1) group

Group theory is the language of particle physics

-> 19 free parameters

2 / Nicole Hartman

LHC physics: Search for where these 19 free parameters fail to describe the data

…And how do we do do this???

3 of 65

What’s exciting about data analysis now ?

Increasingly large datasets!!

3 / Nicole Hartman

AI for science: How to gain the most out of our physics datasets

Rob Schreiber (Cerebras) SLAC Colloquium

Peter Wennik (ASML) CSMBC interview

LCLS : Tb /s

LSST : 20 TB / night

SKA : TB / night

LHC : 90 PB / yr

“We’re changing into a sensing world”

4 of 65

What is a NN?

A function approximator

What’s new are a lot more ways to do this function approximation and input features… more cleverly.

Ө: Trainable parameters O(billions)

Optimize with a large # of training examples

Inputs

Outputs

4 / Nicole Hartman

f_Ө(x)

30 m jets for FTAG training!

Symmetries

in our ML models

Today:

5 of 65

About me

5 / Nicole Hartman

b-tagging

Sets and sequence models for algorithm training

HH -> 4b

Interesting because

Data driven background estimation
Jet -> parton assignment

6 of 65

Outline

Symmetries in architectures

Symmetries in inputs

Other examples

Not a replete overview, using examples to give a sense of the types of applications ….

Coffee break

Part 1 talk:

Part 2 tutorial:

Deep Sets for flavor tagging

6 / Nicole Hartman

✋

Would love feedback + discussions!!

7 of 65

Outline

Symmetries in architectures

Symmetries in inputs

Other examples

7 / Nicole Hartman

Convolutional Neural Networks

Translation invariance

Particle Flow Networks ≡ Deep Set

Permutation invariance

Deep

Impact

Parameter

Sets

8 of 65

CNNs

Translational invariance

8 / Nicole Hartman

Strong inductive bias: or domain knowledge we inject to efficiently converge to a solution.

Graphic adapted from Stanford

CS 231n Lecture 5: CNNs

9 of 65

Example: Deep image prior

The only regularization is the CNN architecture.

Use SGD to find Ө by minimizing ||f(x₀) - x||² on a single image.

Input

Output

Task: super resolution

Corrupted image

CNN

High res image

x* = argmin || x - x₀ ||² = argmin || f(x₀) - x₀ ||²

9 / Nicole Hartman

1711.10925

f_Ө(x₀)

x₀

Shows the power of the imposed symmetry even in the absence of multiple examples for learning.

x = f(x₀)

10 of 65

References

10 / Nicole Hartman

1806.01261

Encoding symmetries into your network can help solve the problem faster

11 of 65

References

11 / Nicole Hartman

1806.01261; 1703.06114

NN designed to operate on sets

Set: Collection of objects without any specified order

Example: # of colored balls in a bag

12 of 65

Where does order matter?

Natural language

12 / Nicole Hartman

(1) Mary likes John

Same words… the order changes the meaning.

(2) John likes Mary

Images ref

13 of 65

References

13 / Nicole Hartman

1806.01261; 1703.06114; 1810.05165

Permutation invariance over the tracks

14 of 65

Input representation

14 / Nicole Hartman

The European Physical Journal C Vol 73 3 (2013) 2304

Collection of tracks:

X_i : i = { 1 , … , n }

Each track has features:

X_i ∈ ℝ^m

Jet has labels Y

Quark / gluon tagging: Y ∈ {q, g}
Higgs tagging: Y ∈ {H, top, QCD}
Top tagging: Y ∈ {top, QCD}
b-tagging: Y ∈ {b, c, l}

Want: p(Y | X₁, … , X_n)

15 of 65

What is a NN?

A function approximator

Inputs

Outputs

15 / Nicole Hartman

f_Ө(x)

1810.05165

16 of 65

What is a Deep Set?

A special type of function approximator

Inputs

Outputs

16 / Nicole Hartman

1810.05165

X₁

X₂

X_n

f_Ө(x)

…

17 of 65

What is a Deep Set?

A special type of function approximator

Inputs

Outputs

17 / Nicole Hartman

1810.05165

X₁

X₂

X_n

f_Ө(x)

…

18 of 65

What is a Deep Set?

A special type of function approximator

Inputs

NN: track feature extractor

Outputs

18 / Nicole Hartman

1810.05165

X₁

X₂

X_n

Φ(X₁)

Φ(X₂)

Φ(X_n)

…

19 of 65

What is a Deep Set?

A special type of function approximator

Inputs

NN track feature extractor

Outputs

Sum: Permutation invariant operation

19 / Nicole Hartman

1810.05165

X₁

X₂

X_n

Φ(X₁)

Φ(X₂)

Φ(X_n)

…

∑

20 of 65

What is a Deep Set?

A special type of function approximator

Inputs

NN track feature extractor

Outputs

Sum: Permutation invariant operation

NN jet feature extractor

Models correlations between the tracks

20 / Nicole Hartman

1810.05165

X₁

X₂

X_n

Φ(X₁)

Φ(X₂)

Φ(X_n)

…

∑

21 of 65

Deep Sets: for b-tagging

Inputs

NN track feature extractor

Outputs

Sum: Permutation invariant operation

NN jet feature extractor

Models correlations between the tracks

21 / Nicole Hartman

X₁

X₂

X_n

Φ(X₁)

Φ(X₂)

Φ(X_n)

…

∑

∈ {b, c, l}

22 of 65

Input features

Tracks

High dimensional

22 / Nicole Hartman

X₁

X₂

X_n

…

ATL-PHYS-PUB-2020-014

X_i =

∈ ℝ¹⁵

Impact parameters

Track kinematics

Track quality

23 of 65

b-tagging

Key variable: Impact Parameter

23 / Nicole Hartman

25 of 65

DIPS

In the energy flow formalism

25 / Nicole Hartman

ATL-PHYS-PUB-2020-014

26 of 65

What does deep sets replace?

Recurrent neural network

Account for correlations between tracks

Challenge: ordering

26 / Nicole Hartman

😃

27 of 65

Why is Deep Sets a nice representation?

Learning faster

27 / Nicole Hartman

28 of 65

How we evaluate the performance

28 / Nicole Hartman

b-jet efficiency =

Background rejection =

Background efficiency

# b-jets > threshold

# b-jets

29 of 65

Performance comparison

29 / Nicole Hartman

30 of 65

Let’s us have faster turnaround time for PHYSICS

30 / Nicole Hartman

31 of 65

Outline

Symmetries in architectures

Symmetries in inputs

Other examples

31 / Nicole Hartman

32 of 65

Input processing: ΔR

Ex: b-tagging input features

32 / Nicole Hartman

failing IP cuts

X_i =

Training with ΔR(trk,jet) encodes a symmetry in the input representation - and makes learning features easier.

33 of 65

Input processing: logging variables

Ex: b-tagging input features

33 / Nicole Hartman

log ( p_T^frac )

scaled log ( p_T^frac )

p_T^frac ≡ p_T^jet/ p_T^jet

ΔR(trk, jet)

log ( ΔR )

scaled log ( ΔR )

Power law distributions

With the log, become bell-shape

Normalizing encourages inputs to be close to the activation functions

34 of 65

Input processing: logging variables

Ex: b-tagging input features

34 / Nicole Hartman

Train p_T^frac and ΔR

Train log p_T^frac and log ΔR

How does this help? 20% speed up in training time!

35 of 65

Input processing: HH -> 4b background modeling

Problem setup

35 / Nicole Hartman

p(m_hh, m_h1, m_h2) = p(m_hh | m_h1, m_h2) p(m_h1, m_h2)

Model the Higgs candidate kinematics to reconstruct m_hh

36 of 65

Input processing: HH -> 4b background modeling

36 / Nicole Hartman

Azimuthal symmetry baked into the model

ΔΦ_HH

log (π - ΔΦ_HH)

Searching for solutions with constant ΔΦ_HH - will give the same m_HH

Concerned about modeling b/c no way to know that -π → π.

Conditioning on m_h1, m_h2 → 6 variables left to model

37 of 65

Input processing: HH -> 4b background modeling

37 / Nicole Hartman

The right transformations can help learn complicated distributions.

38 of 65

Trick: batch norm

What about the features in the hidden layers?

38 / Nicole Hartman

“You want zero mean and unit variance operations? Just make them so!”

1502.03167

Stanford's CS 231n Lecture 6

At training time, normalize over the activations of the minibatch

Usually inserted after Fully Connected layers, and before nonlinearity.

Nice regularization technique to have in your toolkit!

Additional learnable parameters for the scale and shift: γ and β.

39 of 65

Outline

Symmetries in architectures

Symmetries in inputs

Other examples

39 / Nicole Hartman

40 of 65

DeXTer: Deep Set Xbb Tagger

Low pT X->bb tagging

ATL-COM-PHYS-2022-550

Currently in ATLAS circulation

40 / Nicole Hartman

Signal

Backgrounds

Two deep sets!!

Tracks Deep Set

Vertices Deep Set

Low m_a leads to merged b-jets

Pseudo-scalar a p_T is too low for boosted Higgs calibration

41 of 65

Deep Sets: ttH -> bb event classification

41 / Nicole Hartman

Slide from Higgs report from

ML in analysis meeting

better

42 of 65

Transformers

42 / Nicole Hartman

“Any time you have a deep set you can substitute in a transformer as it’s always more expressive.”

~ Peter Battaglia (or what I remembered of the quote)

Sequence: [Mary, likes, John]

Set: [ {0, Mary} , {1, likes} , {2, John} ]

Transformer architecture

Replace the Deep Sets Sum operation with a weighted sum (can still be permutation invariant)

1706.03762

Computes a weight

Weighted sum over inputs

Standard architecture for Natural Language Processing (replacing RNNs): faster training -> better optimization.

And then… stack-them-up!

43 of 65

Transformers: HH->4b combinatorial optimization

Goal: Reconstruct jets into Higgs Candidates

43 / Nicole Hartman

Interpret the transformer attention weights between the jets the probability of coming from the same Higgs.

Anther permutation invariant operation - different application.

baselines

44 of 65

In ∑-mary

Deep sets: Permutation invariance

It can also be just as important to “bake” a symmetry ito a model

Highlighted other applications of use cases in HEP-ex.

44 / Nicole Hartman

Use case: FTAG
4x speed-up training time
faster R&D

45 of 65

BACKUP

Table of Contents:

Questions and answers

RNNs vs Deep Sets

46 of 65

Example: Deep image prior

Task: super resolution

Given a corrupted image x0 reconstruct a higher resolution truth image

Just the inductive bias of the CNN architecture is sufficient to obtain good super resolution performance by choose the SGD to optimize the parameters on a single image.

46 / Nicole Hartman

1711.10925

47 of 65

Inputs

ATL-PHYS-PUB-2020-014

47 / Nicole Hartman

48 of 65

IP3D limitations

48 / Nicole Hartman

ATL-PHYS-PUB-2017-003

49 of 65

Recurrent neural networks

49 / Nicole Hartman

Graphic adapted from Stanford

CS 231n Lecture 10: RNNs

50 of 65

RNNIP

50 / Nicole Hartman

ATL-PHYS-PUB-2017-003

51 of 65

Where are the improvements coming from?

51 / Nicole Hartman

ATL-PHYS-PUB-2017-003

52 of 65

RNNIP

52 / Nicole Hartman

ATL-PHYS-PUB-2017-003

53 of 65

Where are the improvements coming from?

53 / Nicole Hartman

ATL-PHYS-PUB-2017-003

54 of 65

Saliency maps

What has DIPS learned about b-jets?

54 / Nicole Hartman

1312.6034

55 of 65

Saliency definition

What has DIPS learned about b-jets?

55 / Nicole Hartman

ATL-PHYS-PUB-2020-014

b-jets with 8 tracks failing the 77% b-tagging working point

Largest s_d0

Smallest s_d0

ATLAS Simulation Preliminary

√s = 13 TeV, tt

Consider b-jets failing the 77% WP

Average over jets with 8 tracks
Sort the tracks by s_d0 for the average

56 of 65

56 / Nicole Hartman

57 of 65

Saliency maps in mathematics applications

Way to analyze the importance of the inputs

57 / Nicole Hartman

Nature 600, 70–74 (2021)

Key idea: Use saliency maps to postulate new conjectures which could then become new math theorems!

58 of 65

Impact of more data

58 / Nicole Hartman

Alex Foch’s QT summary

Recomm. RNNIP: Run 2 recommendation

Reference DIPS: DIPS model from PUB w/ the optimized track sel

DIPS Default: Alex F’s retraining DIPS w/ many more jets

DIPS Loose: Alex F’s retraining DIPS w/ many more jets + the optimized track sel

59 of 65

Retraining the high level tagger

59 / Nicole Hartman

Alex Foch’s QT summary

60 of 65

Transformers: HH->4b combinatorial optimization

60 / Nicole Hartman

Q: What extra information is the transformer using?

A: Momentum conservation to help with jet selection!

61 of 65

Transformers: HH->4b combinatorial optimization

61 / Nicole Hartman

Training

Enumerate over valid pairings
Maximize the probability of the correct pair

cross entropy loss

62 of 65

4b: extra figures

62 / Nicole Hartman

63 of 65

Input processing: HH -> 4b background modeling

Variable transformations

63 / Nicole Hartman

64 of 65

Invariance vs Equivariance

Equivariance

Invariance

64 / Nicole Hartman

A function f is equivariant under group G if for every g ∈ G there exist operators of the group element g, T_g and S_g, such that:

T_g(f(x)) = f (S_g(x))

2105.09016

A function f is invariant under a symmetry group G means:

Example: Convolutions

f(T_g(x)) = f(x)

65 of 65

CNNs in HEP / ATLAS

65 / Nicole Hartman

Jet images (1511.05190)

PU mitigation

with event images

ATL-PHYS-PUB-2019-028

Top tagging

ATL-COM-PHYS-2022-275

1 of 65

2 of 65

3 of 65

4 of 65

5 of 65

6 of 65

7 of 65

8 of 65

9 of 65

10 of 65

11 of 65

12 of 65

13 of 65

14 of 65

15 of 65

16 of 65

17 of 65

18 of 65

19 of 65

20 of 65

21 of 65

22 of 65

23 of 65

24 of 65

25 of 65

26 of 65

27 of 65

28 of 65

29 of 65

30 of 65

31 of 65

32 of 65

33 of 65

34 of 65

35 of 65

36 of 65

37 of 65

38 of 65

39 of 65

40 of 65

41 of 65

42 of 65

43 of 65

44 of 65

45 of 65

46 of 65

47 of 65

48 of 65

49 of 65

50 of 65

51 of 65

52 of 65

53 of 65

54 of 65

55 of 65

56 of 65

57 of 65

58 of 65

59 of 65

60 of 65

61 of 65

62 of 65

63 of 65

64 of 65

65 of 65