1 of 65

Symmetries in ML

Use case: Deep Sets for flavor tagging

Nicole Hartman

SLAC ATLAS Group

US ATLAS Machine Learning Training Event

Day 2: Applications of ML�July 28th, 2022

2 of 65

Symmetries in particle physics

Standard Model

SU(3) x SU(2) x U(1) group

Group theory is the language of particle physics

-> 19 free parameters

2 / Nicole Hartman

LHC physics: Search for where these 19 free parameters fail to describe the data

…And how do we do do this???

3 of 65

What’s exciting about data analysis now ?

Increasingly large datasets!!

3 / Nicole Hartman

AI for science: How to gain the most out of our physics datasets

Rob Schreiber (Cerebras) SLAC Colloquium

Peter Wennik (ASML) CSMBC interview

LCLS : Tb /s

LSST : 20 TB / night

SKA : TB / night

LHC : 90 PB / yr

“We’re changing into a sensing world”

4 of 65

What is a NN?

A function approximator

What’s new are a lot more ways to do this function approximation and input features… more cleverly.

Ө: Trainable parameters O(billions)

  • Optimize with a large # of training examples

NN

Inputs

Outputs

4 / Nicole Hartman

x

y

fӨ(x)

30 m jets for FTAG training!

Symmetries

in our ML models

Today:

5 of 65

About me

5 / Nicole Hartman

b-tagging

Sets and sequence models for algorithm training

HH -> 4b

Interesting because

  • Data driven background estimation
  • Jet -> parton assignment

6 of 65

Outline

Symmetries in architectures

Symmetries in inputs

Other examples

Not a replete overview, using examples to give a sense of the types of applications ….

Coffee break

Part 1 talk:

Part 2 tutorial:

Deep Sets for flavor tagging

6 / Nicole Hartman

1

2

3

Would love feedback + discussions!!

7 of 65

Outline

Symmetries in architectures

Symmetries in inputs

Other examples

7 / Nicole Hartman

1

2

3

  • Convolutional Neural Networks
    • Translation invariance

  • Particle Flow Networks ≡ Deep Set
    • Permutation invariance

Deep

Impact

Parameter

Sets

8 of 65

CNNs

Translational invariance

8 / Nicole Hartman

Strong inductive bias: or domain knowledge we inject to efficiently converge to a solution.

Graphic adapted from Stanford

CS 231n Lecture 5: CNNs

9 of 65

Example: Deep image prior

The only regularization is the CNN architecture.

Use SGD to find Ө by minimizing ||f(x0) - x||2 on a single image.

Input

Output

Task: super resolution

Corrupted image

CNN

High res image

x* = argmin || x - x0 ||2 = argmin || f(x0) - x0 ||2

9 / Nicole Hartman

1711.10925

fӨ(x0)

x0

x

Shows the power of the imposed symmetry even in the absence of multiple examples for learning.

x

x = f(x0)

10 of 65

References

10 / Nicole Hartman

1806.01261

Encoding symmetries into your network can help solve the problem faster

11 of 65

References

11 / Nicole Hartman

1806.01261; 1703.06114

NN designed to operate on sets

Set: Collection of objects without any specified order

Example: # of colored balls in a bag

12 of 65

Where does order matter?

Natural language

12 / Nicole Hartman

(1) Mary likes John

Same words… the order changes the meaning.

(2) John likes Mary

Images ref

13 of 65

References

13 / Nicole Hartman

1806.01261; 1703.06114; 1810.05165

Permutation invariance over the tracks

14 of 65

Input representation

14 / Nicole Hartman

The European Physical Journal C Vol 73 3 (2013) 2304

Collection of tracks:

Xi : i = { 1 , … , n }

Each track has features:

Xi ∈ ℝm

Jet has labels Y

  • Quark / gluon tagging: Y ∈ {q, g}
  • Higgs tagging: Y ∈ {H, top, QCD}
  • Top tagging: Y ∈ {top, QCD}
  • b-tagging: Y ∈ {b, c, l}

Want: p(Y | X1, … , Xn)

15 of 65

What is a NN?

A function approximator

Inputs

Outputs

NN

15 / Nicole Hartman

x

y

fӨ(x)

1810.05165

16 of 65

What is a Deep Set?

A special type of function approximator

Inputs

NN

Outputs

16 / Nicole Hartman

1810.05165

X1

X2

Xn

fӨ(x)

fӨ(x)

y

17 of 65

What is a Deep Set?

A special type of function approximator

Inputs

NN

Outputs

17 / Nicole Hartman

1810.05165

X1

X2

Xn

fӨ(x)

fӨ(x)

y

18 of 65

What is a Deep Set?

A special type of function approximator

Inputs

NN: track feature extractor

Outputs

18 / Nicole Hartman

1810.05165

X1

X2

Xn

Φ(X1)

y

Φ(X2)

Φ(Xn)

?

19 of 65

What is a Deep Set?

A special type of function approximator

Inputs

NN track feature extractor

Outputs

Sum: Permutation invariant operation

19 / Nicole Hartman

1810.05165

X1

X2

Xn

Φ(X1)

y

Φ(X2)

Φ(Xn)

?

20 of 65

What is a Deep Set?

A special type of function approximator

Inputs

NN track feature extractor

Outputs

Sum: Permutation invariant operation

NN jet feature extractor

Models correlations between the tracks

20 / Nicole Hartman

1810.05165

X1

X2

Xn

Φ(X1)

y

Φ(X2)

Φ(Xn)

F

21 of 65

Deep Sets: for b-tagging

Inputs

NN track feature extractor

Outputs

Sum: Permutation invariant operation

NN jet feature extractor

Models correlations between the tracks

21 / Nicole Hartman

X1

X2

Xn

Φ(X1)

y

Φ(X2)

Φ(Xn)

F

∈ {b, c, l}

22 of 65

Input features

Tracks

High dimensional

22 / Nicole Hartman

X1

X2

Xn

ATL-PHYS-PUB-2020-014

Xi =

∈ ℝ15

Impact parameters

Track kinematics

Track quality

23 of 65

b-tagging

Key variable: Impact Parameter

23 / Nicole Hartman

24 of 65

25 of 65

DIPS

In the energy flow formalism

25 / Nicole Hartman

ATL-PHYS-PUB-2020-014

26 of 65

What does deep sets replace?

Recurrent neural network

Account for correlations between tracks

Challenge: ordering

26 / Nicole Hartman

😃

27 of 65

Why is Deep Sets a nice representation?

Learning faster

27 / Nicole Hartman

28 of 65

How we evaluate the performance

28 / Nicole Hartman

b-jet efficiency =

Background rejection =

Background efficiency

1

# b-jets > threshold

# b-jets

29 of 65

Performance comparison

29 / Nicole Hartman

30 of 65

Let’s us have faster turnaround time for PHYSICS

30 / Nicole Hartman

31 of 65

Outline

Symmetries in architectures

Symmetries in inputs

Other examples

31 / Nicole Hartman

1

2

3

32 of 65

Input processing: ΔR

Ex: b-tagging input features

32 / Nicole Hartman

failing IP cuts

Xi =

Training with ΔR(trk,jet) encodes a symmetry in the input representation - and makes learning features easier.

33 of 65

Input processing: logging variables

Ex: b-tagging input features

33 / Nicole Hartman

log ( pTfrac )

scaled log ( pTfrac )

pTfrac ≡ pTjet/ pTjet

ΔR(trk, jet)

log ( ΔR )

scaled log ( ΔR )

Power law distributions

With the log, become bell-shape

Normalizing encourages inputs to be close to the activation functions

34 of 65

Input processing: logging variables

Ex: b-tagging input features

34 / Nicole Hartman

Train pTfrac and ΔR

Train log pTfrac and log ΔR

How does this help? 20% speed up in training time!

35 of 65

Input processing: HH -> 4b background modeling

Problem setup

35 / Nicole Hartman

p(mhh, mh1, mh2) = p(mhh | mh1, mh2) p(mh1, mh2)

Model the Higgs candidate kinematics to reconstruct mhh

36 of 65

Input processing: HH -> 4b background modeling

36 / Nicole Hartman

Azimuthal symmetry baked into the model

ΔΦHH

log (π - ΔΦHH)

Searching for solutions with constant ΔΦHH - will give the same mHH

Concerned about modeling b/c no way to know that -ππ.

Conditioning on mh1, mh2 → 6 variables left to model

5

37 of 65

Input processing: HH -> 4b background modeling

37 / Nicole Hartman

The right transformations can help learn complicated distributions.

38 of 65

Trick: batch norm

What about the features in the hidden layers?

38 / Nicole Hartman

“You want zero mean and unit variance operations? Just make them so!”

1502.03167

Stanford's CS 231n Lecture 6

At training time, normalize over the activations of the minibatch

Usually inserted after Fully Connected layers, and before nonlinearity.

Nice regularization technique to have in your toolkit!

Additional learnable parameters for the scale and shift: γ and β.

39 of 65

Outline

Symmetries in architectures

Symmetries in inputs

Other examples

39 / Nicole Hartman

1

2

3

40 of 65

DeXTer: Deep Set Xbb Tagger

Low pT X->bb tagging

ATL-COM-PHYS-2022-550

Currently in ATLAS circulation

40 / Nicole Hartman

Signal

Backgrounds

Two deep sets!!

Tracks Deep Set

Vertices Deep Set

Low ma leads to merged b-jets

Pseudo-scalar a pT is too low for boosted Higgs calibration

41 of 65

Deep Sets: ttH -> bb event classification

41 / Nicole Hartman

Slide from Higgs report from

ML in analysis meeting

better

42 of 65

Transformers

42 / Nicole Hartman

“Any time you have a deep set you can substitute in a transformer as it’s always more expressive.”

~ Peter Battaglia (or what I remembered of the quote)

Sequence: [Mary, likes, John]

Set: [ {0, Mary} , {1, likes} , {2, John} ]

Transformer architecture

Replace the Deep Sets Sum operation with a weighted sum (can still be permutation invariant)

1706.03762

Computes a weight

Weighted sum over inputs

Standard architecture for Natural Language Processing (replacing RNNs): faster training -> better optimization.

And then… stack-them-up!

43 of 65

Transformers: HH->4b combinatorial optimization

Goal: Reconstruct jets into Higgs Candidates

43 / Nicole Hartman

Interpret the transformer attention weights between the jets the probability of coming from the same Higgs.

Anther permutation invariant operation - different application.

baselines

44 of 65

In -mary

  • Deep sets: Permutation invariance

  • It can also be just as important to “bake” a symmetry ito a model

  • Highlighted other applications of use cases in HEP-ex.

44 / Nicole Hartman

  • Use case: FTAG
  • 4x speed-up training time
  • faster R&D

45 of 65

BACKUP

Table of Contents:

Questions and answers

RNNs vs Deep Sets

46 of 65

Example: Deep image prior

Task: super resolution

Given a corrupted image x0 reconstruct a higher resolution truth image

Just the inductive bias of the CNN architecture is sufficient to obtain good super resolution performance by choose the SGD to optimize the parameters on a single image.

46 / Nicole Hartman

1711.10925

47 of 65

Inputs

ATL-PHYS-PUB-2020-014

47 / Nicole Hartman

48 of 65

IP3D limitations

48 / Nicole Hartman

ATL-PHYS-PUB-2017-003

49 of 65

Recurrent neural networks

49 / Nicole Hartman

Graphic adapted from Stanford

CS 231n Lecture 10: RNNs

50 of 65

RNNIP

50 / Nicole Hartman

ATL-PHYS-PUB-2017-003

51 of 65

Where are the improvements coming from?

51 / Nicole Hartman

ATL-PHYS-PUB-2017-003

52 of 65

RNNIP

52 / Nicole Hartman

ATL-PHYS-PUB-2017-003

53 of 65

Where are the improvements coming from?

53 / Nicole Hartman

ATL-PHYS-PUB-2017-003

54 of 65

Saliency maps

What has DIPS learned about b-jets?

54 / Nicole Hartman

1312.6034

55 of 65

Saliency definition

What has DIPS learned about b-jets?

55 / Nicole Hartman

ATL-PHYS-PUB-2020-014

b-jets with 8 tracks failing the 77% b-tagging working point

Largest sd0

Smallest sd0

ATLAS Simulation Preliminary

√s = 13 TeV, tt

  • Consider b-jets failing the 77% WP

  • Average over jets with 8 tracks
  • Sort the tracks by sd0 for the average

56 of 65

56 / Nicole Hartman

57 of 65

Saliency maps in mathematics applications

Way to analyze the importance of the inputs

57 / Nicole Hartman

Key idea: Use saliency maps to postulate new conjectures which could then become new math theorems!

58 of 65

Impact of more data

58 / Nicole Hartman

Alex Foch’s QT summary

Recomm. RNNIP: Run 2 recommendation

Reference DIPS: DIPS model from PUB w/ the optimized track sel

DIPS Default: Alex F’s retraining DIPS w/ many more jets

DIPS Loose: Alex F’s retraining DIPS w/ many more jets + the optimized track sel

59 of 65

Retraining the high level tagger

59 / Nicole Hartman

Alex Foch’s QT summary

60 of 65

Transformers: HH->4b combinatorial optimization

60 / Nicole Hartman

Q: What extra information is the transformer using?

A: Momentum conservation to help with jet selection!

61 of 65

Transformers: HH->4b combinatorial optimization

61 / Nicole Hartman

Training

  • Enumerate over valid pairings
  • Maximize the probability of the correct pair
    • cross entropy loss

62 of 65

4b: extra figures

62 / Nicole Hartman

63 of 65

Input processing: HH -> 4b background modeling

Variable transformations

63 / Nicole Hartman

64 of 65

Invariance vs Equivariance

Equivariance

Invariance

64 / Nicole Hartman

A function f is equivariant under group G if for every g ∈ G there exist operators of the group element g, Tg and Sg, such that:

Tg(f(x)) = f (Sg(x))

2105.09016

A function f is invariant under a symmetry group G means:

Example: Convolutions

f(Tg(x)) = f(x)

65 of 65

CNNs in HEP / ATLAS

65 / Nicole Hartman

Jet images (1511.05190)

PU mitigation

with event images

ATL-PHYS-PUB-2019-028

Top tagging

ATL-COM-PHYS-2022-275