1 of 49

A Tutorial on Differentiable Analysis & end-to-end learning

Nathan Simpson

PyHEP, 15/09/22

2 of 49

Two software libraries:

A suite of differentiable operations designed to target typical HEP use cases.

A method for optimizing observables in an end-to-end way, incorporating systematics

Also: a software package that implements helper functions for this use case

2

built with

https://github.com/gradhep/relaxed

https://github.com/gradhep/neos

3 of 49

3

4 of 49

Tangent:

how do neural networks learn at all?

4

5 of 49

5

data

activation(weight*data + bias)

learnable free parameters

“architecture”

(how you combine parameters w/ data)

“parameters”

φ

6 of 49

6

data

φi

result

feedback?

7 of 49

7

data

φi

result

objective(result)

(= scalar representing how good our result is)

want to minimise

8 of 49

8

data

φi

result

objective(result)

(= scalar representing how good our result is)

minimisation

update rule?

9 of 49

9

data

φi

result

objective(result)

(= scalar representing how good our result is)

φi+1 = φi - lr * d(workflow(data, φi))/dφi

= workflow(data, φi)

trying to roll downhill in param space!

positive

gradient

negative

step

update rule: gradient descent

step size

(“learning rate”)

gradient of workflow w.r.t. current parameters

10 of 49

We don’t need neural networks to do this!

10

(but they are often quite useful, so you’ll see some more later)

11 of 49

Same thing with a straight line:

Hard to say where “model” ends and “objective” begins.

11

data

φi = {m,c}

result

objective(result)

φi+1 = φi - lr * d(workflow(data, φi))/dφi

= workflow(data, φi)

still works!

as long as we can calculate this gradient!

y = mx + c

e.g. for 2D data:

data on left of line = signal,

on right = background

12 of 49

Idea:

Using gradient descent, we can optimise any workflow parameters with respect to any goal… *if* the full workflow is differentiable.

12

13 of 49

A typical HEP analysis workflow

13

14 of 49

A typical HEP analysis workflow

14

More abstractly: step with free parameters (e.g. event selection)

15 of 49

A typical HEP analysis workflow

15

16 of 49

A typical HEP analysis workflow

16

17 of 49

A typical HEP analysis workflow

17

18 of 49

A typical HEP analysis workflow

18

19 of 49

A typical HEP analysis workflow

19

20 of 49

A typical HEP analysis workflow

20

21 of 49

(chain rule)

21

In equation form:

but wait, this is all code right?

how do we differentiate a computer program?

22 of 49

How might we get those gradients?

Automatic differentiation!

Quick explanation:

  • Any program can be broken down into a series of primitive operations (+, -, /, *, log, exp…)
  • These have known derivatives!
  • Can then compose these derivatives via the product rule to get the gradient of the whole program!

exact, efficient gradients

22

Thanks to deep learning’s prominence, we have many great software libraries [JAX, PyTorch, TensorFlow] that take advantage of hardware acceleration (GPUs, TPUs)

Img source: AskPython.com

23 of 49

So is that it?

We code up our analysis in PyTorch and fit the model?

…not quite :)

Not all operations can be broken into differentiable primitives, because not all operations are differentiable!

Need to figure out a way to “relax” some operations to allow us to take their gradients.

23

Pictured: One very discrete boi.

24 of 49

Every step of the workflow needs to be differentiable!

24

already differentiable

not necessarily differentiable a priori

=> Let’s change that!

25 of 49

In one slide: Making analysis differentiable

Example: histograms [very discrete!]

*See Kyle Cranmer’s (heavily cited) paper on this: arxiv.org/abs/hep-ex/0011057

26 of 49

In one slide: Making analysis differentiable

Example: histograms [very discrete!]

We developed a histogram-alternative using kernel density estimates (KDEs). [used already in HEP!]*

Integrating the KDE over a set of intervals gives the notion of “bins”. => Binned KDE (bKDE)

*See Kyle Cranmer’s (heavily cited) paper on this: arxiv.org/abs/hep-ex/0011057

27 of 49

In one slide: Making analysis differentiable

Example: histograms [very discrete!]

We developed a histogram-alternative using kernel density estimates (KDEs). [used already in HEP!]*

Integrating the KDE over a set of intervals gives the notion of “bins”. => Binned KDE (bKDE)

Also have:

  • differentiable cuts (sigmoid)
  • differentiable likelihood-building through pyhf
  • differentiable fitting due to exploiting the implicit function theorem

*See Kyle Cranmer’s (heavily cited) paper on this: arxiv.org/abs/hep-ex/0011057

28 of 49

Now time for some code!

28

29 of 49

What makes a good observable?

Searches for new physics endeavour to maximally discriminate simulated signal data from background processes.

But is this really what we want?

29

signal

background

e.g. neural network w/ 1-D output, trained to minimize binary cross-entropy

30 of 49

What makes a good observable?

Searches for new physics endeavour to maximally discriminate simulated signal data from background processes.

But is this really what we want?

e.g. what happens when we include systematic variations of the signal/background?

  • Not guaranteed to produce a sensitive observable for all templates!
  • Observable knows nothing about how we model + profile over the uncertainty!

30

?

31 of 49

“(...) sensitivity to high-level physics questions must account for systematic uncertainties, which involve a nonlinear trade-off between the typical machine learning performance metrics and the systematic uncertainty estimates. ”

Deep Learning and its applications to LHC Physics, section 3.1,

D.Guest, K.Cranmer, D.Whiteson, 2018

arxiv.org/abs/1806.11484

31

(emphasis not in original text)

32 of 49

Can we learn to incorporate systematics?

32

33 of 49

Idea 2:

We can directly optimise the discovery significance/CLs of our analysis this way!

-> Systematic aware [profiling]

33

34 of 49

Oh baby it’s code time!

34

35 of 49

That’s it!

and thanks for listening!

If you want to:

> discuss more about this in any way

> have an interesting use case

> talk about future opportunities

> send me pet images

please reach out! email: n.s@cern.ch

I’d love to hear from you :)

35

one of my cats, enjoying the homely comfort of the washing machine

This work was partially supported by the Insights ITN, funded by the European Union’s Horizon 2020 research and innovation programme, call H2020-MSCA-ITN-2017, under Grant Agreement n. 765710.

Work now supported by the Swedish Science Council and Lund University directly.

(until November)

36 of 49

Seeing it in practice

36

37 of 49

Toy example: 1-bin counting experiment

37

s = 15 + φ

b = 45 − 2φ

σb = 1 + (φ/5)²

Increasing s/b ratio

Increasing uncertainty on background

Increasing φ

38 of 49

Learning to discover: 1-bin example

38

We’re able to recover the optimal significance in our toy problem!

Intuitively, we’re trading off uncertainty and s/b ratio in order to give the best result.

for pdf viewers: optimisation with respect to significance is able to find the optimal significance accounting for uncertainty (minimum of blue curve)

39 of 49

Optimising a neural network observable (neos)

39

for pdf viewers: neural network contours wrap around the signal blob, but also balance the background variations to minimise uncertainty.

40 of 49

Optimising a neural network observable (neos)

40

neos gets better CLs than all other tried methods!

additional plots that show:

> cross-section uncertainty is also optimised for free

> no over/underconstraint of nuisance parameter

41 of 49

Optimising a neural network observable (neos)

41

More fun details and context in our preprint! :)

In collaboration with Lukas Heinrich:

arxiv.org/abs/2203.05570

42 of 49

You can optimize anything!

42

https://github.com/gradhep/relaxed

binning!

cuts!

43 of 49

Backup

43

44 of 49

You want to know how it scales!

Me too!

IRIS-HEP is very interested in this, and plans to support it for the “Analysis grand challenge” on open data, but may need more personpower.

Very much open to collaboration on any use case!

44

example concerns:

> batch size may need to sufficiently represent analysis (so could require lots more VRAM compared to usual approach)

> every minibatch update = one run of the analysis, so may need lots more compute (but GPUs + autodiff are very powerful!)

45 of 49

Discovery significance (it still works!)

45

46 of 49

Differences between discovery p-value and CLs

46

Train to optimize CLs

Train to optimize discovery p-value

47 of 49

Differences between discovery p-value and CLs

47

Train to optimize CLs

Train to optimize discovery p-value

48 of 49

Differences between discovery p-value and CLs

48

Train to optimize CLs

Train to optimize discovery p-value

49 of 49

Which bandwidth to pick?

49