1 of 40

Introduction to Machine Learning

XIX Seminar on Software for Nuclear, Subnuclear and Applied Physics

June 6-10 2022

Andrea.Rizzi@unipi.it

1

2 of 40

Topics

  1. Introduction to machine learning (Wed 11:00 - 12:00)
    1. Basic concepts: loss, overfit, underfit
    2. Examples of linear regression, boosted decision trees
    3. Exercise with colab, numpy, scikit
  2. Deep Neural Networks (Thu 9:00-10:30)
    • Basic FeedForward networks and backpropagation
    • Importance of depth, gradient descent, optimizers
  3. Introduction to tools and first exercises (Thu 11:00-12:00)
  4. Convolutional and Recurrent networks (Fri 9:00-10:30)
    • Reduction of complexity with invariance: RNN and CNN
    • CNN exercise

2

3 of 40

Virtual Machine setup (you can launch and let it go)

# run once

su

yum install rh-python38

exit

# run each time you want to start the python 3.8 environment

scl enable rh-python38 bash

#run once in the python 3.8 environment to install the ML stuff

pip install –user keras tensorflow numpy matplotlib opencv-python install sklearn jupyter seaborn ipympl

#run a notebook

Jupyter notebook

#then use firefox to open the link

3

#in case of error with matplotlib backend

pip uninstall matplotlib

pip install matplotlib --user

4 of 40

Quick Poll

  • How many of you…
    • ...know what an artificial neural network is?
    • ...used a multivariate / machine learning / neural network / BDT analysis technique?
    • ...know what a linear regression is? and what PCA is?

4

5 of 40

Introduction on these ML lectures

This lectures are based on the ML lectures of the Unipi course “Computing Methods in Experimental Physics and Data Analysis”, slides are updated every year as this is an active research field, new ideas emerge every year, if interested you should learn to stay tuned (read, study, update your knowledges)

Suggested books (both having free online versions)

  • “The 100 pages ML book”, Andriy Burkov
    • Can be read almost from A to Z
    • http://themlbook.com/wiki/doku.php
  • “The Deep Learning Book”, Goodfellow, Bengio, Courville (MIT Press)
    • More like a reference book, read individual chapters
    • https://www.deeplearningbook.org/

5

6 of 40

Machine learning

A possible definition (from wikipedia):

Replace “programmers” with computer programs

  • Learn from examples (“training” phase)

Multiple applications, for example:

  • Image and speech processing
  • Agents able to play chess, go or drive cars
  • Detect anomalies (e.g. in credit card usage)
  • Applications for research in various scientific fields
    • E.g. physics!

6

7 of 40

In experimental and applied physics

examples are everywhere..

  • Particle identification and kinematic measurement
  • Signal to background discrimination (BDT and DNN are very popular in HEP experiments)
  • Computer assisted processing of medical exams (ECG, CT, etc...)
  • Processing of astrophysics data

… and your ideas! This is a growing field ...

7

PRL paper observation of Higgs to bb

4 different ML algorithms used for different tasks

in this analysis

8 of 40

Computing in High Energy Physics Conference

  • Machine learning for QCD theory and data analysis
  • BESIII drift chamber tracking with machine learning
  • FPGA-accelerated machine learning inference as a service for particle physics computing
  • Constraining effective field theories with machine learning
  • Fast simulation methods in ATLAS: from classical to generative models
  • Using ML to Speed Up New and Upgrade Detector Studies
  • The Tracking Machine Learning Challenge
  • Particle Reconstruction with Graph Networks for irregular detector geometries
  • ...42 contribution with “Machine Learning” in the title/abstract

8

9 of 40

Real time alerts and automatic telescope pointing

9

10 of 40

Machine Learning basics

(or the “dictionary” for next lectures)

10

11 of 40

Types of typical ML problems

  • Classification: which category a given input belongs to.
  • Regression: value of a real variable given the input.
  • Clustering: group similar samples
  • Anomaly detection: identify inputs that are different from others
  • Generation/synthesis of samples: produce new samples, similar to the original data, starting from noise/random numbers
  • Denoising: remove noise from an input dataset
  • Transcriptions: describe in some language the input data
  • Translations: translate between languages
  • Encoding and decoding: transform input data to a different representation
  • ...many more...

11

12 of 40

Function approximation

  • The goal of a ML algorithm is to approximate an unknown function (often related to some Probability Density Function of the data) given some example data
  • The function is often f(x): Rn -> Rm (often m=1 )
  • In classification we try to approximate the probability for each example, with inputs represented as a vector x to belong to a given category (y) (e.g. the probability to be a LHC Higgs signal event vs a Standard Model background one)
  • In regression we approximate the function that given the inputs (x) returns the value of the variable to predict (y) (e.g. given the data read from some particle detectors, estimate the particle energy)

12

classification

regression

x2

x1

x1

y

y=f(x)

We usually call “x” the inputs or features

We usually call “y” the output or target

13 of 40

Model and Hyper-parameters

  • A model for the functions that can be used to approximate the “f(x)” must be specified. The model can be something simple (e.g. sum of polynomials up to degree N) or more complex (e.g. all the functions that could be coded in M lines of C++)
  • Different ML techniques are based on different “models”
    • Each technique (“class of model”) further allow to specify the exact model
    • The parameters describing the exact model are called “hyper-parameters” (e.g. the degree N of the polynomial, or the maximum number of C++ line M can be considered hyper parameters)
  • We will see example of techniques with different models and complexity:
    • Linear regression
    • Decision trees
    • Principal Component Analysis
    • Nearest Neightbour
    • Artificial Neural Networks

13

14 of 40

Parameters

  • A specific model typically have parameters (e.g. the coefficients of the polynomials or the actual characters of the 10 lines of C++).
  • Parameters are learned in the “training phase”.
  • Different models or similar model with different hyper-parameters settings have different n.d.o.f. in the parameters phase space

y(x) = ax + bx^2 + cx^3 + d (a,b,c,d are the parameters)

14

15 of 40

Objective function

  • A goal for what is “a good approximation” have to be defined
  • This is called objective function (or loss function or error function …)
  • Is a function that returns higher(or lower) value depending how good or bad the approximation is
    • Loss functions have to be minimized
  • Examples of loss functions
    • Classification problems: binary cross entropy
    • Regression problems: Mean Square Error (i.e. the chi2 with sigma=1, I hope you are not surprised by this choice!)

The process is not very different from a typical phys-lab1 chi2 fit… but the number of parameters can be several orders of magnitude larger (10^3 to 10^6)

15

16 of 40

Objective function: binary cross entropy

  • In classification problems the function to approximate is typically Rn -> [0,1]
    • Where, for example, 0 means background and 1 means signal
  • The binary cross entropy is defined as follows ( is the output of the classifier)

  • The above function has large value when an example with y=1 is classifed as a ~ 0 and no loss when ~ 1
    • Viceversa if y=0 …
  • Minimizing the binary cross-entropy we maximize the likelihood in a process with 0/1 outcome (where the output of the tagger is interpreted as a probability)

16

17 of 40

Learning / Training

  • For a given model, and given set of hyper-parameters, how do we infer the parameters that minimize the objective function?
  • The idea of ML is to get the parameters from “data” in a so called “training” step
  • Each ML technique has a different approach to training
  • Different types of training
    • Supervised: i.e. for each example we know the correct answer
    • Unsupervised: we do not know “what is what”, we ask the ML algorithm to learn the probability density function of the examples in the features (i.e. the inputs!) space
    • Reinforcement learning: have agents playing a punishment/reward game

17

18 of 40

Supervised learning

  • We want to teach something we (the supervisors) already know (at least on the training samples)
  • For each example we need to have the “right answer” / “truth” , for example:
    • Labels telling if a given example signal or background
    • Labels classifying the content of an image (multiple labels are possible)
    • Correct values of some quantity, e.g. generated energy of a particle in a detector simulation
    • More complex multi-label or multi-class classification
  • Sample can be labelled in various ways:
    • Humans labelling existing data
    • Data being “generated” from known functions (e.g. simulations)
  • Learn the probability of the label y, given the input x, i.e. P(y|x)

18

19 of 40

Unsupervised learning

  • Often we do not have labels (or we have labels only for few data points)
  • Unsupervised learning techniques allow to train networks that can perform similar tasks as the supervised ones, e.g.
    • Classification of “common” patterns (clustering)
    • Dimensionality reduction, compression
    • Prediction of missing inputs
    • Anomaly detection

  • In practice learn the Probability Density Function of the data, independently of any “label” variable, i.e. P(x)

19

20 of 40

Supervised vs unsupervised

Supervised and unsupervised are not as different as one would imagine, in fact

  • P(x) can be seen as n supervised problems, one for each feature

  • P(y | x) can also be computed, if we treat y as an “x” in unsupervised learning deriving hence , as

20

21 of 40

Reinforcement learning

Applies to “agents” acting in an “environment” that updates their state

  • It is similar to supervised learning as a “reward” has to be calculated
  • The supervisor anyhow doesn’t necessarily know what is the best action to perform in a given state to interact with the environment, it just computes the final reward
  • Learn to make best decision in a given situation
    • The right move in chess or go match
    • Drive a car in the traffic
    • Etc..

21

22 of 40

Capacity and representational power

  • Different models (i.e. techniques/hyper-parameters values) allow to represent different type of functions
  • Models with more free parameters typically can approximate a larger number of functions (or can better approximate a given function) => higher capacity
  • Remember: we do not know the actual function to approximate, we just want to learn from examples
  • With limited samples we have a tradeoff to handle:
    • accuracy in representation vs generalization of the results

22

23 of 40

Capacity and representational power

  • Underfitting: the sample is badly represented
  • Overfitting / Appropriate capacity are less obvious to define
    • Lack of “generalization” -> overfitting

23

24 of 40

Capacity and representational power

  • Underfitting: the sample is badly represented
  • Overfitting / Appropriate capacity are less obvious to define
    • Lack of “generalization” -> overfitting
    • Typical method is to check on independent sample
      • Or just split your sample in two and use only half for training

24

25 of 40

Generalization

  • We can compare the accuracy between the “training” sample and the “generalization/validation” sample

  • Bias/variance trade-off
    • y: function (with random noise)
    • h(x): approximated function

25

26 of 40

Regularization

In order to control the “generalization gap”

  • the objective function can be modified adding a regularization term
    • Introduce a “cost” in increasing the capacity of the model or in accessing some parts of the model-parameters space
  • the examples in training dataset can be increased with augmentation techniques
    • Adding stochastic noise to existing examples
    • Transforming the existing examples with transformation that are known to be invariant for the solution we look for

https://xgboost.readthedocs.io/en/latest/tutorials/model.html

26

27 of 40

Hyperparameters(model) optimization

  • It is normal to have to test a few, if not several, configurations in the model hyper-parameter space
    • Scans of hyper-parameters are often performed
    • Different techniques used
  • Effectively a “second” minimization is done
    • First minimization is on the parameter => minimize on the “training dataset”
    • Second minimization is on the hyper-parameters => minimize on the “validation dataset”
  • A third dataset (“test dataset”) is then also needed
    • To assess the performance of the algorithm in an unbiased way
    • To make an unbiased prediction of the algorithm output
  • Original dataset is typically split in uneven parts to be used as training, validation and test

27

28 of 40

K-folding cross validation

  • If the sample is statistically limited, splitting in 3 chunks means loosing examples
  • With K-folding, “K” independent trainings are performed, each using a different chunk of data for “training” and for “testing” (and another one for validation if a hyper parameter scan is performed)

28

29 of 40

Inference

  • A ML model that has been trained can than be used to act on some new data (or on the test dataset if a prediction has to be made)
  • The evaluation of the algorithm output on the “unseen” data is called inference
  • From a computing point of view inference is usually faster than training

29

30 of 40

30

ROC

Confusion Matrix

Accuracy, Precision, Sensitivity, Specificity

31 of 40

Examples of ML techniques

31

32 of 40

Linear regression

  • Solve a regression problem, i.e. perdict the value of y when x is given
    • Approximate an unknown “y=f(x)” given a some examples of (y,x)
  • Model: y=wixi , i.e. the function is a linear combination of the input paramters
  • Parameters: wi
  • Let’s suppose we have m examples in the form of pairs (x,y)j
  • The objective function can be the mean squared error, MSE=|yj- wixij|2/m
  • Learning: find the wi that minimize the MSE on the given dataset
  • We could increase the capacity of the model using polynomials instead of linear functions
    • The number of parameters would increase as we now would have the second order coefficients too

32

Supervised

33 of 40

Principal Component Analysis (aka PCA)

  • Orthogonal transformation of the input phase space such that
    • The first transformed coordinate has maximum variance
    • The 2nd transformed coordinated has 2nd max variance
    • …etc...
  • Can be computed as the eigenvalue decomposition of the covariance matrix

  • Useful to transform the data in a normalized form (scaling by the variance of each component)
  • Reduce dimensionality (by taking only first N components)

More complex dimensionality reduction Manifold Learning: https://github.com/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/05.10-Manifold-Learning.ipynb

33

Unsupervised

34 of 40

Nearest neighbors

  • A very powerful way to do classification or regression is to look at points in the training datasets that are close to sample to evaluate
  • Multiple neighbors can be used for a more stable evaluation
  • On large dataset it could be a problem to keep all training points for the evaluation phase

34

35 of 40

Decision trees

  • The functions used in the “model” are decision trees, each node has a pass/fail condition on some input variable
  • Classification and regression trees (CART)
    • Examples are categorized based on individual “cuts” on a single input feature
    • A score is given in each leaf
  • Trees can have different depths (depth is an hyper-parameter)

35

36 of 40

Ensembles of trees

  • A single tree is typically not a very performant
  • Combine multiple trees (#trees is an hyperpar)
    • Random forest (bagging)
    • Gradient boosting
    • Adaptive boosting

36

bagging

Gradient boosting

37 of 40

Limitations of decision trees

  • Cuts are axis aligned
  • Classification of x1 > x2 is a hard problem for a decision tree

37

x1

x2

38 of 40

Many more ML techniques!

38

39 of 40

Today hands on session

In the next lectures we will use “colab” from google to run py notebooks

First exercise is taken from Python Data Science Handbook by Jake VanderPlas with some minor edits (the content is available on GitHub. The text is released under the CC-BY-NC-ND license, and code is released under the MIT license. If you find this content useful, please consider supporting the work by buying the book!)

Click here and “make a copy” to be able to edit: https://colab.research.google.com/drive/1Sqn5fuiB5-2EP6UKUmwqjQd_b3uUNu2r?usp=sharing

Or directly download with

wget "https://drive.google.com/uc?export=download&id=1Sqn5fuiB5-2EP6UKUmwqjQd_b3uUNu2r" -o sk.ipynb

39

40 of 40

40