1 of 40

Introduction to Machine Learning

XIX Seminar on Software for Nuclear, Subnuclear and Applied Physics

June 6-10 2022

Andrea.Rizzi@unipi.it

2 of 40

Topics

Introduction to machine learning (Wed 11:00 - 12:00)

Basic concepts: loss, overfit, underfit
Examples of linear regression, boosted decision trees
Exercise with colab, numpy, scikit

Deep Neural Networks (Thu 9:00-10:30)

Basic FeedForward networks and backpropagation
Importance of depth, gradient descent, optimizers

Introduction to tools and first exercises (Thu 11:00-12:00)
Convolutional and Recurrent networks (Fri 9:00-10:30)

Reduction of complexity with invariance: RNN and CNN
CNN exercise

3 of 40

Virtual Machine setup (you can launch and let it go)

# run once

yum install rh-python38

exit

# run each time you want to start the python 3.8 environment

scl enable rh-python38 bash

#run once in the python 3.8 environment to install the ML stuff

pip install –user keras tensorflow numpy matplotlib opencv-python install sklearn jupyter seaborn ipympl

#run a notebook

Jupyter notebook

#then use firefox to open the link

#in case of error with matplotlib backend

pip uninstall matplotlib

pip install matplotlib --user

4 of 40

Quick Poll

How many of you…

...know what an artificial neural network is?
...used a multivariate / machine learning / neural network / BDT analysis technique?
...know what a linear regression is? and what PCA is?

5 of 40

Introduction on these ML lectures

This lectures are based on the ML lectures of the Unipi course “Computing Methods in Experimental Physics and Data Analysis”, slides are updated every year as this is an active research field, new ideas emerge every year, if interested you should learn to stay tuned (read, study, update your knowledges)

Suggested books (both having free online versions)

“The 100 pages ML book”, Andriy Burkov

Can be read almost from A to Z
http://themlbook.com/wiki/doku.php

“The Deep Learning Book”, Goodfellow, Bengio, Courville (MIT Press)

More like a reference book, read individual chapters
https://www.deeplearningbook.org/

6 of 40

Machine learning

A possible definition (from wikipedia):

Replace “programmers” with computer programs

Learn from examples (“training” phase)

Multiple applications, for example:

Image and speech processing
Agents able to play chess, go or drive cars
Detect anomalies (e.g. in credit card usage)
Applications for research in various scientific fields

E.g. physics!

7 of 40

In experimental and applied physics

examples are everywhere..

Particle identification and kinematic measurement
Signal to background discrimination (BDT and DNN are very popular in HEP experiments)
Computer assisted processing of medical exams (ECG, CT, etc...)
Processing of astrophysics data

… and your ideas! This is a growing field ...

PRL paper observation of Higgs to bb

4 different ML algorithms used for different tasks

in this analysis

8 of 40

Computing in High Energy Physics Conference

Machine learning for QCD theory and data analysis
BESIII drift chamber tracking with machine learning
FPGA-accelerated machine learning inference as a service for particle physics computing
Constraining effective field theories with machine learning
Fast simulation methods in ATLAS: from classical to generative models
Using ML to Speed Up New and Upgrade Detector Studies
The Tracking Machine Learning Challenge
Particle Reconstruction with Graph Networks for irregular detector geometries
...42 contribution with “Machine Learning” in the title/abstract

9 of 40

Real time alerts and automatic telescope pointing

10 of 40

Machine Learning basics

(or the “dictionary” for next lectures)

11 of 40

Types of typical ML problems

Classification: which category a given input belongs to.
Regression: value of a real variable given the input.
Clustering: group similar samples
Anomaly detection: identify inputs that are different from others
Generation/synthesis of samples: produce new samples, similar to the original data, starting from noise/random numbers
Denoising: remove noise from an input dataset
Transcriptions: describe in some language the input data
Translations: translate between languages
Encoding and decoding: transform input data to a different representation
...many more...

12 of 40

Function approximation

The goal of a ML algorithm is to approximate an unknown function (often related to some Probability Density Function of the data) given some example data
The function is often f(x): Rⁿ -> R^m(often m=1 )
In classification we try to approximate the probability for each example, with inputs represented as a vector x to belong to a given category (y) (e.g. the probability to be a LHC Higgs signal event vs a Standard Model background one)
In regression we approximate the function that given the inputs (x) returns the value of the variable to predict (y) (e.g. given the data read from some particle detectors, estimate the particle energy)

classification

regression

x₂

x₁

y=f(x)

We usually call “x” the inputs or features

We usually call “y” the output or target

13 of 40

Model and Hyper-parameters

A model for the functions that can be used to approximate the “f(x)” must be specified. The model can be something simple (e.g. sum of polynomials up to degree N) or more complex (e.g. all the functions that could be coded in M lines of C++)
Different ML techniques are based on different “models”

Each technique (“class of model”) further allow to specify the exact model
The parameters describing the exact model are called “hyper-parameters” (e.g. the degree N of the polynomial, or the maximum number of C++ line M can be considered hyper parameters)

We will see example of techniques with different models and complexity:

Linear regression
Decision trees
Principal Component Analysis
Nearest Neightbour
Artificial Neural Networks

14 of 40

Parameters

A specific model typically have parameters (e.g. the coefficients of the polynomials or the actual characters of the 10 lines of C++).
Parameters are learned in the “training phase”.
Different models or similar model with different hyper-parameters settings have different n.d.o.f. in the parameters phase space

y(x) = ax + bx^2 + cx^3 + d (a,b,c,d are the parameters)

15 of 40

Objective function

A goal for what is “a good approximation” have to be defined
This is called objective function (or loss function or error function …)
Is a function that returns higher(or lower) value depending how good or bad the approximation is

Loss functions have to be minimized

Examples of loss functions

Classification problems: binary cross entropy
Regression problems: Mean Square Error (i.e. the chi2 with sigma=1, I hope you are not surprised by this choice!)

The process is not very different from a typical phys-lab1 chi2 fit… but the number of parameters can be several orders of magnitude larger (10^3 to 10^6)

16 of 40

Objective function: binary cross entropy

In classification problems the function to approximate is typically Rⁿ -> [0,1]

Where, for example, 0 means background and 1 means signal

The binary cross entropy is defined as follows ( is the output of the classifier)

The above function has large value when an example with y=1 is classifed as a ~ 0 and no loss when ~ 1

Viceversa if y=0 …

Minimizing the binary cross-entropy we maximize the likelihood in a process with 0/1 outcome (where the output of the tagger is interpreted as a probability)

17 of 40

Learning / Training

For a given model, and given set of hyper-parameters, how do we infer the parameters that minimize the objective function?
The idea of ML is to get the parameters from “data” in a so called “training” step
Each ML technique has a different approach to training
Different types of training

Supervised: i.e. for each example we know the correct answer
Unsupervised: we do not know “what is what”, we ask the ML algorithm to learn the probability density function of the examples in the features (i.e. the inputs!) space
Reinforcement learning: have agents playing a punishment/reward game

18 of 40

Supervised learning

We want to teach something we (the supervisors) already know (at least on the training samples)
For each example we need to have the “right answer” / “truth” , for example:

Labels telling if a given example signal or background
Labels classifying the content of an image (multiple labels are possible)
Correct values of some quantity, e.g. generated energy of a particle in a detector simulation
More complex multi-label or multi-class classification

Sample can be labelled in various ways:

Humans labelling existing data
Data being “generated” from known functions (e.g. simulations)

Learn the probability of the label y, given the input x, i.e. P(y|x)

19 of 40

Unsupervised learning

Often we do not have labels (or we have labels only for few data points)
Unsupervised learning techniques allow to train networks that can perform similar tasks as the supervised ones, e.g.

Classification of “common” patterns (clustering)
Dimensionality reduction, compression
Prediction of missing inputs
Anomaly detection

In practice learn the Probability Density Function of the data, independently of any “label” variable, i.e. P(x)

20 of 40

Supervised vs unsupervised

Supervised and unsupervised are not as different as one would imagine, in fact

P(x) can be seen as n supervised problems, one for each feature

P(y | x) can also be computed, if we treat y as an “x” in unsupervised learning deriving hence , as

21 of 40

Reinforcement learning

Applies to “agents” acting in an “environment” that updates their state

It is similar to supervised learning as a “reward” has to be calculated
The supervisor anyhow doesn’t necessarily know what is the best action to perform in a given state to interact with the environment, it just computes the final reward
Learn to make best decision in a given situation

The right move in chess or go match
Drive a car in the traffic
Etc..

22 of 40

Capacity and representational power

Different models (i.e. techniques/hyper-parameters values) allow to represent different type of functions
Models with more free parameters typically can approximate a larger number of functions (or can better approximate a given function) => higher capacity
Remember: we do not know the actual function to approximate, we just want to learn from examples
With limited samples we have a tradeoff to handle:

accuracy in representation vs generalization of the results

23 of 40

Capacity and representational power

Underfitting: the sample is badly represented
Overfitting / Appropriate capacity are less obvious to define

Lack of “generalization” -> overfitting

24 of 40

Capacity and representational power

Underfitting: the sample is badly represented
Overfitting / Appropriate capacity are less obvious to define

Lack of “generalization” -> overfitting
Typical method is to check on independent sample

Or just split your sample in two and use only half for training

25 of 40

Generalization

We can compare the accuracy between the “training” sample and the “generalization/validation” sample

Bias/variance trade-off

y: function (with random noise)
h(x): approximated function

26 of 40

Regularization

In order to control the “generalization gap”

the objective function can be modified adding a regularization term

Introduce a “cost” in increasing the capacity of the model or in accessing some parts of the model-parameters space

the examples in training dataset can be increased with augmentation techniques

Adding stochastic noise to existing examples
Transforming the existing examples with transformation that are known to be invariant for the solution we look for

https://xgboost.readthedocs.io/en/latest/tutorials/model.html

27 of 40

Hyperparameters(model) optimization

It is normal to have to test a few, if not several, configurations in the model hyper-parameter space

Scans of hyper-parameters are often performed
Different techniques used

Effectively a “second” minimization is done

First minimization is on the parameter => minimize on the “training dataset”
Second minimization is on the hyper-parameters => minimize on the “validation dataset”

A third dataset (“test dataset”) is then also needed

To assess the performance of the algorithm in an unbiased way
To make an unbiased prediction of the algorithm output

Original dataset is typically split in uneven parts to be used as training, validation and test

28 of 40

K-folding cross validation

If the sample is statistically limited, splitting in 3 chunks means loosing examples
With K-folding, “K” independent trainings are performed, each using a different chunk of data for “training” and for “testing” (and another one for validation if a hyper parameter scan is performed)

29 of 40

Inference

A ML model that has been trained can than be used to act on some new data (or on the test dataset if a prediction has to be made)
The evaluation of the algorithm output on the “unseen” data is called inference
From a computing point of view inference is usually faster than training

30 of 40

ROC

Confusion Matrix

Accuracy, Precision, Sensitivity, Specificity

31 of 40

Examples of ML techniques

32 of 40

Linear regression

Solve a regression problem, i.e. perdict the value of y when x is given

Approximate an unknown “y=f(x)” given a some examples of (y,x)

Model: y=w_ix_i, i.e. the function is a linear combination of the input paramters
Parameters: w_i
Let’s suppose we have m examples in the form of pairs (x,y)_j
The objective function can be the mean squared error, MSE=|y_j- w_ix_ij|²/m
Learning: find the w_i that minimize the MSE on the given dataset

Linear regression have an analytical solution (i.e. a minimum for the MSE) that can be computed by requiring the gradient of the MSE to be zero (if you want to see the math https://en.wikipedia.org/wiki/Linear_regression#Least-squares_estimation_and_related_techniques )

We could increase the capacity of the model using polynomials instead of linear functions

The number of parameters would increase as we now would have the second order coefficients too

Supervised

33 of 40

Principal Component Analysis (aka PCA)

Orthogonal transformation of the input phase space such that

The first transformed coordinate has maximum variance
The 2nd transformed coordinated has 2nd max variance
…etc...

Can be computed as the eigenvalue decomposition of the covariance matrix

Useful to transform the data in a normalized form (scaling by the variance of each component)
Reduce dimensionality (by taking only first N components)

More complex dimensionality reduction Manifold Learning: https://github.com/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/05.10-Manifold-Learning.ipynb

Unsupervised

34 of 40

Nearest neighbors

A very powerful way to do classification or regression is to look at points in the training datasets that are close to sample to evaluate
Multiple neighbors can be used for a more stable evaluation
On large dataset it could be a problem to keep all training points for the evaluation phase

Figures from https://scikit-learn.org/stable/modules/neighbors.html

35 of 40

Decision trees

The functions used in the “model” are decision trees, each node has a pass/fail condition on some input variable
Classification and regression trees (CART)

Examples are categorized based on individual “cuts” on a single input feature
A score is given in each leaf

Trees can have different depths (depth is an hyper-parameter)

https://xgboost.readthedocs.io/en/latest/tutorials/model.html

36 of 40

Ensembles of trees

A single tree is typically not a very performant
Combine multiple trees (#trees is an hyperpar)

Random forest (bagging)
Gradient boosting
Adaptive boosting

bagging

Gradient boosting

37 of 40

Limitations of decision trees

Cuts are axis aligned
Classification of x1 > x2 is a hard problem for a decision tree

38 of 40

Many more ML techniques!

Scikit-learn library offers many ML techniques implementation in python

https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html#sphx-glr-auto-examples-classification-plot-classifier-comparison-py

39 of 40

Today hands on session

In the next lectures we will use “colab” from google to run py notebooks

First exercise is taken from Python Data Science Handbook by Jake VanderPlas with some minor edits (the content is available on GitHub. The text is released under the CC-BY-NC-ND license, and code is released under the MIT license. If you find this content useful, please consider supporting the work by buying the book!)

Click here and “make a copy” to be able to edit: https://colab.research.google.com/drive/1Sqn5fuiB5-2EP6UKUmwqjQd_b3uUNu2r?usp=sharing

Or directly download with

wget "https://drive.google.com/uc?export=download&id=1Sqn5fuiB5-2EP6UKUmwqjQd_b3uUNu2r" -o sk.ipynb

1 of 40

2 of 40

3 of 40

4 of 40

5 of 40

6 of 40

7 of 40

8 of 40

9 of 40

10 of 40

11 of 40

12 of 40

13 of 40

14 of 40

15 of 40

16 of 40

17 of 40

18 of 40

19 of 40

20 of 40

21 of 40

22 of 40

23 of 40

24 of 40

25 of 40

26 of 40

27 of 40

28 of 40

29 of 40

30 of 40

31 of 40

32 of 40

33 of 40

34 of 40

35 of 40

36 of 40

37 of 40

38 of 40

39 of 40

40 of 40