1 of 48

Introduction to Artificial Neural Networks

XXI Seminar on Software for Nuclear, Subnuclear and Applied Physics

Alghero, 9-14 June 2024

Andrea.Rizzi@unipi.it

1

2 of 48

This lecture

  • Deep Neural Networks
    1. Basic FeedForward networks and backpropagation
    2. Importance of depth, gradient descent, optimizers
    3. Introduction to tools
  • Tomorrow: first exercises

2

3 of 48

Recap of this morning lecture

  • ML techniques have common elements:
    • The function “f” to approximate
    • The model used to approximate “f” (e.g. polynomials functions or a decision tree or a NN)
    • The parameters of the model (e.g. the coefficients of the poly)
    • The hyper-parameters of the models (e.g. the grade of the polynomial, N=1 for linear)
    • The objective function (i.e. the loss such as MSE or binary cross entropy)
    • The variance-bias tradeoff (aka training vs generalization)
    • The regularization techniques
  • Example of ML algorithms
    • Linear regression
    • PCA
    • Nearest Neighbours
    • Decision trees
      • Bagging vs boosting

3

4 of 48

Artificial Neural Networks

4

5 of 48

(Artificial) neural networks: the “Model”

  • Computation achieved with a network of elementary computing units (neurons)
  • Each basic units, a neuron, has:
    • Weighted input connections to other neurons
    • A non linear activation function
    • An output value to pass to other neurons
  • Biologically inspired to brain structure as a network of neuron
    • But artificial NN goal is not that of “simulating” a brain!
  • Actual modern NN go much beyond the brain-inspired models

5

6 of 48

Brief history, highs and lows

  • First work originates back in ~1940-1960 “cybernetics”
    • Linear models
  • Then called “connectionism” in ’80-’90
    • development of neural networks, backpropagation, non-linear activations (mostly sigmoid)
  • High expectations, low achievements in the ‘90
    • A decade of stagnation
  • New name, “Deep Learning”, from 2006
    • Deep architectures (see next slides)
    • Very active field in the past decade
    • Availability of GP-GPU game changing on typical “size”
    • Processing raw, low level, features
      • It doesn’t mean you must use “raw features” but that rather that you can use raw features!

6

7 of 48

Complexity growth

  • Dataset become larger and larger (“big data”)
    • Not just in “industry”, experimental scientific research is now producing multi PetaByte datasets
    • Digital era => everything can be “data”
  • Increasing hardware performance

=> increasing complexity of the network (number of neurons and connections)

  • 2020 largest ANN: OpenAI GPT3, 175 billion parameters ( 1011 )
  • 2021 “Switch transformer” and “Wu Dao 2.0” => trillion parameters models ( 1012 )
  • (for comparison) Human brain 1013 - 1015 synapses ( ~ parameters )

7

8 of 48

Performance on classic problems

  • Image classification and speech recognition are the typical problems where ML (and Neural Networks) failed in the 90’
  • Now it beats humans ...

8

Speech recognition

Image classification

9 of 48

My favorite performance examples

  • Learning how to translate without seeing a single translation example, just having two independent monolingual corpora (https://arxiv.org/abs/1711.00043)
  • Alpha-Fold: contest to predict protein folding, alphafold ranked first with 25 correct predictions out of 43 tests. The second ranked reached 3 out of 43.
    • 2021: Alpha-Fold2 !!
    • 2024: Alpha-Fold3
  • AlphaGo => AlphaZero: AlphaGo beat humans at “Go”, learning from human matches and know-how. Then AlphaZero learned from scratch. AlphaZero beats AlphaGo 100-0
  • AlphaZero learned chess too, and beat the best existing chess program
  • AI recently proved math theorems, 1200 of them
  • Microsoft and Alibaba AIs beated humans in text understanding test (SQuAD)
  • DeepFake: never ever believe what you see on a screen, even in videos
  • Image super-resolution arxiv:2104.07636

9

10 of 48

OpenAI GPT3

Generative Pre-trained Transformer

A 12M$ autocomplete (that is not really understanding what is talking about, but can still write better than most of us)

10

2020

11 of 48

Generate images from descriptions

11

2021

12 of 48

June 2022: solving simple math problems

12

2022

13 of 48

2023: GPT-3.5 / GPT-4.0

13

14 of 48

Neural Nets Basic elements

14

15 of 48

A neural network node: the artificial neuron

  • The elementary processing unit, a neuron, can be seen as a node in a directed graph
  • Inputs are summed, with weights, and an activation function is evaluated on such sum
  • Nodes are typically also connected (with weight b) to an input “bias node” that has a fixed output value of 1
  • Different activation functions can be used, common ones are: sigmoid, atan, relu (rectified linear unit)

15

16 of 48

The MLP model

  • The most common NN in the ‘90 was the Multi Layer Perceptron (MLP)
  • Graph structure organized in “layers”
    • Input layer (nodes filled with input value)
    • Hidden layer
    • Output layer (node(s) where output is read out)
  • Nodes are connected only from one layer to the next and all possible connections are present (known as “dense” or “fully connected” layer)
    • No intra-layer connections
    • No direct connections from input to output
  • Size of input and output layers are fixed by the problem
  • Hyperparameters are
    • The size of the hidden layers
    • The type of activation function
  • The parameters to learn are the weights of the connections

16

17 of 48

Universal approximation theorem

“One hidden layer is enough to represent (not learn) an approximation of any function to an arbitrary degree of accuracy” (I. Goodfellow et al. 2016)

  • You can approximate any function with arbitrary precision having enough hidden nodes and the right weights
  • How do you get the right weights? You need a “training” for your network
    • The theorem does not say that one hidden layer (+ some training algorithm) is enough to find the optimal weights, just that they exists!
  • Achieving some (even modest with some metric) level of accuracy may need an unmanageable hidden layer size
    • And may need an unreasonable number of “examples” to learn from

17

18 of 48

Example (1-D input)

Approximate this function

With a weighted sum of functions like this one

18

19 of 48

Example

  • The Universal Approximation Theorem says that increasing #nodes I can increase the accuracy as much as I want
  • More hidden nodes, higher “capacity” => more accuracy

19

20 of 48

Training of an MLP

  • How do I get the weights?
  • Remember: we do not know the function we want to approximate, we only have some “samples”

20

21 of 48

Training a NN

  • The goal of training is to minimize the objective function
    • I.e. we want to minimize the loss as a function of the model parameters (i.e. the weights)
  • For a MLP the basic idea is the following
    1. Start with random weights
    2. Compute the prediction y_pred for a given input x and compare target y_true computing the loss (repeat for a few example, aka “one batch”)
    3. Estimate an update for the weights that reduces the loss
    4. Iterate from point (b), repeating for all samples
    5. When the sample has been used completely (end of an epoch), iterate from (b) again on all samples
    6. Repeat for multiple epochs
  • The important point is how to implement point (c) => (stochastic) gradient descent

21

loss

weights space

22 of 48

How to find a minimum?

Gradient Descent

  • We know the loss function value in a point in the weights phase space (e.g. the initial set of random weights, or the iteration N-1), computed numerically as the mean or the sum of the losses for each of our training examples
  • We can compute the gradient of the loss function in that point, we expect the minimum on “the way down” hence we adjust our set of weights doing a “step” in the direction pointed by the gradient with a step size that is proportional to the length of the gradient

Stochastic Gradient Descent (SGD):

  • Compute the gradient on “batches” of events rather than full sample
  • The “noise” may help avoiding local minima

22

23 of 48

Not as simple as you would imagine

  • A parameter named learning rate controls how big the step in the direction of the gradient is
    • A too large step may let you bounce back and forth on the walls of your “valley”
    • A too small step would make your descent lasting forever
  • Several variants of SGD
    • Include “momentum” from previous gradient calculations (may help overcome local obstacles)
    • Reduce step size over time
    • Adadelta, Adagrad, Adam, and many more

23

24 of 48

Learning rate, epochs and batches

  • The gradient update (in SGD) is repeated for each “batch” of events
  • A full pass of the whole dataset (i.e. all batches) is called an epoch
  • A typical training foresee iteration on multiple epochs
  • The size of the update step can be controlled with a multiplicative factor called “learning rate
    • Learning rate can be adapted over time

24

25 of 48

In reality

25

26 of 48

Training and overfitting

  • As discussed previously if the capacity is large enough the network could “overfit” on the training dataset
  • Have a separate, stat independent, validation/generalization sample
  • Evaluate performance (with “loss” or with other metrics) on the validation sample
  • Training results depends on many choices
    • Size of batches (amount of “noise”)
    • Learning rate (how much you move along the gradient at each iteration)
    • Gradient Descent algorithm
    • Capacity of the network

26

27 of 48

Neural Networks, computers and mathematics

  • ANN use a fairly limited and simple set of operations
    • Many operation are simply represented with linear algebra
    • Non linear function are typically applied, repeated, to multiple inputs (hence can be “vectorized”)
    • Gradient Descent works by knowing the derivatives of the functions involved in the NN calculations (weights, activations) and in the loss
  • Datasets are represented as multidimensional tensors
    • The number of indices and the length per index is usually called “shape” and is a tuple with dimension of each index
    • The first index is the one running the “number of sample in the dataset”, and is sometimes omitted when describing a neural network
  • Classification with multiple category is often converted in the “categorical” representation
    • I.e. rather than labelling with a scalar “y” (with 0=horse, 1=dog, 2=cat, 3=bird, …) a vector y is used with as many components as the category (with [1,0,0,0]=horse, [0,1,0,0]=dog, etc..)
  • Tools exist to describe mathematically the network structure that are optimized for fast computations on CPU/GPU/TPU

27

28 of 48

Back-propagation

Calculating the gradient in complex networks could be computationally expensive:

  • Some expression appear repeated, hence we should avoid recomputing them
  • Back-propagation method allow to efficiently compute the derivatives wrt each of the weights
    • Start from the last node and apply derivative chain rule going backward
    • At each step the computation depends only on the already computed derivative and the values of the node outputs (computed already in the NN evaluation, aka forward pass)

28

29 of 48

Deep networks

29

30 of 48

Deep Feed Forward networks

The simplest extension to the MLP is to just add more hidden layers

Other names of this network architecture

  • Deep Feed Forward network
  • Deep Dense network, i.e. made (only) of Dense(ly) connected layers

30

depth

31 of 48

Why going deeper?

Hold on… wasn’t there a theorem saying that MLP is good enough ? Yes but…

  • Amount of nodes to represent complex functions can be too high
  • Learning the weights on finite samples could be too difficult

Advantages of Deep architectures

  • Hierarchical structure can allow easier “abstraction” by the network with early layers computing low level features and deeper layers representing more abstract properties
  • Number of neurons and connections needed to represent the same function highly reduced in many realistic cases

31

32 of 48

Activation functions

  • Linear:
    • Cannot be used in hidden layer has the derivative is constant (does not depend on the inputs, cannot perform gradient descent)
    • Useful for output nodes in regression problems
  • Sigmoid:
    • Used in the past in the hidden layer (leads to vanishing gradient problem)
    • Useful in classification problem with a single output or multi-label classifiers
  • Softmax:
    • Useful in classification problems with one-hot encoded multi-class
  • Rectified Linear Unit (ReLU)
    • The workhorse for hidden layers activations
    • Variants: Leaky-ReLU, Swish

32

33 of 48

Deep architectures

33

34 of 48

Dropout and regularization methods

  • NN training is a numerical process
  • Often the number of samples is limited hence the gradient accuracy is not great
  • Several regularization methods exists to avoid being dominated by stochastic effects
    • Caps to the weights (so that individual nodes cannot be worth more than some amount)
    • Dropout techniques: during the training a fraction of nodes is discarded, randomly, at each iteration
      • NN more robust to noise
      • Effectively “augmenting” the input dataset

34

35 of 48

(Batch) normalization / standard scaling

  • Input features have typically different ranges, means, variance
  • It is generally useful to “normalize” the input distribution
    • Mean zero
    • Variance 1
  • Scikit-learn “StandardScaler” can be used to normalize the inputs
  • Normalization can also be done between the layers of the network
    • A so called “Batch Normalization” layer

35

A typical observable, e.g. the invariant mass of a pairs of leptons

Normalized version

36 of 48

DNN Tools

36

37 of 48

Keras

  • Keras is a python library that allow to build, train and evaluate NN with many modern technologies
  • Keras supports multiple backends for actual calculations
  • Two different syntax are usable to build the network architecture
    • Sequential: simple linear “stack” of layers
    • Model (functional API): create more complex topologies
  • Multiple type of “Layers” are supported
    • Dense: the classic fully connected layer of a FF network
    • Convolutional layers
    • Recurrent layers
  • Multiple type of activation functions
  • Various optimizers and gradient descent techniques

PS: another popular DNN toolset is pytorch not covered here

37

38 of 48

Other common tools

Common alternative to keras

  • Pytorch (trending up!)
  • Sonnet
  • Direct usage of TensorFlow (or other backends, e.g. Theano)
    • Need to write yourself some of the basics of NN training
    • Especially useful to develop new ideas (e.g. a new descent technique, a new type of basic unit/layer)

38

39 of 48

Keras Sequential example

39

40 of 48

Keras “Model” Functional API

A NN can be seen as the composition of multiple functions (one per layer), e.g.

  • A simple stack of layers is: y=f5(f4(f3(f2(f1(x)))))
  • A more complex structure could be something like

  • The functional API allow to express the idea that each layer is evaluate on the output of a previous layer, i.e.

x = Input()

layer1=FirstLayerType(parameters) (x)

layer2=SecondLayerType(parameters) (layer1)

layer3=ThirdLayerType(parameters) (x)

layer4=FourthLayerType(parameters)([layer2,layer3])

40

Input “x”

layer1

layer2

layer3

layer4

y=f4( f2(f1(x)),f3 (x) )

41 of 48

A (modernized) MLP in keras

from keras.models import Model

from keras.layers import Input, Dense

x = Input(shape=(32,))

hid = Dense(32, activation=”relu”)(x)

out = Dense(1, activation=”sigmoid”)(hid)

model = Model(inputs=x, outputs=out)

model.summary()

from keras.utils import plot_model

plot_model(model, to_file='model.png')

41

42 of 48

From the ~1995 to ~2010

from keras.models import Model

from keras.layers import Input, Dense

x = Input(shape=(32,))

hid = Dense(32, activation=”sigmoid”)(x)

out = Dense(1, activation=”sigmoid”)(hid)

model = Model(inputs=x, outputs=out)

from keras.models import Model

from keras.layers import Input, Dense

x = Input(shape=(32,))

b = Dense(32,activation=”relu”)(a)

c = Dense(32,activation=”relu”)(b)

d = Dense(32,activation=”relu”)(c)

e = Dense(32,,activation=”sigmoid”)(d)

model = Model(inputs=x, outputs=e)

42

43 of 48

Training a model with Keras

from keras.layers import Input, Dense

from keras.models import Model

# This returns a tensor

inputs = Input(shape=(784,))

# a layer instance is callable on a tensor, and returns a tensor

x = Dense(64, activation='relu')(inputs)

x = Dense(64, activation='relu')(x)

predictions = Dense(10, activation='softmax')(x)

# This creates a model that includes

# the Input layer and three Dense layers

model = Model(inputs=inputs, outputs=predictions)

model.compile(optimizer='rmsprop',

loss='categorical_crossentropy',

metrics=['accuracy'])

model.fit(data, labels) # starts training

43

Those are numpy arrays with your data

44 of 48

Keras layers

44

45 of 48

Keras basic layers

  • Basic layers
    • Input
    • Dense
    • Activation
    • Dropout
  • Convolutional layers (for next week)
    • Conv1D/2D/3D
    • ConvTranspose or “Deconvolution”
    • UpSampling and ZeroPadding
    • MaxPooling, AveragePooling
    • Flatten
  • More stuff
    • Recursive layers
    • ...check the keras docs...

45

46 of 48

Callbacks

from keras.callbacks import EarlyStopping, ReduceLROnPlateau

# train

history = model.fit(X_train, y_train, epochs=n_epochs, batch_size=batch_size, verbose = 2,

validation_data=(X_test, y_test),

callbacks = [

EarlyStopping(monitor='val_loss', patience=10, verbose=1),

ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=2, verbose=1)

])

  • During training some “callbacks” can be passed to the fit function
    • E.g. to monitor the progress of the training
    • To adapt the training
      • Stop if no improvements in the last N epochs
      • Change learning rate (reduce) if no improvements in the last M epochs
    • Some callbacks are predefined in keras, other can be user implemented

46

47 of 48

Assignment 1

  • Partition a 2D region with a simple function that returns true vs false having x1,x2 as arguments
    • E.g. x1>x2 or x1*2>x2 or x1 > 1/x2 or … whatever...
  • Generate 1000 samples
  • Create a classifier with a MLP or a DNN with similar number of parameters that approximate the function above

Start from this notebook: Exercise 1

Solution

47

48 of 48

Assignment 2

  • Let’s try to implement a regression with DNN in keras
  • Invent a function of x1,x2,x3,x4,x5
  • Generate some data
  • Create a Feed Forward model (with 1 or more hidden layers)
  • What should we change compared to the classification problem?
    • What is the loss function to use?
    • Which activation function in last layer?
  • Try to make a histogram of the residuals on the validation sample
    • residuals=(y_predicted - y_truth)

48

Start from this notebook: Exercise 2