1 of 21

Convolutional and recurrent networks

XX Seminar on Software for Nuclear, Subnuclear and Applied Physics

Alghero, 4-9 June 2023

Andrea.Rizzi@unipi.it

2 of 21

Topics

Introduction to machine learning

Basic concepts: loss, overfit, underfit
Examples of linear regression, boosted decision trees
Exercise with colab, numpy, scikit

Deep Neural Networks

Basic FeedForward networks and backpropagation
Importance of depth, gradient descent, optimizers
Introduction to tools and first exercises

Convolutional and Recurrent networks

Reduction of complexity with invariance: RNN and CNN
CNN exercise

Graph Neural Networks

PointCloud exercise

Autoencoders and Generative Adversarial Networks (Lucio Anderlini)

GAN exercises

3 of 21

Classification of images

Images are data structure with 2 or 3 indices: X,Y or X,Y,channel (=R,G,B)

Shape of the input dataset (Nsamples, Width, Height, nchannels)
nchannels is typically 1 (B&W), 3 (RGB) or 4 with transparency

We can use FF networks to classify images

Reshape the input tensor with the “Flatten” keras layer

Use multiple dense layers with a final one for one-hot encoding output

Limitations of this approach:

If the image is translated, even by a single pixel in x or y, the network may not recognize as “similar” to the untranslated image
Nearby pixels in “Y” (or even the same pixel but in a different color) are not treated any differently than far away pixels

We know that our problem has some invariance
We know that input data has some locality information

………….

4 of 21

Exploit invariance and locality

Suppose you want to count windows in a 800x600 picture with houses

With an MLP or DFF you have 800x600x3(RGB)=1.4M inputs
Each node process independently some part of the image
The initial “Dense” connection should converge to something with lot of “zero” weights because far away pixel points have no reason to be considered at the same time in order to detect local features
=> the problem cannot be managed this way

But the problem is translation invariant!

“Windows” are local features, you can just analyze a patch of the image (locality)
A window is a window no matter if it is top left or bottom right of your image (Invariance)
And actually windows are made of even more local features (some borders/frame, some uniform area, a squared shape)

5 of 21

Can we exploit problem invariance?

Convolutional neural networks (CNN) attempt to exploit invariance against spatial translations

Smaller networks (locality !)
Acting on a single patch of the image
Stacking multiple such Convolutional Layers one after the other
Use “subsampling” layer to scale from local to global

Hierarchical approach

Early layers learn local features
Subsampling reduce the information extracted from a given “patch”
A final flatten+one or more dense layers is used to reach the final target

6 of 21

Limitations

The linear algebra formalism we use can handle nicely images, hence implement nicely CNN (translation invariance along x and y)
There are more invariances out there!

Rotation
Scale
Luminosity
… you name it…

So currently the networks have to learn them all

We can do tricks to increase the number of samples in our datasets with augmentation techniques (i.e. apply random transformations of scale, rotation etc..)
“Built-in” invariance (such as the x-y one) has the advantage of reducing by orders of magnitude the number of weights to learn

7 of 21

Understanding the dimensions of the convolution

Convolution can be 1D, 2D, 3D
Kernel size, typically square (MxM) with M odd (but can be any shape)
Padding: how to we handle borders? We can do only “valid” windows (no padding) or process borders as if there were zeros (or other values) outside
Each “point” in the 1D, 2D, 3D matrix can have multiple features (e.g. R,G,B)
Each Convolutional layer have mutiple outputs (filters) for every “patch” it scans on (one optimized to detect if the patch is uniformly filled, one looking for vertical lines, etc..)

2D-conv (on 6x6x3 image)

3x3 kernel

no padding

5 filters

8 of 21

Pooling (subsampling)

Pooling layers are simply finding maxima or computing average in patches of some convolution layer output
Pooling is used to reduce the space dimensionality after a convolutional layer

The Conv “filters” look for features (e.g. a filter may look for cats eyes)
The Pooling layer checks if in a given region some filtered fired ( there was a cat eye somewhere in this broader region)

9 of 21

Typical CNN architecture

10 of 21

11 of 21

Bounding Box

In order to predict “where” an object is a “bounding box” is defined

Coordinates of two opposite corners
Essentially a “regression” problem

Not simple to extend to multiple objects in a single image, YOLO (You Only Look Once) algorithm is an option https://pjreddie.com/darknet/yolo/

Divide the image in cells, in each cell you predict up to N bounding box corners (relative to the cell position)
Pick only cells with high score (and cluster multiple predictions of the same bb)

12 of 21

Transfer learning

If learn to process images of a given size, can we apply that to different size

If the “scale” is the same, the convolutional part can work unchanged
The dense (when present) anyhow need to be adapted/retrained

Transfer learning is a technique to reuse a network training for a task to perform another task with reduced retraining
E.g. a Conv2D network meant for image processing have initial layers processing “local features”... that is not very domain specific (if you trained on flowers images it may work on animals too)
Very useful when the available sample of the proper domain is small

E.g. annotated medical images are harder to get than labelled real world pictures

13 of 21

Variable length, sequences and causality

What if the input size has a variable length? For example

Text translations
Identification of “jets” of particle in High Energy Physics

In many case sequences have still a concept of locality and translation invariance

“A cat” or “the cat” are two sentences, both containing “cat” but in different position

Sequences often also have implied ordering

“The cat eat a mouse” and “The mouse eat a cat” have different meanings

14 of 21

Exploiting time invariance

Some problems are “time invariant”

E.g. recognize words in a sentence (written or spoken)

Order matters and some causality is implied in the sequence
Length of the inputs or the output may not be fixed

Recurrent Networks (RNN)

Iterative networks with output passed again as input

Allow some “memory” of the previous inputs and/or some internal “state” of what the network understood so far in the sequence

Most commonly used RNN are LSTM (Long Short Term Memory) and GRU (Gated Recurrent Unit)

15 of 21

LSTM and GRU

LSTM and GRU are RNN units with additional features to control their “memory”
“Gates” allow to control (keep or drop) input, output and internal state
The advantage of gated units is that they can forget so that when processing a sequence they focus on the relevant part (e.g. when processing a text we may know that each time we encounter a space the word is over)

LSTM

GRU

16 of 21

Different ways of processing time series

Recurrent Networks can be used to implement networks with variable number of inputs and outputs

Encoding, Decoding, Sequence2Sequence

17 of 21

Keras basic layers

Convolutional layers

Flatten
Conv1D/2D/3D
ConvTranspose or “Deconvolution”
UpSampling and ZeroPadding
MaxPooling, AveragePooling
Flatten

Recurrent layers

LSTM
GRU
SimpleRNN
TimeDistributed
ConvLSTM2D

channels_first vs channels_last

Clarifies which indices are part of the convolution and which indices are the “channels”

(#sample, X, Y, channels ) <- default

(#sample, channels, X, Y )

18 of 21

19 of 21

Using LSTM

Many to one configuration:

Just use a LSTM layer with default config

No need to know the full sequence
Optionally request also the cell state

Many to Many (synchronous)

Set return_sequence=True to get exactly one output for each input

Many to many (async, different length)

Need two LSTM: A encoder + a decoder
Sequence2Sequence or Encode-Decode architecture
The cell state of the encoder can be used as initial state for the decoder
Need to define a STOP character to receive when the decoding sequence is over

Inputs with variable length should be “padded”

Masking layers exist in keras to avoid “learning from padding”
Reversing the sentence order (so that padding is at the beginning also helps)
Often with LSTM useful to provide most important information at the end

20 of 21

Assignment 3

Create a CNN that recognize squares and circles in an image. Let’s try three variations:

Classify: does it contain a rectangle or a circle?
Count circles and rectangles when there is more than one in the dataset
Find the position (bounding box) of the circle or rectangle

https://colab.research.google.com/drive/1kRP1NfbL3hj9xIHAnfMEx9ug76ozGeqR

Solution

21 of 21

Assignment 4

Try building from scratch a LSTM that find the maximum length and its position in a sequence of two dimensional vectors.

Generate some data
Build a network with one LSTM layer followed by a Dense one

Solution