1 of 43

An introduction to modern deep learning

Presenter: Simone Scardapane

INFN 2023

3 of 43

Why do we need neural networks?

“Table”

(classification)

Dense prediction

“A desk with some chairs”

(image captioning)

4 of 43

Why do we need neural networks?

“A desk with some chairs”

(image captioning)

Stable Diffusion - Wikipedia

5 of 43

A warning before we start

It all works, provided we have sufficient

Data…
Hardware…
… and modeling experience.

6 of 43

Introduction

What is a neural network?

7 of 43

The classical definition

No biology, please.

« computing systems [vaguely] inspired by the biological neural networks that constitute animal brains »

— Wikipedia

8 of 43

A modern definition

A “neural network”

Any efficient, differentiable, parametric software code

9 of 43

A modern definition: Differentiability

Automatically computed

Both operations are efficient on hardware and scalable

10 of 43

A modern definition: Composability

11 of 43

A modern definition: Parametric

Defines the behaviour of the network

Potentially in the order of billions

12 of 43

NNs as differentiable code

def my_network(x: image,

w: tensor) -> image:

...

return y

Input type

Output type

Sequence of differentiable primitives

Output

Parameters

13 of 43

Automatic differentiation

Automatic Differentiation

def f(x):

y = my_network(x)

y = another_network(y)

y = yet_another_network(y)

return y

Easily composable

14 of 43

Automatic differentiation

Automatic differentiation - Wikipedia

15 of 43

Numerical optimization

Optimizer

Automatic Differentiation

Loss goes in

A “better” function comes out

E.g., average squared error

16 of 43

Deep learning stack

Tensor primitives + their VJPs (matrix-multiplication, etc.).

Layer 0

Autodiff module

One primitive may be implemented with multiple kernels depending on the supported hardware (CPU, GPU, TPU, IPU, …).

High-level constructs (layers, optimizers, losses, metrics, …)

Layer 1

Layer 2

Ecosystem

(hubs, libraries, extensions, …)

17 of 43

Code example (TensorFlow)

def f(X: tf.Tensor, y: tf.Tensor):

H = tf.linalg.matmul(X, W)

ypred = tf.keras.layers.Dense(1)(H)

return tf.reduce_sum((ypred - y)**2)

with tf.GradientTape() as tape:

l = f(X, y)

g = tape.gradient(l, [W])

Matrix operations

Pre-implemented components

Scalar output

Stores all intermediate outputs

Efficient autodiff engine

18 of 43

Physics-inspired neural networks

[2201.05624] Scientific Machine Learning through Physics-Informed Neural Networks: Where we are and What's next

19 of 43

Let’s get serious

A primer on automatic differentiation

20 of 43

A primer on autodiff (1/3)

Input

(Trainable) parameters

In the general case, these are matrices (Jacobians), i.e., linear maps

Any layer or network

21 of 43

A primer on autodiff (2/3)

Given a generic composition of two neural layers:

We can obtain a chain rule for their gradient:

Gradient of the composition

Matrix multiplication of their Jacobians

22 of 43

A primer on autodiff (3/3)

If g has a scalar output:

Vector

Vector-Jacobian product (primitive)

23 of 43

An example in code

GitHub - mattjj/autodidact: A pedagogical implementation of Autograd

GitHub - karpathy/micrograd: A tiny scalar-valued autograd engine and a neural net library on top of it with PyTorch-like API

24 of 43

Reverse-mode autodiff (backpropagation)

[1502.05767] Automatic differentiation in machine learning: a survey

25 of 43

Neural layers

Building invariances

26 of 43

A basic neural network

Linear projections & elementwise nonlinearities are universal approximators, but they do not scale to more structured types of data.

27 of 43

Deep learning is about leveraging structure

Generate cross-stitch patterns from any image.

28 of 43

Set classification

Consider a neural network that must manipulate sets of objects:

The output should not change for a simple re-ordering (permutation invariance).

29 of 43

Set-based neural networks

We can embed this property to design permutation-invariant networks:

Generic layer of the appropriate type

Invariant aggregation

Zaheer, M., Kottur, S., Ravanbakhsh, S., Poczos, B., Salakhutdinov, R. and Smola, A., 2017. Deep sets. arXiv preprint arXiv:1703.06114.

30 of 43

Image convolutions

Local pixel operation

Neighbour pixel operation

Aggregation

31 of 43

Graph convolutions!

Local vertex operation

We use the adjacency matrix

(or similar)

32 of 43

Neural layers

A Transformer is all we need

33 of 43

Data ingestion in deep learning

CNNs, ...

Images

Audio

WaveNet, ...

Texts

Word embeddings, RNNs, ...

Graphs

MPNNs…

34 of 43

The Transformers revolution

Vision Transformers (2020-2021)

Images

Audio

Audio Transformers (2020-2021)

Texts

NLP Transformers (2017)

Graphs

Graph Transformers (2022)

Transformer

35 of 43

Power Laws for Scaling

[2001.08361] Scaling Laws for Neural Language Models

36 of 43

Transformers at a glance

Tokenization

#Tokens

|Embedding|

Positional embeddings

Transformer model

Input-dependent

Input-independent

37 of 43

Conclusions

Research directions and “humans-in-the-loop”

38 of 43

Research directions

[2206.02770] Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of Experts

Self-supervised learning: pre-training with no labels

Multi-modality through data-dependent tokenization

Efficient sparsity and modularity

Interpretability?

39 of 43

Human intuition and AI

Advancing mathematics by guiding human intuition with AI | Nature

40 of 43

Post-hoc explainability

…

Transformer

“Lion”

Explainer (relevance)

41 of 43

“Intrinsic” intepretability

…

Transformer

“Lion”

Token selection

Discrete selection!

42 of 43

A practical example

[2112.07658] AdaViT: Adaptive Tokens for Efficient Vision Transformer

43 of 43

Thanks! Questions?

https://www.sscardapane.it/

https://twitter.com/s_scardapane

Simone Scardapane�Tenure-track Assistant Professor