An introduction to modern deep learning
Presenter: Simone Scardapane
INFN 2023
Why do we need neural networks?
“Table”
(classification)
Dense prediction
“A desk with some chairs”
(image captioning)
Why do we need neural networks?
“A desk with some chairs”
(image captioning)
A warning before we start
It all works, provided we have sufficient
Introduction
What is a neural network?
The classical definition
No biology, please.
« computing systems [vaguely] inspired by the biological neural networks that constitute animal brains »
— Wikipedia
A modern definition
A “neural network”
Any efficient, differentiable, parametric software code
A modern definition: Differentiability
Automatically computed
Both operations are efficient on hardware and scalable
A modern definition: Composability
A modern definition: Parametric
Defines the behaviour of the network
Potentially in the order of billions
NNs as differentiable code
def my_network(x: image,
w: tensor) -> image:
...
...
...
return y
Input type
Output type
Sequence of differentiable primitives
Output
Parameters
Automatic differentiation
Automatic Differentiation
def f(x):
y = my_network(x)
y = another_network(y)
y = yet_another_network(y)
return y
Easily composable
Automatic differentiation
Numerical optimization
Optimizer
Automatic Differentiation
Loss goes in
A “better” function comes out
E.g., average squared error
Deep learning stack
Tensor primitives + their VJPs (matrix-multiplication, etc.).
Layer 0
Autodiff module
One primitive may be implemented with multiple kernels depending on the supported hardware (CPU, GPU, TPU, IPU, …).
High-level constructs (layers, optimizers, losses, metrics, …)
Layer 1
Layer 2
Ecosystem
(hubs, libraries, extensions, …)
Code example (TensorFlow)
def f(X: tf.Tensor, y: tf.Tensor):
H = tf.linalg.matmul(X, W)
ypred = tf.keras.layers.Dense(1)(H)
return tf.reduce_sum((ypred - y)**2)
with tf.GradientTape() as tape:
l = f(X, y)
g = tape.gradient(l, [W])
Matrix operations
Pre-implemented components
Scalar output
Stores all intermediate outputs
Efficient autodiff engine
Physics-inspired neural networks
Let’s get serious
A primer on automatic differentiation
A primer on autodiff (1/3)
Input
(Trainable) parameters
In the general case, these are matrices (Jacobians), i.e., linear maps
Any layer or network
A primer on autodiff (2/3)
Given a generic composition of two neural layers:
We can obtain a chain rule for their gradient:
Gradient of the composition
Matrix multiplication of their Jacobians
A primer on autodiff (3/3)
If g has a scalar output:
Vector
Vector-Jacobian product (primitive)
An example in code
Reverse-mode autodiff (backpropagation)
Neural layers
Building invariances
A basic neural network
Linear projections & elementwise nonlinearities are universal approximators, but they do not scale to more structured types of data.
Deep learning is about leveraging structure
Set classification
Consider a neural network that must manipulate sets of objects:
The output should not change for a simple re-ordering (permutation invariance).
Set-based neural networks
We can embed this property to design permutation-invariant networks:
Generic layer of the appropriate type
Invariant aggregation
Zaheer, M., Kottur, S., Ravanbakhsh, S., Poczos, B., Salakhutdinov, R. and Smola, A., 2017. Deep sets. arXiv preprint arXiv:1703.06114.
Image convolutions
Local pixel operation
Neighbour pixel operation
Aggregation
Graph convolutions!
Local vertex operation
We use the adjacency matrix
(or similar)
Neural layers
A Transformer is all we need
Data ingestion in deep learning
CNNs, ...
Images
Audio
WaveNet, ...
Texts
Word embeddings, RNNs, ...
Graphs
MPNNs…
The Transformers revolution
Vision Transformers (2020-2021)
Images
Audio
Audio Transformers (2020-2021)
Texts
NLP Transformers (2017)
Graphs
Graph Transformers (2022)
Transformer
Power Laws for Scaling
Transformers at a glance
Tokenization
#Tokens
|Embedding|
+
Positional embeddings
Transformer model
Input-dependent
Input-independent
Conclusions
Research directions and “humans-in-the-loop”
Research directions
Self-supervised learning: pre-training with no labels
Multi-modality through data-dependent tokenization
Efficient sparsity and modularity
Interpretability?
Human intuition and AI
Post-hoc explainability
…
Transformer
“Lion”
Explainer (relevance)
“Intrinsic” intepretability
…
Transformer
“Lion”
Token selection
Discrete selection!
A practical example
Thanks! Questions?
Simone Scardapane�Tenure-track Assistant Professor