1 of 60

Deep Learning (DEEP-0001)�

Prof. André E. Lazzaretti

lazzaretti@utfpr.edu.br

https://sites.google.com/site/andrelazzaretti/graduate-courses/deep-learning-cpgei/2025

4 – Deep Neural Networks

2 of 60

Deep neural networks

Networks with more than one hidden layer
Intuition becomes more difficult

3 of 60

Deep neural networks

Composing two networks
Combining the two networks into one
Hyperparameters
Notation change and general case
Shallow vs. deep networks

4 of 60

Composing two networks.

Network 1:

Network 2:

5 of 60

Composing two networks.

Network 1:

Network 2:

6 of 60

7 of 60

8 of 60

9 of 60

10 of 60

11 of 60

12 of 60

13 of 60

14 of 60

15 of 60

16 of 60

17 of 60

18 of 60

19 of 60

20 of 60

21 of 60

22 of 60

23 of 60

24 of 60

25 of 60

26 of 60

https://udlbook.github.io/udlfigures/

27 of 60

Comparing to shallow with six hidden units

20 parameters
9 regions

19 parameters
Max 7 regions

28 of 60

Composing networks in 2D

29 of 60

Deep neural networks

Composing two networks
Combining the two networks into one
Hyperparameters
Notation change and general case
Shallow vs. deep networks

30 of 60

Combine two networks into one

Network 1:

Network 2:

Hidden units of second network in terms of first:

31 of 60

Create new variables

32 of 60

Two-layer network

1

33 of 60

Two-layer network as one equation

34 of 60

35 of 60

36 of 60

37 of 60

https://udlbook.github.io/udlfigures/

38 of 60

Deep neural networks

Composing two networks
Combining the two networks into one
Hyperparameters
Notation change and general case
Shallow vs. deep networks

39 of 60

Hyperparameters

40 of 60

Deep neural networks

Composing two networks
Combining the two networks into one
Hyperparameters
Notation change and general case
Shallow vs. deep networks

41 of 60

Notation change #1

42 of 60

Notation change #1

43 of 60

Notation change #1

44 of 60

Notation change #1

45 of 60

Notation change #2

46 of 60

Notation change #3

47 of 60

Notation change #3

Bias vector

Weight matrix

48 of 60

General equations for deep network

49 of 60

Example

50 of 60

Deep neural networks

Composing two networks
Combining the two networks into one
Hyperparameters
Notation change and general case
Shallow vs. deep networks

51 of 60

Shallow vs. deep networks

The best results are created by deep networks with many layers.

50-1000 layers for most applications
Best results in

Computer vision
Natural language processing
Graph neural networks
Generative models
Reinforcement learning

All use deep networks. But why?

52 of 60

Shallow vs. deep networks

Ability to approximate different functions?

Both obey the universal approximation theorem.

Argument: One layer is enough.

53 of 60

Shallow vs. deep networks

2. Number of linear regions per parameter

54 of 60

Number of linear regions per parameter

5 layers

10 hidden units per layer

471 parameters

161,501 linear regions

55 of 60

Shallow vs. deep networks

2. Number of linear regions per parameter

Deep networks create many more regions per parameter;
But may contain complex dependencies and symmetries.

56 of 60

Shallow vs. Deep Networks

3. Depth efficiency

There are some functions that require a shallow network with exponentially more hidden units than a deep network to achieve an equivalent approximation

This is known as the depth efficiency of deep networks

But do the real-world functions we want to approximate have this property? Unknown.

57 of 60

Shallow vs. Deep Networks

4. Large structured networks

Think about images as input – might be 1M pixels
Fully connected works not practical
Answer is to have weights that only operate locally, and share across image
This leads to convolutional networks
Gradually integrate information from across the image – needs multiple layers

58 of 60

Shallow vs. Deep Networks

5. Fitting and generalization

Fitting of deep models seems to be easier up to about 20 layers
Then needs various tricks to train deeper networks, so (in vanilla form), fitting becomes harder
Generalization is good in deep networks.

59 of 60

Shallow vs. Deep Networks

5. Fitting and generalization

Fitting of deep models is also faster

60 of 60

Where are we going?

We have defined families of very flexible networks that map multiple inputs to multiple outputs
Now we need to train them

How to choose loss functions
How to find minima of the loss function
How to do this in particular for deep networks

Then we need to test them