1 of 60

Deep Learning (DEEP-0001)�

4 – Deep Neural Networks

2 of 60

Deep neural networks

  • Networks with more than one hidden layer
  • Intuition becomes more difficult

3 of 60

Deep neural networks

  • Composing two networks
  • Combining the two networks into one
  • Hyperparameters
  • Notation change and general case
  • Shallow vs. deep networks

4 of 60

Composing two networks.

Network 1:

Network 2:

5 of 60

Composing two networks.

Network 1:

Network 2:

6 of 60

7 of 60

8 of 60

9 of 60

10 of 60

11 of 60

12 of 60

13 of 60

14 of 60

15 of 60

16 of 60

17 of 60

18 of 60

19 of 60

20 of 60

21 of 60

22 of 60

23 of 60

24 of 60

25 of 60

26 of 60

27 of 60

Comparing to shallow with six hidden units

  • 20 parameters
  • 9 regions
  • 19 parameters
  • Max 7 regions

28 of 60

Composing networks in 2D

29 of 60

Deep neural networks

  • Composing two networks
  • Combining the two networks into one
  • Hyperparameters
  • Notation change and general case
  • Shallow vs. deep networks

30 of 60

Combine two networks into one

Network 1:

Network 2:

Hidden units of second network in terms of first:

31 of 60

Create new variables

32 of 60

Two-layer network

1

1

33 of 60

Two-layer network as one equation

34 of 60

35 of 60

36 of 60

37 of 60

38 of 60

Deep neural networks

  • Composing two networks
  • Combining the two networks into one
  • Hyperparameters
  • Notation change and general case
  • Shallow vs. deep networks

39 of 60

Hyperparameters

  •  

40 of 60

Deep neural networks

  • Composing two networks
  • Combining the two networks into one
  • Hyperparameters
  • Notation change and general case
  • Shallow vs. deep networks

41 of 60

Notation change #1

42 of 60

Notation change #1

43 of 60

Notation change #1

44 of 60

Notation change #1

45 of 60

Notation change #2

46 of 60

Notation change #3

47 of 60

Notation change #3

Bias vector

Weight matrix

48 of 60

General equations for deep network

49 of 60

Example

50 of 60

Deep neural networks

  • Composing two networks
  • Combining the two networks into one
  • Hyperparameters
  • Notation change and general case
  • Shallow vs. deep networks

51 of 60

Shallow vs. deep networks

The best results are created by deep networks with many layers.

    • 50-1000 layers for most applications
    • Best results in
      • Computer vision
      • Natural language processing
      • Graph neural networks
      • Generative models
      • Reinforcement learning

All use deep networks. But why?

52 of 60

Shallow vs. deep networks

  1. Ability to approximate different functions?

Both obey the universal approximation theorem.

Argument: One layer is enough.

53 of 60

Shallow vs. deep networks

2. Number of linear regions per parameter

54 of 60

Number of linear regions per parameter

5 layers

10 hidden units per layer

471 parameters

161,501 linear regions

 

55 of 60

Shallow vs. deep networks

2. Number of linear regions per parameter

  • Deep networks create many more regions per parameter;
  • But may contain complex dependencies and symmetries.

56 of 60

Shallow vs. Deep Networks

3. Depth efficiency

  • There are some functions that require a shallow network with exponentially more hidden units than a deep network to achieve an equivalent approximation

  • This is known as the depth efficiency of deep networks

  • But do the real-world functions we want to approximate have this property? Unknown.

57 of 60

Shallow vs. Deep Networks

4. Large structured networks

  • Think about images as input – might be 1M pixels
  • Fully connected works not practical
  • Answer is to have weights that only operate locally, and share across image
  • This leads to convolutional networks
  • Gradually integrate information from across the image – needs multiple layers

58 of 60

Shallow vs. Deep Networks

5. Fitting and generalization

  • Fitting of deep models seems to be easier up to about 20 layers
  • Then needs various tricks to train deeper networks, so (in vanilla form), fitting becomes harder
  • Generalization is good in deep networks.

59 of 60

Shallow vs. Deep Networks

5. Fitting and generalization

  • Fitting of deep models is also faster

60 of 60

Where are we going?

  • We have defined families of very flexible networks that map multiple inputs to multiple outputs
  • Now we need to train them
    • How to choose loss functions
    • How to find minima of the loss function
    • How to do this in particular for deep networks
  • Then we need to test them