1 of 152

Machine Learning

Prof. Seungtaek Choi

2 of 152

Last Time

  • Logistic Regression
  • Classification (SVM)

3 of 152

Today

  • Neural Networks
  • Backpropagation
  • Announcement: 2nd Assignment!
    • Practice with PyTorch

4 of 152

Neural Networks

5 of 152

Neural Networks

  • Origins: Algorithms that try to mimic the brain.
  • Was very widely used in 80s and early 90s; popularity diminished in late 90s.
  • Recent resurgence: State-of-the-art technique for many applications.

6 of 152

The brain adapts its function to the input it receives.

7 of 152

8 of 152

The brain flexibly adapts to incoming sensory channels and can even learn entirely new senses. – A general-purpose, multimodal learning machine.

9 of 152

Neuron = input integration 🡪 threshold 🡪 output.

10 of 152

Neurons form networks: from sensory input to motor output.

11 of 152

12 of 152

13 of 152

14 of 152

15 of 152

16 of 152

17 of 152

18 of 152

19 of 152

20 of 152

Perceptron: Binary Linear Classifier

  • Given weights

20

21 of 152

Perceptron: Geometric Interpretation

  • Given weights

21

22 of 152

Perceptron for New Data

  • Forward propagation with new data

22

 

23 of 152

Binary Linear Classifier in 2D

23

24 of 152

25 of 152

26 of 152

27 of 152

Idea: Nonlinear Curve Approximated by Multiple Lines

  • Nonlinear regression
  • Nonlinear classification

27

28 of 152

AND Problem

  •  

28

0

0

0

0

1

0

1

0

0

1

1

1

0

0

-30

0

0

1

-10

0

1

0

-10

0

1

1

10

1

1

 

 

 

 

-30

20

20

 

29 of 152

OR Problem

  •  

29

0

0

0

0

1

1

1

0

1

1

1

1

0

0

-10

0

0

1

10

1

1

0

10

1

1

1

30

1

1

 

 

 

 

-10

20

20

 

30 of 152

How about XOR Problem?

  • Misky-Papert controversy on XOR
  • For not linearly separable

  • Single neuron = one linear classification boundary
    • A perceptron cannot solve due to its linear nature

30

0

0

0

0

1

1

1

0

1

1

1

0

31 of 152

XOR Problem

  • At least two lines are required

31

32 of 152

XOR Problem

  • At least two lines are required
  • If two perceptrons are stacked, it represents two hyperplanes.

32

33 of 152

XOR Problem

  • At least two lines are required
  • If two perceptrons are stacked, it represents two hyperplanes.

33

34 of 152

Multiple Perceptrons

  • Multi neurons = multiple linear classification boundaries

34

35 of 152

Multiple Perceptrons

  • Multi neurons = multiple linear classification boundaries

35

36 of 152

Multiple Perceptrons

  • Sigmoid function for nonlinear activation function

36

37 of 152

Multiple Perceptrons

  • In a compact representation

37

38 of 152

Multiple Perceptrons

  • In a compact representation

38

39 of 152

Multiple Perceptrons

  • In a compact representation

39

First layer

with neurons

40 of 152

Multiple Perceptrons

  • With one more layer…

40

First layer

Second layer

41 of 152

42 of 152

Another Interpretation

  • XOR can be represented with only AND, OR, NOT.
  • A XOR B = (A OR B) AND NOT (A AND B)
    • Combination of simple operations

43 of 152

Another Interpretation

  • A XOR B = (A OR B) AND NOT (A AND B)

1

 

 

 

 

-10

20

20

 

1

 

 

 

 

-30

20

20

 

 

 

44 of 152

Another Interpretation

  • A XOR B = (A OR B) AND NOT (A AND B)� = z_1 AND NOT z_2

1

 

 

 

-10

20

20

 

 

-30

20

20

 

-30

20

-20

 

0

0

0

0

0

0

1

1

0

1

1

0

1

0

1

1

1

1

1

0

45 of 152

46 of 152

47 of 152

48 of 152

49 of 152

50 of 152

51 of 152

Looks a lot like logistic regression

The only difference is, instead of input a feature vector, the features are just values calculated by the hidden layer

52 of 152

Feature Learning

Looks a lot like logistic regression

The only difference is, instead of input a feature vector, the features are just values calculated by the hidden layer

53 of 152

Another Perspective:�Hidden Layers as Kernel Learning

53

54 of 152

Nonlinear Classification

54

https://www.youtube.com/watch?v=3liCbRZPrZA

55 of 152

Neuron

  • We can represent this “neuron” as follows:

55

56 of 152

Second Way of Looking at Multiple Perceptrons

  • Can represent nonlinear relationship between input and outputs due to nonlinear activation function

56

57 of 152

Common Activation Functions

57

Source: 6.S191 Intro. to Deep Learning at MIT

Discuss later

58 of 152

XOR Problem in Perceptron

  • The main weakness of linear predictors is their lack of capacity.
  • For classification, the populations have to be linearly separable.

58

59 of 152

Nonlinear Mapping

  • The XOR example can be solved by pre-processing the data to make the two populations linearly separable.

59

Source: Dr. Francois Fleuret at EPFL

60 of 152

Nonlinear Mapping

  •  

60

Source: Dr. Francois Fleuret at EPFL

61 of 152

Nonlinear Mapping

  •  

61

Source: Dr. Francois Fleuret at EPFL

62 of 152

Neuron

  • Suppose that data is not linearly separable

62

63 of 152

Kernel + Neuron

  • Nonlinear mapping + neuron

  • User-defined Kernel

63

 

64 of 152

Neuron + Neuron

  • Nonlinear mapping can be represented by another layer (or neurons)

  • Learnable Kernel
    • Nonlinear activation functions

64

65 of 152

Multi Layer Perceptron (MLP)

  • Nonlinear mapping can be represented by another layer (or neurons)
  • We can generalize an MLP

65

66 of 152

Multi Layer Perceptron (MLP) = Artificial Neural Networks

  • Why do we need multi-layers ?

66

Nonlinear mapping

67 of 152

Multi Layer Perceptron (MLP) = Artificial Neural Networks

  • Why do we need multi-layers ?

67

Nonlinear mapping

68 of 152

Multi Layer Perceptron (MLP) = Artificial Neural Networks

  • Why do we need multi-layers ?

68

Nonlinear mappings

Linearly separable

69 of 152

Multi Layer Perceptron (MLP) = Artificial Neural Networks

  • Why do we need multi-layers ?

69

Nonlinear mappings

Multiple Linear classifiers

Linearly separable

70 of 152

Multi Layer Perceptron (MLP) = Artificial Neural Networks

  • Why do we need multi-layers ?

70

Linear classification

Feature Learning

Nonlinear mappings

Linearly separable

71 of 152

Two Ways of Looking at Artificial Neural Networks

  • Still represent lines

  • Can represent nonlinear relationship between input and outputs due to nonlinear activation function

71

72 of 152

Two Ways of Looking at Artificial Neural Networks

72

(1)

(2)

  • Still represent lines

  • Can represent nonlinear relationship between input and outputs due to nonlinear activation function

73 of 152

74 of 152

75 of 152

76 of 152

Backpropagation

77 of 152

Training Neural Networks: Optimization

  •  

77

78 of 152

Training Neural Networks: Loss Function

  • Measures error between target values and predictions

  • Example
    • Squared loss (for regression):

    • Cross entropy (for classification):�

78

79 of 152

Training Neural Networks: Gradient Descent

  •  

79

80 of 152

81 of 152

Gradients in ANN

  •  

81

82 of 152

Gradients in ANN

  •  

82

 

83 of 152

Training Neural Networks: Backpropagation Learning

  • Forward propagation
    • the initial information propagates up to the hidden units at each layer and finally produces output

  • Backpropagation
    • allows the information from the cost to flow backwards through the network in order to compute the gradients

83

84 of 152

Backpropagation

  •  

84

85 of 152

Backpropagation

  •  

85

These are what we need for GD

86 of 152

Backpropagation

  •  

86

These are what we need for GD

87 of 152

Backpropagation

  •  

87

These are what we need for GD

88 of 152

Backpropagation

  •  

88

These are what we need for GD

89 of 152

Backpropagation

  •  

89

These are what we need for GD

90 of 152

Backpropagation

  •  

90

These are what we need for GD

91 of 152

92 of 152

93 of 152

94 of 152

95 of 152

96 of 152

97 of 152

98 of 152

99 of 152

100 of 152

101 of 152

102 of 152

103 of 152

104 of 152

105 of 152

106 of 152

107 of 152

108 of 152

109 of 152

110 of 152

111 of 152

112 of 152

113 of 152

114 of 152

115 of 152

116 of 152

117 of 152

118 of 152

119 of 152

120 of 152

121 of 152

122 of 152

123 of 152

124 of 152

125 of 152

126 of 152

 

 

 

127 of 152

Training Neural Networks with PyTorch

  •  

127

128 of 152

Activation Function

129 of 152

130 of 152

131 of 152

132 of 152

133 of 152

134 of 152

135 of 152

136 of 152

137 of 152

138 of 152

139 of 152

140 of 152

Artificial Neural Networks with PyTorch

141 of 152

MNIST database

  • Mixed National Institute of Standards and Technology database
  • Handwritten digit database
  • 28 x 28 gray scaled image
  • Flattened matrix into a vector of 28 x 28 = 784

142 of 152

Our Neural Network (Model)

Input layer

(784)

Hidden layer

(100)

Output layer

(10)

Input image

(28 X 28)

Flattened

Digit prediction

in one-hot-encoding

143 of 152

Iterative Optimization

  • We will use
    • Mini-batch gradient descent
    • Adam optimizer

144 of 152

Implementation in Python

145 of 152

Implementation in Python

146 of 152

Implementation in Python

147 of 152

Implementation in Python

148 of 152

Implementation in Python

149 of 152

Implementation in Python

150 of 152

Next

  • Model Ensemble
  • Regularization (dropout)
  • Deep Neural Network w/ CNN & RNN

151 of 152

152 of 152