1 of 48

��Artificial Neural Networks

2 of 48

Overview

Introduction
Biological Motivation
ANN representations
Appropriate problems for Neural Networks Learning
Perceptron
Gradient Descent and Delta Rule
Multilayer networks and Backpropagation algorithm
Remarks on the backpropagation algorithm
An illustrative example: face recognition
Advanced topics in artificial neural networks

3 of 48

Introduction

Artificial neural networks (ANNs) provide a general, practical method for learning

real-valued
discrete-valued
vector-valued function

For certain types of problems, such as learning to interpret complex real-world sensor data, artificial neural networks are among the most effective learning methods currently known.

For example, the BACKPROPAGATION algorithm surprisingly successful in many practical problems such as learning to recognize handwritten characters , learning to recognize spoken words and learning to recognize faces

4 of 48

Biological Motivation

- Human brain : densely interconnected network of 10¹¹ neurons each connected to 10⁴ others

(neuron switching time : approx. 10^-3 sec.)

- Properties of artificial neural nets (ANN’s):

Many neuron-like threshold switching units
Many weighted interconnections among units
Highly parallel, distributed process

5 of 48

Appropriate problems for neural network learning

Input is high-dimensional discrete or real-valued

(e.g. raw sensor input)

Output is discrete or real valued
Output is a vector of values
Possibly noisy data
Long training times accepted
Fast evaluation of the learned function required.
Not important for humans to understand the weights

Examples:

Speech recognition
Image classification
Financial prediction

6 of 48

ALVINN drives 70 mph on highways

The ALVINN system uses backpropagation algorithm to learn to steer an antonomous vehicle driving at speeds up to 70 miles per hour

Appropriate problems for neural network learning

7 of 48

ANN is appropriate for problems with the following charactersistics

Instances are represented by many attribute-value pairs.
The target function output may be discrete-valued, real-valued, or a vector

of several real- or discrete-valued attributes.

The training examples may contain errors.
Long training times are acceptable.
Fast evaluation of the learned target function may be required
The ability of humans to understand the learned target function is not important

8 of 48

Perceptron

Input values → Linear weighted sum → Threshold

Perceptrons can represent all of the primitive boolean functions AND, OR, NAND and NOR

9 of 48

Representational power of perceptrons

- Linearly separable case like (a) :

possible to classify by hyperplane,

- Linearly inseparable case like (b) :

impossible to classify

Decision surface of a perceptron

10 of 48

One way to learn an acceptable weight vector is to begin with random weights,

iteratively apply the perceptron to each training example, modifying the perceptron weights whenever it misclassifies an example.

This process is repeated, iterating through the training examples as many times as needed until the perceptron classifies all training examples correctly.

Weights are modified at each step according to the perceptron training rule

Perceptron training rule

11 of 48

Perceptron training rule (delta rule)

w_i ← w_i + Δw_i

where Δw_i= η (t – o) x_i

Where:

t is target value
o is perceptron output

η is small positive constant (e.g., 0.1) called learning rate

Can prove it will converge

If training data is linearly separable

12 of 48

Gradient descent

The perceptron rule finds a successful weight vector when the training examples are linearly separable, it can fail to converge if the examples are not linearly separable.

A second training rule, called the delta rule, is designed to overcome this difficulty. If the training examples are not linearly separable, the delta rule converges toward a best-fit approximation to the target concept.

13 of 48

Gradient descent

- Error (for all training examples.):

- the gradient of E ( partial differentiating ) :

- direction : steepest increase in E.

- Thus, training rule is as follows.

(The negative sign : the direction that decreases E)

Derivation of gradient descent

14 of 48

Derivation of gradient descent

where x_id denotes the single input components x_i for training example d

- The weight update rule for gradient descent

∴

15 of 48

Gradient descent and delta rule

Because the error surface contains only a single global minimum, this algorithm will converge to a weight vector with minimum error, given a sufficiently small η is used

16 of 48

- Error of different hypotheses

- For a linear unit with two weights, the hypothesis space H is the w_o,w₁ plane.

- This error surface must be parabolic with a single global minimum (we desire a hypothesis with minimum error).

Hypothesis Space

17 of 48

- Stochastic gradient descent (i.e. incremental mode) can sometimes avoid falling into local minima because it uses the various gradient of E rather than overall gradient of E.

Stochastic approximation to gradient descent

18 of 48

Speech recognition example of multilayer networks learned by the backpropagation algorithm
Highly nonlinear decision surfaces

Multilayer networks and the backpropagation algorithm

19 of 48

Sigmoid Threshold Unit

The sigmoid unit first computes a linear combination of its inputs, then applies a threshold to the result. In the case of the sigmoid unit, however, the threshold output is a continuous function of its input.

20 of 48

The Backpropagation algorithm

21 of 48

One major difference in the case of multilayer networks is that the error surface can have multiple local minima, in contrast to the single-minimum parabolic error surface.

Unfortunately, this means that gradient descent is guaranteed only to converge toward some local minimum, and not necessarily the global minimum error.

Despite this obstacle, in practice BACKPROPAGATION has been found to produce excellent results in many real-world applications.

22 of 48

Often include weight momentum α

- n^th iteration update depend on (n-1)^th iteration

- α : constant between 0 and 1 (momentum)

Roles of momentum term

The effect of keeping the ball rolling through small local minima in the error surface
The effect of gradually increasing the step size of the search in regions (greatly improves the speed of learning)

Adding Momentum

The most common is to alter the weight-update in the algorithm is making the weight update on the nth iteration depend partially on the update that occurred during the (n-1)th iteration

23 of 48

22/04/18

Remarks of Backpropagation algorithm

24 of 48

Remarks of Backpropagation algorithm

Convergence and Local Minima

Gradient descent to some local minimum
Perhaps not global minimum...
Add momentum
Stochastic gradient descent

Representational Power of Feed forward Networks

Boolean functions.
Continuous functions.
Arbitrary functions.

Hypothesis Space Search and Inductive Bias

Continuous
E is differentiable

25 of 48

Hidden layer representations

- This 8x3x8 network was trained to learn the identity function.

- 8 training examples are used.

- After 5000 training iterations, the three hidden unit values encode the eight distinct inputs using the encoding shown on the right.

Hidden layer representations

26 of 48

Learning the 8x3x8 network

- Most of the interesting weight changes occurred during the first 2500 iterations.

27 of 48

Generalization, Overfitting, and Stopping Criterion

Termination condition

Until the error E falls below some predetermined threshold

Techniques to address the overfitting problem

Weight decay : Decrease each weight by some small factor during each iteration.
Cross-validation (k-fold cross-validation)

28 of 48

Illustrative example: Face Recognition

Training images : 20 different persons with 32 images per person.
After 260 training images, the network achieves an accuracy of 90% over test set.
Algorithm parameters : η=0.3, α=0.3

29 of 48

Task

Involves classifying camera images of faces of various people in various poses. Images of 20 different people were collected, including approximately 32 images per person, varying the person's expression (happy, sad,angry, neutral), the direction in which they were looking (left, right, straight ahead,up), and whether or not they were wearing sunglasses.

There is also variation in the background behind the person, the clothing worn by the person, and the position of the person's face within the image. In total, 624 greyscale images were collected, each with a resolution of 120 x 128, with each image pixel described by a greyscale intensity value between 0 (black) and 255 (white).

A variety of target functions can be learned from this image data. For example, given an image as input we could train an ANN to output the identity of the person, the direction in which the person is facing, the gender of the person, whether or not they are wearing sunglasses, etc. All of these target functions can be learned to high accuracy from this image data,

30 of 48

Design Choice

In applying BACKPROPAGATION to any given task, a number of design choices must be made

After training on a set of 260 images, classification accuracy over a separate test set is 90%.

Input Encoding
Output Encoding
Network graph structure
Other learning algorithm parameters

31 of 48

Advanced topics in artificial neural networks

Alternative error functions
Alternate error minimization procedures
Recurrent networks
Dynamically modifying network structure

32 of 48

Penalize large weights: (weight decay) : Reducing the risk of overfitting

Train on target slopes as well as values:

Minimizing the cross entropy : Learning a probabilistic output function (chapter 6)

Alternative Error Functions

33 of 48

Recurrent Networks

Feedforward network
Recurrent network
Recurrent network unfolded

in time

(a)

(b)

(c)

34 of 48

Dynamically Modifying Network Structure

To improve generalization accuracy and training efficiency
Cascade-Correlation algorithm (Fahlman and Lebiere 1990)

Start with the simplest possible network (no hidden units) and add complexity

Lecun et al. 1990

Start with the complex network and prune it as we find that certain connectives are inessential.

35 of 48

22/04/18

36 of 48

Evaluating Hypothesis

One reason is simply to understand whether to use the hypothesis. For instance, when learning from a limited-size database indicating the effectiveness of different medical treatments, it is important to understand as precisely as possible the accuracy of the learned hypotheses.

Estimating the accuracy of a hypothesis is relatively straightforward when data is plentiful. However, when we must learn a hypothesis and estimate its future accuracy given only a limited set of data, two key difficulties arise:

Bias in the estimate
Variance in the estimate

measured accuracy can still vary from the true accuracy, depending on the makeup of the particular set of test examples.

The smaller the set of test examples, the greater the expected variance.

37 of 48

ESTIMATING HYPOTHESIS ACCURACY

Given a hypothesis h and a data sample containing n examples drawn at random according to the distribution D, what is the best estimate of the accuracy of h over future instances drawn from the same distribution?

2. What is the probable error in this accuracy estimate?

38 of 48

Sample Error and True Error

One is the error rate of the hypothesis over the sample of data that is available. The other is the error rate of the hypothesis over the entire unknown distribution D of examples.

The sample error of a hypothesis with respect to some sample S of instances

drawn from X is the fraction of S that it misclassifies

39 of 48

The true error of a hypothesis is the probability that it will misclassify a single randomly drawn instance from the distribution D.

_PrxED denotes that the probability is taken over the instance distribution V.

40 of 48

Confidence Intervals for Discrete-Valued Hypotheses

41 of 48

42 of 48

43 of 48

44 of 48

45 of 48

46 of 48

Central Limit Theorem

One essential fact that simplifies attempts to derive confidence intervals is the Central Limit Theorem. Consider again our general setting, in which we observe the values of n independently drawn random variables Yl . . . Yn that obey the same unknown underlying probability distribution (e.g., n tosses of the same coin).