1 of 48

Artificial Neural Networks

2 of 48

Overview

  1. Introduction
  2. Biological Motivation
  3. ANN representations
  4. Appropriate problems for Neural Networks Learning
  5. Perceptron
  6. Gradient Descent and Delta Rule
  7. Multilayer networks and Backpropagation algorithm
  8. Remarks on the backpropagation algorithm
  9. An illustrative example: face recognition
  10. Advanced topics in artificial neural networks

3 of 48

Introduction

Artificial neural networks (ANNs) provide a general, practical method for learning

  • real-valued
  • discrete-valued
  • vector-valued function

For certain types of problems, such as learning to interpret complex real-world sensor data, artificial neural networks are among the most effective learning methods currently known.

For example, the BACKPROPAGATION algorithm surprisingly successful in many practical problems such as learning to recognize handwritten characters , learning to recognize spoken words and learning to recognize faces

4 of 48

Biological Motivation

- Human brain : densely interconnected network of 1011 neurons each connected to 104 others

(neuron switching time : approx. 10-3 sec.)

- Properties of artificial neural nets (ANNs):

  • Many neuron-like threshold switching units
  • Many weighted interconnections among units
  • Highly parallel, distributed process

5 of 48

Appropriate problems for neural network learning

  • Input is high-dimensional discrete or real-valued

(e.g. raw sensor input)

  • Output is discrete or real valued
  • Output is a vector of values
  • Possibly noisy data
  • Long training times accepted
  • Fast evaluation of the learned function required.
  • Not important for humans to understand the weights

Examples:

  • Speech recognition
  • Image classification
  • Financial prediction

6 of 48

  • ALVINN drives 70 mph on highways
    • The ALVINN system uses backpropagation algorithm to learn to steer an antonomous vehicle driving at speeds up to 70 miles per hour

Appropriate problems for neural network learning

7 of 48

ANN is appropriate for problems with the following charactersistics

  • Instances are represented by many attribute-value pairs.
  • The target function output may be discrete-valued, real-valued, or a vector

of several real- or discrete-valued attributes.

  • The training examples may contain errors.
  • Long training times are acceptable.
  • Fast evaluation of the learned target function may be required
  • The ability of humans to understand the learned target function is not important

8 of 48

Perceptron

  • Input values → Linear weighted sum → Threshold

Perceptrons can represent all of the primitive boolean functions AND, OR, NAND and NOR

9 of 48

  • Representational power of perceptrons

- Linearly separable case like (a) :

possible to classify by hyperplane,

- Linearly inseparable case like (b) :

impossible to classify

Decision surface of a perceptron

10 of 48

  • One way to learn an acceptable weight vector is to begin with random weights,

  • iteratively apply the perceptron to each training example, modifying the perceptron weights whenever it misclassifies an example.

  • This process is repeated, iterating through the training examples as many times as needed until the perceptron classifies all training examples correctly.

  • Weights are modified at each step according to the perceptron training rule

Perceptron training rule

11 of 48

Perceptron training rule (delta rule)

wi wi + Δwi

where Δwi = η (to) xi

Where:

  • t is target value
  • o is perceptron output

η is small positive constant (e.g., 0.1) called learning rate

Can prove it will converge

  • If training data is linearly separable

12 of 48

Gradient descent

  • The perceptron rule finds a successful weight vector when the training examples are linearly separable, it can fail to converge if the examples are not linearly separable.

  • A second training rule, called the delta rule, is designed to overcome this difficulty. If the training examples are not linearly separable, the delta rule converges toward a best-fit approximation to the target concept.

13 of 48

  • Gradient descent

- Error (for all training examples.):

- the gradient of E ( partial differentiating ) :

- direction : steepest increase in E.

- Thus, training rule is as follows.

(The negative sign : the direction that decreases E)

Derivation of gradient descent

14 of 48

Derivation of gradient descent

where xid denotes the single input components xi for training example d

- The weight update rule for gradient descent

15 of 48

Gradient descent and delta rule

  • Because the error surface contains only a single global minimum, this algorithm will converge to a weight vector with minimum error, given a sufficiently small η is used

16 of 48

- Error of different hypotheses

- For a linear unit with two weights, the hypothesis space H is the wo,w1 plane.

- This error surface must be parabolic with a single global minimum (we desire a hypothesis with minimum error).

Hypothesis Space

17 of 48

- Stochastic gradient descent (i.e. incremental mode) can sometimes avoid falling into local minima because it uses the various gradient of E rather than overall gradient of E.

Stochastic approximation to gradient descent

18 of 48

  • Speech recognition example of multilayer networks learned by the backpropagation algorithm
  • Highly nonlinear decision surfaces

Multilayer networks and the backpropagation algorithm

19 of 48

Sigmoid Threshold Unit

The sigmoid unit first computes a linear combination of its inputs, then applies a threshold to the result. In the case of the sigmoid unit, however, the threshold output is a continuous function of its input.

20 of 48

The Backpropagation algorithm

21 of 48

  • One major difference in the case of multilayer networks is that the error surface can have multiple local minima, in contrast to the single-minimum parabolic error surface.

  • Unfortunately, this means that gradient descent is guaranteed only to converge toward some local minimum, and not necessarily the global minimum error.

  • Despite this obstacle, in practice BACKPROPAGATION has been found to produce excellent results in many real-world applications.

22 of 48

  • Often include weight momentum α

- nth iteration update depend on (n-1)th iteration

- α : constant between 0 and 1 (momentum)

  • Roles of momentum term
    • The effect of keeping the ball rolling through small local minima in the error surface
    • The effect of gradually increasing the step size of the search in regions (greatly improves the speed of learning)

Adding Momentum

The most common is to alter the weight-update in the algorithm is making the weight update on the nth iteration depend partially on the update that occurred during the (n-1)th iteration

23 of 48

22/04/18

Remarks of Backpropagation algorithm

24 of 48

Remarks of Backpropagation algorithm

Convergence and Local Minima

  • Gradient descent to some local minimum
  • Perhaps not global minimum...
  • Add momentum
  • Stochastic gradient descent

Representational Power of Feed forward Networks

  • Boolean functions.
  • Continuous functions.
  • Arbitrary functions.

Hypothesis Space Search and Inductive Bias

  • Continuous
  • E is differentiable

25 of 48

Hidden layer representations

- This 8x3x8 network was trained to learn the identity function.

- 8 training examples are used.

- After 5000 training iterations, the three hidden unit values encode the eight distinct inputs using the encoding shown on the right.

Hidden layer representations

26 of 48

Learning the 8x3x8 network

- Most of the interesting weight changes occurred during the first 2500 iterations.

27 of 48

Generalization, Overfitting, and Stopping Criterion

  • Termination condition
    • Until the error E falls below some predetermined threshold
  • Techniques to address the overfitting problem
    • Weight decay : Decrease each weight by some small factor during each iteration.
    • Cross-validation (k-fold cross-validation)

28 of 48

Illustrative example: Face Recognition

  • Training images : 20 different persons with 32 images per person.
  • After 260 training images, the network achieves an accuracy of 90% over test set.
  • Algorithm parameters : η=0.3, α=0.3

29 of 48

Task

  • Involves classifying camera images of faces of various people in various poses. Images of 20 different people were collected, including approximately 32 images per person, varying the person's expression (happy, sad,angry, neutral), the direction in which they were looking (left, right, straight ahead,up), and whether or not they were wearing sunglasses.
  • There is also variation in the background behind the person, the clothing worn by the person, and the position of the person's face within the image. In total, 624 greyscale images were collected, each with a resolution of 120 x 128, with each image pixel described by a greyscale intensity value between 0 (black) and 255 (white).

  • A variety of target functions can be learned from this image data. For example, given an image as input we could train an ANN to output the identity of the person, the direction in which the person is facing, the gender of the person, whether or not they are wearing sunglasses, etc. All of these target functions can be learned to high accuracy from this image data,

30 of 48

Design Choice

In applying BACKPROPAGATION to any given task, a number of design choices must be made

After training on a set of 260 images, classification accuracy over a separate test set is 90%.

  • Input Encoding
  • Output Encoding
  • Network graph structure
  • Other learning algorithm parameters

31 of 48

Advanced topics in artificial neural networks

  • Alternative error functions
  • Alternate error minimization procedures
  • Recurrent networks
  • Dynamically modifying network structure

32 of 48

  • Penalize large weights: (weight decay) : Reducing the risk of overfitting

  • Train on target slopes as well as values:

  • Minimizing the cross entropy : Learning a probabilistic output function (chapter 6)

Alternative Error Functions

33 of 48

Recurrent Networks

  1. Feedforward network
  2. Recurrent network
  3. Recurrent network unfolded

in time

(a)

(b)

(c)

34 of 48

Dynamically Modifying Network Structure

  • To improve generalization accuracy and training efficiency
  • Cascade-Correlation algorithm (Fahlman and Lebiere 1990)
    • Start with the simplest possible network (no hidden units) and add complexity
  • Lecun et al. 1990
    • Start with the complex network and prune it as we find that certain connectives are inessential.

35 of 48

22/04/18

36 of 48

Evaluating Hypothesis

One reason is simply to understand whether to use the hypothesis. For instance, when learning from a limited-size database indicating the effectiveness of different medical treatments, it is important to understand as precisely as possible the accuracy of the learned hypotheses.

Estimating the accuracy of a hypothesis is relatively straightforward when data is plentiful. However, when we must learn a hypothesis and estimate its future accuracy given only a limited set of data, two key difficulties arise:

  • Bias in the estimate
  • Variance in the estimate

measured accuracy can still vary from the true accuracy, depending on the makeup of the particular set of test examples.

The smaller the set of test examples, the greater the expected variance.

37 of 48

ESTIMATING HYPOTHESIS ACCURACY

  1. Given a hypothesis h and a data sample containing n examples drawn at random according to the distribution D, what is the best estimate of the accuracy of h over future instances drawn from the same distribution?

2. What is the probable error in this accuracy estimate?

38 of 48

Sample Error and True Error

One is the error rate of the hypothesis over the sample of data that is available. The other is the error rate of the hypothesis over the entire unknown distribution D of examples.

The sample error of a hypothesis with respect to some sample S of instances

drawn from X is the fraction of S that it misclassifies

39 of 48

The true error of a hypothesis is the probability that it will misclassify a single randomly drawn instance from the distribution D.

PrxED denotes that the probability is taken over the instance distribution V.

40 of 48

Confidence Intervals for Discrete-Valued Hypotheses

41 of 48

42 of 48

43 of 48

44 of 48

45 of 48

46 of 48

Central Limit Theorem

One essential fact that simplifies attempts to derive confidence intervals is the Central Limit Theorem. Consider again our general setting, in which we observe the values of n independently drawn random variables Yl . . . Yn that obey the same unknown underlying probability distribution (e.g., n tosses of the same coin).

47 of 48

48 of 48