��Artificial Neural Networks
Overview
Introduction
Artificial neural networks (ANNs) provide a general, practical method for learning
For certain types of problems, such as learning to interpret complex real-world sensor data, artificial neural networks are among the most effective learning methods currently known.
For example, the BACKPROPAGATION algorithm surprisingly successful in many practical problems such as learning to recognize handwritten characters , learning to recognize spoken words and learning to recognize faces
Biological Motivation
- Human brain : densely interconnected network of 1011 neurons each connected to 104 others
(neuron switching time : approx. 10-3 sec.)
- Properties of artificial neural nets (ANN’s):
Appropriate problems for neural network learning
(e.g. raw sensor input)
Examples:
Appropriate problems for neural network learning
ANN is appropriate for problems with the following charactersistics
of several real- or discrete-valued attributes.
Perceptron
Perceptrons can represent all of the primitive boolean functions AND, OR, NAND and NOR
- Linearly separable case like (a) :
possible to classify by hyperplane,
- Linearly inseparable case like (b) :
impossible to classify
Decision surface of a perceptron
Perceptron training rule
Perceptron training rule (delta rule)
wi ← wi + Δwi
where Δwi = η (t – o) xi
Where:
η is small positive constant (e.g., 0.1) called learning rate
Can prove it will converge
Gradient descent
- Error (for all training examples.):
- the gradient of E ( partial differentiating ) :
- direction : steepest increase in E.
- Thus, training rule is as follows.
(The negative sign : the direction that decreases E)
Derivation of gradient descent
Derivation of gradient descent
where xid denotes the single input components xi for training example d
- The weight update rule for gradient descent
∴
Gradient descent and delta rule
- Error of different hypotheses
- For a linear unit with two weights, the hypothesis space H is the wo,w1 plane.
- This error surface must be parabolic with a single global minimum (we desire a hypothesis with minimum error).
Hypothesis Space
- Stochastic gradient descent (i.e. incremental mode) can sometimes avoid falling into local minima because it uses the various gradient of E rather than overall gradient of E.
Stochastic approximation to gradient descent
Multilayer networks and the backpropagation algorithm
Sigmoid Threshold Unit
The sigmoid unit first computes a linear combination of its inputs, then applies a threshold to the result. In the case of the sigmoid unit, however, the threshold output is a continuous function of its input.
The Backpropagation algorithm
- nth iteration update depend on (n-1)th iteration
- α : constant between 0 and 1 (momentum)
Adding Momentum
The most common is to alter the weight-update in the algorithm is making the weight update on the nth iteration depend partially on the update that occurred during the (n-1)th iteration
22/04/18
Remarks of Backpropagation algorithm
Remarks of Backpropagation algorithm
Convergence and Local Minima
Representational Power of Feed forward Networks
Hypothesis Space Search and Inductive Bias
Hidden layer representations
- This 8x3x8 network was trained to learn the identity function.
- 8 training examples are used.
- After 5000 training iterations, the three hidden unit values encode the eight distinct inputs using the encoding shown on the right.
Hidden layer representations
Learning the 8x3x8 network
- Most of the interesting weight changes occurred during the first 2500 iterations.
Generalization, Overfitting, and Stopping Criterion
Illustrative example: Face Recognition
Task
Design Choice
In applying BACKPROPAGATION to any given task, a number of design choices must be made
After training on a set of 260 images, classification accuracy over a separate test set is 90%.
Advanced topics in artificial neural networks
Alternative Error Functions
Recurrent Networks
in time
(a)
(b)
(c)
Dynamically Modifying Network Structure
22/04/18
Evaluating Hypothesis
One reason is simply to understand whether to use the hypothesis. For instance, when learning from a limited-size database indicating the effectiveness of different medical treatments, it is important to understand as precisely as possible the accuracy of the learned hypotheses.
Estimating the accuracy of a hypothesis is relatively straightforward when data is plentiful. However, when we must learn a hypothesis and estimate its future accuracy given only a limited set of data, two key difficulties arise:
measured accuracy can still vary from the true accuracy, depending on the makeup of the particular set of test examples.
The smaller the set of test examples, the greater the expected variance.
ESTIMATING HYPOTHESIS ACCURACY
2. What is the probable error in this accuracy estimate?
Sample Error and True Error
One is the error rate of the hypothesis over the sample of data that is available. The other is the error rate of the hypothesis over the entire unknown distribution D of examples.
The sample error of a hypothesis with respect to some sample S of instances
drawn from X is the fraction of S that it misclassifies
The true error of a hypothesis is the probability that it will misclassify a single randomly drawn instance from the distribution D.
PrxED denotes that the probability is taken over the instance distribution V.
Confidence Intervals for Discrete-Valued Hypotheses
Central Limit Theorem
One essential fact that simplifies attempts to derive confidence intervals is the Central Limit Theorem. Consider again our general setting, in which we observe the values of n independently drawn random variables Yl . . . Yn that obey the same unknown underlying probability distribution (e.g., n tosses of the same coin).