1 of 62

Artificial Neural Network

Dinesh K. Vishwakarma, Ph.D.

PROFESSOR, DEPARTMENT OF INFORMATION TECHNOLOGY

DELHI TECHNOLOGICAL UNIVERSITY, DELHI.

Webpage: http://www.dtu.ac.in/Web/Departments/InformationTechnology/faculty/dkvishwakarma.php

2 of 62

Introduction

Artificial neural networks (ANNs) provide a practical method for learning

real-valued functions
discrete-valued functions
vector-valued functions

Robust to errors in training data
Successfully applied to such problems as

interpreting visual scenes
speech recognition
learning robot control strategies

Dinesh K. Vishwakarma, Ph.D.

2

11/17/2021

3 of 62

Introduction…

ANN learning well-suit to problems which the training data corresponds to noisy, complex data (inputs from cameras or microphones)
Can also be used for problems with symbolic representations
Most appropriate for problems where

Instances have many attribute-value pairs
Target function output may be discrete-valued, real-valued, or a vector of several real- or discrete-valued attributes
Training examples may contain errors
Long training times are acceptable
Fast evaluation of the learned target function may be required
The ability for humans to understand the learned target function is not important.

Dinesh K. Vishwakarma, Ph.D.

3

11/17/2021

4 of 62

Human Brain Processing

Dinesh K. Vishwakarma, Ph.D.

4

11/17/2021

Input

Output

Dendrites: Input

Cell body: Processor

Synaptic: Link

Axon: Output

The human brain is made up of billions of simple processing units – neurons.

5 of 62

Neuron

Dinesh K. Vishwakarma, Ph.D.

5

11/17/2021

…

bias

Activation function

weights

6 of 62

Neuron…

Dinesh K. Vishwakarma, Ph.D.

6

11/17/2021

6

Artificial neurons are based on biological neurons.
Each neuron in the network receives one or more inputs.
An activation function is applied to the inputs, which determines the output of the neuron – the activation level.

Activation functions

Activation Function works

7 of 62

Neural Network

Dinesh K. Vishwakarma, Ph.D.

7

11/17/2021

How do we train?

4 + 2 = 6 neurons (not counting inputs)

[3 x 4] + [4 x 2] = 20 weights

4 + 2 = 6 biases

26 learnable parameters

Weights

Activation functions

8 of 62

Training Perceptron

Learning involves choosing values for the weights
The perceptron is trained as follows:

First, inputs are given random weights (usually between –0.5 and 0.5).
An item of training data is presented. If the perceptron mis-classifies it, the weights are modified according to the following:

where t is the target output for the training example, o is the output generated by the perceptron and a is the learning rate, between 0 and 1 (usually small such as 0.1)

Cycle through training examples until successfully classify all examples

Each cycle known as an epoch

Dinesh K. Vishwakarma, Ph.D.

8

11/17/2021

9 of 62

Backpropagation

Multilayer neural networks learn in the same way as perceptrons.
However, there are many more weights, and it is important to assign credit (or blame) correctly when changing weights.
E sums the errors over all of the network output units

Dinesh K. Vishwakarma, Ph.D.

9

11/17/2021

10 of 62

Backpropagation Algorithm

Create a feed-forward network with n_in inputs, n_hidden hidden units, and n_out output units.
Initialize all network weights to small random numbers.
Until termination condition is met, Do

For each <x,t> in training examples, Do.
Propagate the input forward through the network:
Input the instance x to the network and compute the output o_u of every unit u in the network.
Propagate the errors backward through the network:
For each network output unit k, calculate its error term δ_k

For each hidden unit h, calculate its error term δ_h

Update each network weight w_ji where

CS 484 – Artificial Intelligence

10

First constructs a fixed network structure

Main loop repeatedly iterates over the training examples.

For each training example: 1 applies network; 2. calculates error of the output; 3. computes the gradient with respect to the error; 4. updates the weights

xji denotes input from node i to unit j, and wji denotes the corresponding weight

deltan denotes the error term associated with unit n. It plays a role analogous to the quanity (t-o)

alpha is learning rate

First consider delta of output:

delta rule times derivative of the sigmoid function

delta of hidden unit

don't have targets for hidden units – sum over the errors for the outputs influenced by h – weight the delta by the weight of the edge (degree to which h is responsible for the error)

update weights incrementally

known as a stochastic approximation to gradient descent

when to stop:

after fixed number of iterations

error falls below some threshold

once the error on a separate validation set of examples meets some criterion

Important question – come back to

May not find the global minimum because many local minima. Can run several times to find global minima – in practice it works well

11 of 62

Hidden Layer representation

Can this be learned?

Target Function:

12 of 62

Yes

CS 484 – Artificial Intelligence

12

Input		Hidden Values		Output
10000000	→	.89 .04 .08	→	10000000
01000000	→	.15 .99 .99	→	01000000
00100000	→	.01 .97 .27	→	00100000
00010000	→	.99 .97 .71	→	00010000
00001000	→	.03 .05 .02	→	00001000
00000100	→	.01 .11 .88	→	00000100
00000010	→	.80 .01 .98	→	00000010
00000001	→	.60 .94 .01	→	00000001

13 of 62

Example 1 of NN

Dinesh K. Vishwakarma, Ph.D.

13

11/17/2021

W1

W2

W3

f(x)

1.4

-2.5

-0.06

14 of 62

Example 1 of NN…

Dinesh K. Vishwakarma, Ph.D.

14

11/17/2021

2.7

-8.6

0.002

f(x)

1.4

-2.5

-0.06

x = -0.06×2.7 + 2.5×8.6 + 1.4×0.002 = 21.34

15 of 62

Example 1 of NN…

Dinesh K. Vishwakarma, Ph.D.

15

11/17/2021

A dataset

Fields class

1.4 2.7 1.9 0

3.8 3.4 3.2 0

6.4 2.8 1.7 1

4.1 0.1 0.2 0

etc …

16 of 62

Example 1 of NN…

Dinesh K. Vishwakarma, Ph.D.

16

11/17/2021

Training the neural network

Fields class

1.4 2.7 1.9 0

3.8 3.4 3.2 0

6.4 2.8 1.7 1

4.1 0.1 0.2 0

etc …

17 of 62

Example 1 of NN…

Dinesh K. Vishwakarma, Ph.D.

17

11/17/2021

Training data

Fields class

1.4 2.7 1.9 0

3.8 3.4 3.2 0

6.4 2.8 1.7 1

4.1 0.1 0.2 0

etc …

Initialise with random weights

18 of 62

Example 1 of NN…

Dinesh K. Vishwakarma, Ph.D.

18

11/17/2021

Training data

Fields class

1.4 2.7 1.9 0

3.8 3.4 3.2 0

6.4 2.8 1.7 1

4.1 0.1 0.2 0

etc …

Present a training pattern

1.4

2.7

1.9

19 of 62

Example 1 of NN…

Dinesh K. Vishwakarma, Ph.D.

19

11/17/2021

Training data

Fields class

1.4 2.7 1.9 0

3.8 3.4 3.2 0

6.4 2.8 1.7 1

4.1 0.1 0.2 0

etc …

Feed it through to get output

1.4

2.7 0.8

1.9

20 of 62

Example 1 of NN…

Dinesh K. Vishwakarma, Ph.D.

20

11/17/2021

Training data

Fields class

1.4 2.7 1.9 0

3.8 3.4 3.2 0

6.4 2.8 1.7 1

4.1 0.1 0.2 0

etc …

Compare with target output

1.4

2.7 0.8

0

1.9 error 0.8

21 of 62

Example 1 of NN…

Dinesh K. Vishwakarma, Ph.D.

21

11/17/2021

Training data

Fields class

1.4 2.7 1.9 0

3.8 3.4 3.2 0

6.4 2.8 1.7 1

4.1 0.1 0.2 0

etc …

Adjust weights based on error

1.4

2.7 0.8

0

1.9 error 0.8

22 of 62

Example 1 of NN…

Dinesh K. Vishwakarma, Ph.D.

22

11/17/2021

Training data

Fields class

1.4 2.7 1.9 0

3.8 3.4 3.2 0

6.4 2.8 1.7 1

4.1 0.1 0.2 0

etc …

Present a training pattern

6.4

2.8

1.7

23 of 62

Example 1 of NN…

Dinesh K. Vishwakarma, Ph.D.

23

11/17/2021

Training data

Fields class

1.4 2.7 1.9 0

3.8 3.4 3.2 0

6.4 2.8 1.7 1

4.1 0.1 0.2 0

etc …

Feed it through to get output

6.4

2.8 0.9

1.7

24 of 62

Example 1 of NN…

Dinesh K. Vishwakarma, Ph.D.

24

11/17/2021

Training data

Fields class

1.4 2.7 1.9 0

3.8 3.4 3.2 0

6.4 2.8 1.7 1

4.1 0.1 0.2 0

etc …

Compare with target output

6.4

2.8 0.9

1

1.7 error -0.1

25 of 62

Example 1 of NN…

Dinesh K. Vishwakarma, Ph.D.

25

11/17/2021

Training data

Fields class

1.4 2.7 1.9 0

3.8 3.4 3.2 0

6.4 2.8 1.7 1

4.1 0.1 0.2 0

etc …

Adjust weights based on error

6.4

2.8 0.9

1

1.7 error -0.1

26 of 62

Example 1 of NN…

Dinesh K. Vishwakarma, Ph.D.

26

11/17/2021

Training data

Fields class

1.4 2.7 1.9 0

3.8 3.4 3.2 0

6.4 2.8 1.7 1

4.1 0.1 0.2 0

etc …

And so on ….

6.4

2.8 0.9

1

1.7 error -0.1

Repeat this thousands, maybe millions of times – each time taking a random training instance, and making slight weight adjustments

Algorithms for weight adjustment are designed to make changes that will reduce the error

27 of 62

Example of Digit Recognition

Dinesh K. Vishwakarma, Ph.D.

27

11/17/2021

Machine

“2”

16 x 16 = 256

……

Ink → 1 No ink → 0

……

y₁

y₂

y₁₀

is 1

is 2

is 0

……

0.1

0.7

0.2

The image is “2”

28 of 62

Example of Neural Network

Dinesh K. Vishwakarma, Ph.D.

28

11/17/2021

Sigmoid Function

1

-1

1

-2

1

-1

1

0

4

-2

0.98

0.12

29 of 62

Example of Neural Network

Dinesh K. Vishwakarma, Ph.D.

29

11/17/2021

1

-2

1

-1

1

0

4

-2

0.98

0.12

2

-1

-2

3

-1

4

-1

0.86

0.11

0.62

0.83

0

-2

2

1

-1

30 of 62

Example of Neural Network

Dinesh K. Vishwakarma, Ph.D.

30

11/17/2021

1

-2

1

-1

1

0

0.73

0.5

2

-1

-2

3

-1

4

-1

0.72

0.12

0.51

0.85

0

-2

2

Different parameters define different function

0

31 of 62

Example of Neural Network

Dinesh K. Vishwakarma, Ph.D.

31

11/17/2021

1

-2

1

-1

1

0

4

-2

0.98

0.12

1

-1

32 of 62

Example of Neural Network

Dinesh K. Vishwakarma, Ph.D.

32

11/17/2021

……

y₁

y₂

y_M

W¹

W²

W^L

b²

b^L

x

a¹

a²

y

b¹

W¹

x

+

b²

W²

a¹

+

b^L

W^L

+

a^L-1

b¹

33 of 62

Neural Network

Dinesh K. Vishwakarma, Ph.D.

33

11/17/2021

……

y₁

y₂

y_M

W¹

W²

W^L

b²

b^L

x

a¹

a²

y

x

b¹

W¹

x

+

b²

W²

+

b^L

W^L

+

…

b¹

…

Using parallel computing techniques to speed up matrix operation

34 of 62

Softmax

Softmax layer as the output layer

Dinesh K. Vishwakarma, Ph.D.

34

11/17/2021

Ordinary Layer

In general, the output of network can be any value.

May not be easy to interpret

35 of 62

Softmax

Softmax layer as the output layer

Dinesh K. Vishwakarma, Ph.D.

35

11/17/2021

3

-3

1

2.7

20

0.05

0.88

0.12

≈0

36 of 62

Network Parameters

Dinesh K. Vishwakarma, Ph.D.

36

11/17/2021

16 x 16 = 256

……

Ink → 1

No ink → 0

……

y₁

y₂

y₁₀

0.1

0.7

0.2

y₁ has the maximum value

Set the network parameters such that ……

Input:

y₂ has the maximum value

Input:

is 1

is 2

is 0

Softmax

37 of 62

Visual Information Processing

Visual information processed by our brain is multi-layered.

Dinesh K. Vishwakarma, Ph.D.

37

11/17/2021

38 of 62

Enabling Factor of DL

Training of deep networks was made computationally feasible by:

Faster CPU’s
The move to parallel CPU architectures
Advent of GPU computing

Neural networks are often represented as a matrix of weight vectors.
GPU’s are optimized for very fast matrix multiplication
2008 - Nvidia’s CUDA library for GPU computing is released.

Dinesh K. Vishwakarma, Ph.D.

38

11/17/2021

39 of 62

Hierarchical Learning

Dinesh K. Vishwakarma, Ph.D.

39

11/17/2021

Low-level features

output

Mid-level features

High-level features

Trainable classifier

Inspired from visual information processing, a representation of Hierarchical Learning is developed, also know as “Deep Learning”

First in 1986 by Rina Dechter

Revolution since 2012

40 of 62

Deep Neural Network

Dinesh K. Vishwakarma, Ph.D.

40

11/17/2021

Output Layer

Hidden Layers

Input Layer

Input

Output

Layer 1

……

Layer 2

……

Layer L

……

y₁

y₂

y_M

Deep means many hidden layers

neuron

41 of 62

Why Deep Network?

Dinesh K. Vishwakarma, Ph.D.

41

11/17/2021

Layer X Size	Word Error Rate (%)	Layer X Size	Word Error Rate (%)
1 X 2k	24.2
2 X 2k	20.4
3 X 2k	18.4
4 X 2k	17.8
5 X 2k	17.2	1 X 3772	22.5
7 X 2k	17.1	1 X 4634	22.6
		1 X 16k	22.1

Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription Using Context-Dependent Deep Neural Networks." Interspeech. 2011.

Not surprised, more parameters, better performance

42 of 62

Why Deep Network?

Universal Theorem

Dinesh K. Vishwakarma, Ph.D.

42

11/17/2021

Any continuous function f

Can be realized by a network with one hidden layer

(given enough hidden neurons)

Why “Deep” neural network not “Fat” neural network?

43 of 62

Dinesh K. Vishwakarma, Ph.D.

43

11/17/2021

Fat + Short v.s. Thin + Tall

……

Deep

……

Shallow

Which one is better?

The same number of parameters

44 of 62

Fat + Short v.s. Thin + Tall

Dinesh K. Vishwakarma, Ph.D.

44

11/17/2021

Seide, Frank, Gang Li, and Dong Yu. "Conversational Speech Transcription Using Context-Dependent Deep Neural Networks." Interspeech. 2011.

Layer X Size	Word Error Rate (%)	Layer X Size	Word Error Rate (%)
1 X 2k	24.2
2 X 2k	20.4
3 X 2k	18.4
4 X 2k	17.8
5 X 2k	17.2	1 X 3772	22.5
7 X 2k	17.1	1 X 4634	22.6
		1 X 16k	22.1

45 of 62

Training multi-layer NNs (DNN)

Dinesh K. Vishwakarma, Ph.D.

45

11/17/2021

46 of 62

Dinesh K. Vishwakarma, Ph.D.

46

11/17/2021

Train this layer first

Training multi-layer NNs

47 of 62

Training multi-layer NNs

Dinesh K. Vishwakarma, Ph.D.

47

11/17/2021

Train this layer first

then this layer

48 of 62

Training multi-layer NNs

Dinesh K. Vishwakarma, Ph.D.

48

11/17/2021

Train this layer first

then this layer

49 of 62

Training multi-layer NNs

Dinesh K. Vishwakarma, Ph.D.

49

11/17/2021

Train this layer first

then this layer

50 of 62

Training multi-layer NNs

Dinesh K. Vishwakarma, Ph.D.

50

11/17/2021

Train this layer first

then this layer

finally this layer

51 of 62

When to use Deep Learning?

Data size is large
High end infrastructure
Lack of domain understanding
Complex problem such as image classification, speech recognition etc.

Dinesh K. Vishwakarma, Ph.D.

51

11/17/2021

Fuel of deep learning is the big data by Andrew Ng

Deep

Learning

Machine

Learning

Amount of Data

Performance

52 of 62

Limitations of Deep Learning

Very slow to train
Models are very complex, with lot of parameters to optimize:

Initialization of weights
Layer-wise training algorithm
Neural architecture

Number of layers
Size of layers
Type – regular, pooling, max pooling, soft max

Fine-tuning of weights using back propagation

Dinesh K. Vishwakarma, Ph.D.

52

11/17/2021

53 of 62

Thank you!�dinesh@dtu.ac.in

Dinesh K. Vishwakarma, Ph.D.

Slide 53 of 74

11/17/2021

54 of 62

Problems on Neural Networks

Dinesh K. Vishwakarma, Ph.D.

Slide 54 of 74

11/17/2021

55 of 62

Problem 1

Consider a artificial Neurons, which has three inputs nodes x = (x1, x2, x3) that receive only binary signals (either 0 or 1). How many different input patterns this node can receive? What if the node had four inputs? Five? Can you give a formula that computes the number of binary input patterns for a given number of inputs?

Dinesh K. Vishwakarma, Ph.D.

55

11/17/2021

56 of 62

Solutions

Dinesh K. Vishwakarma, Ph.D.

56

11/17/2021

57 of 62

Problem 2

Consider a artificial neurons have three inputs, the weights corresponding to the these inputs have (2, -4, 1), the activation function is unit step. Determine the output for following input values.

Dinesh K. Vishwakarma, Ph.D.

57

11/17/2021

58 of 62

Solutions

Dinesh K. Vishwakarma, Ph.D.

58

11/17/2021

59 of 62

Problem 3

Dinesh K. Vishwakarma, Ph.D.

59

11/17/2021

60 of 62

Solutions

In order to find the output of the network it is necessary to calculate weighted sums of hidden nodes 3 and 4:

Then find the outputs from hidden nodes using activation function.
Use the outputs of the hidden nodes y3 and y4 as the input values to the output layer (nodes 5 and 6), and find weighted sums of output nodes 5 and 6:

Finally, compute the outputs from nodes 5 and 6 using

Dinesh K. Vishwakarma, Ph.D.

60

11/17/2021

61 of 62

Solutions

Dinesh K. Vishwakarma, Ph.D.

61

11/17/2021

62 of 62

Solutions

Dinesh K. Vishwakarma, Ph.D.

62

11/17/2021