1 of 28

Machine Learning I:

Artificial Neural Networks

Patrick Hall

Visiting Faculty, Department of Decision Sciences

George Washington University

2 of 28

Lecture 5 Agenda

Brief Introduction
Artificial Neural Networks

Algorithm Overview
Supervised: MLP
Unsupervised: Autoencoder

Python Code Example: Sonar Case
Readings

3 of 28

Where are we in the modeling lifecycle?

Data Collection �& ETL

Feature Selection & Engineering

Supervised�Learning

Unsupervised�Learning

Deployment

Cost Intensive

Revenue

Generating

Assessment & Validation

4 of 28

�Brief Introduction

Overview

5 of 28

Historical Development

Image source: https://kautomations.wordpress.com/2016/08/07/neural-networks/

The name “neural network” originates from its initial conception to model neurotransmission where each unit represents a neuron and the connection represents synapses.

6 of 28

Brief History

Sources: Introduction to Statistical Learning

Neural networks where invented in the late 1950s: Rosenblatt’s Perceptron.
Neural networks rose to fame in the late 1980s with backpropagation and the multilayer perceptron (MLP).
Then came SVMs, boosting, and random forests that outperformed neural networks on structured data tasks. Neural networks fell from favor.
Neural networks resurfaced in the mid-00’s with the new name deep learning, and with new architectures and additional features.

Domain knowledge and mechanisms in networks (CNNs)
World-class success in image and video classification, and speech recognition.
Enabled by GPU computing and large labeled datasets.

A great deal of hype surrounds neural networks, making them seem magical and mysterious. While some breakthroughs have been impressive, for the most part they remain nonlinear statistical models (sophisticated regression models).

7 of 28

Neural Network Structure

Sources: Introduction to Statistical Learning

A simple feed-forward neural network using 4 predictors and a single hidden layer (5 hidden units) for modeling a numeric response:

X₁ - X₄ Input Layer: Arrows indicate that each of the input units feed into each of the hidden units
A₁ - A₅Hidden Layer: Computes activations - nonlinear transformations of linear combinations of the input features
Activation Function: tanh, sigmoid, ReLU�(not shown in the figure)
Y Output Layer: Similar to a linear model that uses the activations in the hidden layer as inputs, resulting in output predictions

8 of 28

Artificial neural networks are a wide and diverse technology.

Network architectures like MLP, autoencoder, LSTM, convolutional neural network, and several others are in wide use today.�
Match the structure of your network to the structure of your data for best results.

Image Source: https://www.asimovinstitute.org/neural-network-zoo/

9 of 28

Network Structure & Data Structure

Sources: Demystifying Deep Learning, SAS Institute; Explainable Neural Networks based on Additive Index Models

Convolutional neural networks for images:

Specialized, GAM-like, architectures for structured data:

10 of 28

Neural Networks

Neural Network Algorithm Overview

Supervised: MLP

Unsupervised: Autoencoder

11 of 28

Sources: Introduction to Statistical Learning & Elements of Statistical Learning

Neural Network Algorithm Overview

Neural Network Structure: Hidden Layers and Activation Function
Training Neural Network:

Gradient Descent & Backpropagation
Stochastic Gradient Descent
Regularization & Dropout
Hyperparameter Tuning

Supervised: MLP
Unsupervised: Autoencoder
Issues

12 of 28

Sources: Introduction to Data Mining

Hidden Layers: Activations

Activations in the hidden layer(s) with generated pre-specified activation functions, e.g., tanh, sigmoid, and ReLU
Hidden units as analogous to neurons in the brain - large outputs are firing, while those close to zero are silent
Number of layers is associated with “depth”

Deeper networks can express a complex hierarchy of feature extraction
Every hidden layer ideally represents a level of abstraction where complex features are compositions of simpler features

Every node in a hidden layer operates on activations from the preceding layer and transmits activations forward to the nodes of the next layer

13 of 28

Sources: Introduction to Statistical Learning

Image Source: https://www.spiedigitallibrary.org/

Activation Function

Not an exhaustive list, but most commonly used activations functions are:

Identity (no transformation)
Rectified Linear Activation (ReLU)
Sigmoid (logistic)
Hyperbolic Tangent (tanh)

Activation function:

Computes nonlinear transformations of linear combination of inputs
Prevents extremely large magnitude values from harming the training process
Allows the model to capture (complex) nonlinearities
Sigmoid and tanh “saturate,” ReLU does not
Not typically observed, “hidden”
Activation output units for classification are typically softmax

Piecewise-linear ReLU function is popular for its efficiency and computability; above graph has been scaled for ease of comparison.

14 of 28

Sources: Introduction to Statistical Learning & Demystifying Deep Learning, SAS Institute

Training Neural Networks

x₁

x₂

x₃

h₁

h₂

This neural net has one output unit for an interval target – but neural nets can have an arbitrary number of targets

Hyperbolic tangent activation function

One hidden layer with two hidden units

Multilayer Perceptron

Feed Forward: Estimate the prediction and error using nonlinear activation functions to transform the linear combinations of input functions as derived features
Back Propagation: Compute the loss function and its gradient to measure the error across training data and update the weights accordingly

15 of 28

Sources: Introduction to Statistical Learning

Training Neural Network

Similar to other statistical learning methods

Minimize the error function - squared-error, cross-entropy, log-loss, etc.

However, minimizing loss function is not straightforward:

Nonconvex optimization
Rashomon effect, vast numbers of solutions are possible

Two strategies to overcome nonconvexity and overfitting:

Iterative gradient descent approaches
Regularization: penalties are imposed on the weights (Lasso or Ridge)

The gradient is used to train the network and to minimize training errors

Theoretically, training stops when gradient reaches zero - some minima in the loss function

16 of 28

Sources: Introduction to Statistical Learning

Gradient Descent & Back Propagation

Highly dependent on initialization, but we generally start from a random guess for weight values
Propagate data through network to generate ŷ for each row
Back Propagation:

Use current weights to define a position on the surface of the loss function
Find the gradient of the loss function w.r.t. weights
Update the weights in the direction of “maximum descent” (negative gradient) to improve the weights.

In the next iteration, use the updated weights to compute the loss function until gradient does not change, validation error increases, or some other stopping condition is met
The goal is to end up at a (good) local minimum,�which may require retraining from many initializations!

Illustration of gradient descent:

The objective function is not convex - has two minimas. Start with some value (typically chosen randomly), find each step that moves against the gradient until it cannot go down any further. Here, gradient descent reaches the global minimum in 7 steps.

17 of 28

Sources: Introduction to Statistical Learning

Image Source: https://en.wikipedia.org/wiki/Stochastic_gradient_descent

Stochastic Gradient Descent (SGD)

SGD accelerates gradient descent to reduce the computational burden of gradient calculations, achieving faster iteration
Instead of summing the error and gradient over all N training rows, we can sample a small fraction - a minibatch - of training rows to compute the gradient
Note the fluctuations in the total objective function w.r.t. mini-batches, sometimes enabling optimization procedures to bounce out of local minima!
Terminology:

Iteration: feed-forward and backpropagation on one mini-batch
Epoch: when iterations pass over the entire training data

Iterations

Error Function

18 of 28

Sources: Introduction to Statistical Learning; Wikipedia

Regularization: Ridge (L2)

MNIST Neural Network

60,000 training rows
784 input units; 2 hidden layers (256 and 128 hidden units in each); 10 output units
Total of 235,146 weights - four times the training rows
Overfit

Ridge regularization - augment the objective function with a ridge penalty term on the squared sum of weights

Lasso is also used as an alternative to ridge

Training MNIST data with regularization: 60,000 rows

Validation set 12,000 (20%)
Training set 48,000
Minibatch of 128 rows per gradient update
48,000/128 ~ 375 minibatch gradient updates per epoch
Validation error starts to increase by 30 epochs - early stopping can also be used as an additional form of regularization

Sample from MNIST data:

Iteration plots with regularization:

19 of 28

Regularization: Dropout

Dropout Learning - relatively new and efficient form of regularization:

Nodes or weights are selected at random and ignored during an epoch (set to 0)
Done separately for each mini-batch
Prevents the nodes from becoming overfit and can be seen as a form of regularization

Fully connected mini-batch

Mini-batch with dropout

20 of 28

Sources: Introduction to Statistical Learning & Random Search for Hyper-Parameter Optimization

Hyperparameter Tuning

Network Tuning

Number of hidden layers and units per layer
Regularization parameters - specifying dropout rates and lasso or ridge regularization parameters
Stochastic gradient parameters

Random grid search: recent research highlights the efficiency of randomly selecting hyperparameter values during a search over possible settings

21 of 28

Sources: Demystifying Deep Learning, SAS Institute

Unsupervised: Autoencoder

y	x₁	x₂	x₃
1	2.54	1.65	0.02
0	1.14	0.70	0.82
1	0.99	0.51	2.11
⁞	⁞	⁞	⁞

Target vector

Input vectors

Supervised training - predict target y from input vectors X

Unsupervised training - predict X from X

22 of 28

Sources: Demystifying Deep Learning, SAS Institute

Unsupervised Training

Is it trivial to learn X from X?

Sometimes it is, and the neural network simply learns to duplicate the training data instead of learning generalizable concepts from the training data
To avoid this problem, you can use regularization
A denoising autoencoder is a single-hidden layer unsupervised neural network with regularization �

Why is unsupervised training of neural networks useful?

Anomaly detection!
Feature extraction
Improved initialization of layers in deep networks

23 of 28

MNIST Data on Autoencoder

Train denoising autoencoder�
Extract and display output of 2-unit middle hidden layer as a scatter plot – notice clustering and overlap between clusters

Calculate the reconstruction error (MSE) for each image�
Display the 1 and 7 images with the highest reconstruction error (MSE) to locate data points that are unlike others – anomalies!

Hidden Unit 1

Hidden Unit 2

24 of 28

Issues of Neural Network Overview

Pros:

Few assumptions
Accurately model high-degree interactions and extremely nonlinear phenomena
Tend to excel at image, sound, and other pattern recognition tasks �

Cons:

Prone to overfit
Difficult to interpret

Requires grid search:

Number of nodes in input layer?
Number of hidden layers and nodes per layer?
Initial weights?
Learning rate, max. number of epochs, mini-batch size, etc.?

25 of 28

Sources: Demystifying Deep Learning, SAS Institute

Demystifying Deep Learning

VERY SIMPLY PUT - a NEURAL NETWORK with more than one hidden layer for a supervised or unsupervised learning task

26 of 28

�Sonar Case Study

Using Python

27 of 28

Source: Machine Learning Algorithms from Scratch

Stochastic Gradient Descent & k-fold Cross Validation Approach

Make a Prediction: predict()

Predicts an output value for a row given a set of weights

Training Neural Network Weights: train_weight()

Estimate the weight values for the training data using stochastic gradient descent using following parameter values:

Learning rate - limit the amount each weight is corrected each time it is updated
Epochs - number of times to run through the training data while updating the weight

Three loops: loop over each epoch; loop over each row in the training data for an epoch; and loop over each weight and update it for a row in an iteration

Upload the dataset and prepare data
Run perceptron()
Evaluate model accuracy metrics across k-folds

1 of 28

2 of 28

3 of 28

4 of 28

5 of 28

6 of 28

7 of 28

8 of 28

9 of 28

10 of 28

11 of 28

12 of 28

13 of 28

14 of 28

15 of 28

16 of 28

17 of 28

18 of 28

19 of 28

20 of 28

21 of 28

22 of 28

23 of 28

24 of 28

25 of 28

26 of 28

27 of 28

28 of 28