1 of 28

Machine Learning I:

Artificial Neural Networks

Patrick Hall

Visiting Faculty, Department of Decision Sciences

George Washington University

2 of 28

Lecture 5 Agenda

  • Brief Introduction
  • Artificial Neural Networks
    • Algorithm Overview
    • Supervised: MLP
    • Unsupervised: Autoencoder
  • Python Code Example: Sonar Case
  • Readings

3 of 28

Where are we in the modeling lifecycle?

Data Collection �& ETL

Feature Selection & Engineering

Supervised�Learning

Unsupervised�Learning

Deployment

Cost Intensive

Revenue

Generating

Assessment & Validation

4 of 28

�Brief Introduction

Overview

5 of 28

Historical Development

The name “neural network” originates from its initial conception to model neurotransmission where each unit represents a neuron and the connection represents synapses.

6 of 28

Brief History

Sources: Introduction to Statistical Learning

  • Neural networks where invented in the late 1950s: Rosenblatt’s Perceptron.
  • Neural networks rose to fame in the late 1980s with backpropagation and the multilayer perceptron (MLP).
  • Then came SVMs, boosting, and random forests that outperformed neural networks on structured data tasks. Neural networks fell from favor.
  • Neural networks resurfaced in the mid-00’s with the new name deep learning, and with new architectures and additional features.
    • Domain knowledge and mechanisms in networks (CNNs)
    • World-class success in image and video classification, and speech recognition.
    • Enabled by GPU computing and large labeled datasets.
  • A great deal of hype surrounds neural networks, making them seem magical and mysterious. While some breakthroughs have been impressive, for the most part they remain nonlinear statistical models (sophisticated regression models).

7 of 28

Neural Network Structure

Sources: Introduction to Statistical Learning

A simple feed-forward neural network using 4 predictors and a single hidden layer (5 hidden units) for modeling a numeric response:

  • X1 - X4 Input Layer: Arrows indicate that each of the input units feed into each of the hidden units
  • A1 - A5 Hidden Layer: Computes activations - nonlinear transformations of linear combinations of the input features
  • Activation Function: tanh, sigmoid, ReLU�(not shown in the figure)
  • Y Output Layer: Similar to a linear model that uses the activations in the hidden layer as inputs, resulting in output predictions

Y

8 of 28

  • Artificial neural networks are a wide and diverse technology.

  • Network architectures like MLP, autoencoder, LSTM, convolutional neural network, and several others are in wide use today.�
  • Match the structure of your network to the structure of your data for best results.

Image Source: https://www.asimovinstitute.org/neural-network-zoo/

9 of 28

Network Structure & Data Structure

Sources: Demystifying Deep Learning, SAS Institute; Explainable Neural Networks based on Additive Index Models

Convolutional neural networks for images:

Specialized, GAM-like, architectures for structured data:

10 of 28

Neural Networks

Neural Network Algorithm Overview

Supervised: MLP

Unsupervised: Autoencoder

11 of 28

Sources: Introduction to Statistical Learning & Elements of Statistical Learning

Neural Network Algorithm Overview

  • Neural Network Structure: Hidden Layers and Activation Function
  • Training Neural Network:
    • Gradient Descent & Backpropagation
    • Stochastic Gradient Descent
    • Regularization & Dropout
    • Hyperparameter Tuning
  • Supervised: MLP
  • Unsupervised: Autoencoder
  • Issues

12 of 28

Sources: Introduction to Data Mining

Hidden Layers: Activations

  • Activations in the hidden layer(s) with generated pre-specified activation functions, e.g., tanh, sigmoid, and ReLU
  • Hidden units as analogous to neurons in the brain - large outputs are firing, while those close to zero are silent
  • Number of layers is associated with “depth”
    • Deeper networks can express a complex hierarchy of feature extraction
    • Every hidden layer ideally represents a level of abstraction where complex features are compositions of simpler features
  • Every node in a hidden layer operates on activations from the preceding layer and transmits activations forward to the nodes of the next layer

13 of 28

Sources: Introduction to Statistical Learning

Image Source: https://www.spiedigitallibrary.org/

Activation Function

  • Not an exhaustive list, but most commonly used activations functions are:
    • Identity (no transformation)
    • Rectified Linear Activation (ReLU)
    • Sigmoid (logistic)
    • Hyperbolic Tangent (tanh)
  • Activation function:
    • Computes nonlinear transformations of linear combination of inputs
    • Prevents extremely large magnitude values from harming the training process
    • Allows the model to capture (complex) nonlinearities
    • Sigmoid and tanh “saturate,” ReLU does not
    • Not typically observed, “hidden”
    • Activation output units for classification are typically softmax

Piecewise-linear ReLU function is popular for its efficiency and computability; above graph has been scaled for ease of comparison.

14 of 28

Sources: Introduction to Statistical Learning & Demystifying Deep Learning, SAS Institute

Training Neural Networks

x1

x2

x3

h1

h2

y

 

This neural net has one output unit for an interval target – but neural nets can have an arbitrary number of targets

Hyperbolic tangent activation function

One hidden layer with two hidden units

Multilayer Perceptron

  • Feed Forward: Estimate the prediction and error using nonlinear activation functions to transform the linear combinations of input functions as derived features
  • Back Propagation: Compute the loss function and its gradient to measure the error across training data and update the weights accordingly

15 of 28

Sources: Introduction to Statistical Learning

Training Neural Network

  • Similar to other statistical learning methods
    • Minimize the error function - squared-error, cross-entropy, log-loss, etc.
  • However, minimizing loss function is not straightforward:
    • Nonconvex optimization
    • Rashomon effect, vast numbers of solutions are possible
  • Two strategies to overcome nonconvexity and overfitting:
    • Iterative gradient descent approaches
    • Regularization: penalties are imposed on the weights (Lasso or Ridge)
  • The gradient is used to train the network and to minimize training errors
    • Theoretically, training stops when gradient reaches zero - some minima in the loss function

16 of 28

Sources: Introduction to Statistical Learning

Gradient Descent & Back Propagation

  • Highly dependent on initialization, but we generally start from a random guess for weight values
  • Propagate data through network to generate ŷ for each row
  • Back Propagation:
    • Use current weights to define a position on the surface of the loss function
    • Find the gradient of the loss function w.r.t. weights
    • Update the weights in the direction of “maximum descent” (negative gradient) to improve the weights.
  • In the next iteration, use the updated weights to compute the loss function until gradient does not change, validation error increases, or some other stopping condition is met
  • The goal is to end up at a (good) local minimum,�which may require retraining from many initializations!

Illustration of gradient descent:

The objective function is not convex - has two minimas. Start with some value (typically chosen randomly), find each step that moves against the gradient until it cannot go down any further. Here, gradient descent reaches the global minimum in 7 steps.

17 of 28

Sources: Introduction to Statistical Learning

Image Source: https://en.wikipedia.org/wiki/Stochastic_gradient_descent

Stochastic Gradient Descent (SGD)

  • SGD accelerates gradient descent to reduce the computational burden of gradient calculations, achieving faster iteration
  • Instead of summing the error and gradient over all N training rows, we can sample a small fraction - a minibatch - of training rows to compute the gradient
  • Note the fluctuations in the total objective function w.r.t. mini-batches, sometimes enabling optimization procedures to bounce out of local minima!
  • Terminology:
    • Iteration: feed-forward and backpropagation on one mini-batch
    • Epoch: when iterations pass over the entire training data

Iterations

Error Function

18 of 28

Sources: Introduction to Statistical Learning; Wikipedia

Regularization: Ridge (L2)

  • MNIST Neural Network
    • 60,000 training rows
    • 784 input units; 2 hidden layers (256 and 128 hidden units in each); 10 output units
    • Total of 235,146 weights - four times the training rows
    • Overfit
  • Ridge regularization - augment the objective function with a ridge penalty term on the squared sum of weights
    • Lasso is also used as an alternative to ridge
  • Training MNIST data with regularization: 60,000 rows
    • Validation set 12,000 (20%)
    • Training set 48,000
    • Minibatch of 128 rows per gradient update
    • 48,000/128 ~ 375 minibatch gradient updates per epoch
    • Validation error starts to increase by 30 epochs - early stopping can also be used as an additional form of regularization

Sample from MNIST data:

Iteration plots with regularization:

19 of 28

Regularization: Dropout

Dropout Learning - relatively new and efficient form of regularization:

  • Nodes or weights are selected at random and ignored during an epoch (set to 0)
  • Done separately for each mini-batch
  • Prevents the nodes from becoming overfit and can be seen as a form of regularization

Fully connected mini-batch

Mini-batch with dropout

20 of 28

Sources: Introduction to Statistical Learning & Random Search for Hyper-Parameter Optimization

Hyperparameter Tuning

  • Network Tuning
    • Number of hidden layers and units per layer
    • Regularization parameters - specifying dropout rates and lasso or ridge regularization parameters
    • Stochastic gradient parameters

  • Random grid search: recent research highlights the efficiency of randomly selecting hyperparameter values during a search over possible settings

21 of 28

Sources: Demystifying Deep Learning, SAS Institute

Unsupervised: Autoencoder

y

x1

x2

x3

1

2.54

1.65

0.02

0

1.14

0.70

0.82

1

0.99

0.51

2.11

Target vector

Input vectors

Supervised training - predict target y from input vectors X

Unsupervised training - predict X from X

22 of 28

Sources: Demystifying Deep Learning, SAS Institute

Unsupervised Training

Is it trivial to learn X from X?

  • Sometimes it is, and the neural network simply learns to duplicate the training data instead of learning generalizable concepts from the training data
  • To avoid this problem, you can use regularization
  • A denoising autoencoder is a single-hidden layer unsupervised neural network with regularization �

Why is unsupervised training of neural networks useful?

  • Anomaly detection!
  • Feature extraction
  • Improved initialization of layers in deep networks

23 of 28

MNIST Data on Autoencoder

  • Train denoising autoencoder�
  • Extract and display output of 2-unit middle hidden layer as a scatter plot – notice clustering and overlap between clusters

  • Calculate the reconstruction error (MSE) for each image�
  • Display the 1 and 7 images with the highest reconstruction error (MSE) to locate data points that are unlike others – anomalies!

Hidden Unit 1

Hidden Unit 2

24 of 28

Issues of Neural Network Overview

  • Pros:
    • Few assumptions
    • Accurately model high-degree interactions and extremely nonlinear phenomena
    • Tend to excel at image, sound, and other pattern recognition tasks �
  • Cons:
    • Prone to overfit
    • Difficult to interpret

Requires grid search:

  • Number of nodes in input layer?
  • Number of hidden layers and nodes per layer?
  • Initial weights?
  • Learning rate, max. number of epochs, mini-batch size, etc.?

25 of 28

Sources: Demystifying Deep Learning, SAS Institute

Demystifying Deep Learning

VERY SIMPLY PUT - a NEURAL NETWORK with more than one hidden layer for a supervised or unsupervised learning task

26 of 28

�Sonar Case Study

Using Python

27 of 28

Source: Machine Learning Algorithms from Scratch

Stochastic Gradient Descent & k-fold Cross Validation Approach

  • Make a Prediction: predict()
    • Predicts an output value for a row given a set of weights
  • Training Neural Network Weights: train_weight()
    • Estimate the weight values for the training data using stochastic gradient descent using following parameter values:
      • Learning rate - limit the amount each weight is corrected each time it is updated
      • Epochs - number of times to run through the training data while updating the weight
    • Three loops: loop over each epoch; loop over each row in the training data for an epoch; and loop over each weight and update it for a row in an iteration
  • Upload the dataset and prepare data
  • Run perceptron()
  • Evaluate model accuracy metrics across k-folds

28 of 28

Reading

  • Elements of Statistical Learning
    • Sections 11.3 - 11.7
  • Introduction to Data Mining
    • Section 4.7