1 of 24

DLBasic: Components of Deep Learning

1

Introduction followed by a review of important Deep Learning topics including Activation: (ReLU, Sigmoid, Tanh, Softmax), Loss Function, Optimizer, Stochastic Gradient Descent, Back Propagation, One-hot Vector, Vanishing Gradient, Hyperparameter, Label Smoothing

Digital Science Center

Digital Science Center

2 of 24

DLBasic: Examples, Components and Types

  • Our operational discussion of Deep Learning has 3 parts of which this is second

1) Examples of use

2) Discussion of smaller building blocks or network components

3) Types of Network such as fully connected, convolutional or recurrent. This is just a short summary

2

Digital Science Center

3 of 24

Deep Learning Terms often Used

3

Digital Science Center

4 of 24

Activation Functions

4

ANN artificial Neuron (Node)

Nature’s Neuron

Digital Science Center

5 of 24

5

Digital Science Center

6 of 24

What is an Activation Function?

  • From https://deepai.org/machine-learning-glossary-and-terms/activation-function
  • An activation function determines the output behavior of each node, or “neuron” in an artificial neural network.
  • Activation functions are crucial basic components of artificial neural networks (ANN), since they introduce non-linear characteristics to the network. This allows the ANN to learn complicated, non-linear mappings between inputs and response variables.
  • Without these nonlinear activation functions, then nodal activation would be limited to a linear combination of the inputs, which would exponentially increase the processing power and time needed to solve problems and severely limit the types of relationships between data set features that could be discovered.
  • How do Activation Functions Work?
  • While there are countless variations of these functions, most network frameworks begin by computing the weighted sum of the inputs. Each node in the layer can have it's one unique weighting. However the activation is function is the same across all nodes in the layer. While weights are learning parameters, the activation functions are typically of a fixed form.

6

Digital Science Center

7 of 24

ReLU Rectified Linear Unit

  • The Rectified Linear Unit has become very popular recently (as of 2016). It computes the function:
  • f(x)=max(0,x)
  • Derivative discontinuous at the origin

7

Digital Science Center

8 of 24

The Purpose of ReLU

  • From https://deepai.org/machine-learning-glossary-and-terms/relu
  • Traditionally, some prevalent non-linear activation functions, like sigmoid functions (or logistic) and hyperbolic tangent, are used in neural networks to get activation values corresponding to each neuron. Recently, the ReLu function has been used instead to calculate the activation values in traditional neural network or deep neural network paradigms. The reasons of replacing sigmoid and hyperbolic tangent with ReLu consist of:
  • Computation saving - the ReLu function is able to accelerate the training speed of deep neural networks compared to traditional activation functions since the derivative of ReLu is 1 for a positive input. Due to a constant, deep neural networks do not need to take additional time for computing error terms during training phase.
  • Solving the vanishing gradient problem: the ReLu function does not trigger the vanishing gradient problem when the number of layers grows. This is because this function does not have an asymptotic upper and lower bound. Thus, the earliest layer (the first hidden layer) is able to receive the errors coming from the last layers to adjust all weights between layers. By contrast, a traditional activation function like sigmoid is restricted between 0 and 1, so the errors become small for the first hidden layer. This scenario will lead to a poorly trained neural network.

8

Digital Science Center

9 of 24

Sigmoid Activation Function

  • Also called Logistic function
  • The sigmoid nonlinearity has the mathematical form,where x is simply the input value:
  • σ(x)=1/(1+ exp(-x))
  • Note value and derivative very smooth but derivative very small at very large positive or negative x values

9

  • The name Sigmoidal comes from the Greek letter Sigma, and when graphed, appears as a sloping “S” across the Y-axis.

Digital Science Center

10 of 24

Tanh Activation Function

  • The Tanh nonlinearity has the mathematical form,where x is simply the input value:
  • Tanh(x)=(exp(x) - exp(-x))/ (exp(x) + exp(-x))
  • Note that as with sigmoid, value and derivative very smooth but derivative very small at large positive or negative x values
  • In fact tanh(x)=2σ(2x)−1

10

Digital Science Center

11 of 24

Softmax Activation Layer

  • Multiple (N) output cells label by i -- each with its own fi
  • fi(x) are function of vector x with components xi��fi(x) = exp(-xi) / ∑i=1i=N exp(-xi)�
  • Softmax can generate probabilities of getting one of N values i
  • Note i=1i=N fi(x) = 1
  • See https://deepai.org/machine-learning-glossary-and-terms/softmax-layer

11

Digital Science Center

12 of 24

Scaled Exponential Linear Unit. SELU

  • Activation Functions Explained - GELU, SELU, ELU, ReLU and more
  • This activation functions is one of the newer one's, and it serves us on a particularly long appendix (90 pages) with theorems, proofs etc. When using this activation function in practice, one must use lecun_normal for weight initialization, and if dropout wants to be applied, one should use AlphaDropout. More on this later in the code section of article
  • The authors have calculated two values; an alpha α and lambda λ value for the equation. α≈1.6732632423543772848170429916717
  • λ≈1.0507009873554804934193349852946
  • They have that many numbers after the decimal point for absolute precision, and they are predetermined, which means we do not have to worry about �picking the right alpha value for this activation function.
  • To be honest, the equation just looks like the other equations, �which it more or less is. All the newer activation functions just �look like a combination of the other existing activation functions.
  • The equation for SELU looks like this:
  • SELU(x) = λ x if x > 0 or
  • SELU(x) = λ α (ex - 1) if x <= 0 becoming - λ α �as x goes to - infinity

12

Digital Science Center

13 of 24

Loss Functions I

  • See https://machinelearningmastery.com/loss-and-loss-functions-for-training-deep-learning-neural-networks/
  • Everything we discuss is solving an optimization problem with DNN’s proposing a quantity to specify system state and the loss function is to be minimized as a function of this state.
  • Stochastic Gradient Descent” is most common method to find state that minimizes the loss function
  • Local minima and overfitting are dangers (false minima) to be avoided using “regularization methods”.
  • Maximum likelihood estimation, or MLE, is a framework for inference for finding the best statistical estimates of parameters from historical training data: exactly what we are trying to do with the neural network.
    • Typically use negative (to make it a minimization not maximization) of log of likelihood

13

Digital Science Center

14 of 24

Loss Functions II

  • See https://deepai.org/machine-learning-glossary-and-terms/loss-function
  • Common Loss Functions: There are multiple ways to determine loss. Two of the most popular loss functions in machine learning are the 0-1 loss function and the quadratic loss function.
    • The 0-1 loss function is an indicator function that returns 1 when the target and output are not equal and zero otherwise:
    • IndicatorLoss(y,ŷ) = 0 if y=ŷ or 1 if y≠ŷ
    • The quadratic loss is a commonly used symmetric loss function. The quadratic losses’ symmetry comes from its output being identical with relation to targets that differ by some value x in any direction (i.e. if the output overshoots by 1, that is the same as undershooting by 1). The quadratic loss (least squares) is of the following form:
    • QuadraticLoss(y,ŷ) = C(y- ŷ)2

14

Digital Science Center

15 of 24

Loss Functions III

  • In most cases, our parametric model defines a distribution and we use the principle of maximum likelihood. This means we use the cross-entropy between the training data and the model’s predictions as the cost function.
  • Mean Squared Error MSE is also common and as likelihood is often an exponential of a Gaussian, -log of likelihood is often identical to MSE
  • The complexity of CNN/RNN leads to complex formulae for loss function and its derivatives wrt parameters (weights, biases) leading to ideas like back propagation and steps to avoid pathological (small) derivatives
  • Typically loss function is (1/D)j=1j=D Loss(j) where Loss(j) is the contribution of j’th point in training dataset

15

Digital Science Center

16 of 24

Optimization and Stochastic Gradient Descent SGD I

  • SGD was a major discovery from the Deep Learning community as a major new approach to optimization. There are two ideas
  • The first idea steepest descent is very old. If I have a loss function L to be minimized as a function of parameters α1α2.. αP, then L will decrease for small step size η in direction ∂L/∂αp for p = 1...P.
    • This is direction with largest decrease for a given value of η and so this is the direction of steepest descent.
  • The second idea is to note that L=(1/D) j=1j=D Loss(j)
  • Basic steepest descent is αpαp - η ∂L/∂αp
  • This now gets modified in several subtle ways
  • The stochastic in SGD is to replace L by (1/B) k=1k=B Loss(for choice of B values jk -- a stochastic Batch)
  • One runs over all points j eventually (called an epoch) but with many approximate shifts -- not one big shift

16

Digital Science Center

17 of 24

Optimization and Stochastic Gradient Descent SGD II

  • An epoch is a run over all D values of j; typically 10-100 epochs needed to converge iterative optimization.
  • Batch sizes B could be 128 to 1024
  • Note (1/B) k=1k=B Loss( jk ) is by central limit theorem an estimate of full L
  • The learning rate η needs to be chosen and experimented with and will typically change as optimization progresses
    • In AlexNet 2012, learning rate is 0.01, and is divided by 10 once the the accuracy plateaus. The learning rate is decreased 3 times during the training process.
  • Momentum m is another important idea
  • We proposed αpαp + Δαp with Δαp = - η ∂L/∂αp
  • We change definition of Δαp to� Δαp = m Δαp(previous value) - η ∂L/∂αp (factor 1-m absorbed in η)
  • This keeps most of the shift to be that which was successful previously
  • AlexNet used momentum m = 0.9.�

17

Digital Science Center

18 of 24

Optimization and Stochastic Gradient Descent SGD III

  • AlexNet uses weight decay. Modify Δαp = - η ∂L/∂αp
  • to Δαp = - wd η αp- η ∂L/∂αp where weight decay wd = 0.0005�
  • There are a set of methods that varies shifts on a per parameter basis
  • They aim to reduce shifts in parameters that are changing rapidly and increase those that are changing slowly
  • These methods are AdaGrad, RMSProp and Adam
  • AdaGrad (adaptive gradient algorithm) illustrates this by replacing�Δαp = - η ∂L/∂αp by
  • Δαp = - η (1/Normp)∂L/∂αp
  • where Normp is calculated as √(t=1t=TΔαp2(step t)) �
  • See https://en.wikipedia.org/wiki/Stochastic_gradient_descent

18

Digital Science Center

19 of 24

Back Propagation I

  • In SGD we saw need to differentiate the loss function wrt the unknown weights
  • This can be quite confusing in a multilayer neural net
  • Chain Rule ∂xf(g(x)) = ∂gf(∂xg(x)) where x run over weights, biases etc.
  • We apply this to layered networks

19

Output = f(signals from hidden layer 3)

Signals in hidden layer 3 = g(signals from hidden layer 2)

Digital Science Center

20 of 24

Back Propagation II

  • We have functions f1,f2,f3,f4 corresponding to activation functions at hidden layer 1, 2, 3 and output respectively
  • Output = f4(f3(f2(f1(input)))) where it is the functional forms that depend on weights biases etc.
  • So we let g = f3(f2(f1(input))) and f() =f4() and apply chain rule recursively

20

Output = f(signals from hidden layer 3)

Signals in hidden layer 3 = g(signals from hidden layer 2)

So recursion starts at the output (back) and precedes one layer at a time to the input

Digital Science Center

21 of 24

One-Hot Vector

  • Suppose we have a classification problem with four possible outcomes horse, cat, bus, frog
  • We could represent these as 1 2 3 or 4 but often more powerful to view as separate dimensions�horse [1, 0, 0, 0]�cat [0, 1, 0, 0]�bus [0, 0, 1, 0]�frog [0, 0, 0, 1]
  • Binary representations where only one digit is 1
  • This is a one-hot representation

21

Digital Science Center

22 of 24

Vanishing Gradient Problem

  • https://deepai.org/machine-learning-glossary-and-terms/vanishing-gradient-problem
  • What is the vanishing gradient problem? The vanishing gradient problem is an issue that sometimes arises when training machine learning algorithms through gradient descent. This most often occurs in neural networks that have several neuronal layers such as in a deep learning system, but also occurs in recurrent neural networks. The key point is that the calculated partial derivatives used to compute the gradient as one goes deeper into the network. Since the gradients control how much the network learns during training, if the gradients are very small or zero, then little to no training can take place, leading to poor predictive performance.
  • Proposed solutions
    • Multi-level hierarchy: This technique pretrains one layer at a time, and then performs backpropagation for fine tuning.
    • Residual networks: The technique introduces bypass connections that connect layers further behind the preceding layer to a given layer. This allows gradients to propagate faster to deep layers before they can be attenuated to small or zero values
    • Rectified linear units (ReLUs): When using rectified linear units, the typical sigmoidal activation functions used for node output is replaced with with a new function: f(x) = max(0, x). This activation only saturates on one direction and thus is more resilient to the vanishing of gradients.

22

Digital Science Center

23 of 24

DeepAI on Hyperparameters

  • A hyperparameter is a parameter that is set before the learning process begins. These parameters are tunable and can directly affect how well a model trains.
  • Some examples of hyperparameters in machine learning:
    • Batch Size
    • Number of Hidden layers
    • Learning Rate
    • Number of Epochs
    • Momentum
    • Regularization method
    • Number of branches in a decision tree
    • Number of clusters in a clustering algorithm (like k-means)
  • Genetic algorithms are quite popular for finding best hyperparameters
    • This implies you run many copies of neural network to explore parameter space

23

Digital Science Center

24 of 24

Label Smoothing

24

From Paper: LS is Label Smoothing

ResNet training for classifying 3 image categories: “beaver, dolphin and otter”…note the tremendous difference in cluster tightness.

Digital Science Center