1 of 24

DLBasic: Components of Deep Learning

Introduction followed by a review of important Deep Learning topics including Activation: (ReLU, Sigmoid, Tanh, Softmax), Loss Function, Optimizer, Stochastic Gradient Descent, Back Propagation, One-hot Vector, Vanishing Gradient, Hyperparameter, Label Smoothing

Digital Science Center

2 of 24

DLBasic: Examples, Components and Types

Our operational discussion of Deep Learning has 3 parts of which this is second

1) Examples of use

2) Discussion of smaller building blocks or network components

3) Types of Network such as fully connected, convolutional or recurrent. This is just a short summary

Digital Science Center

3 of 24

Deep Learning Terms often Used

Activation

Linear (no activation layer)
ReLU (ReLu Definition)
Sigmoid (Sigmoid Function Definition)
Tanh (Activation function)
Softmax(Softmax Function Definition)

Loss Function (Loss Function Definition | DeepAI)
Optimizer (Adam Definition, RMSProp Definition, Adaptive Subgradient Methods (AdaGrad) Definition, Moment Definition )
Stochastic Gradient Descent (Stochastic Gradient Descent Definition | DeepAI)
Back Propagation (Backpropagation Definition, Error Backpropagation Learning Algorithm Definition, Exploding Gradient Problem Definition | DeepAI)
One-hot Vector (One Hot Encoding Definition)
Vanishing Gradient (Vanishing Gradient Problem Definition)
Hyperparameter (Hyperparameter Definition | DeepAI)
Label Smoothing (Label Smoothing & Deep Learning: Google Brain explains why it works and when to use (SOTA tips))
Dropout and Max Pooling could be described here but put in following presentation on “Network Types”

Digital Science Center

4 of 24

Activation Functions

https://ethen8181.github.io/machine-learning/deep_learning/nn_tensorflow.html
https://en.wikipedia.org/wiki/Activation_function
https://www.analyticsvidhya.com/blog/2017/10/fundamentals-deep-learning-activation-functions-when-to-use-them/
https://towardsdatascience.com/comparison-of-activation-functions-for-deep-neural-networks-706ac4284c8a�
Each of different activation choices corresponds to a different choice of f
We learn values of w_iand b by training network
Derivatives are important as calculate differential of loss function; zero or very small derivatives can be a problem as values are slow to change
The simplest activation function is linear activation� or f(x) = ax
or the identity activation f(x) = x

ANN artificial Neuron (Node)

Nature’s Neuron

Digital Science Center

5 of 24

Activation Functions Explained - GELU, SELU, ELU, ReLU and more

Digital Science Center

6 of 24

What is an Activation Function?

From https://deepai.org/machine-learning-glossary-and-terms/activation-function
An activation function determines the output behavior of each node, or “neuron” in an artificial neural network.
Activation functions are crucial basic components of artificial neural networks (ANN), since they introduce non-linear characteristics to the network. This allows the ANN to learn complicated, non-linear mappings between inputs and response variables.
Without these nonlinear activation functions, then nodal activation would be limited to a linear combination of the inputs, which would exponentially increase the processing power and time needed to solve problems and severely limit the types of relationships between data set features that could be discovered.
How do Activation Functions Work?
While there are countless variations of these functions, most network frameworks begin by computing the weighted sum of the inputs. Each node in the layer can have it's one unique weighting. However the activation is function is the same across all nodes in the layer. While weights are learning parameters, the activation functions are typically of a fixed form.

Digital Science Center

7 of 24

ReLU Rectified Linear Unit

The Rectified Linear Unit has become very popular recently (as of 2016). It computes the function:
f(x)=max(0,x)
Derivative discontinuous at the origin

Digital Science Center

8 of 24

The Purpose of ReLU

From https://deepai.org/machine-learning-glossary-and-terms/relu
Traditionally, some prevalent non-linear activation functions, like sigmoid functions (or logistic) and hyperbolic tangent, are used in neural networks to get activation values corresponding to each neuron. Recently, the ReLu function has been used instead to calculate the activation values in traditional neural network or deep neural network paradigms. The reasons of replacing sigmoid and hyperbolic tangent with ReLu consist of:
Computation saving - the ReLu function is able to accelerate the training speed of deep neural networks compared to traditional activation functions since the derivative of ReLu is 1 for a positive input. Due to a constant, deep neural networks do not need to take additional time for computing error terms during training phase.
Solving the vanishing gradient problem: the ReLu function does not trigger the vanishing gradient problem when the number of layers grows. This is because this function does not have an asymptotic upper and lower bound. Thus, the earliest layer (the first hidden layer) is able to receive the errors coming from the last layers to adjust all weights between layers. By contrast, a traditional activation function like sigmoid is restricted between 0 and 1, so the errors become small for the first hidden layer. This scenario will lead to a poorly trained neural network.

Digital Science Center

9 of 24

Sigmoid Activation Function

Also called Logistic function
The sigmoid nonlinearity has the mathematical form,where x is simply the input value:
σ(x)=1/(1+ exp(-x))
Note value and derivative very smooth but derivative very small at very large positive or negative x values

The name Sigmoidal comes from the Greek letter Sigma, and when graphed, appears as a sloping “S” across the Y-axis.

Digital Science Center

10 of 24

Tanh Activation Function

The Tanh nonlinearity has the mathematical form,where x is simply the input value:
Tanh(x)=(exp(x) - exp(-x))/ (exp(x) + exp(-x))
Note that as with sigmoid, value and derivative very smooth but derivative very small at large positive or negative x values
In fact tanh(x)=2σ(2x)−1

Digital Science Center

11 of 24

Softmax Activation Layer

Multiple (N) output cells label by i -- each with its own f_i
f_i(x) are function of vector x with components x_i��f_i(x) = exp(-x_i) / ∑_i=1^i=N exp(-x_i)�
Softmax can generate probabilities of getting one of N values i
Note ∑_i=1^i=N f_i(x) = 1
See https://deepai.org/machine-learning-glossary-and-terms/softmax-layer

Digital Science Center

12 of 24

Scaled Exponential Linear Unit. SELU

Activation Functions Explained - GELU, SELU, ELU, ReLU and more
This activation functions is one of the newer one's, and it serves us on a particularly long appendix (90 pages) with theorems, proofs etc. When using this activation function in practice, one must use lecun_normal for weight initialization, and if dropout wants to be applied, one should use AlphaDropout. More on this later in the code section of article
The authors have calculated two values; an alpha α and lambda λ value for the equation. α≈1.6732632423543772848170429916717
λ≈1.0507009873554804934193349852946
They have that many numbers after the decimal point for absolute precision, and they are predetermined, which means we do not have to worry about �picking the right alpha value for this activation function.
To be honest, the equation just looks like the other equations, �which it more or less is. All the newer activation functions just �look like a combination of the other existing activation functions.
The equation for SELU looks like this:
SELU(x) = λ x if x > 0 or
SELU(x) = λ α (e^x - 1) if x <= 0 becoming - λ α �as x goes to - infinity

Digital Science Center

13 of 24

Loss Functions I

See https://machinelearningmastery.com/loss-and-loss-functions-for-training-deep-learning-neural-networks/
Everything we discuss is solving an optimization problem with DNN’s proposing a quantity to specify system state and the loss function is to be minimized as a function of this state.
“Stochastic Gradient Descent” is most common method to find state that minimizes the loss function
Local minima and overfitting are dangers (false minima) to be avoided using “regularization methods”.
Maximum likelihood estimation, or MLE, is a framework for inference for finding the best statistical estimates of parameters from historical training data: exactly what we are trying to do with the neural network.

Typically use negative (to make it a minimization not maximization) of log of likelihood

Digital Science Center

14 of 24

Loss Functions II

See https://deepai.org/machine-learning-glossary-and-terms/loss-function
Common Loss Functions: There are multiple ways to determine loss. Two of the most popular loss functions in machine learning are the 0-1 loss function and the quadratic loss function.

The 0-1 loss function is an indicator function that returns 1 when the target and output are not equal and zero otherwise:
IndicatorLoss(y,ŷ) = 0 if y=ŷ or 1 if y≠ŷ
The quadratic loss is a commonly used symmetric loss function. The quadratic losses’ symmetry comes from its output being identical with relation to targets that differ by some value x in any direction (i.e. if the output overshoots by 1, that is the same as undershooting by 1). The quadratic loss (least squares) is of the following form:
QuadraticLoss(y,ŷ) = C(y- ŷ)²

Digital Science Center

15 of 24

Loss Functions III

In most cases, our parametric model defines a distribution and we use the principle of maximum likelihood. This means we use the cross-entropy between the training data and the model’s predictions as the cost function.
Mean Squared Error MSE is also common and as likelihood is often an exponential of a Gaussian, -log of likelihood is often identical to MSE
The complexity of CNN/RNN leads to complex formulae for loss function and its derivatives wrt parameters (weights, biases) leading to ideas like back propagation and steps to avoid pathological (small) derivatives
Typically loss function is (1/D)∑_j=1^j=D Loss(j) where Loss(j) is the contribution of j’th point in training dataset

Digital Science Center

16 of 24

Optimization and Stochastic Gradient Descent SGD I

SGD was a major discovery from the Deep Learning community as a major new approach to optimization. There are two ideas
The first idea steepest descent is very old. If I have a loss function L to be minimized as a function of parameters α₁α₂.. α_P, then L will decrease for small step size η in direction ∂L/∂α_p for p = 1...P.

This is direction with largest decrease for a given value of η and so this is the direction of steepest descent.

The second idea is to note that L=(1/D) ∑_j=1^j=D Loss(j)
Basic steepest descent is α_p → α_p - η ∂L/∂α_p
This now gets modified in several subtle ways
The stochastic in SGD is to replace L by (1/B) ∑_k=1^k=B Loss(for choice of B values j_k -- a stochastic Batch)
One runs over all points j eventually (called an epoch) but with many approximate shifts -- not one big shift

Digital Science Center

17 of 24

Optimization and Stochastic Gradient Descent SGD II

An epoch is a run over all D values of j; typically 10-100 epochs needed to converge iterative optimization.
Batch sizes B could be 128 to 1024
Note (1/B) ∑_k=1^k=B Loss( j_k ) is by central limit theorem an estimate of full L
The learning rate η needs to be chosen and experimented with and will typically change as optimization progresses

In AlexNet 2012, learning rate is 0.01, and is divided by 10 once the the accuracy plateaus. The learning rate is decreased 3 times during the training process.

Momentum m is another important idea
We proposed α_p → α_p + Δα_p with Δα_p = - η ∂L/∂α_p
We change definition of Δα_p to� Δα_p = m Δα_p(previous value) - η ∂L/∂α_p (factor 1-m absorbed in η)
This keeps most of the shift to be that which was successful previously
AlexNet used momentum m = 0.9.�

Digital Science Center

18 of 24

Optimization and Stochastic Gradient Descent SGD III

AlexNet uses weight decay. Modify Δα_p = - η ∂L/∂α_p
to Δα_p = - wd η α_p- η ∂L/∂α_p where weight decay wd = 0.0005�
There are a set of methods that varies shifts on a per parameter basis
They aim to reduce shifts in parameters that are changing rapidly and increase those that are changing slowly
These methods are AdaGrad, RMSProp and Adam�
AdaGrad (adaptive gradient algorithm) illustrates this by replacing�Δα_p = - η ∂L/∂α_p by
Δα_p = - η (1/Norm_p)∂L/∂α_p
where Norm_p is calculated as √(∑_t=1^t=TΔα_p²(step t)) �
See https://en.wikipedia.org/wiki/Stochastic_gradient_descent

Digital Science Center

19 of 24

Back Propagation I

In SGD we saw need to differentiate the loss function wrt the unknown weights
This can be quite confusing in a multilayer neural net
Chain Rule ∂_xf(g(x)) = ∂_gf(∂_xg(x)) where x run over weights, biases etc.
We apply this to layered networks

Output = f(signals from hidden layer 3)

Signals in hidden layer 3 = g(signals from hidden layer 2)

Digital Science Center

20 of 24

Back Propagation II

We have functions f₁,f₂,f₃,f₄corresponding to activation functions at hidden layer 1, 2, 3 and output respectively
Output = f₄(f₃(f₂(f₁(input)))) where it is the functional forms that depend on weights biases etc.
So we let g = f₃(f₂(f₁(input))) and f() =f₄() and apply chain rule recursively

Output = f(signals from hidden layer 3)

Signals in hidden layer 3 = g(signals from hidden layer 2)

So recursion starts at the output (back) and precedes one layer at a time to the input

Digital Science Center

21 of 24

One-Hot Vector

Suppose we have a classification problem with four possible outcomes horse, cat, bus, frog
We could represent these as 1 2 3 or 4 but often more powerful to view as separate dimensions�horse [1, 0, 0, 0]�cat [0, 1, 0, 0]�bus [0, 0, 1, 0]�frog [0, 0, 0, 1]
Binary representations where only one digit is 1
This is a one-hot representation

Digital Science Center

22 of 24

Vanishing Gradient Problem

https://deepai.org/machine-learning-glossary-and-terms/vanishing-gradient-problem
What is the vanishing gradient problem? The vanishing gradient problem is an issue that sometimes arises when training machine learning algorithms through gradient descent. This most often occurs in neural networks that have several neuronal layers such as in a deep learning system, but also occurs in recurrent neural networks. The key point is that the calculated partial derivatives used to compute the gradient as one goes deeper into the network. Since the gradients control how much the network learns during training, if the gradients are very small or zero, then little to no training can take place, leading to poor predictive performance.
Proposed solutions

Multi-level hierarchy: This technique pretrains one layer at a time, and then performs backpropagation for fine tuning.
Residual networks: The technique introduces bypass connections that connect layers further behind the preceding layer to a given layer. This allows gradients to propagate faster to deep layers before they can be attenuated to small or zero values
Rectified linear units (ReLUs): When using rectified linear units, the typical sigmoidal activation functions used for node output is replaced with with a new function: f(x) = max(0, x). This activation only saturates on one direction and thus is more resilient to the vanishing of gradients.

Digital Science Center

23 of 24

DeepAI on Hyperparameters

A hyperparameter is a parameter that is set before the learning process begins. These parameters are tunable and can directly affect how well a model trains.
Some examples of hyperparameters in machine learning:

Batch Size
Number of Hidden layers
Learning Rate
Number of Epochs
Momentum
Regularization method
Number of branches in a decision tree
Number of clusters in a clustering algorithm (like k-means)

Genetic algorithms are quite popular for finding best hyperparameters

This implies you run many copies of neural network to explore parameter space

Digital Science Center

24 of 24

Label Smoothing

Google paper “When Does Label Smoothing Help?” https://arxiv.org/abs/1906.02629v2
Label Smoothing & Deep Learning: Google Brain explains why it works and when to use https://medium.com/@lessw/label-smoothing-deep-learning-google-brain-explains-why-it-works-and-when-to-use-sota-tips-977733ef020
Basic idea is to train to fuzzy values i.e. that trained value is not exactly 1 (this is a cat) but rather this is a cat 90% of the time (90% set by user)
This implies a change to loss function
Label smoothing improves accuracy in image classification, translation, and even speech recognition

From Paper: LS is Label Smoothing

ResNet training for classifying 3 image categories: “beaver, dolphin and otter”…note the tremendous difference in cluster tightness.

Digital Science Center