1 of 95

CNNs

Aaron Snowberger

9주차: CNNs

p. 313-425

2 of 95

13

Convolutional Neural Networks

menu

p. 313-349

3 of 95

13 신경망

Neural Networks

Deep learning is built from networks of artificial neurons, simplified computational units inspired by neurons in the human brain. These artificial neurons are grouped into layers to form deep neural networks. Although the term "neuron" comes from biology, the artificial version is only a symbolic abstraction rather than a realistic model of brain function.

4 of 95

13 신경망

Real Neurons

Biological neurons transmit information through electrical and chemical signals. When neurotransmitters bind to receptor sites, electrical impulses travel through the neuron. If the total signal exceeds a threshold, a new signal is transmitted to other neurons. These neurons form densely connected networks called connectomes, which are unique to each individual.

5 of 95

13 신경망

Artificial Neurons

Artificial neurons used in machine learning are extremely simplified. Despite their simplicity, they form the foundation of deep learning. The media often exaggerates their capability by calling them "electronic brains," but in reality, they are mathematical units, often referred to as units instead of neurons.

6 of 95

13 신경망

The Perceptron

The perceptron, introduced in 1957, was the first mathematical model of an artificial neuron. It takes multiple numeric inputs, multiplies each by a weight, sums them, and compares the result to a threshold to produce an output of either +1 or −1. Early perceptrons were even implemented in hardware.

7 of 95

13 신경망

The Perceptron

Perceptrons initially generated excitement but faced theoretical limitations—they couldn’t solve problems that weren’t linearly separable. This led to the AI winter. However, later research showed that combining multiple perceptrons into layered structures with training algorithms could overcome these limits, leading to modern neural networks.

8 of 95

13 신경망

Modern Artificial Neurons

Modern neurons extend the perceptron model with two enhancements:

  1. Bias Term – An additional input that shifts the activation threshold.
  2. Activation Function – Instead of outputting just +1 or −1, the sum of weighted inputs passes through a nonlinear function, producing more flexible outputs.

These improvements made neural networks far more powerful and are the basis of today's deep learning systems.

9 of 95

13 신경망

Neuron Diagrams

Simplifying Neuron Diagrams

When drawing artificial neurons, diagrams typically do not explicitly show the weights or the multiplication steps to avoid clutter. These operations are implied, and sometimes only the weights' names are written on the connecting lines.

10 of 95

13 신경망

Neuron Diagrams

Bias Trick

To simplify both diagrams and math, the bias is often treated as just another input with a fixed value of 1, and only its weight is adjusted. This is known as the bias trick. In most diagrams, even this is omitted, and both weights and biases are simply assumed to exist.

11 of 95

13 신경망

Neuron Diagrams

Networks and Connections

Neurons are connected by "wires" where each wire implies a weight that multiplies the signal before it reaches the next neuron. Even if the weights are not drawn, they are always present conceptually. To refer to weights, a common naming convention uses two letters: the output neuron first and the input neuron second (e.g., AD means weight from A to D).

12 of 95

13 신경망

Neuron Diagrams

Feed-Forward Layers

Neural networks are usually organized into layers where data flows from input layer → hidden layers → output layer without looping back.

This is called a feed-forward network, and it processes information hierarchically, similar to workers on different floors of a building passing work upward.

13 of 95

13 신경망

Neuron Diagrams

Neural Networks as Graphs

Neural networks can be viewed as directed graphs, where nodes are neurons and edges are weighted connections. These graphs are directed acyclic graphs (DAGs), meaning information flows only forward, with no cycles. Each edge implicitly carries a weight multiplication, even if not shown.

14 of 95

13 신경망

Initializing the Weights

Training a neural network starts by assigning initial random values to the weights. The way these initial weights are chosen can significantly impact how fast the network learns. Researchers have designed various initialization strategies, such as LeCun Uniform, Glorot/Xavier Uniform, and He Uniform, which draw values from a uniform distribution. Their Normal counterparts draw from a normal distribution.

Modern deep learning libraries implement these methods automatically, and in most cases, the default initialization works well without needing manual adjustment.

15 of 95

13 신경망

Deep Networks

A powerful structure in neural networks is stacking neurons in multiple layers. Neurons within a layer don’t connect to each other; instead, they take input from the previous layer and send output to the next.

This layered structure allows hierarchical processing of data: early layers detect simple features like pixel intensity, and deeper layers detect complex patterns like shapes or objects. Layers between input and output are called hidden layers.

The depth of the network refers to how many neuron-containing layers it has.

16 of 95

13 신경망

Fully Connected Layers

A fully connected (dense) layer is one where every neuron receives an input from every neuron in the previous layer. If the previous layer has 4 neurons and the current one has 3, then there are 12 total connections, each with its own weight. Networks made entirely of these dense layers are often called fully connected networks or multilayer perceptrons (MLPs).

17 of 95

13 신경망

Tensors

Although each neuron outputs a single number, we often refer to the output of a whole layer as a collection, described by its shape. A single number is a 0D array; a list of numbers is a 1D array (vector); an image can be a 2D array (matrix), or a 3D one if color channels are included. Instead of many different terms, deep learning uses the word tensor to refer to any multi-dimensional block of numbers, defined by its dimensions and shape.

18 of 95

13 신경망

Preventing Network Collapse

Without activation functions, a neural network collapses into the equivalent of a single neuron, regardless of how many layers it has. This happens because the operations remain purely linear, and linear functions can always be merged into one. Activation functions introduce nonlinearity, which prevents this collapse and allows networks to learn complex patterns. Different activation functions exist to address various training difficulties, but only a few are commonly used in practical systems.

19 of 95

13 신경망

Activation Functions

Activation functions, also called nonlinearities, take a numeric input and produce a transformed numeric output. They are essential in preventing network collapse, where a network without nonlinear activation would behave like a single linear neuron. Although we could assign different activation functions to every neuron, in practice we usually assign the same one per layer.

20 of 95

13 신경망

Activation Functions

Straight-Line Functions

Linear activation functions (like the identity function) simply pass or scale the input. They do not prevent collapse, so they are used only in output layers or intermediate processing steps.

21 of 95

13 신경망

Activation Functions

Step Functions

These produce abrupt output jumps. The most basic is the step function used in early perceptrons. Variants include:

  • Unit Step: outputs 0 → 1
  • Heaviside Step: threshold at 0
  • Sign Function: outputs −1 → 1 (or sometimes −1, 0, 1)

22 of 95

13 신경망

Activation Functions

Step Functions

23 of 95

13 신경망

Activation Functions

Piecewise Linear Functions (Nonlinear)

  • ReLU (Rectified Linear Unit): outputs 0 for negatives and x for positive values. Simple and efficient but can lead to dead neurons.
  • Leaky ReLU / Parametric ReLU: allows small negative output values to avoid dead neurons.
  • Shifted ReLU, Maxout, Noisy ReLU: variations to improve performance and stability.

24 of 95

13 신경망

Activation Functions

Piecewise Linear Functions (Nonlinear)

25 of 95

13 신경망

Activation Functions

Smooth Activation Functions

These functions are differentiable everywhere, which is good for learning:

  • Softplus: smoothed ReLU.
  • ELU: smoothed shifted ReLU.
  • Swish: smooth bump near 0.
  • Sigmoid: squashes values to [0, 1], smooth version of Heaviside step.
  • Tanh: squashes values to [-1, 1], smooth and centered at 0.
  • Sine Activation: outputs from [-1, 1] but doesn’t saturate far from zero.

26 of 95

13 신경망

Activation Functions

Choosing Activation Functions (Practical Tips)

Situation

Recommended

Hidden Layers (especially dense layers)

ReLU or Leaky ReLU (faster learning, avoids dead neurons).

Regression Output Layer

No activation or Linear.

Binary Classification Output

Sigmoid

Multiclass Classification Output

Use a different activation covered next (usually Softmax).

27 of 95

13 신경망

Softmax

Softmax is an operation typically used only on the output layer of a classification neural network when there are two or more output neurons. It is not exactly an activation function in the traditional sense because it takes all output scores at once and transforms them collectively into a set of probabilities.

28 of 95

13 신경망

Softmax

Normally, the output layer neurons do not use any activation function (or use a linear one), and then the raw output scores are passed into softmax. Each raw score represents how strongly the network believes the input belongs to that class, but these scores cannot be directly interpreted as probabilities.

29 of 95

13 신경망

Softmax

Softmax resolves this by converting them into values that:

  • Are between 0 and 1, and
  • Sum to 1, forming a proper probability distribution.

30 of 95

13 신경망

Softmax

Softmax ensures that relative ranking (ordering) of the outputs remains the same, but it scales and normalizes them based on the entire set of outputs. When input scores are small (e.g., <1), the distribution is more balanced. When one score is much larger than the rest (especially >1), softmax amplifies that difference so the dominant class stands out clearly — this is called exaggeration, and the others are often described as being crushed.

31 of 95

13 신경망

소프트맥스

This transformation is extremely useful because it allows us to say things like “Class A is twice as likely as Class B”, which we cannot do with raw scores.

32 of 95

14

Backpropagation

menu

p. 351-386

33 of 95

14 역전파

High-Level Idea

A neural network consists of many neurons connected together. To make it produce correct outputs, we must train it efficiently. The key algorithm that makes this possible is backpropagation (backprop). Without backprop, modern deep learning would be too slow to be practical.

34 of 95

14 역전파

Error and Learning

Training starts by defining a loss (error) function, which gives a single number representing how wrong the prediction is. A prediction with zero loss means the network got it right. We aim to reduce this loss across the entire dataset by adjusting network weights.

We don't directly teach the network correct answers. Instead, we penalize wrong answers, and the only way for the network to reduce penalties is to adjust weights to avoid errors. Multiple penalty terms can be combined, such as correctness and keeping weights small (regularization).

35 of 95

14 역전파

Naive Learning Approach

A simple but extremely slow method is to change one weight at a time randomly, evaluate if the error gets better, and keep or discard the change. This process is conceptually understandable but completely impractical for millions of weights.

36 of 95

14 역전파

Gradient Descent

To improve efficiency, we use gradient descent, where we determine whether each weight should increase or decrease to reduce error by calculating the gradient (slope) of the loss with respect to each weight.

  • Positive gradient → Decrease weight
  • Negative gradient → Increase weight

Unlike the naive method, gradient descent updates all weights at once, making training much faster. However, because weight changes influence each other, updates are made in small steps to avoid instability.

37 of 95

14 역전파

Getting Started with Backpropagation

Overview of the Two-Step Learning Process

Training reduces error by:

  1. Backpropagation step – Each neuron computes a delta (δ) value representing how its output affects the overall error.
  2. Update/Optimization step – These δ values are used to adjust the weights. (This part is covered later, in Chapter 15.)

In this chapter, we focus only on finding δ values through backpropagation, ignoring activation functions temporarily to make the core logic clearer.

38 of 95

14 역전파

Getting Started with Backpropagation

Key Concept

Without activation functions, neuron outputs are just weighted sums. So:

  • Changing a neuron's input weight → changes its output
  • Changing output → proportionally changes final error
  • The proportionality factor is called δ (delta) for that neuron.

Every neuron has a delta value that tells us:

"If my output changes by m, the total error will change by m × δ."

39 of 95

14 역전파

Getting Started with Backpropagation

Gradient Intuition

  • Think of δ as the gradient that tells whether increasing a neuron’s output will increase or decrease the error.
  • Positive δ → increasing the neuron output increases error.
  • Negative δ → increasing the neuron output reduces error.

Once we know δ for each neuron, we know which direction to push each incoming weight to reduce error.

40 of 95

14 역전파

역전파 시작하기

41 of 95

14 역전파

Getting Started with Backpropagation

Finding Dδ (Delta for Neuron D)

After calculating the delta for neuron C (Cδ), we do the same procedure for P2 (corresponding to neuron D). Looking at the error curve:

  • A small positive change in P1 decreased error, giving Cδ ≈ -3.
  • A small negative change in P2 decreased error, giving Dδ ≈ 2.5.
  • This positive sign means increasing P2 increases error, so we should move it in the negative direction.

42 of 95

14 역전파

Getting Started with Backpropagation

Finding Dδ (Delta for Neuron D)

An important observation is that we cannot reduce the error to zero by changing one weight at a time because P1 and P2 influence the error jointly. Even if one weight reaches its optimal value, the other prevents error from reaching zero. In practice, we update all weights slightly and repeatedly, gradually reducing the error.

43 of 95

14 역전파

Getting Started with Backpropagation

Finding Dδ (Delta for Neuron D)

Also, we don't necessarily want zero training error – reducing error too much may lead to overfitting, where performance on new data worsens.

44 of 95

14 역전파

Getting Started with Backpropagation

Measuring Error – Mean Squared Error (MSE)

For simplicity, the quadratic cost function (MSE) is used. With this,

  • Delta for an output neuron = (label value – neuron output).
  • Example: If the label is 1 and output is 2, then δ = 1 – 2 = -1, meaning the error increases in the opposite direction of the output change.

This completes Step 1 of backpropagation: computing deltas for the output layer.

45 of 95

14 역전파

Getting Started with Backpropagation

Using Deltas to Update Weights

  • A weight like AC affects error proportional to Ao × Cδ.
  • To reduce error, we update the weight using AC ← AC - (Ao × Cδ).
  • This is applied to all weights leading into output neurons.

By doing so, we make the first training step — updating weights AC, BC, AD, BD.

46 of 95

14 역전파

Getting Started with Backpropagation

Finding Deltas for Hidden Layer Neurons

To update earlier layers, we need deltas for hidden neurons like A and B.

  • A neuron like A affects multiple output neurons (C and D).
  • Therefore, its delta is Aδ = (AC × Cδ) + (AD × Dδ).
  • This logic extends to any network depth, layer by layer backward — the essence of backpropagation.

47 of 95

14 역전파

역전파 시작하기

48 of 95

14 역전파

Getting Started with Backpropagation

Feedforward vs Backprop Flow

  • Forward pass (prediction): outputs flow rightward, multiplied by weights.
  • Backward pass (backpropagation): deltas flow leftward, multiplied by the same weights.�This symmetry makes computing deltas efficient, similar in cost to computing outputs.

49 of 95

14 역전파

Backprop on a Larger Network

Backpropagation lets us calculate the delta values for every neuron in a network. We begin at the output layer, calculate deltas using the loss function, and then move backward through each layer, one at a time, computing deltas based only on:

  • The weights leading to the next layer
  • The deltas of those next-layer neurons

50 of 95

14 역전파

Backprop on a Larger Network

We don’t need any other layers once a layer’s deltas are computed. After going layer by layer all the way back to the first hidden layer, we will have a delta for every neuron in the network.

Once all deltas are known, we update all weights using these deltas and the values of the neurons that feed into them. This completes one backpropagation cycle, which is used repeatedly to train the network.

51 of 95

14 역전파

더 큰 네트워크에서의 역전파

Backpropagation is efficient because it uses local information (just the next layer’s weights and deltas) and can be parallelized on a GPU, allowing entire layers to be processed simultaneously.

52 of 95

14 역전파

Learning Rate (η) and Its Importance

Updating weights too aggressively (η too large) causes instability, making weights jump wildly and fail to converge.

Updating too slowly (η too small) makes training painfully slow, leading to long plateaus with little improvement.

53 of 95

14 역전파

Learning Rate (η) and Its Importance

Therefore, the learning rate η (between 0 and 1) controls how big each weight update is:

  • η = 0 → no learning
  • η = 1 → unstable, overshooting
  • Best practice → small positive value, often starting around 0.001 or 0.01

Choosing the right η is critical and usually determined by experimenting. Modern optimizers can even adjust the learning rate automatically during training.

54 of 95

14 역전파

Binary Classifier Example

A network with:

  • 2 inputs
  • Two hidden layers with 4 neurons each
  • 1 output neuron (sigmoid for binary classification)
  • 37 weights total (including biases)

55 of 95

14 역전파

Binary Classifier Example

Training results for different learning rates:

  • η = 0.5 → Completely fails to learn, predictions collapse to one class
  • η = 0.05 → Learns perfectly in ~16 epochs, fast and accurate
  • η = 0.01 → Still learns but takes ~170 epochs, showing long plateaus

56 of 95

14 역전파

Binary Classifier Example

Training results for different learning rates:

  • η = 0.5 → Completely fails to learn, predictions collapse to one class
  • η = 0.05 → Learns perfectly in ~16 epochs, fast and accurate
  • η = 0.01 → Still learns but takes ~170 epochs, showing long plateaus

57 of 95

14 역전파

Binary Classifier Example

Training results for different learning rates:

  • η = 0.5 → Completely fails to learn, predictions collapse to one class
  • η = 0.05 → Learns perfectly in ~16 epochs, fast and accurate
  • η = 0.01 → Still learns but takes ~170 epochs, showing long plateaus

This shows that η = 0.05 was "just right" for this setup, while too large or too small values were inefficient or unstable.

58 of 95

14 역전파

Summary

Backpropagation allows us to:

  1. Compute delta values layer by layer, backward through the network
  2. Use those deltas to update weights efficiently
  3. Control training behavior using the learning rate η

Backprop propagates gradients, not the error itself. With GPU acceleration and proper η tuning, backprop makes deep learning feasible and efficient. The next chapter explores how much to update weights using different strategies for adjusting the learning rate.

59 of 95

15

Optimizers

menu

p. 387-425

60 of 95

15 옵티마이저

Training neural networks can be slow, so optimizers are algorithms designed to speed up learning and improve gradient descent. Their goals are to:

  • Make gradient descent faster and more efficient
  • Prevent it from getting stuck in poor minima
  • Automatically adjust the learning rate (η) during training

Different optimizers have different strengths, and choosing the right one helps improve performance.

61 of 95

15 옵티마이저

Error as a 2D Curve

We can visualize error as a curve, showing how it changes with model parameters. In a simple example with two classes on a line, we can slide a dividing boundary and count misclassified samples.

Plotting error against the boundary gives us a 2D error curve — and our goal is to find the lowest point on that curve.

This same principle extends

to all neural network weights:

minimizing total error

across parameters.

62 of 95

15 옵티마이저

Adjusting the Learning Rate

In gradient descent, the learning rate η (usually between 0.01 and 0.0001) controls how large each step is when updating weights.

  • Large η → fast learning but unstable (can overshoot minima)
  • Small η → stable but very slow (may get stuck in shallow valleys)

To improve this, we can change η over time — large at first, small later — just like a metal detector user takes big steps first, then smaller ones near the signal.

63 of 95

15 옵티마이저

Constant-Sized Updates

Using a constant learning rate means η stays fixed throughout training. When applied to a simple “bowl-shaped” error curve:

  • The updates oscillate around the bottom — moving left and right repeatedly without settling.
  • Smaller η reduces oscillation but slows learning.
  • Larger η makes oscillation worse or causes the model to jump into the wrong valley.

Thus, a fixed η rarely performs well — we need a way to adapt it dynamically.

64 of 95

15 옵티마이저

Error as a 2D Curve

65 of 95

15 옵티마이저

Error as a 2D Curve

66 of 95

15 옵티마이저

Error as a 2D Curve

67 of 95

15 옵티마이저

Changing the Learning Rate Over Time

A simple approach is exponential decay — reducing η slightly after each update:

𝜂𝑛𝑒𝑤 = 𝜂𝑜𝑙𝑑 × 𝑑𝑒𝑐𝑎𝑦_𝑓𝑎𝑐𝑡𝑜𝑟

where the decay factor (e.g., 0.99) is close to 1. This makes η shrink gradually, taking large steps early and tiny steps near minima, preventing overshooting.

This approach works much better — the model smoothly converges into the valley bottom instead of bouncing around.

68 of 95

15 옵티마이저

Changing the Learning Rate Over Time

69 of 95

15 옵티마이저

Decay Schedules

Several methods exist for controlling how η decreases:

  1. Exponential Decay: Apply decay continuously each epoch.
  2. Delayed Decay: Start with a fixed η, then begin decay after a few epochs.
  3. Interval (Fixed-Step) Decay: Reduce η every fixed number of epochs (e.g., every 5).
  4. Error-Based Decay: Decrease η only when training error stops improving.
  5. Bold Driver: If loss decreases → slightly increase η; if loss spikes → sharply reduce η.

70 of 95

15 옵티마이저

Decay Schedules

These strategies are known as decay schedules. They help η adapt naturally during training.

However, all require hyperparameters (like decay rate or interval), which must be tuned manually or searched automatically.

71 of 95

15 옵티마이저

Updating Strategies

We compare three optimization strategies for training neural networks:

  1. Batch Gradient Descent,
  2. Stochastic Gradient Descent (SGD), and
  3. Mini-Batch Gradient Descent.

72 of 95

15 옵티마이저

Updating Strategies

The test case is a two-class classification problem using the well-known two moons dataset with 300 samples. The neural network has three hidden layers (12, 13, 13 neurons) with ReLU activations, and a softmax output layer with 2 nodes. A fixed learning rate of η = 0.01 is used for strategies that require it.

73 of 95

15 옵티마이저

Batch Gradient Descent

In batch gradient descent, weights are updated only once per epoch, using the error accumulated from all training samples. The error curve is very smooth, showing a steep drop initially and then a long, shallow descent until it reaches zero after about 20,000 epochs. Although stable, this approach is slow and memory-intensive, requiring access to the entire dataset at once, making it an offline method.

74 of 95

15 옵티마이저

Stochastic Gradient Descent (SGD)

SGD updates weights after every individual sample. With 300 samples, this means 300 updates per epoch, and the error curve becomes very noisy and unpredictable. There are dramatic spikes, including one around epoch 225, where the error temporarily jumps from near zero to nearly one. However, SGD converges to zero error in just about 400 epochs, which is far faster in terms of epochs, but actually performs 120,000 updates, far more than batch gradient descent. SGD is considered an online method because it doesn't require storing all data, but its noise makes convergence unstable.

75 of 95

15 옵티마이저

Mini-Batch Gradient Descent

Mini-batch gradient descent offers a balance between the two extremes. Here, weights are updated after processing small chunks (mini-batches) of data, typically between 32 and 256 samples. With a mini-batch size of 32, this results in 10 updates per epoch, leading to about 5,000 epochs to reach zero error, for a total of 50,000 updates. The curve is smoother than SGD but faster than batch gradient descent, making it practical, efficient, and GPU-optimized. For this reason, mini-batch SGD is the most commonly used method in modern deep learning, and the term “SGD” in literature often implicitly means mini-batch SGD.

76 of 95

15 옵티마이저

Gradient Descent Variations

Mini-batch gradient descent (commonly referred to as SGD) is widely used in practice, but it still has limitations. One major challenge is selecting the right learning rate η. If η is too small, learning becomes slow and may get stuck in shallow minima. If it's too large, the algorithm may overshoot minima and oscillate. Even when using learning rate decay, we still need to decide the initial value and decay schedule. Choosing the mini-batch size is less of an issue, as it typically depends on hardware capabilities.

77 of 95

15 옵티마이저

Gradient Descent Variations

To improve SGD, we can assign individual adaptive learning rates per weight rather than applying a single rate to all weights. Another challenge arises from saddle points and plateaus on the error surface, which cause gradients to vanish and slow training. Since saddle points are common in deep learning, it's beneficial to use algorithms that help escape these flat regions.

78 of 95

15 옵티마이저

Gradient Descent Variations

Momentum

Momentum treats weight updates like a rolling ball on a landscape, carrying inertia from previous updates. Even if the gradient becomes shallow (near a plateau), the accumulated past direction helps continue movement. Momentum adds a fraction γ of the previous update to the current gradient step. If γ is too high, the model overshoots; if too low, it loses the benefit. While it speeds up convergence significantly, it introduces a new hyperparameter γ that must be tuned.

79 of 95

15 옵티마이저

Gradient Descent Variations

Momentum

80 of 95

15 옵티마이저

Gradient Descent Variations

Momentum

81 of 95

15 옵티마이저

Gradient Descent Variations

Momentum

82 of 95

15 옵티마이저

Gradient Descent Variations

Nesterov Momentum

Nesterov momentum improves upon standard momentum by looking ahead. Instead of computing the gradient at the current position, it estimates where the weight would move next based on momentum, then computes the gradient at that future position. This "peek into the future" helps prevent excessive overshooting and leads to faster, more stable convergence. It introduces no new hyperparameters beyond γ, yet performs better than standard momentum, reaching low error in fewer epochs and with less noise.

83 of 95

15 옵티마이저

Gradient Descent Variations

Nesterov Momentum

84 of 95

15 옵티마이저

Gradient Descent Variations

Adagrad

Adagrad introduces the idea of assigning a separate adaptive learning rate to each weight. Instead of using a single η for all parameters, Adagrad keeps a running sum of squared gradients for each weight. Each time a weight is updated, the gradient is squared and added to this sum. Then, the gradient is divided by a value derived from this sum, effectively reducing the learning rate over time for weights that have accumulated many updates.

85 of 95

15 옵티마이저

Gradient Descent Variations

Adagrad

This acts like learning rate decay applied individually per parameter, making the algorithm less sensitive to the initial learning rate choice, often allowing us to simply set η = 0.01. However, because the running sum only grows and never decreases, the updates eventually become extremely small, causing very slow convergence in later epochs, as seen in Adagrad’s error curve, which takes around 8,000 epochs to reach near-zero error.

86 of 95

15 옵티마이저

Gradient Descent Variations

Adadelta and RMSprop

Adadelta and RMSprop were proposed to solve Adagrad’s issue of ever-shrinking updates. Instead of accumulating all squared gradients from the beginning, they maintain a decaying moving average of recent squared gradients. This means recent gradients matter more than older ones, allowing the effective learning rate to adapt up or down depending on current behavior, not just shrink forever.

87 of 95

15 옵티마이저

Gradient Descent Variations

Adadelta and RMSprop

Adadelta introduces a decay parameter γ (around 0.9) to control how long past gradients stay influential. It also uses a small ε (epsilon) for numerical stability. RMSprop works almost identically but uses a slightly different formulation with root-mean-square scaling, which is where its name comes from. Both algorithms converge faster than Adagrad, with Adadelta reaching zero error in around 2,500 epochs instead of 8,000.

88 of 95

15 옵티마이저

Gradient Descent Variations

Adam

Adam combines the advantages of momentum and adaptive learning rates. It keeps two separate moving averages per weight:

  • One for the gradients themselves (to keep sign and direction information)
  • One for the squared gradients (to scale learning rates like Adadelta/RMSprop)

89 of 95

15 옵티마이저

Gradient Descent Variations

Adam

Using both, Adam computes a bias-corrected adaptive update, leading to fast and stable convergence. In tests, Adam reaches near-zero error in around 900 epochs, faster than all previous methods. It uses two hyperparameters, β1 (0.9) controlling momentum of gradients and β2 (0.999) controlling momentum of squared gradients, which usually work well with default values.

90 of 95

15 옵티마이저

Choosing an Optimizer

There are many optimizers available in deep learning, and new ones continue to appear. Each has its own strengths and weaknesses, and there is no universally best optimizer.

In the two-moon test case, mini-batch SGD with Nesterov momentum performed the best, with Adam as a close second.

However, in more complex datasets, adaptive optimizers like Adadelta, RMSprop, and Adam often show better performance, with Adam being a common first choice thanks to its stable results and reliable defaults.

91 of 95

15 옵티마이저

Choosing an Optimizer

The No Free Lunch Theorem explains why there can never be a single best optimizer for all problems. Each optimizer can perform better or worse depending on the dataset and model.

Therefore, it’s common practice to start with Adam using its default settings, and only change if necessary. Automated optimizer search tools in modern libraries can also test multiple optimizers and parameter configurations to find the best fit for a specific case.

92 of 95

15 옵티마이저

Regularization

Even with a good optimizer, networks can still overfit when trained too long. Regularization techniques help delay overfitting so the model can train for more epochs while still learning useful patterns.

Dropout

Dropout is a training-only layer that temporarily disables a random subset of neurons in each batch. By forcing other neurons to compensate, it prevents any single neuron from becoming overly specialized. This distributes learning more evenly and delays overfitting.

93 of 95

15 옵티마이저

Regularization

Batch Normalization (Batchnorm)

Batchnorm normalizes and scales the outputs of a layer so that they remain in a stable range near zero, where activation functions behave efficiently. This prevents neurons from producing extreme values, stabilizes learning, and also helps reduce overfitting.

94 of 95

15 옵티마이저

Summary

Optimization updates weights using gradients, and the learning rate is the most critical hyperparameter. We explored gradient descent variations, momentum, adaptive optimizers like Adam, and regularization methods such as dropout and batchnorm to prevent overfitting. These techniques significantly enhance training efficiency and model generalization.

95 of 95

Thanks!

Please keep this slide for attribution.

menu

CREDITS: This presentation template was created by Slidesgo, including icons by Flaticon, infographics & images by Freepik and illustrations by Storyset