1 of 29

Autoencoder

Dr. Dinesh K. Vishwakarma

PROFESSOR, DEPARTMENT OF INFORMATION TECHNOLOGY

DELHI TECHNOLOGICAL UNIVERSITY, DELHI.

Webpage: http://www.dtu.ac.in/Web/Departments/InformationTechnology/faculty/dkvishwakarma.php

Email: dinesh@dtu.ac.in

2 of 29

What is Autoencoder?

Autoencoder is an unsupervised ANN that learns how to efficiently compress and encode data then learns how to reconstruct the data back from the reduced encoded representation to a representation that is as close to the original input as possible.
It uses an unsupervised learning method, although technically, they are trained using supervised learning methods, referred to as self-supervised.
The aim of an Autoencoder is to learn a representation (encoding) for a set of data, typically for dimensionality reduction, by training the network to ignore signal “noise”.

4/29/26

3 of 29

Applications of Autoencoder

4/29/26

Dimensionality Reduction
Image Compression
Image Denoising
Feature Extraction
Image generation
Sequence-to-sequence prediction
Recommendation system

4 of 29

Introduction

Autoencoder are a specific type of ANN where the input is the same as the output. They compress the input into a lower-dimensional code and then reconstruct the output from this representation. The code is a compact “summary” or “compression” of the input, also called the latent-space representation.
An Autoencoder consists of 3 components: encoder, code and decoder. The encoder compresses the input and produces the code, the decoder then reconstructs the input only using this code.

4/29/26

5 of 29

Introduction…

4/29/26

To build an Autoencoder, we need 3 things: an encoding method, decoding method, and a loss function to compare the output with the target.

Autoencoder block diagram

6 of 29

Introduction…

Autoencoders are mainly a dimensionality reduction (or compression) algorithm with following important properties:

Data-specific: Autoencoders are only able to meaningfully compress data similar to what they have been trained on. Since they learn features specific for the given training data, they are different than a standard data compression algorithm like gzip. So we can’t expect an Autoencoder trained on handwritten digits to compress landscape photos.
Lossy: The output of the Autoencoder will not be exactly the same as the input, it will be a close but degraded representation. No lossless possible.
Unsupervised: To train an Autoencoder we don’t need to do anything fancy, just throw the raw input data at it. Autoencoder are considered an unsupervised learning technique since they don’t need explicit labels to train on. But to be more precise they are self-supervised because they generate their own labels from the training data.

4/29/26

7 of 29

Architecture

4/29/26

Both the encoder and decoder are fully-connected feedforward neural networks
Code is a single layer of an ANN with the dimensionality of our choice. The number of nodes in the code layer (code size) is a hyperparameter that we set before training the Autoencoder.

The goal is to get an output identical with the input.

The decoder architecture is the mirror image of the encoder. It is not a requirement, the only requirement is the dimension of Input & Output has to be same.

8 of 29

Parameters for Training

Code size: number of nodes in the middle layer. Smaller size results in more compression.
Number of layers: the Autoencoder can be as deep as we like. In the figure above we have 2 layers in both the encoder and decoder, without considering the input and output.
Number of nodes per layer: the Autoencoder architecture we’re working on is called a stacked Autoencoder since the layers are stacked one after another. Usually stacked Autoencoders look like a “sandwitch”. The number of nodes per layer decreases with each subsequent layer of the encoder, and increases back in the decoder. Also the decoder is symmetric to the encoder in terms of layer structure. As noted above this is not necessary and we have total control over these parameters.
Loss function: we either use mean squared error (MSE) or binary crossentropy. If the input values are in the range [0, 1] then we typically use crossentropy, otherwise we use the mean squared error. For more details check out this video.

4/29/26

9 of 29

Unsupervised Data

4/29/26

10 of 29

Restricted Boltzmann Machine (RBM)

Model the joint probability of hidden state and observation.

4/29/26

Objective: maximize likelihood of the data

11 of 29

Autoencoder… ANN

4/29/26

Bottleneck

Compactness of representation, measured as the compressibility.
It retains some behaviorally relevant variables from the input.

12 of 29

Loss Function used for Autoencoder

4/29/26

13 of 29

Deep Autoencoder

4/29/26

They always looked like a really nice way to do non-linear dimensionality reduction:

But it is very difficult to optimize deep Autoencoders using backpropagation.

We now have a much better way to optimize them:

First train a stack of 4 RBM’s
Then “unroll” them.
Then fine-tune with backpropagation.

14 of 29

E.G. Autoencoder

4/29/26

Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., & Efros, A. A. (2016). Context encoders: Feature learning by inpainting. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2536-2544).

15 of 29

Autoencoder for CNN

4/29/26

Convolution

Pooling

Convolution

Pooling

Deconvolution

Unpooling

Deconvolution

Unpooling

As close as possible

Deconvolution

code

16 of 29

De-noising

4/29/26

17 of 29

Types of Autoencoder

Denoising Autoencoder
Sparse Autoencoder
Deep Autoencoder
Contractive Autoencoder
Convolutional Autoencoder
Variational Autoencoder

4/29/26

18 of 29

De-noising Autoencoder

Denoising autoencoders create a corrupted copy of the input by introducing some noise. This helps to avoid the autoencoders to copy the input to the output without learning features about the data. These autoencoders take a partially corrupted input while training to recover the original undistorted input. The model learns a vector field for mapping the input data towards a lower dimensional manifold which describes the natural data to cancel out the added noise.

4/29/26

19 of 29

De-noising Autoencoder…

Advantage:

Provide good representation, such a representation is one that can be obtained robustly from a corrupted input and that will be useful for recovering the corresponding clean input.
Corruption of the input can be done randomly by making some of the input as zero. Remaining nodes copy the input to the noised input.
Minimizes the loss function between the output node and the corrupted input.
Setting up a single-thread de-noising Autoencoder is easy.

Disadvantages

To train an Autoencoder to de-noise data, it is necessary to perform preliminary stochastic mapping in order to corrupt the data and use as input.
This model isn't able to develop a mapping which memorizes the training data because our input and target output are no longer the same.

4/29/26

20 of 29

Sparse Autoencoder

Sparse Autoencoder have hidden nodes greater than input nodes. They can still discover important features from the data.
A generic sparse Autoencoder is visualized where the obscurity of a node corresponds with the level of activation.
A sparsity constraint is introduced on the hidden layer. This is to prevent output layer copy input data.
Sparsity may be obtained by additional terms in the loss function during the training process, either by comparing the probability distribution of the hidden unit activations with some low desired value, or by manually zeroing all but the strongest hidden unit activations.
Some of the most powerful AIs in the 2010s involved sparse Autoencoder stacked inside of DNN.

4/29/26

21 of 29

Sparse Autoencoder…

Advantages

Sparse Autoencoders have a sparsity penalty, a value close to zero but not exactly zero. Sparsity penalty is applied on the hidden layer in addition to the reconstruction error. This prevents overfitting.
They take the highest activation values in the hidden layer and zero out the rest of the hidden nodes. This prevents Autoencoder to use all of the hidden nodes at a time and forcing only a reduced number of hidden nodes to be used.

Disadvantage

For it to be working, it's essential that the individual nodes of a trained model which activate are data dependent, and that different inputs will result in activations of different nodes through the network.

4/29/26

22 of 29

Deep Autoencoder

Deep Autoencoder consist of two identical deep belief networks, One network for encoding and another for decoding.
Typically deep Autoencoder have 4 to 5 layers for encoding and the next 4 to 5 layers for decoding.
We use unsupervised layer by layer pre-training for this model. The layers are Restricted Boltzmann Machines which are the building blocks of deep-belief networks. Processing the benchmark dataset MNIST, a deep autoencoder would use binary transformations after each RBM.
Deep autoencoders are useful in topic modeling, or statistically modeling abstract topics that are distributed across a collection of documents. They are also capable of compressing images into 30 number vectors.

4/29/26

23 of 29

Deep Autoencoder…

Advantages

Deep Autoencoder can be used for other types of datasets with real-valued data, on which you would use Gaussian rectified transformations for the RBMs instead.
Final encoding layer is compact and fast.

Disadvantage

Chances of overfitting to occur since there's more parameters than input data.
Training the data maybe a nuance since at the stage of the decoder’s backpropagation, the learning rate should be lowered or made slower depending on whether binary or continuous data is being handled.

4/29/26

24 of 29

Contractive Autoencoder

The objective of a contractive Autoencoder is to have a robust learned representation which is less sensitive to small variation in the data.
Robustness of the representation for the data is done by applying a penalty term to the loss function. Contractive Autoencoder is another regularization technique just like sparse and denoising Autoencoder.
However, this regularizer corresponds to the Frobenius norm of the Jacobian matrix of the encoder activations with respect to the input. Frobenius norm of the Jacobian matrix for the hidden layer is calculated with respect to input and it is basically the sum of square of all elements.

4/29/26

This model learns an encoding in which similar inputs have similar encodings. Hence, we're forcing the model to learn how to contract a neighborhood of inputs into a smaller neighborhood of outputs.

Advantage

Contractive Autoencoder is a better choice than denoising Autoencoder to learn useful feature extraction.

25 of 29

Convolutional Autoencoder

Autoencoder in their traditional formulation does not take into account the fact that a signal can be seen as a sum of other signals. Convolutional Autoencoder use the convolution operator to exploit this observation.
They learn to encode the input in a set of simple signals and then try to reconstruct the input from them, modify the geometry or the reflectance of the image.
They are the state-of-the-art tools for unsupervised learning of convolutional filters. Once these filters have been learned, they can be applied to any input in order to extract features. These features, then, can be used to do any task that requires a compact representation of the input, like classification.

4/29/26

26 of 29

Convolutional Autoencoder…

Advantage

Due to their convolutional nature, they scale well to realistic-sized high dimensional images.
Can remove noise from picture or reconstruct missing parts.

Disadvantage

The reconstruction of the input image is often blurry and of lower quality due to compression during which information is lost.

4/29/26

27 of 29

Variational Autoencoder

Variational Autoencoder models make strong assumptions concerning the distribution of latent variables.
They use a variational approach for latent representation learning, which results in an additional loss component and a specific estimator for the training algorithm called the Stochastic Gradient Variational Bayes estimator.
It assumes that the data is generated by a directed graphical model and that the encoder is learning an approximation to the posterior distribution where Ф and θ denote the parameters of the encoder (recognition model) and decoder (generative model) respectively.
The probability distribution of the latent vector of a variational Autoencoder typically matches that of the training data much closer than a standard Autoencoder.

4/29/26

28 of 29

Variational Autoencoder…

Advantage

It gives significant control over how we want to model our latent distribution unlike the other models.
After training you can just sample from the distribution followed by decoding and generating new data.

Disadvantage

When training the model, there is a need to calculate the relationship of each parameter in the network with respect to the final output loss using a technique known as backpropagation. Hence, the sampling process requires some extra attention.

4/29/26

1 of 29

2 of 29

3 of 29

4 of 29

5 of 29

6 of 29

7 of 29

8 of 29

9 of 29

10 of 29

11 of 29

12 of 29

13 of 29

14 of 29

15 of 29

16 of 29

17 of 29

18 of 29

19 of 29

20 of 29

21 of 29

22 of 29

23 of 29

24 of 29

25 of 29

26 of 29

27 of 29

28 of 29

29 of 29