1 of 18

Intro to Machine Learning

Girish Varma

IIIT Hyderabad

http://bit.ly/2tzcXHu

2 of 18

A Machine Learning Problem

Given a image of a handwritten digit, find the digit.

No well defined function from input to output.

3 of 18

Programming vs Machine Learning

Machine Learning:

Find the handwritten digit in an image.

Collect (image, digit) pairs (dataset).
Train a machine learning model to fit the dataset.
Given a new image, apply the model to get the digit (testing or inference).

Programming:

Find the shortest path in an input graph G.

Implement Dijkstra's algorithm for shortest path in a programming language.

4 of 18

Dataset

Consist of (x,y) pairs, x is the input and y is called the label.
Examples

MNIST: x is a 28x28 b/w image of a handwritten digit, y is a digit in 0 to 9.
CIFAR10: x is a 32x32 color image, y is a label in {aeroplane, automobile, bird, car ..}. Y is given as a number in 0 to 9, and there is a mapping between the numbers and the correct label.

Divided into train, test and validation.

5 of 18

Tensors

All data, intermediate outputs, learnable parameters are represented by a tensor.

A machine learning model transforms an input tensor to an output tensor.

Tensors have a shape.

Tensor T with shape [10,10] is equivalent to a 10x10 matrix. It can be indexed by 2 numbers. T[i,j] is a real number.
Tensor can be 3D. T with shape [5, 10, 15] can be indexed by 3 numbers i, j, k (i <= 5, j <=10, k <= 15).
Tensor can have arbitrary shape. T with shape [100, 32, 32, 3] can represent 100 color images each 32x32 in size.

6 of 18

Model

The function that maps the input to the output. �y = f_𝛉(x)

A model has learnable parameters, 𝛉.

Fit a line to a set of points.

Slope and offset are learnable parameters.

Fit a degree 4 polynomial.

Coefficients are learnable parameters.

Fit a Multilayered perceptron.

Weights and biases are learnable parameters

7 of 18

The Neural Network Model

Neuron or Perceptron

Input X is n dimensional, Y is 1 dimensional.
Has learnable parameters W = (W₁,W₂,..., W_n) (weights) and b (bias).
Y = 𝞂(∑W_iX_i+ b)
𝞂 is a non linear activation function.�

Fully Connected or Linear

Y is also multidimensional (dimension m).
Has learnable parameters W = ( W_ij) and b = ( b_j) �where i <= n, j <= m
Y = 𝞂(WX+ b)

8 of 18

MNIST Classification

Input : x is a [28,28] shaped tensor, giving pixel values of the image

Output : y is a [10] shaped tensor, giving the probabilities of being 0 to 9.

If the dataset gives y as a digit, convert it to probability vector by one hot encoding.

Use Softmax function for converting real valued output to probabilities.

9 of 18

Multilayered Network

Complex data fits only more complex models.

Obtain complex models by layering multiple linear layers.

Multilayered Perceptron (MLP)

Multiple Linear layers �one following the other.
Y = 𝞂(V 𝞂(WX+ b) + c )
Intermediate outputs are called�hidden units.

10 of 18

A MLP model for MNIST

Reshape

Fully Connected Layer

Softmax

p(0)�p(1)

p(8)

p(9)

Predicted probabilities for different digits

11 of 18

Training a Model

The process of finding the right parameters for the model.

12 of 18

Loss Function

Loss Function : A function that computes the difference between the predicted output and the correct output.

Eg: Mean Squared Error (f(x) - y_correct)². y_correct is also called the ground truth.
Eg: Cross Entropy Loss ∑_iy_correct(i)log y_pred(i)�

13 of 18

Gradient Descent

Gradient Descent : Change the parameters 𝛉 slightly such that the loss function decreases. Gradients are the partial derivatives of the loss function wrt. the parameters.

14 of 18

Backpropagation

Backpropagation : The process of finding the gradients of parameters in a multilayered network.

15 of 18

Training Algorithm

Initialize model with random parameters.
Repeat

Take a small random subset of the dataset that will fit in memory (minibatch).
Forward Pass : pass the subset through the model and obtain predictions
Compute the mean loss function for the subset
Backward Pass: compute the gradients of the parameters, last layer to the first.
Update the gradients using learning rate