1 of 98

UNIT 3�CLASSIFICATION�

PREPARED BY : NEHAL SHAH

2 of 98

CLASSIFICATION

  • Logistic Regression
    • Cost Function
    • Problem of Overfitting
    • Regularization

    • Support Vector Machine
      • Support Vector
      • Kernel
      • K-Nearest Neighbor(KNN)

3 of 98

Classification:

  • Classification algorithms are used when the output variable is categorical, which means there are two classes such as Yes-No, Male-Female, True-false, etc.
  • Spam Filtering
  • Random Forest
  • Decision Trees
  • Logistic Regression
  • Support vector Machines

4 of 98

The following are the steps involved in building a classification model:�

  • Initialize the classifier to be used.
  • Train the classifier: All classifiers in scikit-learn uses a fit(X, y) method to fit the model(training) for the given train data X and train label y.
  • Predict the target: Given an unlabeled observation X, the predict(X) returns the predicted label y.
  • Evaluate the classifier model

5 of 98

Classification Terminologies In Machine Learning:

  • Classifier – It is an algorithm that is used to map the input data to a specific category.
  • Classification Model – The model predicts or draws a conclusion to the input data given for training, it will predict the class or category for the data.
  • Feature – A feature is an individual measurable property of the phenomenon being observed.
  • Binary  Classification – It is a type of classification with two outcomes, for eg – either true or false.
  • Multi-Class Classification – The classification with more than two classes, in multi-class classification each sample is assigned to one and only one label or target.
  • Multi-label Classification – This is a type of classification where each sample is assigned to a set of labels or targets.

6 of 98

Classification Algorithms��

7 of 98

Algorithm Selection:��

8 of 98

What is the Classification Algorithm?

  • The Classification algorithm is a Supervised Learning technique that is used to identify the category of new observations on the basis of training data. 
  • In Regression algorithms, we have predicted the output for continuous values, but to predict the categorical values, we need Classification algorithms
  • In Classification, a program learns from the given dataset or observations and then classifies new observation into a number of classes or groups. 
  • Such as, Yes or No, 0 or 1, Spam or Not Spam, cat or dog, etc. Classes can be called as targets/labels or categories.
  • Unlike regression, the output variable of Classification is a category, not a value, such as "Green or Blue", "fruit or animal", etc.

9 of 98

  • Since the Classification algorithm is a Supervised learning technique, hence it takes labeled input data, which means it contains input with the corresponding output.
  • In classification algorithm, a discrete output function(y) is mapped to input variable(x).
  • y=f(x), where y = categorical output
  • The best example of an ML classification algorithm is Email Spam Detector.
  • The main goal of the Classification algorithm is to identify the category of a given dataset, and these algorithms are mainly used to predict the output for the categorical data.

10 of 98

  • Classification algorithms can be better understood using the below diagram. 
  • In the below diagram, there are two classes, class A and Class B. 
  • These classes have features that are similar to each other and dissimilar to other classes.

11 of 98

  • The algorithm which implements the classification on a dataset is known as a classifier. 
  • There are two types of Classifications:
  • 1) Binary Classifier: If the classification problem has only two possible outcomes, then it is called as Binary Classifier.�Examples: YES or NO, MALE or FEMALE, SPAM or NOT SPAM, CAT or DOG, etc.
  • 2) Multi-class Classifier: If a classification problem has more than two outcomes, then it is called as Multi-class Classifier.�Example: Classifications of types of crops, Classification of types of music.

12 of 98

  • Learners in Classification Problems:
  • In the classification problems, there are two types of learners:
  • Lazy Learners: Lazy Learner firstly stores the training dataset and wait until it receives the test dataset. In Lazy learner case, classification is done on the basis of the most related data stored in the training dataset. It takes less time in training but more time for predictions.�Example: K-NN algorithm, Case-based reasoning
  • Eager Learners: Eager Learners develop a classification model based on a training dataset before receiving a test dataset. Opposite to Lazy learners, Eager Learner takes more time in learning, and less time in prediction.

 Example: Decision Trees, Naïve Bayes, ANN.

13 of 98

Types of ML Classification Algorithms:�

  • Classification Algorithms can be further divided into the Mainly two category:
  • Linear Models
    • Logistic Regression
    • Support Vector Machines
  • Non-linear Models
    • K-Nearest Neighbors
    • Kernel SVM
    • Naïve Bayes
    • Decision Tree Classification

14 of 98

LOGISTIC REGERESSION�

15 of 98

  • Logistic regression is one of the most popular Machine Learning algorithms, which comes under the Supervised Learning technique. It is used for predicting the categorical dependent variable using a given set of independent variables.
  • Linear Regression is used for solving Regression problems, whereas Logistic regression is used for solving the classification problems.
  • In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic function, which predicts two maximum values (0 or 1).
  • The curve from the logistic function indicates the likelihood of something such as whether the cells are cancerous or not, a mouse is obese or not based on its weight, etc.

16 of 98

  • Logistic Regression is a significant machine learning algorithm because it has the ability to provide probabilities and classify new data using continuous and discrete datasets.
  • Logistic Regression can be used to classify the observations using different types of data and can easily determine the most effective variables used for the classification. The below image is showing the logistic function:

17 of 98

  • Logistic regression uses the concept of predictive modeling as regression; therefore, it is called logistic regression, but is used to classify samples; Therefore, it falls under the classification algorithm.
  • Mathematically, a logistic regression model predicts P(Y=1) as a function of X. It is one of the simplest ML algorithms that can be used for various classification problems such as spam detection, Diabetes prediction, cancer detection etc.

18 of 98

Types of Logistic Regression�

  • Generally, logistic regression means binary logistic regression having binary target variables, but there can be two more categories of target variables that can be predicted by it. Based on those number of categories, Logistic regression can be divided into following types −
  • Binary or Binomial
  • In such a kind of classification, a dependent variable will have only two possible types either 1 and 0. For example, these variables may represent success or failure, yes or no, win or loss etc.
  • Multinomial
  • In such a kind of classification, dependent variable can have 3 or more possible unordered types or the types having no quantitative significance. For example, these variables may represent “Type A” or “Type B” or “Type C”.
  • Ordinal
  • In such a kind of classification, dependent variable can have 3 or more possible ordered types or the types having a quantitative significance. For example, these variables may represent “poor” or “good”, “very good”, “Excellent” and each category can have the scores like 0,1,2,3.

19 of 98

Linear Regression

Logistic Regression

Linear regression is used to predict the continuous dependent variable using a given set of independent variables.

Logistic Regression is used to predict the categorical dependent variable using a given set of independent variables.

Linear Regression is used for solving Regression problem.

Logistic regression is used for solving Classification problems.

In Linear regression, we predict the value of continuous variables.

In logistic Regression, we predict the values of categorical variables.

In linear regression, we find the best fit line, by which we can easily predict the output.

In Logistic Regression, we find the S-curve by which we can classify the samples.

Least square estimation method is used for estimation of accuracy.

Maximum likelihood estimation method is used for estimation of accuracy.

The output for Linear Regression must be a continuous value, such as price, age, etc.

The output of Logistic Regression must be a Categorical value such as 0 or 1, Yes or No, etc.

In Linear regression, it is required that relationship between dependent variable and independent variable must be linear.

In Logistic regression, it is not required to have the linear relationship between the dependent and independent variable.

In linear regression, there may be collinearity between the independent variables.

In logistic regression, there should not be collinearity between the independent variable.

20 of 98

Cost Function

21 of 98

Overfitting

22 of 98

Regularization

23 of 98

KNN��

24 of 98

  • K-nearest neighbors (KNN) algorithm is a type of supervised ML algorithm which can be used for both classification as well as regression predictive problems.
  • Classification- A category or class as output.
  • Regression-  real number (a number with a decimal point) as its output.
  • The KNN algorithm assumes that similar things exist in close proximity. 
  • In other words, similar things are near to each other.
  • K-NN is a non-parametric algorithm, which means it does not make any assumption on underlying data.
  • It is also called a lazy learner algorithm because it does not learn from the training set immediately instead it stores the dataset and at the time of classification, it performs an action on the dataset.

25 of 98

 

knn

 

 

 

26 of 98

  • What is ‘K’ in KNN…?
  • Number of neighbours consider for new data point prediction.
  • kNN classifier determines the class of a data point by majority voting principle.
  •  If k is set to 5, the classes of 5 closest points are checked. Prediction is done according to the majority class. 
  • Similarly, kNN regression takes the mean value of 5 closest points.

27 of 98

How to find closest point?�Euclidean Distance�

  • ��

28 of 98

  • Algorithm:
  • Step 1 − For implementing any algorithm, we need dataset. So during the first step of KNN, we must load the training as well as test data.
  • Step 2 − Next, we need to choose the value of K i.e. the nearest data points. K can be any integer.
  • Step 3 − For each point in the test data do the following −
  • 3.1 − Calculate the distance between test data and each row of training data with the help of any of the method namely: Euclidean, Manhattan or Hamming distance. The most commonly used method to calculate distance is Euclidean.

29 of 98

  • 3.2 − Now, based on the distance value, sort them in ascending order.
  • 3.3 − Next, it will choose the top K rows from the sorted array.
  • 3.4 − Now, it will assign a class to the test point based on most frequent class of these rows.
  • Step 4 − End

30 of 98

  • Decision Tree 

31 of 98

  • Used For both Classification and Regression.
  • As the name goes, it uses a tree-like model of decisions. 
  • They can be used either to drive informal discussion or to map out an algorithm that predicts the best choice mathematically.
  • The goal of using a Decision Tree is to create a training model that can use to predict the class or value of the target variable by learning simple decision rules inferred from prior data(training data).

32 of 98

  • Important Terminology related to Decision Trees:
  • Root Node: It represents the entire population or sample and this further gets divided into two or more homogeneous sets.
  • Splitting: It is a process of dividing a node into two or more sub-nodes.
  • Decision Node: When a sub-node splits into further sub-nodes, then it is called the decision node.
  • Leaf / Terminal Node: Nodes do not split is called Leaf or Terminal node.

33 of 98

  • Pruning: When we remove sub-nodes of a decision node, this process is called pruning. You can say the opposite process of splitting.
  • Branch / Sub-Tree: A subsection of the entire tree is called branch or sub-tree.
  • Parent and Child Node: A node, which is divided into sub-nodes is called a parent node of sub-nodes whereas sub-nodes are the child of a parent node.

34 of 98

  •       

 

35 of 98

  • ��������A decision tree for the concept buys computer, indicating whether an All Electronics customer is likely to purchase a computer. 
  • Each internal (no leaf) node represents a test on an attribute. 
  • Each leaf node represents a class (either buys computer=yes or buys computer =no).

36 of 98

37 of 98

  • Attribute Selection Measures:
  • If the dataset consists of N attributes then deciding which attribute to place at the root or at different levels of the tree as internal nodes is a complicated step.
  •  By just randomly selecting any node to be the root can’t solve the issue. 
  • If we follow a random approach, it may give us bad results with low accuracy.
  • For solving this attribute selection problem, researchers worked and devised some solutions.

38 of 98

  • They suggested using some criteria like :
  • Entropy
  • Information gain
  • Gini index
  • Gain Ratio
  • Reduction in Variance
  • Chi-Square

39 of 98

  • Steps for Making decision tree:
  • Get list of rows (dataset) which are taken into consideration for making decision tree (recursively at each nodes).
  • Calculate uncertainty of our dataset or Gini impurity or how much our data is mixed up etc.
  • Generate list of all question which needs to be asked at that node.
  • Partition rows into True rows and False rows based on each question asked.
  • Calculate information gain based on Gini impurity and partition of data from previous step.
  • Update highest information gain based on each question asked.

40 of 98

  • Update best question based on information gain (higher information gain).
  • Divide the node on best question. Repeat again from step 1 again until we get pure node (leaf nodes).
  • Advantage of Decision Tree:
  • Easy to use and understand.
  • Can handle both categorical and numerical data.
  • Resistant to outliers, hence require little data preprocessing.
  • ��

41 of 98

  • Disadvantage of Decision Tree:
  • Prone to over fitting.
  • Require some kind of measurement as to how well they are doing.
  • Need to be careful with parameter tuning.
  • Can create biased learned trees if some classes dominate.

42 of 98

Naïve Bayes Algorithm��

43 of 98

  • Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes theorem and used for solving classification problems.
  • It is mainly used in text classification that includes a high-dimensional training dataset.
  • Naïve Bayes Classifier is one of the simple and most effective Classification algorithms which helps in building the fast machine learning models that can make quick predictions.
  • It is a probabilistic classifier, which means it predicts on the basis of the probability of an object.
  • Some popular examples of Naïve Bayes Algorithm are spam filtration, Sentimental analysis, and classifying articles.

44 of 98

  • Why is it called Naïve Bayes?
  • The Naïve Bayes algorithm is comprised of two words Naïve and Bayes, Which can be described as:
  • 1) Naïve: It is called Naïve because it assumes that the occurrence of a certain feature is independent of the occurrence of other features. 
  • Such as if the fruit is identified on the bases of color, shape, and taste, then red, spherical, and sweet fruit is recognized as an apple. 
  • Hence each feature individually contributes to identify that it is an apple without depending on each other.
  • 2) Bayes: It is called Bayes because it depends on the principle of Bayes' Theorem.

45 of 98

  • Bayes' Theorem:
  • Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is used to determine the probability of a hypothesis with prior knowledge.
  •  It depends on the conditional probability.
  • The formula for Bayes' theorem is given as:

46 of 98

47 of 98

  • Where,
  • P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.
  • P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a hypothesis is true.
  • P(A) is Prior Probability: Probability of hypothesis before observing the evidence.
  • P(B) is Marginal Probability: Probability of Evidence.

48 of 98

  • Working of Naïve Bayes' Classifier:
  • Working of Naïve Bayes' Classifier can be understood with the help of the below example:
  • Suppose we have a dataset of weather conditions and corresponding target variable "Play". So using this dataset we need to decide that whether we should play or not on a particular day according to the weather conditions. So to solve this problem, we need to follow the below steps:
  • Convert the given dataset into frequency tables.
  • Generate Likelihood table by finding the probabilities of given features.
  • Now, use Bayes theorem to calculate the posterior probability.

49 of 98

  • Working of Naïve Bayes' Classifier:
  • Working of Naïve Bayes' Classifier can be understood with the help of the below example:
  • Suppose we have a dataset of weather conditions and corresponding target variable "Play". So using this dataset we need to decide that whether we should play or not on a particular day according to the weather conditions. So to solve this problem, we need to follow the below steps:
  • Convert the given dataset into frequency tables.
  • Generate Likelihood table by finding the probabilities of given features.
  • Now, use Bayes theorem to calculate the posterior probability.

50 of 98

Problem: If the weather is sunny, then the Player should play or not?�Solution: To solve this, first consider the below dataset:

51 of 98

  • Applying Bayes'theorem:
  • P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)
  • P(Sunny|Yes)= 3/10= 0.3
  • P(Sunny)= 0.35
  • P(Yes)=0.71
  • So P(Yes|Sunny) = 0.3*0.71/0.35= 0.60
  • P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)
  • P(Sunny|NO)= 2/4=0.5
  • P(No)= 0.29
  • P(Sunny)= 0.35
  • So P(No|Sunny)= 0.5*0.29/0.35 = 0.41
  • So as we can see from the above calculation that P(Yes|Sunny)>P(No|Sunny)
  • Hence on a Sunny day, Player can play the game.

52 of 98

  • Advantages of Naïve Bayes Classifier:
  • Naïve Bayes is one of the fast and easy ML algorithms to predict a class of datasets.
  • It can be used for Binary as well as Multi-class Classifications.
  • It performs well in Multi-class predictions as compared to the other Algorithms.
  • It is the most popular choice for text classification problems.

53 of 98

  • Disadvantages of Naïve Bayes Classifier:
  • Naive Bayes assumes that all features are independent or unrelated, so it cannot learn the relationship between features.

54 of 98

  • Applications of Naïve Bayes Classifier:
  • It is used for Credit Scoring.
  • It is used in medical data classification.
  • It can be used in real-time predictions because Naïve Bayes Classifier is an eager learner.
  • It is used in Text classification such as Spam filtering and Sentiment analysis.

55 of 98

  • Python Implementation of the Naïve Bayes algorithm:
  • Now we will implement a Naive Bayes Algorithm using Python. So for this, we will use the "user_datadataset, which we have used in our other classification model. Therefore we can easily compare the Naive Bayes model with the other models.
  • Steps to implement:
  • Data Pre-processing step
  • Fitting Naive Bayes to the Training set
  • Predicting the test result
  • Test accuracy of the result(Creation of Confusion matrix)
  • Visualizing the test set result.

56 of 98

SVM��

57 of 98

  • What is the Support Vector Machine?
  • “Support Vector Machine” (SVM) is a supervised machine learning algorithm that can be used for both classification or regression challenges. 
  • However,  it is mostly used in classification problems. 
  • In the SVM algorithm, we plot each data item as a point in n-dimensional space (where n is a number of features you have) with the value of each feature being the value of a particular coordinate. 
  • Then, we perform classification by finding the hyper-plane that differentiates the two classes very well (look at the below snapshot).

58 of 98

  • What is the Support Vector Machine?
  • “Support Vector Machine” (SVM) is a supervised machine learning algorithm that can be used for both classification or regression challenges. 
  • However,  it is mostly used in classification problems. 
  • In the SVM algorithm, we plot each data item as a point in n-dimensional space (where n is a number of features you have) with the value of each feature being the value of a particular coordinate. 
  • Then, we perform classification by finding the hyper-plane that differentiates the two classes very well (look at the below snapshot).

59 of 98

  • �������Support Vectors are simply the coordinates of individual observation. 
  • The SVM classifier is a frontier that best segregates the two classes (hyper-plane/ line).

60 of 98

  • The goal of the SVM algorithm is to create the best line or decision boundary that can segregate n-dimensional space into classes so that we can easily put the new data point in the correct category in the future. This best decision boundary is called a hyperplane.
  • SVM chooses the extreme points/vectors that help in creating the hyperplane. 
  • These extreme cases are called as support vectors, and hence algorithm is termed as Support Vector Machine. 

61 of 98

Consider the below diagram in which there are two different categories that are classified using a decision boundary or hyperplane:�

62 of 98

  • Example: 
  • SVM can be understood with the example that we have used in the KNN classifier. 
  • Suppose we see a strange cat that also has some features of dogs, so if we want a model that can accurately identify whether it is a cat or dog, so such a model can be created by using the SVM algorithm.
  • We will first train our model with lots of images of cats and dogs so that it can learn about different features of cats and dogs, and then we test it with this strange creature. 
  • So as support vector creates a decision boundary between these two data (cat and dog) and choose extreme cases (support vectors), it will see the extreme case of cat and dog.

63 of 98

  • On the basis of the support vectors, it will classify it as a cat. Consider the below diagram:
  • �����������SVM algorithm can be used for Face detection, image classification, text categorization, etc.

64 of 98

65 of 98

  • Types of SVM:
  • SVM can be of two types:
  • Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can be classified into two classes by using a single straight line, then such data is termed as linearly separable data, and classifier is used called as Linear SVM classifier.
  • Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means if a dataset cannot be classified by using a straight line, then such data is termed as non-linear data and classifier used is called as Non-linear SVM classifier.

66 of 98

  • Hyperplane and Support Vectors in the SVM algorithm:
  • Hyperplane: 
  • There can be multiple lines/decision boundaries to segregate the classes in n-dimensional space, but we need to find out the best decision boundary that helps to classify the data points. 
  • This best boundary is known as the hyperplane of SVM.
  • The dimensions of the hyperplane depend on the features present in the dataset, which means if there are 2 features (as shown in image), then hyperplane will be a straight line. And if there are 3 features, then hyperplane will be a 2-dimension plane.
  • We always create a hyperplane that has a maximum margin, which means the maximum distance between the data points.

67 of 98

  • Support Vectors:
  • The data points or vectors that are the closest to the hyperplane and which affect the position of the hyperplane are termed as Support Vector. Since these vectors support the hyperplane, hence called a Support vector.

68 of 98

  • Pros and Cons of SVM Classifiers:
  • Pros of SVM classifiers
  • SVM classifiers offers great accuracy and work well with high dimensional space. 
  • SVM classifiers basically use a subset of training points hence in result uses very less memory.
  • Cons of SVM classifiers
  • They have high training time hence in practice not suitable for large datasets. 
  • Another disadvantage is that SVM classifiers do not work well with overlapping classes.

69 of 98

  • How does it work?
  • Above, we got accustomed to the process of segregating the two classes with a hyper-plane.
  • Now the burning question is 
  •         “How can we identify the right hyper-plane?”

70 of 98

  • Identify the right hyper-plane (Scenario-1): 
  • Here, we have three hyper-planes (A, B, and C).
  • Now, identify the right hyper-plane to classify stars and circles.
  • ������You need to remember a thumb rule to identify the right hyper-plane: “Select the hyper-plane which segregates the two classes better”. In this scenario, hyper-plane “B” has excellently performed this job.

71 of 98

72 of 98

  • Identify the right hyper-plane (Scenario-2): 
  • Here, we have three hyper-planes (A, B, and C) and all are segregating the classes well. 
  • Now, How can we identify the right hyper-plane?

73 of 98

74 of 98

  • Here, maximizing the distances between nearest data point (either class) and hyper-plane will help us to decide the right hyper-plane. This distance is called as Margin. Let’s look at the below snapshot:

75 of 98

76 of 98

  • Above, you can see that the margin for hyper-plane C is high as compared to both A and B. 
  • Hence, we name the right hyper-plane as C. 
  • Another lightning reason for selecting the hyper-plane with higher margin is robustness. 
  • If we select a hyper-plane having low margin then there is high chance of miss-classification.

77 of 98

  • Identify the right hyper-plane (Scenario-3):
  • Hint: Use the rules as discussed in previous section to identify the right hyper-plane

78 of 98

79 of 98

  • Some of you may have selected the hyper-plane as it has higher margin compared to A. 
  • But, here is the catch, SVM selects the hyper-plane which classifies the classes accurately prior to maximizing margin. 
  • Here, hyper-plane B has a classification error and A has classified all correctly. 
  • Therefore, the right hyper-plane is A.

80 of 98

  • Can we classify two classes (Scenario-4)?: 
  • Below, I am unable to segregate the two classes using a straight line, as one of the stars lies in the territory of other(circle) class as an outlier. 

81 of 98

82 of 98

  • As I have already mentioned, one star at other end is like an outlier for star class. 
  • The SVM algorithm has a feature to ignore outliers and find the hyper-plane that has the maximum margin. 
  • Hence, we can say, SVM classification is robust to outliers.��

83 of 98

84 of 98

  • Find the hyper-plane to segregate to classes (Scenario-5): 
  • In the scenario below, we can’t have linear hyper-plane between the two classes, so how does SVM classify these two classes? Till now, we have only looked at the linear hyper-plane.

85 of 98

86 of 98

  • SVM can solve this problem. Easily! It solves this problem by introducing additional feature. Here, we will add a new feature z=x^2+y^2. Now, let’s plot the data points on axis x and z:

87 of 98

88 of 98

  • In above plot, points to consider are:
  • All values for z would be positive always because z is the squared sum of both x and y
  • In the original plot, red circles appear close to the origin of x and y axes, leading to lower value of z and star relatively away from the origin result to higher value of z.
  • In the SVM classifier, it is easy to have a linear hyper-plane between these two classes. 

89 of 98

  •  Another burning question which arises is, should we need to add this feature manually to have a hyper-plane. 
  • No, the SVM  algorithm has a technique called the kernel trick
  • The SVM kernel is a function that takes low dimensional input space and transforms it to a higher dimensional space i.e. it converts not separable problem to separable problem. 
  • It is mostly useful in non-linear separation problem.
  • Simply put, it does some extremely complex data transformations, then finds out the process to separate the data based on the labels or outputs you’ve defined.

90 of 98

When we look at the hyper-plane in original input space it looks like a circle:�

91 of 98

92 of 98

93 of 98

94 of 98

95 of 98

96 of 98

97 of 98

98 of 98