UNIT – III - CHAPTER – II� SUPERVISED LEARNING : CLASSIFICATION�In supervised learning, the labelled training data provide the basis for learning. According to the definition of machine learning, this labelled training data is the experience or prior knowledge��Some more examples of supervised learning are as follows: �Prediction of results of a game based on the past analysis of results �Predicting whether a tumor is malignant or benign on the basis of the analysis of data �Price prediction in domains such as real estate, stocks, etc.
1
2
Training data is the past information with known value of class field or ‘label’. Hence, we say that the ‘training data is labelled’ in the case of supervised learning (refer Fig.). Contrary to this, there is no labelled training data for unsupervised learning. Semi-supervised learning, as depicted in Figure, uses a small amount of labelled data along with unlabelled data for training.
CLASSIFICATION MODEL
3
Supervised machine learning is as good as the data used to train it. If the training data is poor in quality, the prediction will also be far from being precise.
The following fig depicts the typical process of classification �where a classification model is obtained from the �labelled training data by a classifier algorithm. On the �basis of the model, a class label (e.g. ‘Intel’ as in the case �of the test data referred in Fig.) is assigned to the test �data.
4
FIG. Classification model
5
6
CLASSIFICATION LEARNING STEPS
7
CLASSIFICATION LEARNING STEPS
Problem Identification
Identification of Required Data
Data Pre-processing
8
Definition of Training Data Set
Algorithm Selection
Training
CLASSIFICATION LEARNING STEPS
9
Evaluation with the Test Data Set
CLASSIFICATION LEARNING STEPS
COMMON CLASSIFICATION ALGORITHMS
Following are the most common classification algorithms
1. k-Nearest Neighbour(k-NN)
2. Decision tree
3. Random forest
4. Support Vector Machine (SVM)
5. Naïve Bayes classifier
10
k - Nearest Neighbour (k-NN)
11
k - Nearest Neighbour (k-NN)
12
Input: Training data set, test data set (or data points), value of ‘k’ (i.e. number of nearest neighbours to be considered)
Steps:
Do for all test data points
Calculate the distance (usually Euclidean distance) of the test data
point from the different training data points.
Find the closest ‘k’ training data points, i.e. training data points whose distances are least from the test data point.
If k = 1
Then assign class label of the training data point to the test data point
Else
Whichever class label is predominantly present in the training data points, assign that class label to the test data point
End do
1. Students having good communication skills
as well as a good level of aptitude have
been classified as ‘Leader’
2. Students having good communication skills
but not so good level of aptitude have
been classified as ‘Speaker’
3. Students having not so good
communication skill but a good level of
aptitude have been classified as ‘Intel’
13
14
In the kNN algorithm, the class label of the test data elements is decided by the class label of the training data elements which are neighbouring, i.e. similar in nature. But there are two challenges:
1. What is the basis of this similarity or when can we say that two data elements are similar?
2. How many similar elements should be considered for deciding the class label of each test data element?
To answer the first question, though there are many measures of similarity, the most common approach adopted by kNN to measure similarity between two data elements is Euclidean distance.
Considering a very simple data set having two features (say f1 and f2),
Euclidean distance between two data elements d1 and d2 can be measured by
for data element d1
15
16
17
18
19
20
It is difficult to decide the value of k. The reasons are as follows:
One common practice among few strategies is to set k equal to the square root of the number of training records.
An alternative approach is to test several k values on a variety of test data sets and choose the one that delivers the best performance.
Another interesting approach is to choose a larger value of k, but apply a weighted voting process in which the vote of close neighbors is considered more influential than the vote of distant neighbors.
21
Why the k-NN algorithm is called a lazy learner?
K-NN algorithm, stores the training data and directly applies the philosophy of nearest neighborhood finding to arrive at the classification when the problem comes. So, for k-NN, there is no learning happening in the real sense. Therefore, k-NN falls under the category of lazy learner.
Strengths of the k-NN algorithm
Weaknesses of the k-NN algorithm
Applications of the k-NN algorithm : Recommender system, searching documents/ contents similar to a given document/content. This is an area under information retrieval and is known as concept search.
22
Decision tree
23
Thus, a decision tree consists of three types of
nodes:
24
25
Building a decision tree
the same class
2. All features have been used up in the partitioning
3. The tree has grown to a pre-defined threshold limit
26
27
28
29
According to the table, job offer condition (i.e. the outcome) is FALSE for all the cases where Aptitude = Low, irrespective of other conditions.
So, the feature Aptitude can be taken up as the first node of the decision tree. For Aptitude = High, job offer condition is TRUE for all the cases where Communication = Good.
For cases where Communication = Bad, job offer condition is TRUE for all the cases where CGPA = High.
30
Searching a decision tree
By using the above decision tree depicted in Figure, we need to predict whether Chandra might get a job offer for the given parameter values: CGPA = High, Communication = Bad, Aptitude = High, Programming skills = Bad.
There are multiple ways to search through the trained decision tree for a solution to the given prediction problem.
Exhaustive search
1. Place the item in the first group (class). Recursively examine solutions with the
item in the first group (class).
2. Place the item in the second group (class). Recursively examine solutions with the
item in the second group (class).
3. Repeat the above steps until the solution is reached.
Exhaustive search travels through the decision tree exhaustively, but it will take much time when the decision tree is big with multiple leaves and multiple attribute values.
31
To overcome the problem in Exhaustive search, the following technique can be adopted:
Branch and bound search
32
Decision tree based on the training data (depicting a sample path)
Fig. depicts a sample path (thick line) for the conditions CGPA = High, Communication = Bad, Aptitude = High and Programming skills = Bad. According to the above decision tree, the prediction can be made as Chandra will get the job offer.
33
There are many implementations of decision tree, the most prominent ones being
The biggest challenge of a decision tree algorithm is to find out which feature to split upon.
The main driver for identifying the feature is that the data should be split in such a way that the partitions created by the split should contain examples belonging to a single class. If that happens, the partitions are considered to be pure.
34
Entropy of a decision tree
where c is the number of different class labels and
p refers to the proportion of values falling into the i-th class label.
35
Entropy(S) = -0.44 log (0.44) - 0.56 log (0.56) = 0.99.
Information gain of a decision tree
Information gain for a particular feature A is calculated by the difference in entropy before a split (or Sbs ) with the entropy after the split (Sas).
Information Gain (S, A) = Entropy (Sbs ) − Entropy (Sas)
36
Random Forest Algorithm
As the name suggests, "Random Forest is a classifier that contains a number of decision trees on various subsets of the given dataset and takes the average to improve the predictive accuracy of that dataset."
Instead of relying on one decision tree, the random forest takes the prediction from each tree and based on the majority votes of predictions, and it predicts the final output.
The greater number of trees in the forest leads to higher accuracy and prevents the problem of overfitting.
37
For example: consider the fruit basket as the data as shown in the figure below. Now n number of samples are taken from the fruit basket, and an individual decision tree is constructed for each sample. Each decision tree will generate an output, as shown in the figure. The final output is considered based on majority voting. In the below figure, you can see that the majority decision tree gives output as an apple when compared to a banana, so the final output is taken as an apple.
38
Random Forest
39
The following steps explain the working Random Forest Algorithm:
Step 1: Select random samples from a given data or training set.
Step 2: This algorithm will construct a decision tree for every training data.
Step 3: Voting will take place by averaging the decision tree.
Step 4: Finally, select the most voted prediction result as the final prediction result.
This combination of multiple models is called Ensemble.
Ensemble uses two methods:
40
Bagging
41
Bagging
42
Boosting
43
Why use Random Forest?
It takes less training time as compared to other algorithms.
Applications of Random Forest
There are mainly four sectors where Random forest mostly used:
44
Advantages of Random Forest
Disadvantages of Random Forest
45
Support Vector Machines
2 - Dimensional Feature Space
46
47
Hyperplane: There can be multiple lines/decision boundaries to segregate the classes in n-dimensional space, but we need to find out the best decision boundary that helps to classify the data points. This best boundary is known as the hyperplane of SVM.
The dimensions of the hyperplane depend on the features present in the dataset, which means if there are 2 features (as shown in image), then hyperplane will be a straight line. And if there are 3 features, then hyperplane will be a 2-dimension plane.
We always create a hyperplane that has a maximum margin, which means the maximum distance between the data points.
Support Vectors: The data points or vectors that are the closest to the hyperplane and which affect the position of the hyperplane are termed as Support Vector. Since these vectors support the hyperplane, hence called a Support vector.
The working of the SVM algorithm can be understood by using an example. Suppose we have a dataset that has two tags (green and blue), and the dataset has two features x1 and x2. We want a classifier that can classify the pair(x1, x2) of coordinates in either green or blue. Consider the below image:
So as it is 2-d space by just using a straight line, we can easily separate these two classes. But there can be multiple lines that can separate these classes. Consider the below image:
Hence, the SVM algorithm helps to find the best line or decision boundary; this best boundary or region is called as a hyperplane. SVM algorithm finds the closest point of the lines from both the classes. These points are called support vectors. The distance between the vectors and the hyperplane is called as margin. And the goal of SVM is to maximize this margin. The hyperplane with maximum margin is called the optimal hyperplane.