UNIT – V�
Unsupervised Learning
1
2
SYLLABUS
3
Unsupervised learning
4
S.No | Area | Supervised Learning | Unsupervised Learning |
1 | Goals | To predict outcomes for new data. We know the type of results to expect. | The goal is to get insights from large volumes of new data |
2 | Data | Both input and output variables to be given | Only output variable will be given |
3 | Applications | spam detection, sentiment analysis, weather forecasting and pricing predictions, etc., | anomaly detection, recommendation engines, customer personas and medical imaging |
4 | Complexity | Solutions are provided by the use of programs like R or Python | need powerful tools for working with large amounts of unclassified data. |
5 | Techniques Used | Regression and Classification | Clustering and Association |
6 | Drawbacks | Supervised learning models can be time-consuming to train, and the labels for input and output variables require expertise. | unsupervised learning methods can have wildly inaccurate results unless you have human intervention to validate the output variables. |
Key Difference Between Supervised and Unsupervised Learning
5
APPLICATIONS OF UNSUPERVISED LEARNING
There are many domains where unsupervised learning finds its application. Few examples of such applications are as follows:
6
Two major aspects of unsupervised learning, namely Clustering which helps in segmentation of the set of objects into groups of similar objects.
Association Analysis which is related to the identification of relationships among objects in a data set.
7
CLUSTERING
8
There are many different fields where cluster analysis is used effectively, such as
9
Clustering as a machine learning task
10
Clustering as a machine learning task
11
12
13
Partitioning Methods
Hierarchical Methods
Different types of clustering techniques
The major clustering techniques are as follows:
14
Partitioning Methods
Two important algorithms for partitioning based clustering are
In the k-means algorithm, the centroid of the prototype is identified for clustering, which is normally the mean of a group of points.
The k-medoid algorithm identifies the medoid which is the most representative point for a group of points.
15
K-Means - A centroid-based technique
16
Simple Algorithm of K-means
Step 1: Select K points in the data space and mark them as initial centroids
loop
Step 2: Assign each point in the data space to the nearest centroid to form
K clusters
Step 3: Measure the distance of each point in the cluster from the centroid
Step 4: Calculate the Sum of Squared Error (SSE) to measure the quality
of the clusters
Step 5: Identify the new centroid of each cluster on the basis of distance
between points
Step 6: Repeat Steps 2 to 5 to refine until centroids do not change
end loop
17
Clustering concept – before and after clustering
18
Student Assignment
Problem solving on K-means
K-means clustering – solved example
https://www.youtube.com/watch?v=KzJORp8bgqs&list=PPSV
By Mahesh Haddur
19
Elbow method
Fig: Elbow point to determine the appropriate number of clusters
20
21
Strengths and Weaknesses of K-means Algorithm
S.No | Strengths | Weaknesses |
1 | Easy to use and It returns clusters which can be easily interpreted and even visualized. This simplicity makes it highly useful in some cases when you need a quick overview of the data segments. | It needs human to set the cluster size to function optimally. |
2 | High performance: It can handle large datasets conveniently. | It only creates spherical clusters. |
3 | If your data has no labels (class values or targets) or even column headers, it will still successfully cluster your data. | It doesn't have an outlier concept. It will throw in everything in its clusters. It is Sometimes good and sometimes not so good. |
22
K-Medoids: a representative object-based technique
Medoid: A Medoid is a point in the cluster from which the sum of distances to other data points is minimal. The distance can be measured by using the Euclidean distance, Manhattan distance, or any other suitable distance function.
There are three types of algorithms for K-Medoids Clustering:
23
Working of the PAM Algorithm
The steps taken by the K-medoids algorithm for clustering are as follows:-
Finally, we will have k medoid points with their clusters.
24
25
Student Assignment
Problem solving on K-medoids
https://www.youtube.com/watch?v=ChBxx4aR-bY&list=PPSV
By
Mahesh Haddur
26
Hierarchical clustering
27
divisive clustering.
28
A dendrogram is a commonly used tree structure representation of step-by-step creation of hierarchical clustering.
It shows how the
of agglomerative clustering) or
The following fig. represents a dendrogram.
29
agglomerative clustering and divisive clustering
30
One of the core measures of proximities between clusters is the distance between them. There are four standard methods to measure the distance between clusters:
31
32
The distance measure is used to decide when to terminate the clustering algorithm.
33
On the other hand, when an algorithm uses the maximum distance Dmax to measure the distance between the clusters, then it is referred to as furthest neighbour clustering algorithm, and if the decision to stop the algorithm is based on a user-defined limit on Dmax then it is called complete linkage algorithm.
As minimum and maximum measures provide two extreme options to measure distance between the clusters, they are prone to the outliers and noisy data. Instead, the use of mean and average distance helps in avoiding such problem and provides more consistent results.
34
Density-based methods - DBSCAN
35
Student Assignment : Density-Based Methods
DBSCAN - Problem Solving
By
Mahesh Haddur
https://www.youtube.com/watch?v=-p354tQsKrs&list=PPSV
36
Finding Pattern Using Association Rule
{Bread, Milk} → {Egg}c
37
Few common terminologies used in association analysis
Itemset : A collection of zero or more items is called an itemset.
For example, in Table, {Bread, Milk, Egg} can be grouped together to form an itemset as those are frequently bought together.
A null itemset is the one which does not contain any item.
In the association analysis, an itemset is called k-itemset if it contains k number of items.
EX: {Bread, Milk, Egg} -> 3-itemset
Transaction Number | Purchased Items |
1 | { Bread, Milk, Egg, Butter, Salt, Apple } |
2 | { Bread, Milk, Egg, Apple } |
3 | { Bread, Milk, Butter, Apple } |
4 | { Milk, Egg, Butter, Apple } |
5 | { Bread, Egg, Salt } |
6 | { Bread, Milk, Egg, Apple } |
38
Support count : denotes the number of transactions in which a particular itemset is present
In the above table, the itemset {Bread, Milk, Egg} occurs together three times and thus have a support count of 3.
Association rule: The result of the market basket analysis is expressed as a set of association rules that specify patterns of relationships among items. A typical rule might be expressed as{Bread, Milk}→{Egg}, which denotes that if Bread and Milk are purchased, then Egg is also likely to be purchased.
The above rule was identified from the set of {Bread, Milk, Egg}.
39
Support and confidence are the two concepts that are used for measuring the strength of an association rule.
Support refers to popularity of any item, which is nothing but counting several transactions containing that item. For ex, we divide the count by the total items bought. For ex, if bread, butter, and bread + butter are bought in 200, 150, 100 transactions, respectively, in a day when 1000 transactions happened, then support for these associations will be
40
Confidence refers to the probability that item B will be bought if it is given that the user has purchased item A. for ex, if A and B both are bought in 100 transactions and A is bought in 200 transactions, then confidence would be:
Confidence ( A 🡪 B ) = Transaction containing both A and B / Transaction only
containing A
Confidence ( Bread - - > Butter ) = Transaction having both bread and butter / Transaction only containing bread = 100 / 200 = 0.5
41
Minimum Support : This is the support threshold below which the association rule will be discarded.
Minimum Confidence: This helps us filter out rules that do not meet particular confidence.
42
Apriori Algorithm – Solved Example
43
Consider the following dataset and we will find frequent item sets and generate association rules for them.
minimum support count is 2�minimum confidence is 60%
44
Step-1: K=1
(I) Create a table containing support count of each item present in dataset – Called C1(candidate set)
(II) compare candidate set item’s support count with minimum support count(here min_support=2 if support_count of candidate set items is less than min_support then remove those items). This gives us itemset L1.
45
Step-2: K=2
(Example subset of{I1, I2} are {I1}, {I2} they are frequent. Check for each itemset)
46
(II) compare candidate (C2) support count with minimum support count(here min_support=2 if support_count of candidate set item is less than min_support then remove those items) this gives us itemset L2.
47
Step-3:
(Here subset of {I1, I2, I3} are {I1, I2},{I2, I3},{I1, I3} which are frequent. For {I2, I3, I4}, subset {I3, I4} is not frequent so remove it. Similarly check for every itemset)
48
(II) Compare candidate (C3) support count with minimum support count(here min_support=2 if support_count of candidate set item is less than min_support then remove those items) this gives us itemset L3.
Step-4:
(Here itemset formed by joining L3 is {I1, I2, I3, I5} so its subset contains {I1, I3, I5}, which is not frequent). So no itemset in C4
49
Advantages are as follows:
Disadvantages of Apriori Algorithm are as follows:
50
Thank You All.....