2 of 50

Introduction,
Unsupervised vs Supervised Learning,
Applications of Unsupervised Learning,
Clustering – Clustering as a machine learning task,
Different types of clustering techniques,
Partitioning methods,�K-Medoids: a representative object-based technique,
Hierarchical clustering,
Density-based methods – DBSCAN
Finding Pattern using Association Rule - Definition of common terms, Association rule, The Apriori algorithm for association rule learning, Build the Apriori principle rules

SYLLABUS

3 of 50

Unsupervised learning is one of the machine learning strategies.
It is used to discover hidden knowledge or patterns.
It works on unlabelled, unclassified and untrained data.
This method is used to identify patterns, groupings, sorting order, and numerous other interesting knowledge from the set of data.

Unsupervised learning

4 of 50

S.No	Area	Supervised Learning	Unsupervised Learning
1	Goals	To predict outcomes for new data. We know the type of results to expect.	The goal is to get insights from large volumes of new data
2	Data	Both input and output variables to be given	Only output variable will be given
3	Applications	spam detection, sentiment analysis, weather forecasting and pricing predictions, etc.,	anomaly detection, recommendation engines, customer personas and medical imaging
4	Complexity	Solutions are provided by the use of programs like R or Python	need powerful tools for working with large amounts of unclassified data.
5	Techniques Used	Regression and Classification	Clustering and Association
6	Drawbacks	Supervised learning models can be time-consuming to train, and the labels for input and output variables require expertise.	unsupervised learning methods can have wildly inaccurate results unless you have human intervention to validate the output variables.

Key Difference Between Supervised and Unsupervised Learning

5 of 50

APPLICATIONS OF UNSUPERVISED LEARNING

There are many domains where unsupervised learning finds its application. Few examples of such applications are as follows:

Segmentation of target consumer populations by an advertisement consulting agency on the basis of few dimensions such as demography, financial data, purchasing habits, etc. so that the advertisers can reach their target consumers efficiently.
Anomaly or fraud detection in the banking sector by identifying the pattern of loan defaulters
Image processing and image segmentation such as face recognition, expression identification, etc.
Document clustering and identifying potential labelling options.

6 of 50

Two major aspects of unsupervised learning, namely Clustering which helps in segmentation of the set of objects into groups of similar objects.

Association Analysis which is related to the identification of relationships among objects in a data set.

7 of 50

CLUSTERING

Clustering is a technique for finding subgroups, or clusters, in a data set on the basis of the characteristics of the objects within that data set in such a manner that the objects within the group are similar (or related to each other) but are different from (or unrelated to) the objects from the other groups.
The effectiveness of clustering depends on how similar or related the objects within a group are or how different or unrelated the objects in different groups are from each other. It is often domain specific to define what is meant by two objects to be similar or dissimilar.

8 of 50

There are many different fields where cluster analysis is used effectively, such as

Text data mining: this includes tasks such as text categorization, text clustering, document summarization, concept extraction, sentiment analysis, and entity relation modelling
Customer segmentation: creating clusters of customers on the basis of parameters such as demographics, financial conditions, buying habits, etc., which can be used by retailers and advertisers to promote their products in the correct segment
Anomaly checking: checking of anomalous behaviours such as fraudulent bank transaction, unauthorized computer intrusion, suspicious movements on a radar scanner, etc.
Data mining: simplify the data mining task by grouping a large number of features from an extremely large data set to make the analysis manageable

9 of 50

Clustering as a machine learning task

clustering is defined as an unsupervised machine learning task that automatically divides the data into clusters or groups of similar items.
We can assume that the definition of similarity might vary across applications.
Using this principle, whenever a large set of diverse and varied data is presented for analysis, clustering enables to represent the data in a smaller number of groups.
It helps to reduce the complexity and provides awareness into patterns of relationships to generate meaningful and actionable structures within the data.
The effectiveness of clustering is measured by the homogeneity within a group as well as the difference between distinct groups.

10 of 50

Clustering as a machine learning task

11 of 50

Through clustering, we are trying to label the objects with class labels.
But clustering is somewhat different from the classification and numeric prediction in supervised learning.
In each of these cases, the goal was to create a model that relates features to an outcome or to other features and the model identifies patterns within the data.

13 of 50

Partitioning Methods

Uses mean or medoid to represent cluster center
Adopts distance based approach to refine clusters
Finds mutually exclusive clusters of spherical or nearly spherical shape
Effective for datasets of small to medium size

Hierarchical Methods

Creates hierarchical or tree like structure through decomposition or merger
Uses distance between the nearest or furthest points in neighboring clusters as a guidance for refinement

Density based Methods

Identification of dense region of objects in space which are separated by low-density regions.
Identifying arbitrarily shaped clusters
May filter out outliers.

Different types of clustering techniques

The major clustering techniques are as follows:

14 of 50

Partitioning Methods

Two important algorithms for partitioning based clustering are

k-means
k-medoid.

In the k-means algorithm, the centroid of the prototype is identified for clustering, which is normally the mean of a group of points.

The k-medoid algorithm identifies the medoid which is the most representative point for a group of points.

15 of 50

K-Means - A centroid-based technique

This is one of the oldest and most popularly used algorithm for clustering.
The Principle of the k-means algorithm is to assign each of the ‘n’ data points to one of the K clusters where ‘K’ is a user-defined parameter as the number of clusters desired.
The objective is to maximize the homogeneity within the clusters and also to maximize the differences between the clusters.
The homogeneity and differences are measured in terms of the distance between the objects or points in the data set.

16 of 50

Simple Algorithm of K-means

Step 1: Select K points in the data space and mark them as initial centroids

loop

Step 2: Assign each point in the data space to the nearest centroid to form

K clusters

Step 3: Measure the distance of each point in the cluster from the centroid

Step 4: Calculate the Sum of Squared Error (SSE) to measure the quality

of the clusters

Step 5: Identify the new centroid of each cluster on the basis of distance

between points

Step 6: Repeat Steps 2 to 5 to refine until centroids do not change

end loop

17 of 50

Clustering concept – before and after clustering

18 of 50

Student Assignment

Problem solving on K-means

K-means clustering – solved example

https://www.youtube.com/watch?v=KzJORp8bgqs&list=PPSV

By Mahesh Haddur

19 of 50

Elbow method

This method tries to measure the homogeneity or heterogeneity within the cluster and for various values of ‘K’ and helps in arriving at the optimal ‘K’.
The homogeneity will increase or heterogeneity will decrease with increasing ‘K’.
After a certain point, the increase in homogeneity benefit is no longer in accordance with the investment required to achieve it, as is evident from the figure. This point is known as the elbow point, and the ‘K’ value at this point produces the optimal clustering performance.

Fig: Elbow point to determine the appropriate number of clusters

20 of 50

Student Assignment

Elbow Method - Solved example

Mahesh Haddur

https://www.youtube.com/watch?v=wW1tgWtkj4I&list=PPSV

21 of 50

Strengths and Weaknesses of K-means Algorithm

S.No	Strengths	Weaknesses
1	Easy to use and It returns clusters which can be easily interpreted and even visualized. This simplicity makes it highly useful in some cases when you need a quick overview of the data segments.	It needs human to set the cluster size to function optimally.
2	High performance: It can handle large datasets conveniently.	It only creates spherical clusters.
3	If your data has no labels (class values or targets) or even column headers, it will still successfully cluster your data.	It doesn't have an outlier concept. It will throw in everything in its clusters. It is Sometimes good and sometimes not so good.

22 of 50

K-Medoids: a representative object-based technique

Medoid: A Medoid is a point in the cluster from which the sum of distances to other data points is minimal. The distance can be measured by using the Euclidean distance, Manhattan distance, or any other suitable distance function.

There are three types of algorithms for K-Medoids Clustering:

PAM (Partitioning Around Clustering)
CLARA (Clustering Large Applications)
CLARANS (Randomized Clustering Large Applications)

23 of 50

Working of the PAM Algorithm

The steps taken by the K-medoids algorithm for clustering are as follows:-

Randomly select k points from the data( k is the number of clusters to be formed). These k points would act as our initial medoids.
The distances between the medoid points and the non-medoid points are calculated, and each point is assigned to the cluster of its nearest medoid.
Calculate the cost as the total sum of the distances(also called dissimilarities) of the data points from the assigned medoid.
Swap one medoid point with a non-medoid point(from the same cluster as the medoid point) and recalculate the cost.
If the calculated cost with the new medoid point is more than the previous cost, we undo the swap, and the algorithm converges else; we repeat step 4

Finally, we will have k medoid points with their clusters.

25 of 50

Student Assignment

Problem solving on K-medoids

https://www.youtube.com/watch?v=ChBxx4aR-bY&list=PPSV

Mahesh Haddur

26 of 50

Hierarchical clustering

The hierarchical clustering methods are used to group the data into hierarchy or tree-like structure.
For example, in a machine learning problem of organizing employees of a university in different departments, first the employees are grouped under the different departments in the university, and then within each department, the employees can be grouped according to their roles such as professors, assistant professors, supervisors, lab assistants, etc.
This creates a hierarchical structure of the employee data and eases visualization and analysis.

27 of 50

There are two main hierarchical clustering methods: agglomerative clustering and

divisive clustering.

Agglomerative clustering is a bottom-up technique which starts with individual objects as clusters and then iteratively merges them to form larger clusters.

divisive method is a top-down method and starts with one cluster (large cluster) with all given objects and then splits it iteratively to form smaller clusters.

28 of 50

A dendrogram is a commonly used tree structure representation of step-by-step creation of hierarchical clustering.

It shows how the

clusters are merged iteratively (in the case

of agglomerative clustering) or

split iteratively (in the case of divisive clustering) to arrive at the optimal clustering solution.

The following fig. represents a dendrogram.

29 of 50

agglomerative clustering and divisive clustering

30 of 50

One of the core measures of proximities between clusters is the distance between them. There are four standard methods to measure the distance between clusters:

Let Ci and Cj be the two clusters with ni and nj respectively.
pi and pj represents the points in clusters Ci and Cj respectively.
We will denote the mean of cluster Ci as mi.

32 of 50

The distance measure is used to decide when to terminate the clustering algorithm.

For example, in an agglomerative clustering, the merging iterations may be stopped once the MIN distance between two neighbouring clusters becomes less than the user- defined threshold.
So, when an algorithm uses the minimum distance Dmin to measure the distance between the clusters, then it is referred to as nearest neighbour clustering algorithm, and
if the decision to stop the algorithm is based on a user-defined limit on Dmin, then it is called single linkage algorithm.

33 of 50

On the other hand, when an algorithm uses the maximum distance Dmax to measure the distance between the clusters, then it is referred to as furthest neighbour clustering algorithm, and if the decision to stop the algorithm is based on a user-defined limit on Dmax then it is called complete linkage algorithm.

As minimum and maximum measures provide two extreme options to measure distance between the clusters, they are prone to the outliers and noisy data. Instead, the use of mean and average distance helps in avoiding such problem and provides more consistent results.

34 of 50

Density-based methods - DBSCAN

When we use the partitioning and hierarchical clustering methods, the resulting clusters are spherical or nearly spherical in nature.
In the case of the other shaped clusters such as S- shaped or uneven shaped clusters, the above two types of method do not provide accurate results.
The density- based clustering approach provides a solution to identify clusters of arbitrary shapes.
The principle is based on identifying the dense area and sparse area within the data set and then run the clustering algorithm.
DBSCAN is one of the popular density-based algorithm which creates clusters by using connected regions with high density.

35 of 50

Student Assignment : Density-Based Methods

DBSCAN - Problem Solving

Mahesh Haddur

https://www.youtube.com/watch?v=-p354tQsKrs&list=PPSV

36 of 50

Finding Pattern Using Association Rule

Association rule as a technique is used for identifying interesting relationships hidden in large data sets.
It is also known as association analysis.
The relationships can be represented in the form of association rules comprising a set of frequent items.
A common application of this analysis is the Market Basket Analysis
The below association rule says that, people who have bought bread and milk have often bought egg also; so, for the retailer, it makes sense that these items are placed together for new opportunities for cross-selling.

{Bread, Milk} → {Egg}c

37 of 50

Few common terminologies used in association analysis

Itemset : A collection of zero or more items is called an itemset.

For example, in Table, {Bread, Milk, Egg} can be grouped together to form an itemset as those are frequently bought together.

A null itemset is the one which does not contain any item.

In the association analysis, an itemset is called k-itemset if it contains k number of items.

EX: {Bread, Milk, Egg} -> 3-itemset

Transaction Number	Purchased Items
1	{ Bread, Milk, Egg, Butter, Salt, Apple }
2	{ Bread, Milk, Egg, Apple }
3	{ Bread, Milk, Butter, Apple }
4	{ Milk, Egg, Butter, Apple }
5	{ Bread, Egg, Salt }
6	{ Bread, Milk, Egg, Apple }

38 of 50

Support count : denotes the number of transactions in which a particular itemset is present

In the above table, the itemset {Bread, Milk, Egg} occurs together three times and thus have a support count of 3.

Association rule: The result of the market basket analysis is expressed as a set of association rules that specify patterns of relationships among items. A typical rule might be expressed as{Bread, Milk}→{Egg}, which denotes that if Bread and Milk are purchased, then Egg is also likely to be purchased.

The above rule was identified from the set of {Bread, Milk, Egg}.

39 of 50

Support and confidence are the two concepts that are used for measuring the strength of an association rule.

Support refers to popularity of any item, which is nothing but counting several transactions containing that item. For ex, we divide the count by the total items bought. For ex, if bread, butter, and bread + butter are bought in 200, 150, 100 transactions, respectively, in a day when 1000 transactions happened, then support for these associations will be

40 of 50

Confidence refers to the probability that item B will be bought if it is given that the user has purchased item A. for ex, if A and B both are bought in 100 transactions and A is bought in 200 transactions, then confidence would be:

Confidence ( A 🡪 B ) = Transaction containing both A and B / Transaction only

containing A

Confidence ( Bread - - > Butter ) = Transaction having both bread and butter / Transaction only containing bread = 100 / 200 = 0.5

41 of 50

Minimum Support : This is the support threshold below which the association rule will be discarded.

Minimum Confidence: This helps us filter out rules that do not meet particular confidence.

42 of 50

Apriori Algorithm – Solved Example

It’s called Apriori, because it uses prior knowledge of frequent itemset properties.
We apply an iterative approach or level-wise search where k-frequent item sets are used to find k+1 item sets.
To improve the efficiency of level-wise generation of frequent item sets, an important property is used called Apriori property which helps by reducing the search space.
Apriori Property – All non-empty subset of frequent itemset must be frequent.

43 of 50

Consider the following dataset and we will find frequent item sets and generate association rules for them.

minimum support count is 2�minimum confidence is 60%

44 of 50

Step-1: K=1

(I) Create a table containing support count of each item present in dataset – Called C1(candidate set)

(II) compare candidate set item’s support count with minimum support count(here min_support=2 if support_count of candidate set items is less than min_support then remove those items). This gives us itemset L1.

45 of 50

Step-2: K=2

Generate candidate set C2 using L1 (this is called join step). Condition of joining L_k-1 and L_k-1 is that it should have (K-2) elements in common.
Check all subsets of an itemset are frequent or not and if not frequent remove that itemset.

(Example subset of{I1, I2} are {I1}, {I2} they are frequent. Check for each itemset)

Now find support count of these item sets by searching in dataset.

46 of 50

(II) compare candidate (C2) support count with minimum support count(here min_support=2 if support_count of candidate set item is less than min_support then remove those items) this gives us itemset L2.

47 of 50

Step-3:

Generate candidate set C3 using L2 (join step). Condition of joining L_k-1 and L_k-1 is that it should have (K-2) elements in common. So here, for L2, first element should match.�So itemset generated by joining L2 is {I1, I2, I3}{I1, I2, I5}{I1, I3, I5}{I2, I3, I4}{I2, I4, I5} {I2, I3, I5}
Check if all subsets of these item sets are frequent or not and if not, then remove that itemset.

(Here subset of {I1, I2, I3} are {I1, I2},{I2, I3},{I1, I3} which are frequent. For {I2, I3, I4}, subset {I3, I4} is not frequent so remove it. Similarly check for every itemset)

Find support count of these remaining item sets by searching in dataset.

48 of 50

(II) Compare candidate (C3) support count with minimum support count(here min_support=2 if support_count of candidate set item is less than min_support then remove those items) this gives us itemset L3.

Step-4:

Generate candidate set C4 using L3 (join step). Condition of joining L_k-1 and L_k-1 (K=4) is that, they should have (K-2) elements in common. So here, for L3, first 2 elements (items) should match.
Check all subsets of these item sets are frequent or not

(Here itemset formed by joining L3 is {I1, I2, I3, I5} so its subset contains {I1, I3, I5}, which is not frequent). So no itemset in C4

We stop here because no frequent item sets are found further

49 of 50

Advantages are as follows:

Simplicity & ease of implementation
The rules are easy to human-readable * interpretable
Works well on unlabelled data
Flexibility
Extensions for multiple use cases can be created easily
The algorithm is widely used & studied

Disadvantages of Apriori Algorithm are as follows:

Time & space overhead
Higher memory usage
Bias of minimum support threshold
Inability to handle numeric data

1 of 50

2 of 50

3 of 50

4 of 50

5 of 50

6 of 50

7 of 50

8 of 50

9 of 50

10 of 50

11 of 50

12 of 50

13 of 50

14 of 50

15 of 50

16 of 50

17 of 50

18 of 50

19 of 50

20 of 50

21 of 50

22 of 50

23 of 50

24 of 50

25 of 50

26 of 50

27 of 50

28 of 50

29 of 50

30 of 50

31 of 50

32 of 50

33 of 50

34 of 50

35 of 50

36 of 50

37 of 50

38 of 50

39 of 50

40 of 50

41 of 50

42 of 50

43 of 50

44 of 50

45 of 50

46 of 50

47 of 50

48 of 50

49 of 50

50 of 50