1 of 50

Clustering for

Image Analysis

Kanika Chopra,

Nicholas Vadivelu

2 of 50

Introduction

KANIKA CHOPRA

(she/her)

4A Math. Finance & Stats

kanikadatt@gmail.com

NICHOLAS VADIVELU (he/him)

4A Computer Science & Stats �nicholas.vadivelu@gmail.com

3 of 50

How can you get involved?

This slide deck: bit.ly/uwdsc_wistem_w21

Facebook Page: f acebook.com/uwdsc�Email: waterloodatascience@gmail.com�Discord: bit.ly/uwdsc-discord

Other Links: YouTube, Website, Twitter, Linktr.ee

4 of 50

Contents

What is Data Science?
Python & NumPy Basics
K-Means Clustering
Implementing K-Means

5 of 50

What is Data Science?

Data science is an inter-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from many structural and unstructured data. Data science is related to data mining, machine learning and big data.

Wikipedia

6 of 50

Examples of Data Science

Visualizing voter patterns across the country
A/B testing a new product feature with users
Netflix’s video recommendation engine
Identifying fraudulent credit card transactions
Matching and routing drivers for Uber
Determining pricing rates for insurance
Identifying and predicting motion of objects on the road for self driving
Facial recognition software for security
Generating fake images for malicious use
….and basically everything you use in your daily life!

7 of 50

What is M a c h i n e L e a r n i n g ?

Machine learning is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed.

expertsystem.com

8 of 50

Examples of Machine Learning

Netflix’s video recommendation engine
Identifying fraudulent credit card transactions
Matching and routing drivers for Uber
Determining pricing rates for insurance
Identifying and predicting motion of objects on the road for self driving
Facial recognition software for security
Generating fake images for malicious use

9 of 50

Artificial Intelligence

Data Science

Machine Learning

10 of 50

What are career options in this field?

Data Analyst
Data Scientist
Machine Learning / Research Engineer
Data Engineer/Architect/Consultant
Research Scientist

Disclaimer: These are our categorizations! Often these titles are mixed, and definitely vary by organization.

11 of 50

Python “Crash Course”

12 of 50

What is Python?

A general purpose programming language that is commonly used in data science
What is a program?

A sequence of commands executed by a computer

13 of 50

Your First Python Program

print('Hello, World!')

Hello, World!

Program:

Output:

14 of 50

Your First Python Program

print('Hello')�print('World')

Hello�World

Program:

Output:

15 of 50

Variables

x = 'Hello, World!' # assignment�print(x)

y = x # assignment�print(y)

Hello, World!�Hello, World!

Program:

Output:

16 of 50

Variables 2

x = 2�print(x)

x = x + 2�print(x)

x *= 3 # same as x = x * 3�print(x)

2�4�12

Program:

Output:

17 of 50

Conditionals

string = 'great job!'�if string == 'great job!':� print(':)')�else:� print(':(')

:)

Program:

Output:

18 of 50

Conditionals 2

x = 4.0�if x < 3:� print(x, ' is less than 3.')�elif x >= 7:� print(x, ' is >= to 7.')�else:� print(x, ' is between 3 and 6.')

4.0 is between 3 and 6.

Program:

Output:

19 of 50

While Loops

# compute floor(log_2(x))�x = 13�value = -1

while x > 0:� x = x // 2 # integer division� value += 1� print('x =', x)� print('value =', value)� print()

print('floor(log_2(x)) = ', value)

x = 6�value = 0

x = 3�value = 1

x = 1�value = 2

x = 0�value = 3

floor(log_2(x)) = 3

Program:

Output:

20 of 50

Lists

var = [1, 2, 3, 4]�print(var[0]) # indexing�print(var[2])

var.append(5)�print(var)

var.append('nice')�print(var)

1�3�[1, 2, 3, 4, 5]�[1, 2, 3, 4, 5, 'nice']

Program:

Output:

21 of 50

Lists: Advanced Indexing

var = [10, 11, 12, 13, 14, 15]�print(var[-1])�print(var[-2])�print(var[0:4]) # slice�print(var[0:4:2])

15�14�[10, 11, 12, 13]�[10, 12]

Program:

Output:

22 of 50

For Loops

var = [10, 11, 12, 13, 14, 15]

for elem in var:� print(elem + 10)

20�21�22�23�24�25

Program:

Output:

23 of 50

For Loops

print(list(range(10)))�print(list(range(4, 10)))�print(list(range(4, 10, 3)))

for i in range(3):� print(i)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]�[4, 5, 6, 7, 8, 9]�[4, 7]�0�1�2

Program:

Output:

24 of 50

Functions

def func(x, y):� print('Called func')� return x + y

z = func(1, 2)�print(z)�

# could do func(x=1, y=2)�# or func(1, y=2)�# or func(y=2, x=1)

Called func�3

Program:

Output:

25 of 50

NumPy

import numpy as np # load library

x = np.array([1, 2, 3])�print(x)�print(x + 1)

y = np.array([4, 5, 6])�print(x + y)

[1 2 3]�[2 3 4]�[5 7 9]

26 of 50

NumPy

import numpy as np # load library

x = np.array([[1, 2, 3], � [4, 5, 6]])

print(x)�print()

print(np.sum(x))�print()

print(np.sum(x, axis=0))�print()

print(np.sum(x, axis=1))�print()

[[1 2 3]� [4 5 6]]��21��[5 7 9]��[ 6 15]��

27 of 50

NumPy Attributes & Methods

import numpy as np # load library

x = np.array([[1, 2, 3], � [4, 5, 6]])

print(x.shape) # shape is an attribute�

y = x.reshape(2, 3) # reshape is a method

print(y)

print(y.shape)�

z = y.reshape(6)

print(z)

print(z.shape)

(2, 3)

[[1 2 3]� [4 5 6]]

(2, 3)

[1 2 3 4 5 6]

(6,)��

28 of 50

Exercises

The equation for a line is y = mx + b. Write a function that takes a list x, a coefficient m, and a bias b, and returns a list y with the mx+b.

e.g. x = [2, 4, 6, 8], m = 3, b = 1. �f(x, m, b) -> [7, 13, 19, 25]�

Write a function that finds the maximum value in list a, and returns the corresponding element at the same index in list b.

e.g. a = [3, 2, 5, 1], b = [‘a’, ‘b’, ‘c’, ‘d’]�f(a, b) -> ‘c’

29 of 50

Any questions so far?

30 of 50

What is Clustering?

A form of pattern recognition
Goal is to group a set of data based on their common features

Data within a cluster are similar
Data in different clusters are different

Most common method is K-Means Clustering (we will learn this soon!)

31 of 50

Clustering: Example

Let’s say you own a t-shirt business and only offer one size
You want to introduce three sizes for your shirts based on your customer’s weights and heights
You have the following data and create three sizes

32 of 50

Applications of Clustering

Marketing and Sales
Document Classification
Image Analysis

33 of 50

Applications of Clustering: Marketing and Sales

Goal: Identify distinct customer groups based on their characteristics
Types of Data:

Demographic (gender, age, income, etc.)
Geographic (city, province, country)
Psychographics (social class, lifestyle, personality traits)
Behavioural (spending habits, product/service purchases)

Use Cases:

Determining the right price
Customizing and personalizing marketing campaigns
Choosing new product features

Example:

You work at a clothing store and your sales are down. Your marketing team wants to determine the most optimal marketing plan that can target different groups of customers. They ask you for your help to segment their customer base.

34 of 50

Applications of Clustering: Document Classification

Goal: Classify the type of document
Types of Data:

Tags (title, subtitle, author, caption)
Topics (sports, politics, entertainment)
Content of document

Use Cases:

Finding a similar document
Categorizing documents

Example:

Your teacher asks each student to write a research paper on a topic of their choosing. You want to figure out what common categories these papers fall into. You use clustering to organize the papers into various clusters based on the content.

35 of 50

Applications of Clustering: Image Analysis

Goal: Identify distinct parts of an image
Types of Data:

Pixels

Use Cases:

Identifying cancerous parts of a scan
Determining a number in a set of images
Segment an image, i.e. determine the water and the land in the image

Example:

You are analysing scans of a patients and want to determine whether a patient has cancer. You perform a cluster analysis to identify the cancerous and non-canerous cells based on the traits (i.e. color, shape, location) in the scan.

36 of 50

Applications of Clustering: Image Analysis

Detecting cancer in scans

Object detection

Image segmentation

Pic Credit: omicsonline.org

37 of 50

K-Means Clustering: Introduction

Most common method for clustering
Unsupervised learning technique

Infers patterns in the data without knowing the final answer
Algorithm determines what features should distinguish the groups

Definitions:

Cluster: A group of data
Centroid: Central point of each cluster

Goal: Segment the data into non-overlapping K clusters by minimizing the distance between data points in each cluster and maximizing the distance between the points in distinct clusters

38 of 50

K-Means Clustering: Calculations

Euclidean Distance: The metric to calculate the distance between two points

Centroid: The arithmetic mean position of a set of points

39 of 50

K-Means Clustering: Process

Input: Data and the # of clusters

NOTE: We do not know the group labels

Process:

Determine the number of clusters, K
Choose K initial points randomly - the centroids for the first round of clusters
Assign the data to each of the groups based on the closest centroid
Compute the average centroid in each of the groups
Reassign all the data based on the new closest centroid using Euclidean distance
STOP iteration when groups no longer change

40 of 50

K-Means Clustering: Demo

Randomly assigned centroids
Data assigned to closest centroids

Centroids of initial clusters calculated

K = 3, Iteration = 0

K = 3, Iteration = 1

41 of 50

K-Means Clustering: Demo

Points are re-assigned based on centroids from iteration 2
New centroids calculated

Points are re-assigned based on centroids from iteration 1
New centroids calculated

K = 3, Iteration = 2

K = 3, Iteration = 3

42 of 50

K-Means Clustering: Demo

Points are re-assigned based on centroids from iteration 3
New centroids calculated

Iteration 5 is the same as Iteration 4
STOP the algorithm
Final clusters created

K = 3, Iteration = 4

K = 3, Iteration = 5

43 of 50

K-Means Clustering: Demo

This link shows a visualization of the different iterations: https://user.ceng.metu.edu.tr/~akifakkus/courses/ceng574/k-means/

44 of 50

K-Means Clustering: Image Data

http://bit.ly/kmeans-lesson

45 of 50

K-Means Clustering: FAQ

What happens if I have too many clusters?

Data that should belong in the same group may be divided into sub-groups and create less meaningful groups

What happens if I have too few clusters?

The differences between clusters may not be represented properly with the clusters; data with different features may end up in the same cluster

How to choose the number of clusters?

Common methods include the “Elbow” method, Silhouette method, and the Sum of Squares method

46 of 50

K-Means Clustering: The “Elbow Method”

Most common method; however, inexact
Choose a range of clusters, i.e. try k-means for cluster sizes 1-20
Calculate the sum of square for each number of clusters, K
Look for the change in slope from steep to shallow (“elbow”) to determine K
Bend indicates that adding more clusters in does not add much value to the model

K = 3 optimal!

47 of 50

K-Means Clustering: Pros and Cons

Pros

Simple to understand
Easy to visualize the results
Dynamic → your clusters can change as data evolves over time
Great for data exploration
Efficient and easily scalable

Cons

Need to determine the number of clusters
Not deterministic → different attempts of running this algorithm can return different results
Clusters must be defined afterwards (not always easy)
Dependent on the initial centroid values
Sensitive to outliers

48 of 50

Resources

If you want to learn more in depth or about the other methods to choose the optimal number of clusters for K-Means, check out: https://towardsdatascience.com/10-tips-for-choosing-the-optimal-number-of-clusters-277e93d72d92

49 of 50

Recap

Clustering is used to group data based on common features into clusters
Applications include: Marketing & Sales, Document Classification and Image Analysis
Most common method is K-Means Clustering

Goal is to maximize inter-cluster distances and minimize intra-cluster differences
Use the Elbow method to determine the optimal number of clusters
Not always an effective implementation depending on the dataset but can be efficient

50 of 50

Any questions?