1 of 50

Clustering for

Image Analysis

Kanika Chopra,

Nicholas Vadivelu

2 of 50

Introduction

KANIKA CHOPRA

(she/her)

4A Math. Finance & Stats

kanikadatt@gmail.com

NICHOLAS VADIVELU (he/him)

4A Computer Science & Stats �nicholas.vadivelu@gmail.com

3 of 50

How can you get involved?

4 of 50

Contents

  1. What is Data Science?
  2. Python & NumPy Basics
  3. K-Means Clustering
  4. Implementing K-Means

5 of 50

What is Data Science?

Data science is an inter-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from many structural and unstructured data. Data science is related to data mining, machine learning and big data.

  • Wikipedia

6 of 50

Examples of Data Science

  • Visualizing voter patterns across the country
  • A/B testing a new product feature with users
  • Netflix’s video recommendation engine
  • Identifying fraudulent credit card transactions
  • Matching and routing drivers for Uber
  • Determining pricing rates for insurance
  • Identifying and predicting motion of objects on the road for self driving
  • Facial recognition software for security
  • Generating fake images for malicious use
  • ….and basically everything you use in your daily life!

7 of 50

What is M a c h i n e L e a r n i n g ?

Machine learning is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed.

  • expertsystem.com

8 of 50

Examples of Machine Learning

  • Netflix’s video recommendation engine
  • Identifying fraudulent credit card transactions
  • Matching and routing drivers for Uber
  • Determining pricing rates for insurance
  • Identifying and predicting motion of objects on the road for self driving
  • Facial recognition software for security
  • Generating fake images for malicious use

9 of 50

Artificial Intelligence

Artificial Intelligence

Data Science

Machine Learning

10 of 50

What are career options in this field?

  • Data Analyst
  • Data Scientist
  • Machine Learning / Research Engineer
  • Data Engineer/Architect/Consultant
  • Research Scientist

Disclaimer: These are our categorizations! Often these titles are mixed, and definitely vary by organization.

11 of 50

Python “Crash Course”

12 of 50

What is Python?

  • A general purpose programming language that is commonly used in data science
  • What is a program?
    • A sequence of commands executed by a computer

13 of 50

Your First Python Program

print('Hello, World!')

Hello, World!

Program:

Output:

14 of 50

Your First Python Program

print('Hello')�print('World')

Hello�World

Program:

Output:

15 of 50

Variables

x = 'Hello, World!' # assignment�print(x)

y = x # assignment�print(y)

Hello, World!Hello, World!

Program:

Output:

16 of 50

Variables 2

x = 2�print(x)

x = x + 2�print(x)

x *= 3 # same as x = x * 3�print(x)

2�4�12

Program:

Output:

17 of 50

Conditionals

string = 'great job!'if string == 'great job!':� print(':)')�else:� print(':(')

:)

Program:

Output:

18 of 50

Conditionals 2

x = 4.0�if x < 3:� print(x, ' is less than 3.')�elif x >= 7:� print(x, ' is >= to 7.')�else:� print(x, ' is between 3 and 6.')

4.0 is between 3 and 6.

Program:

Output:

19 of 50

While Loops

# compute floor(log_2(x))�x = 13�value = -1

while x > 0:� x = x // 2 # integer division� value += 1� print('x =', x)� print('value =', value)� print()

print('floor(log_2(x)) = ', value)

x = 6�value = 0

x = 3�value = 1

x = 1�value = 2

x = 0�value = 3

floor(log_2(x)) = 3

Program:

Output:

20 of 50

Lists

var = [1, 2, 3, 4]�print(var[0]) # indexing�print(var[2])

var.append(5)�print(var)

var.append('nice')�print(var)

1�3�[1, 2, 3, 4, 5]�[1, 2, 3, 4, 5, 'nice']

Program:

Output:

21 of 50

Lists: Advanced Indexing

var = [10, 11, 12, 13, 14, 15]�print(var[-1])�print(var[-2])�print(var[0:4]) # slice�print(var[0:4:2])

15�14�[10, 11, 12, 13]�[10, 12]

Program:

Output:

22 of 50

For Loops

var = [10, 11, 12, 13, 14, 15]

for elem in var:� print(elem + 10)

20�21�22�23�24�25

Program:

Output:

23 of 50

For Loops

print(list(range(10)))�print(list(range(4, 10)))�print(list(range(4, 10, 3)))

for i in range(3):� print(i)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]�[4, 5, 6, 7, 8, 9]�[4, 7]�0�1�2

Program:

Output:

24 of 50

Functions

def func(x, y):� print('Called func')� return x + y

z = func(1, 2)�print(z)�

# could do func(x=1, y=2)# or func(1, y=2)# or func(y=2, x=1)

Called func�3

Program:

Output:

25 of 50

NumPy

import numpy as np # load library

x = np.array([1, 2, 3])�print(x)�print(x + 1)

y = np.array([4, 5, 6])�print(x + y)

[1 2 3]�[2 3 4]�[5 7 9]

26 of 50

NumPy

import numpy as np # load library

x = np.array([[1, 2, 3], � [4, 5, 6]])

print(x)�print()

print(np.sum(x))�print()

print(np.sum(x, axis=0))�print()

print(np.sum(x, axis=1))�print()

[[1 2 3]� [4 5 6]]��21��[5 7 9]��[ 6 15]���

27 of 50

NumPy Attributes & Methods

import numpy as np # load library

x = np.array([[1, 2, 3], � [4, 5, 6]])

print(x.shape) # shape is an attribute�

y = x.reshape(2, 3) # reshape is a method

print(y)

print(y.shape)�

z = y.reshape(6)

print(z)

print(z.shape)

(2, 3)

[[1 2 3]� [4 5 6]]

(2, 3)

[1 2 3 4 5 6]

(6,)���

28 of 50

Exercises

  1. The equation for a line is y = mx + b. Write a function that takes a list x, a coefficient m, and a bias b, and returns a list y with the mx+b.
    1. e.g. x = [2, 4, 6, 8], m = 3, b = 1. �f(x, m, b) -> [7, 13, 19, 25]�
  2. Write a function that finds the maximum value in list a, and returns the corresponding element at the same index in list b.
    • e.g. a = [3, 2, 5, 1], b = [‘a’, ‘b’, ‘c’, ‘d’]�f(a, b) -> ‘c’

29 of 50

Any questions so far?

30 of 50

What is Clustering?

  • A form of pattern recognition
  • Goal is to group a set of data based on their common features
    • Data within a cluster are similar
    • Data in different clusters are different
  • Most common method is K-Means Clustering (we will learn this soon!)

31 of 50

Clustering: Example

  • Let’s say you own a t-shirt business and only offer one size
  • You want to introduce three sizes for your shirts based on your customer’s weights and heights
  • You have the following data and create three sizes

32 of 50

Applications of Clustering

  • Marketing and Sales
  • Document Classification
  • Image Analysis

33 of 50

Applications of Clustering: Marketing and Sales

  • Goal: Identify distinct customer groups based on their characteristics
  • Types of Data:
    • Demographic (gender, age, income, etc.)
    • Geographic (city, province, country)
    • Psychographics (social class, lifestyle, personality traits)
    • Behavioural (spending habits, product/service purchases)
  • Use Cases:
    • Determining the right price
    • Customizing and personalizing marketing campaigns
    • Choosing new product features
  • Example:
    • You work at a clothing store and your sales are down. Your marketing team wants to determine the most optimal marketing plan that can target different groups of customers. They ask you for your help to segment their customer base.

34 of 50

Applications of Clustering: Document Classification

  • Goal: Classify the type of document
  • Types of Data:
    • Tags (title, subtitle, author, caption)
    • Topics (sports, politics, entertainment)
    • Content of document
  • Use Cases:
    • Finding a similar document
    • Categorizing documents
  • Example:
    • Your teacher asks each student to write a research paper on a topic of their choosing. You want to figure out what common categories these papers fall into. You use clustering to organize the papers into various clusters based on the content.

35 of 50

Applications of Clustering: Image Analysis

  • Goal: Identify distinct parts of an image
  • Types of Data:
    • Pixels
  • Use Cases:
    • Identifying cancerous parts of a scan
    • Determining a number in a set of images
    • Segment an image, i.e. determine the water and the land in the image
  • Example:
    • You are analysing scans of a patients and want to determine whether a patient has cancer. You perform a cluster analysis to identify the cancerous and non-canerous cells based on the traits (i.e. color, shape, location) in the scan.

36 of 50

Applications of Clustering: Image Analysis

Detecting cancer in scans

Object detection

Image segmentation

Pic Credit: omicsonline.org

37 of 50

K-Means Clustering: Introduction

  • Most common method for clustering
  • Unsupervised learning technique
    • Infers patterns in the data without knowing the final answer
    • Algorithm determines what features should distinguish the groups

Definitions:

  1. Cluster: A group of data
  2. Centroid: Central point of each cluster

Goal: Segment the data into non-overlapping K clusters by minimizing the distance between data points in each cluster and maximizing the distance between the points in distinct clusters

38 of 50

K-Means Clustering: Calculations

  1. Euclidean Distance: The metric to calculate the distance between two points

  • Centroid: The arithmetic mean position of a set of points

39 of 50

K-Means Clustering: Process

Input: Data and the # of clusters

NOTE: We do not know the group labels

Process:

  1. Determine the number of clusters, K
  2. Choose K initial points randomly - the centroids for the first round of clusters
  3. Assign the data to each of the groups based on the closest centroid
  4. Compute the average centroid in each of the groups
  5. Reassign all the data based on the new closest centroid using Euclidean distance
  6. STOP iteration when groups no longer change

40 of 50

K-Means Clustering: Demo

  • Randomly assigned centroids
  • Data assigned to closest centroids
  • Centroids of initial clusters calculated

K = 3, Iteration = 0

K = 3, Iteration = 1

41 of 50

K-Means Clustering: Demo

  • Points are re-assigned based on centroids from iteration 2
  • New centroids calculated
  • Points are re-assigned based on centroids from iteration 1
  • New centroids calculated

K = 3, Iteration = 2

K = 3, Iteration = 3

42 of 50

K-Means Clustering: Demo

  • Points are re-assigned based on centroids from iteration 3
  • New centroids calculated
  • Iteration 5 is the same as Iteration 4
  • STOP the algorithm
  • Final clusters created

K = 3, Iteration = 4

K = 3, Iteration = 5

43 of 50

K-Means Clustering: Demo

This link shows a visualization of the different iterations: https://user.ceng.metu.edu.tr/~akifakkus/courses/ceng574/k-means/

44 of 50

K-Means Clustering: Image Data

45 of 50

K-Means Clustering: FAQ

  • What happens if I have too many clusters?

Data that should belong in the same group may be divided into sub-groups and create less meaningful groups

  • What happens if I have too few clusters?

The differences between clusters may not be represented properly with the clusters; data with different features may end up in the same cluster

  • How to choose the number of clusters?

Common methods include the “Elbow” method, Silhouette method, and the Sum of Squares method

46 of 50

K-Means Clustering: The “Elbow Method”

  • Most common method; however, inexact
  • Choose a range of clusters, i.e. try k-means for cluster sizes 1-20
  • Calculate the sum of square for each number of clusters, K
  • Look for the change in slope from steep to shallow (“elbow”) to determine K
  • Bend indicates that adding more clusters in does not add much value to the model

K = 3 optimal!

47 of 50

K-Means Clustering: Pros and Cons

Pros

  • Simple to understand
  • Easy to visualize the results
  • Dynamic → your clusters can change as data evolves over time
  • Great for data exploration
  • Efficient and easily scalable

Cons

  • Need to determine the number of clusters
  • Not deterministic → different attempts of running this algorithm can return different results
  • Clusters must be defined afterwards (not always easy)
  • Dependent on the initial centroid values
  • Sensitive to outliers

48 of 50

Resources

49 of 50

Recap

  • Clustering is used to group data based on common features into clusters
  • Applications include: Marketing & Sales, Document Classification and Image Analysis
  • Most common method is K-Means Clustering
    • Goal is to maximize inter-cluster distances and minimize intra-cluster differences
    • Use the Elbow method to determine the optimal number of clusters
    • Not always an effective implementation depending on the dataset but can be efficient

50 of 50

Any questions?