1 of 58

Machine Learning Lab

Department of Information Technology Dhanekula Institute of Engineering and Technology

2 of 58

3 of 58

4 of 58

5 of 58

6 of 58

7 of 58

8 of 58

9 of 58

10 of 58

11 of 58

12 of 58

13 of 58

14 of 58

15 of 58

Central Tendency

Describes the center point of a dataset.

Dispersion

Describes the spread or variability.

Relevance

Statistical properties determine data distribution.

Crucial for selecting the right algorithms and understanding model behavior.

Mean: Arithmetic average.

Median: Middle value.

Mode: Most frequent value.

Variance ( σ 2 ): Average squared difference from Mean.

Std Dev ( σ ): Square root of variance.

Statistical Foundations

16 of 58

Cleaning & Selection

Transformation

Missing Values: Handle via deletion or imputation (Mean/Median).

Attribute Selection: Removing irrelevant features to improve accuracy.

Discretization: Converting continuous data into discrete bins (e.g., Age Groups).

Outliers: Detecting and removing anomalies (Z-Score, IQR).

Data Preprocessing Techniques

17 of 58

Lazy Learning Algorithm

KNN is non-parametric and makes no assumptions about the underlying data distribution.

Mechanism

Calculates distance between new point and training points (e.g., Euclidean).

Identifies 'K' closest neighbors.

Classification: Majority vote determines class.

Regression: Average of neighbors' values.

K-Nearest Neighbors (KNN)

18 of 58

Decision Trees

Tree-structured classifier where internal nodes represent features and branches represent rules.

Random Forest

An ensemble learning method that constructs a multitude of decision trees.

Splitting: Uses Information Gain or Gini Index.

Pruning: Removes sub-nodes to prevent overfitting.

Leaf Nodes: Represent the final outcome.

Bagging: Trains trees on random subsets of data.

Voting: Classification output is the class selected by most trees.

Robustness: Mitigates overfitting seen in single trees.

Trees & Ensembles

19 of 58

Naïve Bayes

Probabilistic Classifier

Based on Bayes' Theorem with a strong (naïve) independence assumption between features.

Key Characteristics

Assumes feature independence.

Highly effective for large datasets.

Commonly used in text classification (e.g., Spam Filtering).

20 of 58

Optimal Hyperplane

Finds a hyperplane in N-dimensional space that distinctly classifies data points.

Key Concepts

Margin: The distance between the hyperplane and nearest data points (Support Vectors). Maximizing this is the goal.

Kernel Trick: Transforms data into higher dimensions to handle non-linearly separable data.

Support Vector Machines (SVM)

21 of 58

Simple Linear Regression

Models the relationship between a dependent variable (y) and an independent variable (x).

y = mx + c

Minimizes the sum of squared errors to fit a straight line.

Logistic Regression

Despite the name, it is a Classification algorithm for binary outcomes (0 or 1).

Uses the Sigmoid Function to map predictions to probabilities between 0 and 1.

Decision boundary separates the classes.

Regression Algorithms

22 of 58

Multi-layer Perceptron

Feedforward Neural Network

A class of Artificial Neural Networks (ANN) consisting of at least three layers of nodes.

Structure & Training

Layers: Input, Hidden (non-linear), and Output.

Backpropagation: Training technique where error is propagated backward to update weights.

Activation: Uses functions like ReLU or Sigmoid.

23 of 58

Unsupervised Learning

Partitions dataset into K distinct, non-overlapping clusters.

Iterative Process

1.

Initialize: Select K random centroids.

2.

Assign: Assign points to nearest centroid (Euclidean distance).

3.

Update: Recalculate centroids based on mean of assigned points.

K-Means Clustering

24 of 58

Fuzzy C-Means (FCM)

Soft clustering where data points can belong to multiple clusters.

Expectation Maximization (EM)

Used for Gaussian Mixture Models (GMM).

Membership: Each point has a degree of membership (probability) for each cluster.

Useful when boundaries are ambiguous.

E-Step: Estimate probability of belonging to a cluster.

M-Step: Update parameters to maximize likelihood.

Accounts for cluster variance and shape (elliptical).

Advanced Clustering

25 of 58

Accuracy

The ratio of correctly predicted observations to total observations.

(TP + TN) / Total

Confusion Matrix

Table visualizing performance.

True Positives, False Positives, True Negatives, False Negatives

MSE

Mean Squared Error.

Average squared difference between actual and predicted values (for Regression).

Model Evaluation Metrics

26 of 58

EXP1: To compute and understand the basic measures of Central Tendency (Mean, Median, Mode) and Dispersion (Variance, Standard Deviation) using Python

Program:

import numpy as np # Library for numerical computing

from scipy import stats # Library for statistical analysis

data = [85, 92, 78, 85, 95, 88, 72, 85, 90, 80] # Dataset

data_array = np.array(data) # Converting lists to NumPy array

mean_value = np.mean(data_array) # Average

median_value = np.median(data_array) # If kept in order middle value

mode_result = stats.mode(data_array) # stats.mode() will give repeated number

print(f"Mean: {mean_value:.2f}") # :.2f (decimal point)

print(f"Median: {median_value:.2f}")

print(f"Mode: {mode_result.mode[0]}")

variance_value = np.var(data_array, ddof=1) # Sample variance

std_dev_value = np.std(data_array, ddof=1) # Sample standard deviation

print(f"Variance: {variance_value:.2f}")

print(f"Standard Deviation: {std_dev_value:.2f}")

27 of 58

28 of 58

29 of 58

30 of 58

EXP 2 : To apply fundamental data preprocessing techniques Attribute selection, Handling Missing Values, Discretization, Elimination of Outliers using Python libraries.

Pre-Processing Techniques

Before analyzing a dataset, we must clean and prepare it. This process is called data pre-processing.

a) Attribute Selection

Attribute selection means removing unwanted or unnecessary columns from the dataset, such as ID numbers or names, which do not affect the analysis.

b) Handling Missing Values

Missing values are empty fields in the data. We fill them using statistical methods like mean, median, or mode so that no data remains empty.

c) Discretization

Discretization is the process of converting continuous numeric data into groups or categories (called bins).

For example, converting marks into grade ranges.

d) Elimination of Outliers

Outliers are incorrect or extreme values that affect analysis.

These values are removed or corrected to make the data more accurate.

31 of 58

import pandas as pd

from sklearn.preprocessing import KBinsDiscretizer

df = pd.read_csv(r"C:\Users\pvste\OneDrive\Desktop\exp2.csv") # Load Dataset (Give the path from desktop)

df = df.drop(["Roll No", "Name of the Student"], axis=1) # Remove Unwanted Columns

df["Age"] = df["Age"].fillna(df["Age"].median())

df["Marks"] = df["Marks"].fillna(df["Marks"].median())

df["Grade"] = df["Grade"].fillna(df["Grade"].mode()[0])

df["Status"] = df["Status"].fillna(df["Status"].mode()[0])

# Cap Outliers (>100 Marks)

df.loc[df["Marks"] > 100, "Marks"] = 100

# Discretize (Bin) Marks into 5 groups

kb = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='quantile')

df["Marks_Bin"] = kb.fit_transform(df[["Marks"]]).astype(int)

# Final Output

print(df.head())

Age Marks Grade Status Marks_Bin

0 19.0 77 B Pass 3

1 19.0 67 C Pass 2

2 19.0 4 F Fail 0

3 19.0 60 C Pass 1

4 20.0 91 S Pass 4

32 of 58

import pandas as pd #used to work with tables (DataFrames)

import numpy as np #used for numerical operations (like np.nan)

from sklearn.preprocessing import KBinsDiscretizer #converts continuous values into groups (bins)

data = { #Creating the Dataset

    'Age': [25, 30, 35, 40, 22, 60],

    'Salary': [50000, 60000, np.nan, 120000, 30000, 1000000],

    'Experience': [2, 5, 8, 15, 1, 40],

    'Department': ['IT', 'HR', 'IT', 'Finance', 'HR', 'IT']

}

df = pd.DataFrame(data) #Creating a DataFrame

print("Original Dataset:\n", df)

df_selected = df[['Age', 'Salary', 'Experience']].copy() #.copy() avoids changing the original DataFrame

df_selected['Salary'] = df_selected['Salary'].fillna(df_selected['Salary'].mean())

discretizer = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='uniform') #Divides Age into 3 equal ranges (bins)

df_selected['Age_Bin'] = discretizer.fit_transform(df_selected[['Age']]) #Converts ages into numbers:0 → young, 1 → middle- aged, 2 → older .

Q1 = df_selected['Salary'].quantile(0.25)

Q3 = df_selected['Salary'].quantile(0.75)

IQR = Q3 - Q1

lower_bound = Q1 - 1.5 * IQR

upper_bound = Q3 + 1.5 * IQR

df_no_outliers = df_selected[ # Removing Outliers

    (df_selected['Salary'] >= lower_bound) &

    (df_selected['Salary'] <= upper_bound)

]

print("\nAfter Eliminating Outliers:\n", df_no_outliers)

33 of 58

EXP 3: To apply the K-Nearest Neighbors (KNN) algorithm for both classification and regression tasks using Python.

import numpy as np # For numerical operations

import matplotlib.pyplot as plt # For creating the scatter plot graph.

from sklearn.datasets import make_regression # for generating data, splitting it, creating the model, and calculating error metrics.

from sklearn.model_selection import train_test_split

from sklearn.neighbors import KNeighborsRegressor

from sklearn.metrics import mean_squared_error, r2_score

 X, y = make_regression(n_samples=200, n_features=1, noise=0.1, random_state=42) # Generate synthetic dataset

 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Split the dataset into training and testing sets

 knn_regressor = KNeighborsRegressor(n_neighbors=5) # Create and train the KNN regressor

knn_regressor.fit(X_train, y_train)

 y_pred = knn_regressor.predict(X_test) # Make predictions on the test data

 mse = mean_squared_error(y_test, y_pred) # Evaluate the model

r2 = r2_score(y_test, y_pred)

print(f'Mean Squared Error: {mse}')

print(f'R-squared: {r2}')

plt.scatter(X_test, y_test, color='blue', label='Actual') # Visualize the results

plt.scatter(X_test, y_pred, color='red', label='Predicted')

plt.title('KNN Regression')

plt.xlabel('Feature')

plt.ylabel('Target')

plt.legend()

plt.show()

Mean Squared Error: 133.62045142000457

R-squared: 0.9817384115764595

34 of 58

import numpy as np

import matplotlib.pyplot as plt

from sklearn.datasets import load_iris

from sklearn.model_selection import train_test_split

from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Step 1: Load the dataset

iris = load_iris()

X = iris.data # Features

y = iris.target # Labels

# Step 2: Split the dataset

X_train, X_test, y_train, y_test = train_test_split(

X, y, test_size=0.2, random_state=42

)

# Step 3: Create and train the KNN classifier

knn_classifier = KNeighborsClassifier(n_neighbors=5)

knn_classifier.fit(X_train, y_train)

# Step 4: Make predictions

y_pred = knn_classifier.predict(X_test)

# Step 5: Evaluate the model

accuracy = accuracy_score(y_test, y_pred)

print(f'Accuracy: {accuracy:.4f}')

print('\nClassification Report:')

print(classification_report(y_test, y_pred, target_names=iris.target_names))

print('\nConfusion Matrix:')

print(confusion_matrix(y_test, y_pred))

# Step 6: Predict a new input sample

input_sample = np.array([[5.1, 3.5, 1.4, 0.2]])

predicted_class = knn_classifier.predict(input_sample)

predicted_prob = knn_classifier.predict_proba(input_sample)

print("\nInput Sample:", input_sample)

print("Predicted Flower Name:", iris.target_names[predicted_class[0]])

print("Prediction Probabilities:", predicted_prob)

35 of 58

Exp 4: Demonstrate decision tree algorithm for a classification problem and perform parameter tuning for better results

import pandas as pd

from sklearn.datasets import load_iris

from sklearn.model_selection import train_test_split, GridSearchCV

from sklearn.tree import DecisionTreeClassifier

from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

data = load_iris() # Load dataset

X = pd.DataFrame(data.data, columns=data.feature_names)

y = pd.Series(data.target)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # Split data 

clf = DecisionTreeClassifier() # Default Decision Tree

clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)

print("Default Accuracy:", accuracy_score(y_test, y_pred))

params = {   # Parameter Tuning using Grid Search

'criterion': ['gini', 'entropy'],

'max_depth': [2, 3, 4, 5, None],

'min_samples_split': [2, 3, 4],

'min_samples_leaf': [1, 2, 3]

}

grid = GridSearchCV(DecisionTreeClassifier(), params,cv=5, scoring='accuracy')

grid.fit(X_train, y_train)

best_model = grid.best_estimator_   # Best Model

y_best = best_model.predict(X_test)

print("Best Params:", grid.best_params_)

print("Tuned Accuracy:", accuracy_score(y_test, y_best))

print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_best))

print("\nClassification Report:\n", classification_report(y_test, y_best))

36 of 58

from sklearn.datasets import load_diabetes

dataset=load_diabetes()

print(dataset['DESCR'])

import pandas as pd

df_diabetes=pd.DataFrame(dataset['data'],columns=dataset['feature_names'])

x=df_diabetes

y=dataset['target']

from sklearn.model_selection import train_test_split

x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=42)

from sklearn.tree import DecisionTreeRegressor

regressor=DecisionTreeRegressor(max_depth=3)

regressor.fit(x_train,y_train)

y_pred=regressor.predict(x_test)

from sklearn.metrics import r2_score,mean_squared_error,mean_absolute_error

print(r2_score(y_test,y_pred))

print(mean_squared_error(y_test,y_pred))

print(mean_absolute_error(y_test,y_pred))

import matplotlib.pyplot as plt

from sklearn import tree

plt.figure(figsize=(15,10))

tree.plot_tree(regressor,fontsize=10)

plt.show()

Exp 5: Demonstrate decision tree algorithm for a regression problem

37 of 58

38 of 58

39 of 58

import numpy as np

from sklearn.tree import DecisionTreeRegressor, export_text

import matplotlib.pyplot as plt

from sklearn.tree import plot_tree

# Dataset

X = np.array([[1400, 20], 1600, 15], [1700, 30], [1875, 10], [1100, 25], [1550, 18], [2350, 5], [2450, 8],

              [1425, 22],

              [1700, 12]])

y = np.array([105000, 120000, 95000, 145000, 80000, 115000, 180000, 190000, 108000, 130000])

# Train Decision Tree Regressor (max_depth=3 for interpretability)

regressor = DecisionTreeRegressor(max_depth=3, random_state=42)

regressor.fit(X, y)

# Text representation of the tree

tree_text = export_text(regressor, feature_names=['Size', 'Age'])

print("Tree Structure:\n")

print(tree_text)

# Predictions on training data

y_pred = regressor.predict(X)

print("\nPredictions vs Actual:")

for i in range(len(y)):

    print(f"Size: {X[i][0]}, Age: {X[i][1]}, Actual: {y[i]}, Predicted: {y_pred[i]:.2f}")

# Feature importances

print("\nFeature Importances:")

print(f"Size: {regressor.feature_importances_[0]:.4f}")

print(f"Age: {regressor.feature_importances_[1]:.4f}")

# Visualize the tree (requires matplotlib)

plt.figure(figsize=(20, 10))

plot_tree(regressor, feature_names=['Size', 'Age'], filled=True, rounded=True, precision=0)

plt.title("Decision Tree for House Price Regression")

plt.show()

Exp 5: Demonstrate decision tree algorithm for a regression problem

40 of 58

Output

41 of 58

Output

42 of 58

# Import required libraries

from sklearn.datasets import load_iris

from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Load dataset

iris = load_iris()

X = iris.data

y = iris.target

# Split data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create Random Forest Classifier model

rf_classifier = RandomForestClassifier(

    n_estimators=100,

    criterion="gini",

    random_state=42

)

# Train the model

rf_classifier.fit(X_train, y_train)

# Make predictions

y_pred = rf_classifier.predict(X_test)

# Evaluate the model

print("Accuracy:", accuracy_score(y_test, y_pred))

print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))

print("\nClassification Report:\n", classification_report(y_test, y_pred))

Exp 6a: Apply Random Forest algorithm for classification

43 of 58

Output

44 of 58

Exp 6b: Apply Random Forest algorithm for regression

# Import libraries

import numpy as np

import matplotlib.pyplot as plt

from sklearn.ensemble import RandomForestRegressor

from sklearn.model_selection import train_test_split

from sklearn.metrics import mean_squared_error, r2_score

# Generate synthetic regression data

np.random.seed(42)

X = np.linspace(0, 10, 200).reshape(-1, 1)

y = 3 * X.squeeze()**2 + 5 * X.squeeze() + 10 + np.random.randn(200) * 10

# Split the dataset

X_train, X_test, y_train, y_test = train_test_split(

    X, y, test_size=0.2, random_state=42

)

# Create Random Forest Regressor

rf_model = RandomForestRegressor(

    n_estimators=200,

    max_depth=10,

    min_samples_split=5,

    random_state=42

)

# Train the model

rf_model.fit(X_train, y_train)

# Predict on test data

y_pred = rf_model.predict(X_test)

# Model evaluation

print("Mean Squared Error:", mean_squared_error(y_test, y_pred))

print("R2 Score:", r2_score(y_test, y_pred))

45 of 58

Output

# Sort values for smooth plotting

sorted_idx = X_test.squeeze().argsort()

# Plot actual vs predicted values

plt.figure()

plt.scatter(X_test, y_test)

plt.plot(X_test[sorted_idx], y_pred[sorted_idx])

plt.xlabel("Input Feature (X)")

plt.ylabel("Target Value (y)")

plt.title("Random Forest Regression")

plt.show()

46 of 58

# Import necessary libraries

from sklearn.datasets import load_iris

from sklearn.model_selection import train_test_split

from sklearn.naive_bayes import GaussianNB

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Load the Iris dataset

data = load_iris()

X = data.data  # Features

y = data.target  # Target labels

# Split the data into training and testing sets (70% train, 30% test)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize the Gaussian Naïve Bayes classifier

nb_classifier = GaussianNB()

# Train the model

nb_classifier.fit(X_train, y_train)

# Make predictions on the test set

y_pred = nb_classifier.predict(X_test)

# Evaluate the model

accuracy = accuracy_score(y_test, y_pred)

print(f"Accuracy: {accuracy:.2f}")

# Classification Report

print("\nClassification Report:")

print(classification_report(y_test, y_pred, target_names=data.target_names))

# Confusion Matrix

print("Confusion Matrix:")

print(confusion_matrix(y_test, y_pred))

Exp 7: Demonstrate Naïve Bayes Classification algorithm.

47 of 58

Output

48 of 58

Exp 8: Apply Support Vector algorithm for classification.

# Import necessary libraries

from sklearn.datasets import load_iris

from sklearn.model_selection import train_test_split

from sklearn.svm import SVC

from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Load the Iris dataset

data = load_iris()

X = data.data  # Features

y = data.target  # Target labels

# Split the data into training and testing sets (70% train, 30% test)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize the SVM classifier

# Using 'linear' kernel for simplicity (you can try 'rbf', 'poly', 'sigmoid')

svm_classifier = SVC(kernel='linear', random_state=42)

# Train the model

svm_classifier.fit(X_train, y_train)

# Make predictions on the test set

y_pred = svm_classifier.predict(X_test)

# Evaluate the model

accuracy = accuracy_score(y_test, y_pred)

print(f"Accuracy: {accuracy:.2f}")

# Classification Report

print("\nClassification Report:")

print(classification_report(y_test, y_pred, target_names=data.target_names))

# Confusion Matrix

print("Confusion Matrix:")

print(confusion_matrix(y_test, y_pred))

49 of 58

Output

50 of 58

Exp 9: Demonstrate simple linear regression algorithm for a regression problem.

from sklearn.linear_model import LinearRegression

import numpy as np

import matplotlib.pyplot as plt

x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

y = np.array([1, 3, 2, 5, 7, 8, 8, 9, 10, 12])

lreg=LinearRegression()

lreg.fit(x.reshape(-1, 1), y)

plt.scatter(x, y)

plt.plot(x, lreg.predict(x.reshape(-1, 1)), color='red')

plt.title("Linear Regression")

plt.xlabel("x")

plt.ylabel("y")

plt.show()

Output

Program

51 of 58

Exp 10: Apply Logistic regression algorithm for a classification problem.

import numpy

X = numpy.array([3.78, 2.44, 2.09, 0.14, 1.72, 1.65, 4.92, 4.37, 4.96, 4.52, 3.69,

5.88]).reshape(-1,1)

print(X)

y = numpy.array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1])

from sklearn import linear_model

logr = linear_model.LogisticRegression()

logr.fit(X,y)

predicted = logr.predict(numpy.array([3.46]).reshape(-1,1))

print(predicted)

logr = linear_model.LogisticRegression()

logr.fit(X,y)

log_odds = logr.coef_

odds = numpy.exp(log_odds)

print(odds)

Output

Program

52 of 58

Exp 11: Demonstrate Multi-layer Perceptron algorithm for a classification problem.

Program

# Import necessary libraries

from sklearn.datasets import load_iris

from sklearn.model_selection import train_test_split

from sklearn.neural_network import MLPClassifier

from sklearn.preprocessing import StandardScaler

from sklearn.metrics import (accuracy_score, classification_report,  

confusion_matrix, ConfusionMatrixDisplay)

import matplotlib.pyplot as plt

# Load the Iris dataset

data = load_iris()

X = data.data  # Features

y = data.target  # Target labels

feature_names = data.feature_names

class_names = data.target_names

# Split the data into training and testing sets (80% train, 20% test)

X_train, X_test, y_train, y_test = train_test_split(

X, y, test_size=0.2, random_state=42, stratify=y

)

# Standardize features by removing the mean and scaling to unit variance

scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)

X_test = scaler.transform(X_test)

53 of 58

# Initialize the MLP classifier

mlp = MLPClassifier(

    hidden_layer_sizes=(100, 50),  # Two hidden layers with 100 and 50 neurons

    activation='relu',             # Rectified Linear Unit activation

    solver='adam',                 # Optimization algorithm

    alpha=0.0001,                  # L2 penalty (regularization term) parameter

    batch_size='auto',             # Size of minibatches

    learning_rate='constant',      # Learning rate schedule

    learning_rate_init=0.001,      # Initial learning rate

    max_iter=500,                  # Maximum number of iterations

    random_state=42,               # Random seed

    early_stopping=True,           # Use early stopping to terminate training when validation score stops improving

    validation_fraction=0.1        # Fraction of training data to set aside as validation set

)

# Train the model

mlp.fit(X_train, y_train)

# Make predictions

y_pred = mlp.predict(X_test)

y_pred_prob = mlp.predict_proba(X_test)

# Evaluate the model

print(f"Training set score: {mlp.score(X_train, y_train):.3f}")

print(f"Test set score: {mlp.score(X_test, y_test):.3f}\n")

print("Classification Report:")

print(classification_report(y_test, y_pred, target_names=class_names))

 # Plot confusion matrix

Program…..Contd….

54 of 58

Output

55 of 58

Exp 12: Implement the K-means algorithm and apply it to the data you selected. Evaluate performance by measuring the sum of the Euclidean distance of each example from its class center. Test the performance of the algorithm as a function of the parameters K.

import numpy as np

from sklearn.datasets import load_iris

�X = load_iris().data  # selected data

def kmeans(X, K, i=10):

    c = X[np.random.choice(len(X), K, 0)]

    for _ in range(i):

        l = ((X[:,None]-c)**2).sum(2).argmin(1)

        c = np.array([X[l==k].mean(0) for k in range(K)])

    return l, c

def sse(X, l, c):

    return sum(np.linalg.norm(x-c[k]) for x,k in zip(X,l))

for K in range(1,6):

    l,c = kmeans(X,K)

    print(K, sse(X,l,c))

Program

Output

56 of 58

Exp 13: Demonstrate the use of Fuzzy C-Means Clustering.

import numpy as np

from sklearn.datasets import load_iris

data = load_iris()

X = data.data  # 1. Load data (first 4 features), transpose

K, m, e, i = 3, 2, 1e-5, 10                     # 2. clusters, fuzziness, error, iterations

c = X[:, np.random.choice(X.shape[1], K, 0)]    # 3. Randomly init cluster centers

for _ in range(i):                              # 4. Repeat update steps

    d = np.linalg.norm(X[:, :, None]-c[:, None], axis=0) + 1e-9  # 5. Distance matrix

    u = 1 / d**(2/(m-1))                         # 6. Compute membership (unnormalized)

    u = u / u.sum(1, keepdims=1)                 # 7. Normalize memberships

    c = (X @ (u**m)) / (u**m).sum(0)             # 8. Update cluster centers

labels = u.argmax(1)                             # 9. Final cluster labels

print(labels, sum(d[n, labels[n]] for n in range(len(labels))))  #10. Labels + performance

Output

57 of 58

Exp 14: Demonstrate the use of Expectation Maximization based clustering algorithm.

import numpy as np

from sklearn.datasets import load_iris

�X = load_iris().data        # Load dataset (150 flowers, 4 features)

K = 3                       # We want 3 clusters (groups)

n, d = X.shape              # n=samples, d=features

p = np.ones((n, K))/K       # Start: each point has equal chance in each cluster

m = X[np.random.choice(n, K, 0)]  # Pick random points as initial cluster centers

s = np.array([np.eye(d)]*K) # Start: clusters are assumed circular (identity matrix)

for _ in range(10):         # Repeat 10 times to improve clusters

    for k in range(K):      # ----- E STEP (Expectation) -----

        diff = X - m[k]     # Distance of all points from cluster k mean

        p[:,k] = np.exp(-.5*np.sum(diff@np.linalg.inv(s[k])*diff,1))/np.sqrt(np.linalg.det(s[k]))

    p /= p.sum(1, keepdims=True)  # Convert to real probability (sum of each row = 1)

�    for k in range(K):      # ----- M STEP (Maximization) -----

        w = p[:,k]          # Weight = how much each point belongs to cluster k

        m[k] = (w@X)/w.sum()      # Update cluster center (mean)

        s[k] = ((X-m[k]).T*(w/w.sum()))@(X-m[k])  # Update cluster shape/spread

print("Cluster labels:", p.argmax(1))  # Final group = cluster with highest probability

Output

58 of 58

Thank You…