Machine Learning Lab
Department of Information Technology� Dhanekula Institute of Engineering and Technology
Central Tendency
Describes the center point of a dataset.
Dispersion
Describes the spread or variability.
Relevance
Statistical properties determine data distribution.
Crucial for selecting the right algorithms and understanding model behavior.
•
Mean: Arithmetic average.
•
Median: Middle value.
•
Mode: Most frequent value.
Variance ( σ 2 ): Average squared difference from Mean.
Std Dev ( σ ): Square root of variance.
Statistical Foundations
Cleaning & Selection
Transformation
•
Missing Values: Handle via deletion or imputation (Mean/Median).
•
Attribute Selection: Removing irrelevant features to improve accuracy.
•
Discretization: Converting continuous data into discrete bins (e.g., Age Groups).
•
Outliers: Detecting and removing anomalies (Z-Score, IQR).
Data Preprocessing Techniques
Lazy Learning Algorithm
KNN is non-parametric and makes no assumptions about the underlying data distribution.
Mechanism
•
Calculates distance between new point and training points (e.g., Euclidean).
•
Identifies 'K' closest neighbors.
•
Classification: Majority vote determines class.
•
Regression: Average of neighbors' values.
K-Nearest Neighbors (KNN)
Decision Trees
Tree-structured classifier where internal nodes represent features and branches represent rules.
Random Forest
An ensemble learning method that constructs a multitude of decision trees.
•
Splitting: Uses Information Gain or Gini Index.
•
Pruning: Removes sub-nodes to prevent overfitting.
•
Leaf Nodes: Represent the final outcome.
•
Bagging: Trains trees on random subsets of data.
•
Voting: Classification output is the class selected by most trees.
•
Robustness: Mitigates overfitting seen in single trees.
Trees & Ensembles
Naïve Bayes
Probabilistic Classifier
Based on Bayes' Theorem with a strong (naïve) independence assumption between features.
Key Characteristics
•
Assumes feature independence.
•
Highly effective for large datasets.
•
Commonly used in text classification (e.g., Spam Filtering).
Optimal Hyperplane
Finds a hyperplane in N-dimensional space that distinctly classifies data points.
Key Concepts
•
Margin: The distance between the hyperplane and nearest data points (Support Vectors). Maximizing this is the goal.
•
Kernel Trick: Transforms data into higher dimensions to handle non-linearly separable data.
Support Vector Machines (SVM)
Simple Linear Regression
Models the relationship between a dependent variable (y) and an independent variable (x).
y = mx + c
Minimizes the sum of squared errors to fit a straight line.
Logistic Regression
Despite the name, it is a Classification algorithm for binary outcomes (0 or 1).
•
Uses the Sigmoid Function to map predictions to probabilities between 0 and 1.
•
Decision boundary separates the classes.
Regression Algorithms
Multi-layer Perceptron
Feedforward Neural Network
A class of Artificial Neural Networks (ANN) consisting of at least three layers of nodes.
Structure & Training
•
Layers: Input, Hidden (non-linear), and Output.
•
Backpropagation: Training technique where error is propagated backward to update weights.
•
Activation: Uses functions like ReLU or Sigmoid.
Unsupervised Learning
Partitions dataset into K distinct, non-overlapping clusters.
Iterative Process
1.
Initialize: Select K random centroids.
2.
Assign: Assign points to nearest centroid (Euclidean distance).
3.
Update: Recalculate centroids based on mean of assigned points.
K-Means Clustering
Fuzzy C-Means (FCM)
Soft clustering where data points can belong to multiple clusters.
Expectation Maximization (EM)
Used for Gaussian Mixture Models (GMM).
•
Membership: Each point has a degree of membership (probability) for each cluster.
•
Useful when boundaries are ambiguous.
•
E-Step: Estimate probability of belonging to a cluster.
•
M-Step: Update parameters to maximize likelihood.
•
Accounts for cluster variance and shape (elliptical).
Advanced Clustering
Accuracy
The ratio of correctly predicted observations to total observations.
(TP + TN) / Total
Confusion Matrix
Table visualizing performance.
True Positives, False Positives, True Negatives, False Negatives
MSE
Mean Squared Error.
Average squared difference between actual and predicted values (for Regression).
Model Evaluation Metrics
EXP1: To compute and understand the basic measures of Central Tendency (Mean, Median, Mode) and Dispersion (Variance, Standard Deviation) using Python
Program:
import numpy as np # Library for numerical computing
from scipy import stats # Library for statistical analysis
data = [85, 92, 78, 85, 95, 88, 72, 85, 90, 80] # Dataset
data_array = np.array(data) # Converting lists to NumPy array
mean_value = np.mean(data_array) # Average
median_value = np.median(data_array) # If kept in order middle value
mode_result = stats.mode(data_array) # stats.mode() will give repeated number
print(f"Mean: {mean_value:.2f}") # :.2f (decimal point)
print(f"Median: {median_value:.2f}")
print(f"Mode: {mode_result.mode[0]}")
variance_value = np.var(data_array, ddof=1) # Sample variance
std_dev_value = np.std(data_array, ddof=1) # Sample standard deviation
print(f"Variance: {variance_value:.2f}")
print(f"Standard Deviation: {std_dev_value:.2f}")
EXP 2 : To apply fundamental data preprocessing techniques Attribute selection, Handling Missing Values, Discretization, Elimination of Outliers using Python libraries.
Pre-Processing Techniques
Before analyzing a dataset, we must clean and prepare it. This process is called data pre-processing.
a) Attribute Selection
Attribute selection means removing unwanted or unnecessary columns from the dataset, such as ID numbers or names, which do not affect the analysis.
b) Handling Missing Values
Missing values are empty fields in the data. We fill them using statistical methods like mean, median, or mode so that no data remains empty.
c) Discretization
Discretization is the process of converting continuous numeric data into groups or categories (called bins).
For example, converting marks into grade ranges.
d) Elimination of Outliers
Outliers are incorrect or extreme values that affect analysis.
These values are removed or corrected to make the data more accurate.
import pandas as pd
from sklearn.preprocessing import KBinsDiscretizer
df = pd.read_csv(r"C:\Users\pvste\OneDrive\Desktop\exp2.csv") # Load Dataset (Give the path from desktop)
df = df.drop(["Roll No", "Name of the Student"], axis=1) # Remove Unwanted Columns
df["Age"] = df["Age"].fillna(df["Age"].median())
df["Marks"] = df["Marks"].fillna(df["Marks"].median())
df["Grade"] = df["Grade"].fillna(df["Grade"].mode()[0])
df["Status"] = df["Status"].fillna(df["Status"].mode()[0])
# Cap Outliers (>100 Marks)
df.loc[df["Marks"] > 100, "Marks"] = 100
# Discretize (Bin) Marks into 5 groups
kb = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='quantile')
df["Marks_Bin"] = kb.fit_transform(df[["Marks"]]).astype(int)
# Final Output
print(df.head())
Age Marks Grade Status Marks_Bin
0 19.0 77 B Pass 3
1 19.0 67 C Pass 2
2 19.0 4 F Fail 0
3 19.0 60 C Pass 1
4 20.0 91 S Pass 4
import pandas as pd #used to work with tables (DataFrames)
import numpy as np #used for numerical operations (like np.nan)
from sklearn.preprocessing import KBinsDiscretizer #converts continuous values into groups (bins)
data = { #Creating the Dataset
'Age': [25, 30, 35, 40, 22, 60],
'Salary': [50000, 60000, np.nan, 120000, 30000, 1000000],
'Experience': [2, 5, 8, 15, 1, 40],
'Department': ['IT', 'HR', 'IT', 'Finance', 'HR', 'IT']
}
df = pd.DataFrame(data) #Creating a DataFrame
print("Original Dataset:\n", df)
df_selected = df[['Age', 'Salary', 'Experience']].copy() #.copy() avoids changing the original DataFrame
df_selected['Salary'] = df_selected['Salary'].fillna(df_selected['Salary'].mean())
discretizer = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='uniform') #Divides Age into 3 equal ranges (bins)
df_selected['Age_Bin'] = discretizer.fit_transform(df_selected[['Age']]) #Converts ages into numbers:0 → young, 1 → middle- aged, 2 → older .
Q1 = df_selected['Salary'].quantile(0.25)
Q3 = df_selected['Salary'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
df_no_outliers = df_selected[ # Removing Outliers
(df_selected['Salary'] >= lower_bound) &
(df_selected['Salary'] <= upper_bound)
]
print("\nAfter Eliminating Outliers:\n", df_no_outliers)
EXP 3: To apply the K-Nearest Neighbors (KNN) algorithm for both classification and regression tasks using Python.
import numpy as np # For numerical operations
import matplotlib.pyplot as plt # For creating the scatter plot graph.
from sklearn.datasets import make_regression # for generating data, splitting it, creating the model, and calculating error metrics.
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error, r2_score
X, y = make_regression(n_samples=200, n_features=1, noise=0.1, random_state=42) # Generate synthetic dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Split the dataset into training and testing sets
knn_regressor = KNeighborsRegressor(n_neighbors=5) # Create and train the KNN regressor
knn_regressor.fit(X_train, y_train)
y_pred = knn_regressor.predict(X_test) # Make predictions on the test data
mse = mean_squared_error(y_test, y_pred) # Evaluate the model
r2 = r2_score(y_test, y_pred)
print(f'Mean Squared Error: {mse}')
print(f'R-squared: {r2}')
plt.scatter(X_test, y_test, color='blue', label='Actual') # Visualize the results
plt.scatter(X_test, y_pred, color='red', label='Predicted')
plt.title('KNN Regression')
plt.xlabel('Feature')
plt.ylabel('Target')
plt.legend()
plt.show()
Mean Squared Error: 133.62045142000457
R-squared: 0.9817384115764595
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
# Step 1: Load the dataset
iris = load_iris()
X = iris.data # Features
y = iris.target # Labels
# Step 2: Split the dataset
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Step 3: Create and train the KNN classifier
knn_classifier = KNeighborsClassifier(n_neighbors=5)
knn_classifier.fit(X_train, y_train)
# Step 4: Make predictions
y_pred = knn_classifier.predict(X_test)
# Step 5: Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.4f}')
print('\nClassification Report:')
print(classification_report(y_test, y_pred, target_names=iris.target_names))
print('\nConfusion Matrix:')
print(confusion_matrix(y_test, y_pred))
# Step 6: Predict a new input sample
input_sample = np.array([[5.1, 3.5, 1.4, 0.2]])
predicted_class = knn_classifier.predict(input_sample)
predicted_prob = knn_classifier.predict_proba(input_sample)
print("\nInput Sample:", input_sample)
print("Predicted Flower Name:", iris.target_names[predicted_class[0]])
print("Prediction Probabilities:", predicted_prob)
Exp 4: Demonstrate decision tree algorithm for a classification problem and perform parameter tuning for better results
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
data = load_iris() # Load dataset
X = pd.DataFrame(data.data, columns=data.feature_names)
y = pd.Series(data.target)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # Split data
clf = DecisionTreeClassifier() # Default Decision Tree
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print("Default Accuracy:", accuracy_score(y_test, y_pred))
params = { # Parameter Tuning using Grid Search
'criterion': ['gini', 'entropy'],
'max_depth': [2, 3, 4, 5, None],
'min_samples_split': [2, 3, 4],
'min_samples_leaf': [1, 2, 3]
}
grid = GridSearchCV(DecisionTreeClassifier(), params,cv=5, scoring='accuracy')
grid.fit(X_train, y_train)
best_model = grid.best_estimator_ # Best Model
y_best = best_model.predict(X_test)
print("Best Params:", grid.best_params_)
print("Tuned Accuracy:", accuracy_score(y_test, y_best))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_best))
print("\nClassification Report:\n", classification_report(y_test, y_best))
from sklearn.datasets import load_diabetes
dataset=load_diabetes()
print(dataset['DESCR'])
import pandas as pd
df_diabetes=pd.DataFrame(dataset['data'],columns=dataset['feature_names'])
x=df_diabetes
y=dataset['target']
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.2,random_state=42)
from sklearn.tree import DecisionTreeRegressor
regressor=DecisionTreeRegressor(max_depth=3)
regressor.fit(x_train,y_train)
y_pred=regressor.predict(x_test)
from sklearn.metrics import r2_score,mean_squared_error,mean_absolute_error
print(r2_score(y_test,y_pred))
print(mean_squared_error(y_test,y_pred))
print(mean_absolute_error(y_test,y_pred))
import matplotlib.pyplot as plt
from sklearn import tree
plt.figure(figsize=(15,10))
tree.plot_tree(regressor,fontsize=10)
plt.show()
Exp 5: Demonstrate decision tree algorithm for a regression problem
import numpy as np
from sklearn.tree import DecisionTreeRegressor, export_text
import matplotlib.pyplot as plt
from sklearn.tree import plot_tree
�# Dataset
X = np.array([[1400, 20], 1600, 15], [1700, 30], [1875, 10], [1100, 25], [1550, 18], [2350, 5], [2450, 8],
[1425, 22],
[1700, 12]])
y = np.array([105000, 120000, 95000, 145000, 80000, 115000, 180000, 190000, 108000, 130000])
# Train Decision Tree Regressor (max_depth=3 for interpretability)
regressor = DecisionTreeRegressor(max_depth=3, random_state=42)
regressor.fit(X, y)
# Text representation of the tree
tree_text = export_text(regressor, feature_names=['Size', 'Age'])
print("Tree Structure:\n")
print(tree_text)
# Predictions on training data
y_pred = regressor.predict(X)
print("\nPredictions vs Actual:")
for i in range(len(y)):
print(f"Size: {X[i][0]}, Age: {X[i][1]}, Actual: {y[i]}, Predicted: {y_pred[i]:.2f}")
# Feature importances
print("\nFeature Importances:")
print(f"Size: {regressor.feature_importances_[0]:.4f}")
print(f"Age: {regressor.feature_importances_[1]:.4f}")
# Visualize the tree (requires matplotlib)
plt.figure(figsize=(20, 10))
plot_tree(regressor, feature_names=['Size', 'Age'], filled=True, rounded=True, precision=0)
plt.title("Decision Tree for House Price Regression")
plt.show()
�
Exp 5: Demonstrate decision tree algorithm for a regression problem
Output
Output
# Import required libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
# Load dataset
iris = load_iris()
X = iris.data
y = iris.target
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create Random Forest Classifier model
rf_classifier = RandomForestClassifier(
n_estimators=100,
criterion="gini",
random_state=42
)
# Train the model
rf_classifier.fit(X_train, y_train)
# Make predictions
y_pred = rf_classifier.predict(X_test)
# Evaluate the model
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
Exp 6a: Apply Random Forest algorithm for classification
Output
Exp 6b: Apply Random Forest algorithm for regression
# Import libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
# Generate synthetic regression data
np.random.seed(42)
X = np.linspace(0, 10, 200).reshape(-1, 1)
y = 3 * X.squeeze()**2 + 5 * X.squeeze() + 10 + np.random.randn(200) * 10
# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Create Random Forest Regressor
rf_model = RandomForestRegressor(
n_estimators=200,
max_depth=10,
min_samples_split=5,
random_state=42
)
# Train the model
rf_model.fit(X_train, y_train)
# Predict on test data
y_pred = rf_model.predict(X_test)
# Model evaluation
print("Mean Squared Error:", mean_squared_error(y_test, y_pred))
print("R2 Score:", r2_score(y_test, y_pred))
�
Output
# Sort values for smooth plotting
sorted_idx = X_test.squeeze().argsort()
# Plot actual vs predicted values
plt.figure()
plt.scatter(X_test, y_test)
plt.plot(X_test[sorted_idx], y_pred[sorted_idx])
plt.xlabel("Input Feature (X)")
plt.ylabel("Target Value (y)")
plt.title("Random Forest Regression")
plt.show()
# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
# Load the Iris dataset
data = load_iris()
X = data.data # Features
y = data.target # Target labels
# Split the data into training and testing sets (70% train, 30% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Initialize the Gaussian Naïve Bayes classifier
nb_classifier = GaussianNB()
# Train the model
nb_classifier.fit(X_train, y_train)
# Make predictions on the test set
y_pred = nb_classifier.predict(X_test)
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
# Classification Report
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=data.target_names))
# Confusion Matrix
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
Exp 7: Demonstrate Naïve Bayes Classification algorithm.
Output
Exp 8: Apply Support Vector algorithm for classification.
# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
# Load the Iris dataset
data = load_iris()
X = data.data # Features
y = data.target # Target labels
# Split the data into training and testing sets (70% train, 30% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Initialize the SVM classifier
# Using 'linear' kernel for simplicity (you can try 'rbf', 'poly', 'sigmoid')
svm_classifier = SVC(kernel='linear', random_state=42)
# Train the model
svm_classifier.fit(X_train, y_train)
# Make predictions on the test set
y_pred = svm_classifier.predict(X_test)
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
# Classification Report
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=data.target_names))
# Confusion Matrix
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))
Output
Exp 9: Demonstrate simple linear regression algorithm for a regression problem.
from sklearn.linear_model import LinearRegression
import numpy as np
import matplotlib.pyplot as plt
x = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
y = np.array([1, 3, 2, 5, 7, 8, 8, 9, 10, 12])
lreg=LinearRegression()
lreg.fit(x.reshape(-1, 1), y)
plt.scatter(x, y)
plt.plot(x, lreg.predict(x.reshape(-1, 1)), color='red')
plt.title("Linear Regression")
plt.xlabel("x")
plt.ylabel("y")
plt.show()
Output
Program
Exp 10: Apply Logistic regression algorithm for a classification problem.
import numpy
X = numpy.array([3.78, 2.44, 2.09, 0.14, 1.72, 1.65, 4.92, 4.37, 4.96, 4.52, 3.69,
5.88]).reshape(-1,1)
print(X)
y = numpy.array([0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1])
from sklearn import linear_model
logr = linear_model.LogisticRegression()
logr.fit(X,y)
predicted = logr.predict(numpy.array([3.46]).reshape(-1,1))
print(predicted)
logr = linear_model.LogisticRegression()
logr.fit(X,y)
log_odds = logr.coef_
odds = numpy.exp(log_odds)
print(odds)
Output
Program
Exp 11: Demonstrate Multi-layer Perceptron algorithm for a classification problem.
Program
# Import necessary libraries
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.neural_network import MLPClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (accuracy_score, classification_report,
confusion_matrix, ConfusionMatrixDisplay)
import matplotlib.pyplot as plt
# Load the Iris dataset
data = load_iris()
X = data.data # Features
y = data.target # Target labels
feature_names = data.feature_names
class_names = data.target_names
# Split the data into training and testing sets (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Standardize features by removing the mean and scaling to unit variance
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
# Initialize the MLP classifier
mlp = MLPClassifier(
hidden_layer_sizes=(100, 50), # Two hidden layers with 100 and 50 neurons
activation='relu', # Rectified Linear Unit activation
solver='adam', # Optimization algorithm
alpha=0.0001, # L2 penalty (regularization term) parameter
batch_size='auto', # Size of minibatches
learning_rate='constant', # Learning rate schedule
learning_rate_init=0.001, # Initial learning rate
max_iter=500, # Maximum number of iterations
random_state=42, # Random seed
early_stopping=True, # Use early stopping to terminate training when validation score stops improving
validation_fraction=0.1 # Fraction of training data to set aside as validation set
)
# Train the model
mlp.fit(X_train, y_train)
# Make predictions
y_pred = mlp.predict(X_test)
y_pred_prob = mlp.predict_proba(X_test)
# Evaluate the model
print(f"Training set score: {mlp.score(X_train, y_train):.3f}")
print(f"Test set score: {mlp.score(X_test, y_test):.3f}\n")
print("Classification Report:")
print(classification_report(y_test, y_pred, target_names=class_names))
# Plot confusion matrix
Program…..Contd….
Output
Exp 12: Implement the K-means algorithm and apply it to the data you selected. Evaluate performance by measuring the sum of the Euclidean distance of each example from its class center. Test the performance of the algorithm as a function of the parameters K.
import numpy as np
from sklearn.datasets import load_iris
�X = load_iris().data # selected data
�def kmeans(X, K, i=10):
c = X[np.random.choice(len(X), K, 0)]
for _ in range(i):
l = ((X[:,None]-c)**2).sum(2).argmin(1)
c = np.array([X[l==k].mean(0) for k in range(K)])
return l, c
�def sse(X, l, c):
return sum(np.linalg.norm(x-c[k]) for x,k in zip(X,l))
�for K in range(1,6):
l,c = kmeans(X,K)
print(K, sse(X,l,c))
Program
Output
Exp 13: Demonstrate the use of Fuzzy C-Means Clustering.
import numpy as np
from sklearn.datasets import load_iris
data = load_iris()
X = data.data # 1. Load data (first 4 features), transpose
K, m, e, i = 3, 2, 1e-5, 10 # 2. clusters, fuzziness, error, iterations
c = X[:, np.random.choice(X.shape[1], K, 0)] # 3. Randomly init cluster centers
for _ in range(i): # 4. Repeat update steps
d = np.linalg.norm(X[:, :, None]-c[:, None], axis=0) + 1e-9 # 5. Distance matrix
u = 1 / d**(2/(m-1)) # 6. Compute membership (unnormalized)
u = u / u.sum(1, keepdims=1) # 7. Normalize memberships
c = (X @ (u**m)) / (u**m).sum(0) # 8. Update cluster centers
labels = u.argmax(1) # 9. Final cluster labels
print(labels, sum(d[n, labels[n]] for n in range(len(labels)))) #10. Labels + performance
Output
Exp 14: Demonstrate the use of Expectation Maximization based clustering algorithm.
import numpy as np
from sklearn.datasets import load_iris
�X = load_iris().data # Load dataset (150 flowers, 4 features)
K = 3 # We want 3 clusters (groups)
n, d = X.shape # n=samples, d=features
p = np.ones((n, K))/K # Start: each point has equal chance in each cluster
m = X[np.random.choice(n, K, 0)] # Pick random points as initial cluster centers
s = np.array([np.eye(d)]*K) # Start: clusters are assumed circular (identity matrix)
�for _ in range(10): # Repeat 10 times to improve clusters
for k in range(K): # ----- E STEP (Expectation) -----
diff = X - m[k] # Distance of all points from cluster k mean
p[:,k] = np.exp(-.5*np.sum(diff@np.linalg.inv(s[k])*diff,1))/np.sqrt(np.linalg.det(s[k]))
p /= p.sum(1, keepdims=True) # Convert to real probability (sum of each row = 1)
� for k in range(K): # ----- M STEP (Maximization) -----
w = p[:,k] # Weight = how much each point belongs to cluster k
m[k] = (w@X)/w.sum() # Update cluster center (mean)
s[k] = ((X-m[k]).T*(w/w.sum()))@(X-m[k]) # Update cluster shape/spread
�print("Cluster labels:", p.argmax(1)) # Final group = cluster with highest probability
Output
Thank You…