SEMESTER-V
COURSE 14 B: PREDICTIVE AND ADVANCED ANALYTICS USING R
Unit-I
Introduction to Data Mining Introduction, what is Data Mining? Concepts of Data mining, Technologies Used, Data Mining Process, KDD Process Model, CRISP — DM, Mining on various kinds of data, Applications of Data Mining, Challenges of Data Mining.
Q) Explain the stages of the Data mining process/KDD process ?
Ans: The knowledge discovery process is shown in Figure as an iterative sequence of the following steps:
1. Data cleaning : to remove noise and inconsistent data
2. Data integration :where multiple data sources may be combined

3. Data selection :where data relevant to the analysis task are taken from the database
4. Data transformation: where data are transformed and consolidated into forms appropriate for mining
5. Data mining : process where intelligent methods are applied to extract data patterns
6. Pattern evaluation :to identify the truly interesting patterns
7. Knowledge presentation: where visualization and knowledge representation techniques are used to present mined knowledge to users.
Steps 1 through 4 are different forms of data pre-processing, where data are prepared for mining.
The data mining step may interact with the user or a knowledge base. The interesting patterns are presented to the user and may be stored as new knowledge in the knowledge base.
Q) What is data mining?
Data mining is the process of studying large sets of data to find useful information. It helps turn raw, messy data into clear knowledge that people can understand and use.
🌟 Key Points:
- Just like mining for gold, data mining means digging through a huge amount of data to find small, valuable pieces of information.
- The main aim is to discover patterns or trends that can help in making better decisions.
- It can be done with numbers (like sales records), text (like social media posts), or even images.
- Without using proper tools, big data can be ignored or misused—people might guess or go with their gut instead of using real facts.
So, data mining turns confusing data into something smart and useful for companies, hospitals, banks, and more.
Q) Explain about Data mining techniques? Or Data mining technologies used
Ans: Data mining functionalities are used to specify the kinds of patterns to be mined.
Data mining tasks can be classified into two categories: descriptive and predictive.
Descriptive tasks summarizes data and gives general properties
Predictive tasks analyse the data and constructs model which can predict the behaviour of new data.
There are several data mining functionalities. These include
- characterization and discrimination
- Mining of frequent patterns, associations, and correlations
- classification and regression
- clustering analysis
- outlier analysis
1. characterization and discrimination : Data characterization is a summarization of the general characteristics or features of a target class of data. Example: Summarize the characteristics of customers who spend more than Rs. 5000 a year at Modern Super Market.
Data discrimination is a comparison of the general features of the target class against the general features of one or more contrasting classes. For example: compare two groups of customers—those who buy computer products regularly and those who rarely buy such products.
2. Mining of frequent patterns, associations and correlations
Frequent patterns, as the name suggests, are patterns that occur frequently in data. Mining frequent patterns leads to the discovery of interesting associations and correlations within data.
3. Classification and regression: Classification is supervised learning technique. The class labels of training data will be given. A model is trained using training data. The trained model or classifier is then is used to predict the category or the class label of new data. If the model is used to predict a number (or continuous value) then it is called regression.
Example: Given the details of customer like age, credit rating, student or not. A decision tree is constructed based on training data. This tree is then used to predict whether a customer will buy (yes) computer or will not buy (no). Below the classification model is represented using decision tree.
Classification model can be represented using IF-THEN rules as follows:
age(X, “youth”) AND student(X, “high”) 🡺 buys(X, “yes”)
age(X, “middle_aged”) 🡺 buys(X, “yes”)
age(X, “senior”) AND credit_rating(X, “excellent”) 🡺 buys(X, “yes”)
4. Cluster Analysis:
Any group of objects that belongs to the same class is known as a cluster. In data mining, cluster analysis is a way to discover similar item groups.
Cluster analysis can be performed on Modern Super Market customer data to identify homogeneous subpopulations of customers. These clusters may represent individual target groups for marketing

5. Outlier Analysis: An outlier is a data point that is very different from other points.
It is unusual, and may be an error or a rare event.
An outlier is a data object that deviates significantly from the rest of the objects. The analysis of outlier data is referred to as outlier analysis or anomaly mining. Outlier analysis tries to find unusual patterns in any dataset. Outlier detection is important in many applications in addition to fraud detection such as medical care, public safety and security, industry damage detection, image processing, sensor/video network surveillance, and intrusion detection.
Q) MINING ON VARIOUS KINDS OF DATA
The data is categorized into fundamental and more complex types:
- Database Data (Relational Data):
- This is data typically stored in tables, like those you might see in a spreadsheet, where information is organised into rows (representing objects like customers or items) and columns (representing characteristics or attributes like age or price).
- Example: An electronics store's customer database, with details on customer IDs, names, addresses, and purchase history.
- These are large, centralised repositories of information collected from various sources within an organisation, specifically organised for analysis and decision-making. They often contain historical and summarised data.
- Example: An AllElectronics company with branches worldwide might consolidate sales data from all branches into a single data warehouse for comprehensive analysis. This data is often modelled as a data cube, which allows data to be viewed in multiple dimensions (e.g., sales by time, item, branch, and location).
- Transactional Data: This type of data records individual transactions, such as purchases or financial transactions, providing insights into customer behaviour and business operations.

Time Series Data: It refers to data collected over a period of time, such as stock prices or weather conditions, allowing for analysis of patterns and trends.

Biological Data: It contains information related to living organisms, such as genetic sequences or physiological measurements.This data aids research in areas like genetics, medicine, and ecology.
Spatial Data: Spatial data contains information about physical locations and. This data helps in analysis and visualisation of geographic patterns, such as maps or satellite imagery.

Social Network Data: It involves data about individuals and their relationships in a social network, offering insights into social interactions, influence, and community structures.

- Other Kinds of Data (Complex Data Types): These present greater challenges and often require specialised mining methodologies.
- Symbolic Sequence Data: Ordered lists of events or elements, with or without a precise time notion.
- Example: Customer shopping sequences (e.g., buying a PC, then a digital camera, then a memory card) or web click streams.
- Spatiotemporal Data and Moving Objects: Data that change over both space and time, often related to moving entities.
- Example: Tracking vehicles in a city or monitoring weather patterns.
- Cyber-Physical System Data: Data from interconnected computing and physical components, such as sensor networks.
- Multimedia Data: Includes images, video, and audio.
- Example: Mining images to classify objects or video data to detect specific events.
- Text Data: Unstructured information stored as text.
- Example: News articles, technical papers, customer reviews, or emails. Text mining can extract high-quality information, identify trends, or perform sentiment analysis.
- Graph and Networked Data: Data representing relationships between interconnected objects.
- Example: Social networks (friends linked to friends) or information networks (web pages linked to each other). Mining can discover hidden communities, hubs, or outliers.
- Data Streams: Data that continuously flows into a system in vast volumes, changing dynamically, and potentially infinite.
- Example: Real-time video surveillance feeds or sensor data. This requires efficient single-pass or few-pass algorithms due to their continuous nature.
It's important to note that multiple types of data often coexist in real-world applications (e.g., web mining involves text, multimedia, and graph data). Mining these multiple sources can lead to richer findings but also presents challenges in data cleaning and integration.
Q)📊 Applications of Data Mining
Data mining is used in many areas to find useful patterns and help make better decisions. It turns huge amounts of data into meaningful information.
🔢 Area | 🏷️ Use Case | 📌Explanation |
🏪 1. Business & Commerce | Retail & Marketing | - Find what products are bought together. - Plan ads and layout. |
| Banking & Finance | - Detect fraud, check loan risk, group customers. |
| Telecom | - Find customers likely to leave. - Analyze call usage. |
🔬 2. Science & Engineering | Medicine & Bioinformatics | - Predict diseases. - Study DNA and discover drugs. |
| Computer Engineering | - Detect bugs or attacks. - Improve system performance. |
| Environmental Science | - Predict climate. - Link geography and poverty. |
🌐 3. Web & Internet | Search Engines | - Improve search. - Show related ads or trending searches. |
| Recommender Systems | - Suggest products or shows (like Netflix, Amazon). |
| Text Mining | - Analyze opinions from reviews. - Group articles. |
🛡️ 4. Security & Society | Crime & Fraud Prevention | - Spot frauds in banking and insurance. - Help crime detection. |
| Hidden Data Mining (Social Impact) | - Happens silently while shopping or browsing. |
Q) Challenges in Data Mining
Data mining helps us discover useful information from large datasets. But it also faces several difficulties. These problems come from the nature of the data, the technical methods used, user interaction, and social concerns. The challenges are listed below
🧩 Problem Area | 📌 Simple Explanation |
🧹 Data Quality Problems | - Data may have mistakes or missing parts. - Merging data from many sources causes confusion. - Hard to find real patterns due to noise or outliers. |
⚙️ Scalability & Efficiency | - Too much data to process. - Too many features (columns). - Live data is hard to handle. - Some algorithms are slow and heavy. |
🧬 Variety in Data Types | - Data comes in many forms (text, image, video). - Some data like networks are complex to analyze. |
🧪 Mining & Evaluation Issues | - Too many patterns, only a few are useful. - “Interesting” patterns are hard to define. - Some methods give only partial results. |
👩💻 User Involvement | - Users want to change or explore during mining. - Hard to understand complex models. - Need good charts or visuals. |
🔐 Privacy & Social Issues | - Mining may break personal privacy. - Can be used in bad ways. - Often done secretly, without users knowing. |
Q) CRISP-DM (Cross Industry Standard Process for Data Mining)
CRISP-DM is a common method used to do data mining projects in a step-by-step way. It helps people work with data clearly and effectively, in any industry or with any tool.
✅ Main Features of CRISP-DM
- It uses a step-by-step process.
- It works in any field (business, health, etc.).
- Steps can be repeated if needed.
🔢 Six Steps of CRISP-DM
- Business Understanding – Know the goal of the project from a business point of view.
- Data Understanding – Collect data and check its quality.
- Data Preparation – Make the data ready for analysis (cleaning, formatting).
- Modeling – Use data mining techniques to build a model.
- Evaluation – Check if the model gives good results.
- Deployment – Use the model to help in real-world decision making.
Q) Differentiate between crisp-dm and data mining
DM is what you do. CRISP-DM is how you do it.
🔍 CRISP-DM vs. DM: Key Differences
Feature | CRISP-DM | Data Mining (DM) |
Definition | A structured methodology for DM projects | The general process of extracting insights from data |
Purpose | Guide teams through DM projects step-by-step | Discover patterns and knowledge from data |
Phases | 6 defined stages (Business to Deployment) | No fixed structure – depends on the approach |
Flexibility | Highly adaptable across industries | Flexible but can be chaotic without a method |
Tool Dependence | Tool-agnostic (works with any tool) | May depend on specific tools or algorithms |
Project Management | Includes planning, evaluation, and real-world use | Often focused only on modeling and analysis |
🧭 Why CRISP-DM Matters
CRISP-DM brings clarity, repeatability, and structure to data mining. It’s like having a GPS for your data journey—especially useful when working in teams or across industries
Unit II: Data Understanding and Preparation Introduction, Reading data from various sources, Data visualization, Distributions and summary statistics, Relationships among variables, Extent of Missing Data. Segmentation, Outlier detection
Q) Data Understanding and Preparation – Introduction
In Data Science, it is very important to first understand your data and then prepare it. Data understanding helps us know what the data is about. Data preparation makes the data ready for analysis. In R, we use different tools and functions for these steps.
Q1. Data Understanding (Getting to Know Your Data)
Before you work with data, you need to understand it well. This means knowing what kind of data you have, what values it contains, and how these values are spread out.
a. Types of Data (Attributes): Data is made of "attributes" (also called variables or features). We can classify attributes into different types:
- Nominal Attributes: These are names or labels, without any order.
- Example: Gender (Male, Female), Colour (Red, Blue, Green).
- Binary Attributes: These have only two possible values.
- Example: Yes/No, True/False, 0/1.
- Ordinal Attributes: These have an order, but the differences between values are not fixed.
- Example: Ratings (Low, Medium, High), Shirt Size (Small, Medium, Large).
- Numeric Attributes: These are numbers.
- Discrete: Values are whole numbers or can be counted.
- Example: Number of children, ZIP Code.
- Continuous: Values can be any number within a range (like decimal numbers).
- Example: Height, Weight, Temperature.
b. Basic Statistical Measures: To understand the data's values, we often look at statistical measures.
- Measures of Central Tendency: These show the "middle" or "center" of the data.
- Mean: The average value. You add all values and divide by how many there are.
- R Code Example: mean_value <- mean(data).
- Median: The middle value when data is sorted. If there are two middle values, it's their average.
- R Code Example: median_value <- median(data).
- Mode: The value that appears most often. R does not have a built-in mode() function for this, so you usually create one or find it by looking at frequencies.
- Measures of Variability (Dispersion): These show how "spread out" the data is.
- Range: The difference between the highest and lowest values.
- InterQuartile Range (IQR): The range of the middle 50% of the data. It helps to see spread without outliers.
- Variance: Measures how far values are from the mean, on average, squared.
- Standard Deviation: The square root of the variance, showing typical distance from the mean.
R Code Example:
data_sample <- c(12, 7, 3, 4.2, 18, 2, 54, -21, 8, -5)
range_val <- max(data_sample) - min(data_sample) # Calculate range
iqr_val <- IQR(data_sample) # Calculate IQR
variance_val <- var(data_sample) # Calculate variance
sd_val <- sd(data_sample) # Calculate standard deviation
c. Data Visualization: Pictures (graphs) help us see patterns and problems in data more easily.
- Histograms: Show the distribution of one numeric variable, how often values fall into certain ranges.
- Boxplots: Show the distribution, median, quartiles, and possible outliers for one or more variables.
- Scatter Plots: Show the relationship between two numeric variables, as points on a graph.
R Code Example (using iris dataset, which is preloaded in R):
# Histogram for Sepal.Length
hist(iris$Sepal.Length, main="Histogram of Sepal Length", xlab="Sepal Length")
# Boxplot for Sepal.Length by Species
boxplot(Sepal.Length ~ Species, data=iris, ylab="Sepal Length")
# Scatter plot of Sepal.Length vs Sepal.Width
plot(iris$Sepal.Length, iris$Sepal.Width, col=iris$Species) # Color points by species
Q2. Data Preparation (Preprocessing)
Data preparation is the most important step in data mining. It transforms raw data into a clean and useful format. This step takes a lot of time in a data science project.
The main steps in data preparation are:
a. Data Cleaning: Real-world data often has problems like missing information, errors, or inconsistencies. Data cleaning fixes these problems.
- Handling Missing Values (NA): Missing values are shown as NA in R.
- Ignore/Remove: Remove rows or columns with missing values.
- Impute: Fill in missing values using methods like the mean, median, or a more advanced model.
- Handling Noise and Outliers: Noise means random errors. Outliers are values that are very different from most other data.
- Tools like boxplots can help identify outliers visually.
b. Data Integration: This step combines data from many different sources into one dataset. It is important to handle different ways data is named or stored (e.g., different units like meters vs. feet).
c. Data Reduction: Large datasets can be hard to work with. Data reduction makes the dataset smaller, but still keeps important information.
- Sampling: Take a smaller part of the data that still represents the whole.
- Feature Selection: Choose only the most important attributes (columns) that are useful for your analysis.
- Dimensionality Reduction: Techniques like Principal Components Analysis (PCA) combine attributes to create fewer, new attributes.
d. Data Transformation: This changes data into a suitable format for mining.
- Normalization: Scaling data values to a specific range (e.g., 0 to 1). This is useful when attributes have very different ranges.
- Discretization: Changing numeric data into categorical "bins" or groups (e.g., age into "young", "medium", "old").
Q) Reading Data from Various Sources
In data science, we must collect data before we can analyze it.
Data can come from different places (sources) and different formats.
We must know how to read and load this data into our tools (like R, Python, Excel).
🔹 Types of Data Sources
🔢 Type | 📌 Simple Meaning | 📁 Example | 🧪 R Syntax to Read |
Flat Files | Text files with rows and columns | .csv, .txt, .tsv | read.csv("file.csv") read.table("file.txt") |
Excel Files | Spreadsheet format | .xls, .xlsx | readxl::read_excel("file.xlsx") |
Databases | Organized storage with SQL access | MySQL, Oracle | DBI::dbReadTable(con, "table_name") (after DB connection) |
Web Data | Data from websites or APIs | HTML tables, weather APIs | |
Cloud Storage | Online data storage services | Google Drive, AWS S3 | Use googledrive, aws.s3, or download to local & use read.csv() |
Sensor/Live Data | Streaming data from devices or logs | GPS, logs, IoT streams | readLines("sensor_log.txt") scan() for simple text reading |
- For Excel: you must install the readxl package:
install.packages("readxl")
- For databases: Use DBI + a driver like RMySQL or RSQLite
- For APIs or JSON: Use jsonlite or httr packages
Q) Data Visualization
Data Visualization means showing data in pictures (charts/graphs), so that we can understand it easily.
🔹 Why is it useful?
- Shows patterns and trends in the data
- Helps to find mistakes or outliers
- Makes data easy to explain to others
🔹 Common Types of Charts
📊 Chart Type | 📌 Use |
Bar Chart | To compare categories (e.g., number of students by class) |
Pie Chart | To show parts of a whole (e.g., percentage of sales) |
Histogram | To show distribution of numbers (e.g., age of people) |
Line Chart | To see trends over time (e.g., monthly sales) |
Box Plot | To show spread of values and outliers |
Scatter Plot | To see relationships between two variables |
The plot() function in R is used to create the line graph.
Syntax
The basic syntax to create a line chart in R is –
plot(v,type,col,xlab,ylab)
Following is the description of the parameters used −
- v is a vector containing the numeric values.
- type takes the value "p" to draw only the points, "l" to draw only the lines and "o" to draw both points and lines.
- xlab is the label for x axis.
- ylab is the label for y axis.
- main is the Title of the chart.
- col is used to give colors to both the points and lines.
Program:
marks=c(15,22,35,55,45,65)
plot(marks, type="l", col="Blue")
Output:

Boxplots are created in R by using the boxplot() function.
The basic syntax to create a boxplot in R is −
boxplot(x, data, notch, varwidth, names, main)
Example:
data<- c(1,2,3,4,5)
boxplot(data)

In R the pie chart is created using the pie() function
Syntax
The basic syntax for creating a pie-chart using the R is −
pie(x, labels, radius, main, col, clockwise)
Example:
x <- c(50,40,10)
labels <- c("mpcs","mscs","dscs")
# Plot the chart.
pie(x,labels,col = rainbow(length(x)))

The basic syntax for creating a histogram using R is −
hist(v,main,xlab,xlim,ylim,breaks,col,border)

Q) Distributions: Key Concepts and Applications
Distributions are fundamental for understanding how data values are spread out. A probability distribution describes the "law governing a random variable" from which observed data originated, indicating the likelihood of different values being observed.
Core Concepts:
For continuous variables, distributions are described by a probability density function.
For discrete variables, they are described by a probability function.
Key Types of Distributions:
You want to understand the height of your students. You measure them and plot the number of students with each height on a graph. You notice most of them are around 165–170 cm, with fewer students being very short or very tall. The graph looks like a hill or bell.
That’s a normal distribution — one of the most famous distributions.
You toss a coin 10 times and count how many times you get heads. You repeat this many times and record the results. Most of the time you’ll get around 5 heads, sometimes 4 or 6, and rarely 0 or 10. That’s a binomial distribution.👉 Example Problem (Binomial Distribution): A coin is tossed 10 times. What is the probability of getting exactly 6 heads?
✅ Solution: Use binomial formula: P(X=6) = C(10,6) * (0.5)^6 * (0.5)^4 = 210 * 0.015625 * 0.0625 = approx. 0.205
If you toss the coin just once, it’s either head or tail — 1 or 0. That’s a Bernoulli distribution — the simplest one.
Imagine customers arriving at a bank. You count how many come in each hour. The number varies, but usually it’s 3 to 5 per hour. This kind of count data is often modeled using the Poisson distribution.
👉 Example Problem (Poisson Distribution):On average, 4 customers visit a bank per hour. What is the probability that exactly 2 customers arrive in the next hour?
✅ Solution: Use Poisson formula:
P(X=2) = (e^-4 * 4^2) / 2! = (0.0183 * 16) / 2 = 0.146
Let’s say you conduct a survey expecting 20 students to prefer chocolate over vanilla ice cream, but only 14 do. You want to know: is this difference by chance, or is it statistically significant? You use the Chi-squared distribution to test this.
👉 Example Problem (Chi-squared Distribution): Expected: 20 students like chocolate, 10 like vanilla. Observed: 14 like chocolate, 16 like vanilla.
✅ Solution: Use Chi-squared formula: χ² = Σ((Observed - Expected)² / Expected)
χ² = (14-20)²/20 + (16-10)²/10 = (36/20) + (36/10) = 1.8 + 3.6 = 5.4
The critical value comes from the Chi-squared distribution table, which tells you the threshold beyond which your result is statistically significant. df = 1 and Significance level (α) = 0.05 You look up the value in the Chi-squared table, and you’ll find: Critical value ≈ 3.841 The difference is big enough (χ² = 5.4 > 3.841) that it’s unlikely to be due to chance.
So, you conclude: student preferences are different than expected—maybe chocolate isn’t as popular as you thought!
There are two types of Chi-squared tests, and df is calculated differently depending on which one you're using:
1. Goodness-of-Fit Test (comparing observed vs expected in one categorical variable)
- df = number of categories − 1
- ✅ This is the test you're using in your example (chocolate vs vanilla).
2. Test of Independence (e.g., contingency tables)
df = (number of rows − 1) × (number of columns − 1)
Inverse Gaussian distribution is used when modeling time until an event happens but assumes that the event rate changes over time (often used in advanced reliability and survival analysis).👉 Example Problem (Inverse Gaussian Distribution): Suppose a machine has an average lifetime of 1000 hours. What is the chance it will last more than 1200 hours?
✅ Solution (conceptual): Use the inverse Gaussian cumulative distribution function or software like R: 1 - pinvgauss(1200, mean = 1000, dispersion = 500)
Now imagine you are tracking how long it takes between customer arrivals at a bank. You find that the waiting time is not always regular. You can model this kind of waiting time using the Gamma distribution.👉 Example Problem (Gamma Distribution): If the average time between arrivals is 3 minutes, and you want the probability that a customer arrives in 5 minutes, use the gamma probability function with shape and rate parameters.
✅ Solution (conceptual): You plug values into the gamma formula or use R’s pgamma() function: pgamma(5, shape, rate) to get the probability.
⚙️ Where Are Distributions Used?
In Regression Models: Errors (difference between actual and predicted values) are often assumed to be normally distributed.
In Hypothesis Testing: We compare means, proportions, etc., using tests based on t-distribution or chi-squared distribution.
In Classification: We estimate probabilities of different classes using distributions.
In Clustering: We assume that data comes from a mix of several distributions (like multiple bell curves).

In Outlier Detection: Outliers are the values that lie far from the common range — in the “tails” of the distribution.
Visualising Data Distributions: Visual tools are crucial for inspecting and understanding data distributions:
Histograms: Used to show the frequency distribution of quantitative variables, providing insight into their shape and concentration.
R code example: hist(iris$Sepal.Length)
Boxplots: Effectively summarise the five-number summary (minimum, first quartile, median, third quartile, maximum) of a distribution and are useful for identifying outliers and comparing distributions across different groups. They can also indicate distribution densities using box whiskers.
R code example: boxplot(Petal.Length ~ Species, data = iris)
Outliers are data points that fall far outside the typical range. Specifically:
📌 Rule of Thumb:
- Lower Bound = Q1 − 1.5 × IQR
- Upper Bound = Q3 + 1.5 × IQR
Any data point:
- < Lower Bound or
- > Upper Bound
is considered an outlier.
✅ 1. Quantile Plot – Detecting Unusual Values (Outliers)
📘 Example Scenario:
You are analyzing monthly expenses of 20 students in a college. Most spend between ₹5,000 to ₹7,000. But a few spend above ₹10,000.
You draw a quantile plot:
Most dots form a smooth curve
Suddenly, the line jumps at the end
🧠 What You Learn:
Some students are spending much more than others.
These are outliers — maybe from rich families or incorrect entries.
🎯 Use in Analytics:
These outliers can be removed or treated before building a prediction model (like predicting spending based on background).
# Monthly expenses of 20 students
expenses <- c(5200, 5300, 5400, 5500, 5600, 5800, 5900, 6000,
6100, 6200, 6400, 6500, 6600, 6800, 7000,
10500, 11000, 11500, 12000, 13000)
# Quantile plot
qqplot <- plot(sort(expenses), ppoints(length(expenses)), ylab = "Percentile")

✅ 2. Q–Q Plot – Checking if Data is Normal
📘 Example Scenario:
You want to use linear regression to predict students' marks based on their study hours. This method assumes the data is normal.
You draw a Q–Q plot for marks:
If the dots lie on a straight line ➝ data is normal ✅
If the dots curve ➝ data is skewed ❌
🧠 What You Learn:
If it's not normal, you may need to transform the data (like taking log values) before applying regression.
🎯 Use in Analytics:
This helps you choose the right model or prepare data better for accurate predictions.
# Marks of 20 students
marks <- c(45, 50, 55, 60, 65, 70, 72, 73, 74, 75,
76, 77, 78, 80, 82, 85, 88, 90, 92, 95)
# Q-Q plot against normal distribution
qqnorm(marks, main = "Q–Q Plot of Student Marks")
qqline(marks, col = "red") # Add reference line

-2 (≈ 2.5th percentile)
-1 (≈ 16th percentile)
0 (50th percentile or median)
+1 (≈ 84th percentile)
+2 (≈ 97.5th percentile)
These values are standard scores (z-scores) showing where a value would lie on a normal distribution curve.
✅ 3. Density Plot – Comparing Two Groups
📘 Example Scenario:
You want to compare the exam scores of two classes — Class A and Class B.
You draw density plots for both classes:
Class A has a peak near 75
Class B has a flatter curve, with scores spread from 50 to 90
🧠 What You Learn:
Class A students are more consistent
Class B has more variation in marks
🎯 Use in Analytics:
This analysis helps teachers know which class needs attention or personalized coaching.
# Scores of Class A and Class B
classA <- c(70, 72, 74, 75, 76, 77, 78, 78, 79, 80)
classB <- c(50, 55, 60, 65, 70, 75, 80, 85, 90, 95)
# Plot density curves
plot(density(classA), lwd = 2, ylim = c(0, 0.05))
lines(density(classB), col = "green", lwd = 2)

Class A has a narrow peak — consistent scores.
Class B is spread out — more variation.
Q) Summary Statistics: Describing Your Data
Summary statistics are numerical measures that provide concise descriptions of data features, particularly distributions. They are a fundamental part of Exploratory Data Analysis (EDA), helping to generate questions about data and identify properties like noise or outliers. Statistical data descriptions are useful to grasp data trends and identify anomalies.
Key Categories of Summary Statistics:
Measures of Central Tendency: These indicate the middle or center of a data distribution.
Mean: The average of all values in a dataset.
Median: The middle value in a sorted dataset. It is the 50th percentile and effectively divides the data into two equal halves. It can be found for qualitative and quantitative attributes.
Mode: The value that occurs most frequently in a dataset. A dataset can have one (unimodal), two (bimodal), or more (multimodal) modes.
Measures of Dispersion (Spread): These indicate how spread out the data values are.
Range: The difference between the largest and smallest values in a dataset. This is used in techniques like Min-Max normalization.
Quartiles (Q1 and Q3):
Q1 (First Quartile): The 25th percentile, cutting off the lowest 25% of the data.
Q3 (Third Quartile): The 75th percentile, cutting off the lowest 75% (or highest 25%) of the data.
Together with the median, they indicate a distribution's center, spread, and shape.
Interquartile Range (IQR): The difference between Q3 and Q1 (IQR = Q3 - Q1). It defines the middle 50% of the data. A common rule for identifying suspected outliers is to single out values falling at least 1.5 × IQR above Q3 or below Q1.
Variance (σ2) and Standard Deviation (σ):
Measures of how spread out a data distribution is, relative to the mean.
A low standard deviation indicates data observations are close to the mean, while a high standard deviation means data are spread over a large range.
These measures are useful for identifying outliers. Data is often standardized using mean and standard deviation to ensure similar scaling and weighting for all attributes.
The Five-Number Summary:
A fuller summary of a distribution's shape, especially for skewed distributions.
Consists of: Minimum, Q1, Median (Q2), Q3, Maximum.
Other Statistical Measures (for relationships):
Correlation Coefficient and Covariance: Used for numeric attributes to measure how strongly one attribute implies another. They assess how one attribute's values vary from those of another.
Chi-squared (χ2) Measure: Used for nominal data to detect correlations.
Using R for Summary Statistics: R is a powerful tool for calculating these measures:
The summary() command provides a quick statistical summary for each variable in a dataset, including min, max, mean, median, and quartiles.
Specific functions are available for individual measures:
max() and min() for range.
var() for variance.
sd() for standard deviation.
To calculate the mean without considering NA (missing) values, you can use mean(examsquiz$per, na.rm=TRUE).
These summary statistics are crucial for initial data inspection and understanding the overall behaviour and properties of your data.
Q) short note on Distributions and Summary Statistics
🔹 What is a Distribution?
A distribution tells us how the values in a dataset are spread out.
- It shows which values are common and which are rare.
- Often shown using a histogram or curve.
🔹 Summary Statistics
These are numbers that tell us about the data in a short and simple way.
📏 Measure | 📌 What it tells us |
Mean (Average) | Total / number of items |
Median | Middle value (when sorted) |
Mode | Most repeated value |
Range | Difference between highest and lowest |
Standard Deviation | How spread out the data is |
Variance | Square of standard deviation |
Min/Max | Smallest and biggest values |
Quartiles | Divide data into 4 equal parts |
🔹 Example:
Marks: 40, 50, 50, 60, 70
- Mean = (40+50+50+60+70)/5 = 54
- Median = 50
- Mode = 50
- Range = 70 – 40 = 30
Distributions show how data values are spread.
Summary statistics give short numerical info like average, median, etc.
Q) Relationships Among Variables
Understanding relationships among variables is a core task in data analysis, allowing us to find significant connections and patterns within data. This helps in extracting knowledge from data to solve business problems.
Variables can be broadly categorised into quantitative (numeric) and qualitative (categorical) types, and the methods for studying relationships vary depending on these types.
I. Measuring Relationships Between Quantitative Variables
For quantitative variables, which are numerical and can be measured on a scale (e.g., age, salary), we primarily look at:
- Correlation Coefficient (Pearson)
- The Pearson correlation coefficient measures how strongly two numeric variables are linearly related.
- It ranges from -1 to +1.
- Interpretation:
- Values close to +1 indicate a strong positive linear correlation (as one variable increases, the other tends to increase).
- Values close to -1 indicate a strong negative linear correlation (as one variable increases, the other tends to decrease).
- Values close to 0 suggest a weak or no linear correlation.
- Limitations: Correlation only captures linear relationships and might not detect strong nonlinear relationships.
R Code Snippet:# View correlation between Sepal.Length and Petal.Length
print(cor(iris$Sepal.Length, iris$Petal.Length))
✅ You’ll get a value close to +0.87 — meaning a strong positive relation.
Covariance: This is a related measure that indicates how two variables change together, but it is not standardised like correlation.
- Visualising Quantitative Relationships: Scatter Plots
- What they are: Scatter plots are graphical representations where each data point is plotted as a dot based on its values for two variables. They are essential for understanding the relationship's direction and spread.
- Interpretation:
- If points generally slope from lower-left to upper-right, it suggests a positive correlation.
- If points generally slope from upper-left to lower-right, it suggests a negative correlation.
- Scattered points without a clear pattern indicate weak or no linear correlation.
R Code Snippets:
data(iris)
plot(iris$Sepal.Length, iris$Petal.Length)
3. Scatterplot Matrix: A scatterplot matrix shows scatter plots between every pair of numeric variables in a dataset — all in one grid. It helps you quickly see which variables are related.
🧪 Example:mtcars is a built-in R dataset about 32 car models.mpg Miles per gallon (fuel efficiency ),disp Displacement (engine size), hp Horsepower (engine power – higher means faster car), wt Weight (in 1000 lbs – heavier cars have higher numbers)
📊 R Code:
data(mtcars)
pairs(mtcars[, c("mpg", "disp", "hp", "wt")])

🧠 Interpretation:
mpg vs wt: Dots go down →cars wt perige koddu mpg thagguthundi→ negative relationship
hp vs disp: Dots go up → bigger engine → more horsepower → positive relationship
🎯 Use in Predictive Analytics:
helps you decide which variables to include in regression or machine learning models.
Used in multivariate analysis to detect clusters or relationships.
4. Local Regression (LOESS / LOWESS)
Local regression fits a smooth curve to your data — not a straight line.
It’s great when the relationship between variables is not linear (i.e., not a straight line).
🧪 Example:
Let’s say you want to study the relationship between Petal.Length and Petal.Width in the iris dataset, but it’s not perfectly linear.
📊 R Code:
plot(iris$Petal.Length, iris$Petal.Width, pch = 19, col = "blue")
# Add LOESS smooth curve
lines(lowess(iris$Petal.Length, iris$Petal.Width), col = "red", lwd = 2)
🧠 Interpretation:
The red curve shows the actual pattern in the data.
If the curve bends, the relationship is nonlinear.
🎯 Use in Predictive Analytics:
Captures nonlinear trends missed by linear regression.
Used for smoothing noisy data before model training.
Helps visualize and understand complex patterns.
Useful in time series forecasting, customer behavior modeling, etc.
Technique | Purpose | R Function | Used For |
Scatterplot Matrix | Visualize relationships among variables | pairs() | Feature selection, EDA |
Local Regression | Fit smooth (nonlinear) curves | lowess() or loess() | Smoothing, nonlinear modeling |
II. Measuring Relationships Between Qualitative/Categorical Variables
For qualitative or categorical variables, which represent categories or labels (e.g., gender, marital status), different methods are used:
- Chi-squared (χ²) Test
The Chi-squared test helps us find out if two categorical (qualitative) variables are related.
You run a bookstore and want to check if gender affects book preference:
- Do males prefer fiction more than females?
- Or is book preference independent of gender?
📊 Example Data:
Gender | Book Type | Count |
Male | Fiction | 30 |
Male | Nonfiction | 20 |
Female | Fiction | 10 |
Female | Nonfiction | 40 |
book_data <- matrix(c(30, 20, 10, 40), nrow = 2, byrow = TRUE)
colnames(book_data) <- c("Fiction", "Nonfiction")
rownames(book_data) <- c("Male", "Female")
# Perform Chi-squared Test
chisq.test(book_data)
✅ Output Insight:X-squared = 15.042, df = 1, p-value = 0.0001052
- If p-value < 0.05, there is a relationship between gender and book type.
- If p-value > 0.05, they are independent….since p-value 0.0001052 < 0.05 there is relationship between gender and book type..

- Association Rules
- What they are: These are "IF-THEN" statements that describe relationships between items in a dataset, commonly used in "market basket analysis" to find frequently co-occurring products.
- Key measures:
- Support: How often the items in the rule appear together in the dataset.
- Confidence: How often the "THEN" part of the rule is true when the "IF" part is true.
- Example: "IF a customer buys bread AND butter THEN they also buy milk".
- Graphical Models
Graphical models show connections between multiple variables using nodes and edges:
- Nodes = variables (like age, income)
Edges = relationships (strong or weak)
No edge = variables are conditionally independent
✅ Extent of Missing Data – Exam Notes (Simple English)
🔹 What is Missing Data?
Sometimes, some values in a dataset are empty or not available.
This is called missing data.
🔹 Why Does Data Go Missing?
Reason | Example |
❌ Not recorded properly | Student forgot to write age in a form |
📂 Data lost during transfer | File corrupted while saving |
👨💼 Person refused to answer | Patient did not share income details |
🔍 System error or bug | App failed to collect GPS location |
🔹 Extent of Missing Data
- Extent means how much data is missing in the dataset.
- We calculate the percentage or number of missing values.
🧮 Example:
If a column has 100 values and 10 are missing → 10% missing
🔹 Why is Missing Data a Problem?
Problem | Meaning |
📉 Reduces Accuracy | Wrong results in analysis or models |
🚫 Some Methods Cannot Work | Some tools need full data |
💡 May Hide Important Patterns | We may miss useful relationships |
🔹 What to Do with Missing Data?
Method | Meaning |
❌ Delete rows | Remove rows that have missing values |
📥 Fill with average/median | Use average value to replace missing |
🔁 Predict missing values | Use machine learning to guess the value |
🚫 Leave as is (carefully) | Sometimes okay if only a few missing |
Q) Cluster Analysis (Segmentation)
Cluster analysis is a powerful technique used in data mining to group data items that are similar to each other. Imagine you have a large collection of items, but you don't have any pre-defined categories or labels for them. Cluster analysis helps you to discover natural groupings or hidden patterns within this data.
The main idea is to make sure that:
- Items within the same group (called a cluster) are very much alike.
- Items in different groups are very different from each other.
This process is often called data segmentation because it effectively divides a large set of data into smaller, more manageable parts or segments.
How Do We Find Clusters? (Common Methods)
There are several ways to perform cluster analysis, each with its own approach:
- Partitioning Methods (e.g., k-Means)
- Idea: These methods divide the data into a specific number of groups, which you decide beforehand (let's call this number 'k').
- How k-Means works (simplified):
- You tell the computer how many groups ('k') you want to find.
- The algorithm picks 'k' starting points, which act like "centre points" for your groups.
- Every data item is then put into the group whose "centre point" is closest to it.
- Once all items are assigned, the computer recalculates the actual "centre point" for each group based on where all the items in that group are located.
- Steps 3 and 4 are repeated until the groups no longer change much.
- Good for: Finding clusters that are generally round or spherical.
- Important Note: A challenge is deciding the best number of 'k' groups before you start. Finding the exact best solution can be very difficult computationally, so simpler, step-by-step (greedy) methods like k-means are commonly used.
- Hierarchical Methods
- Idea: These methods build a tree-like structure of clusters. You don't need to specify 'k' upfront.
- Two main types:
- Agglomerative (Bottom-up): This approach starts by treating every single data item as its own tiny cluster. Then, it repeatedly merges the two closest clusters together. This continues until all items are eventually merged into one large cluster, or until a certain stopping point is reached.
- Divisive (Top-down): This is the opposite. It starts with all data items in one big cluster and then repeatedly divides the clusters into smaller and smaller ones.
- Result: The clustering can be shown as a tree diagram, called a dendrogram, which helps visualise how clusters are related at different levels.
- Important Note: For very large datasets, these methods can require a lot of computing power and memory.
- Density-Based Methods (e.g., DBSCAN)
- Idea: These methods find clusters by looking for areas where data points are densely packed together. Sparse areas between dense regions are considered boundaries, and isolated points are often seen as "noise" or outliers.
- Good for: Discovering clusters that have irregular or complex shapes (not just circles or spheres), and for finding noisy data points that don't belong to any cluster.
Example: Iris Dataset
Let's imagine the Iris dataset, which is a famous collection of measurements (like sepal length, sepal width, petal length, and petal width) for 150 iris flowers. These flowers actually belong to three known species (setosa, versicolor, and virginica).
- How segmentation is applied: If you didn't know the species beforehand, you could use cluster analysis on these measurements.
- What it does: A k-means algorithm, for example, if told to find 3 clusters, would group the 150 flowers into three distinct groups based purely on their measurements. You could then check how well these automatically formed groups match the actual known species of the flowers. This helps to see if the measurements alone are good enough to tell the species apart.
Why is Cluster Analysis Important? (Applications)
Cluster analysis (segmentation) is used in many different areas:
- Customer Segmentation: Businesses group customers based on their buying habits or preferences. This helps them create more effective marketing campaigns for specific customer groups.
- Document Organisation: Grouping news articles or documents that talk about similar topics.
- Biology: Identifying groups of genes or proteins that show similar behaviour.
- City Planning: Segmenting areas in a city based on housing types or population characteristics for urban development
Q) short note on Outlier Detection
- An outlier is a data point that is very different from other points.
- It is unusual, and may be an error or a rare event.
🔹 Why Detect Outliers?
Reason | Example |
🚨 To find errors | A person with age = 200 (not possible) |
🕵️♂️ To catch fraud | Credit card used in 3 countries in 1 hour |
📊 To improve accuracy | Wrong data affects model results |
🔹 How to Detect Outliers?
Method | Simple Meaning |
Box Plot | Shows outliers as dots outside the box |
Z-Score | If a value is far from average (mean), it's an outlier |
IQR Method | If value is far outside the range (Q1–Q3), it is outlier |
Scatter Plot | Points that are far away from others |
🔹 Example:
Marks: 45, 50, 55, 60, 150
→ 150 is an outlier, because it is too high compared to others.
Outlier = A value that is far away or unusual.
Detecting outliers helps improve data quality and detect fraud.
Q) Outlier Detection
Outlier detection is about finding data points that are significantly different from the majority of other data in a dataset. These unusual points are also known as anomalies. An outlier represents something that does not conform to the expected pattern of the data. For example, a credit card transaction that is much larger than a customer's usual spending might be flagged as an outlier, potentially indicating fraud.
It is important to differentiate outliers from 'noise'. While noise refers to random errors or irrelevant data that one usually aims to clean from a dataset, outliers often carry valuable information about unusual events or behaviours.
What are the Different Kinds of Outliers?
Outliers can be categorised into three main types:
- Global Outliers:
📌 What it means:
A data point is very different from all other data, no matter the situation or context.
✅ Simple Example:
Most people finish a test in 30 to 60 minutes, but one person takes 5 hours.
That’s clearly unusual — it’s a global outlier.
- Contextual Outliers:
A data point seems normal in general, but becomes unusual in a specific context or condition.
✅ Simple Example:
- A speed of 80 km/h is normal on a highway 🚗 — not an outlier.
- But 80 km/h in a school zone? That’s very unusual — it's a contextual outlier.
It only becomes an outlier when you look at the context (school zone).
Collective Outliers
📌 What it means:
A group of values together is unusual, even if each value alone doesn’t seem odd.
✅ Simple Example:
One student getting a low grade on a quiz isn’t strange.
But if an entire class of students suddenly scores very low on the same quiz — that’s a collective outlier.
It may suggest a problem with the quiz or something else that affected everyone.
How are Outliers Found?
There are several primary approaches to detecting outliers:
- Statistical Methods:
These methods look at the overall pattern of the data . Most of the time, we assume the data follows a common shape, like the bell curve (also called a normal distribution).
- Outliers are identified as data points that have a very low probability of occurring under this assumed distribution.
- Boxplots are a simple visual tool that can help in identifying potential outliers.
- Proximity-Based Methods:If something is far away from its neighbors, it's probably an outlier. Local Outlier Factor (LOF) checks how close a data point is to its neighbors. If it's in a low-density area, it's an outlier.
- Clustering-Based Methods:Normal points form big groups (clusters). If a point doesn’t belong to a group or is in a tiny group, it's an outlier.
Use clustering algorithms (like K-Means or DBSCAN) to find groups. Points that don’t belong to a group are flagged as outliers. - Classification-Based Methods: If you already know what “normal” and “abnormal” look like, you can train a model to detect new outliers.
📌 Real Method: Train a model on normal data only (e.g., only healthy patients).
When it sees something that doesn’t fit the pattern (e.g., an unusual health reading), it marks it as an outlier.
Unit III: Model development & techniques Data Partitioning, Model selection, Model Development Techniques, Neural networks, Decision trees, Logistic regression, Discriminant analysis, Support vector machine, Bayesian Networks, Linear Regression, Cox Regression, Association rules.
Model development is like creating a "smart program" or a "mathematical model" that learns from existing data to make predictions or find patterns in new, unseen data. This process helps translate a real-world problem into something a computer can solve, and then turn the computer's answers back into useful solutions.
Q) Data Partitioning (Splitting Data)
When developing a model, it is crucial to test its performance on data it has not seen before. This is like a student studying for an exam: you want to see if they can answer new questions, not just the ones they memorized. This is why data is split into different parts.
- Training Set: This is the largest part of your data, used to build and "teach" the model. The model learns patterns and relationships from this data.
- Test Set: This part of the data is kept separate and is used only once, at the very end, to measure how well the final model performs on new, unseen data.
- Validation Set (Optional but Recommended): After training, the model is evaluated multiple times on the validation set to:
1) Tune the model's hyperparameters (e.g., learning rate, number of layers, max iterations)
2) Decide which version of the model performs better
A typical way to split data is 50% for training, 25% for validation, and 25% for testing.
Q) Model Selection (Choosing the Best Model)
Once you have built different models, you need to decide which one is the "best" for your specific problem.
- Evaluation on Test Set: The primary way to select a model is to evaluate its performance (e.g., accuracy for classification) on the test set. A model that performs well on the test set is expected to perform well on real-world, new data.
2. Cross-Validation
- Idea: Instead of depending on just one test/train split, we repeatedly split the data and test each time.
- How it works (k-fold CV):
- Split data into k equal parts (folds).
- In each round, use one fold as test set and the rest as training set.
- Repeat for all k folds.
- Average the performance across all rounds.
- Why used in model selection?
- Reduces risk of choosing a model that performs well just by chance on one test set.
- Gives a more reliable estimate of model’s ability to generalize.
3. Bootstrap
- Idea: Create multiple “new datasets” by sampling the original training data with replacement.
- For each bootstrap sample:
- Train a model on sampled data.
- Test it on the data points that were not sampled (called out-of-bag data).
- Repeat this process many times and average the results.
- Why used in model selection?
- Helps when dataset is small.
- Provides an estimate of stability and reliability of model performance.
4. Information Criteria (AIC, BIC)
Sometimes, two models perform almost equally well on test data. Should we always pick the more complex one?
→ Not necessarily! Complex models may overfit.
- AIC (Akaike Information Criterion)
- Balances fit vs complexity.
- Formula (conceptually):
AIC = Error (badness of fit) + Penalty (number of parameters)
- Lower AIC → Better model.
- BIC (Bayesian Information Criterion)
- Similar to AIC, but penalizes complexity more strongly.
- Works well when we want very simple and interpretable models.
🔑 In simple English:
- If two models predict almost equally well, prefer the one with fewer parameters.
- AIC and BIC give us a number to compare models: the smaller value = better choice.


Here k = 5 for example, you have 5 coefficients or weights or nodes in your model
Q) Association Rules
Association Rules are like "if-then" statements that describe relationships between different items or events in a large collection of data. They tell us that if one thing happens (the "if" part or condition), then another thing is likely to happen (the "then" part or consequence).
You can think of it like this: Condition ⇒ Consequence.
For example, in a computer store, an association rule might be: If a customer buys Computer, then they also buy Anti-Virus.
To find and understand these rules, we use some key ideas:
- Support: This tells How common is the rule in all transactions?.
For example: Consider the rule: Computer ⇒ Anti-virus

Itemsets with high support are called frequent itemsets.
- Confidence: tells How often is anti-virus bought when computer is bought? In other words, “Given that someone bought a computer, how often did they also buy anti-virus?”

It shows the "strength" of the rule.
- Lift: This measures how much more likely the "consequence" is to happen when the "condition" is met, compared to when the "condition" is not met.
- If Lift = 1, the items are independent (no real relationship).
- If Lift > 1, there is a positive relationship (buying Computer "lifts" the chance of buying Anti-Virus).
- If Lift < 1, there is a negative relationship (buying Computer decreases the chance of buying Anti-Virus). This is important because a rule can have high support and confidence but still be misleading if the items are negatively correlated.
How are Association Rules Found?
Finding association rules is usually a two-step process:
- Find Frequent Itemsets: First, identify all groups of items that appear together very often, based on a minimum "support" level.
- Generate Rules: Then, use these frequent itemsets to create rules that meet a minimum "confidence" level.
Common algorithms for finding these rules include:
- Apriori Algorithm: This is one of the oldest and most famous methods. It works by first finding small groups of frequently bought items, then using these small groups to build larger ones, step by step.
- ECLAT (Equivalence Class Transformation): This method works by changing how the data is stored to make finding frequent itemsets more efficient, especially for vertical data formats. You can implement both Apriori and ECLAT using R programs.
Why are Association Rules Useful? (Applications)
Association rules are used in many real-world situations, such as:
- Market Basket Analysis: This is the most common use. Supermarkets use it to understand what customers buy together. For example, if customers often buy "beer" and "diapers" together, the store might place them closer on the shelves or offer special deals.
- Target Marketing: Companies use rules to identify specific groups of customers who are likely to buy certain products. For instance, if people aged 85-95 often buy a certain brand of checkers, then the company can target advertisements for checkers to this age group.
Q) Cox Regression
Cox Regression is a type of statistical model used when you want to predict the time until an event happens. Think of it like trying to predict how long something will last before a specific event occurs.
- What it's for: It is often used in situations where the "event" is something like patient survival (e.g., how long a patient lives after treatment) or the time until a machine fails. Because some observations might not have experienced the event yet (e.g., a patient is still alive at the end of the study, or a machine is still working), this type of data is called "censored survival data".
- How it helps: It helps to understand how different factors (called “predictors” or "covariates" or "features") influence the time until that event. For example, in medical research, it can help doctors understand if a new treatment or a patient's age affects how long they live.
- Example: It can be used in studies involving gene expression data to predict survival time in patients, such as those with lymphoma. This model helps to find which genes might be important in predicting how long these patients survive.
Q) Explain in detail about linear regression and multiple linear regression.
Regression analysis is a very widely used statistical tool.
It is used to establish a relationship model between two variables.
One variable is called a dependent or response variable whose value must be predicted.
Other variable is called an independent or predictor variable whose value is known.
In Linear Regression these two variables are related through an equation.
Mathematically a linear relationship represents a straight line.
A non-linear relationship creates a curve.
The general mathematical equation for a linear regression is −
y = ax + b
Following is the description of the parameters used −
y is the dependent variable.
x is the independent variable.
a and b are constants which are called the coefficients.
Steps to Establish a Regression
A simple example of regression is to predict the weight of a person when his height is known. To do this we need to have the relationship between height and weight of a person.
Here y is weight and x is height.
The steps to create the relationship is −
- Gather the height and weight of a few people.
- Create a relationship model using the lm() functions in R.
- Find the coefficients from the model
- Know the average error in prediction. Also called residuals.
- Use the predict() function to predict the weight of new persons.
For example:
heightx <- c(1,2,3)
weighty <- c(1,3,4)
relation <- lm(weighty~heightx) # Apply the lm() function.
print(relation)
When we execute the above code, it give a and b value as coefficients
b= -0.3333 a= 1.5000
Hence the line equation in y=1.5x-0.33
Multiple regression is an extension of linear regression.
It finds relationships between more than two variables. In simple linear relation we have one independent and one dependent variable, but in multiple regression we have more than one independent variable and one dependent variable.
The general mathematical equation for multiple regression is −
y = a1x1+a2x2+...+b
Following is the description of the parameters used −
y is the response variable.
b, a1, a2...an are the coefficients.
x1, x2, ...xn are the predictor variables.
We create the regression model using the lm() function in R. The model determines the value of the coefficients using the input data. Next we can predict the value of the response variable for a given set of predictor variables using these coefficients.
lm() Function
This function creates the relationship model between the predictor and the response variable.
Syntax
The basic syntax for lm() function in multiple regression is −
lm(weighty ~ heightx+agex)
heightx = c(1,2,3)
weighty=c(1,3,4)
agex=c(0,3,4)
relation=lm(weighty~heightx+agex)
print(relation)
newdata = data.frame(heightx=2.5,agex=3)
predict(relation,newdata)
Q) Explain Feed forward neural networks
A feedforward neural network is a type of artificial neural network where the information flows only in one direction, from input to output, without any feedback or loops.
In a feedforward neural network, the input layer receives the input data and passes it to the first hidden layer. Each neuron in the hidden layer applies a mathematical function to the input and passes the output to the next layer. This process is repeated for all the hidden layers until the output layer is reached, which produces the final output of the network.

The output of each neuron is calculated by applying a weighted sum of the inputs and passing the result through an activation function. The weights are learned during the training process, where the network adjusts the weights to minimize the error between the predicted output and the actual output.
Feedforward neural networks are commonly used for a variety of tasks, including classification, regression, and pattern recognition. They are also used as building blocks for more complex neural network architectures, such as convolutional neural networks and recurrent neural networks.
Q) explain back propagation?
The backpropagation algorithm works by propagating the error backwards from the output layer to the input layer, adjusting the weights of the neurons in each layer along the way.
During training, the input data is fed into the neural network, and the output of the network is compared to the actual output. The difference between the predicted output and the actual output is called the error, and this error is used to adjust the weights of the neurons in the network.
The backpropagation algorithm starts by computing the error at the output layer, and then propagating this error backwards through the network, layer by layer. The amount of error that each neuron contributes to the output is computed by taking the partial derivative of the error with respect to the output of the neuron. The weights of the neurons are then adjusted based on the amount of error they contributed to the output.
The backpropagation algorithm is typically used in conjunction with gradient descent optimization, which is used to minimize the error in the network by adjusting the weights of the neurons in the direction of the steepest descent of the error surface.
Backpropagation is an important technique for training neural networks and is used in many popular neural network architectures, including feedforward neural networks, convolutional neural networks, and recurrent neural networks.
Q) Linear Discriminant Analysis (LDA)
When the class labels or response variable has more than 2 classes, We use LDA.
The objective of LDA is to perform dimensionality reduction. It can also be used for classification. In LDA, we create new axis and
data is projected onto the new axis

Here as shown in above figure, the two dimensional data is projected onto one dimensional data. Thus LDA has reduced one dimension. But this axis is not separating the two classes (red and blue).
LDA choses an axis that separates the two classes as shown below. The axis was chosen to maximize the distance between the means of 2 classes (red and blue) while minimizing the scatter.

https://www.youtube.com/watch?v=azXCzI57Yfc
https://www.youtube.com/watch?v=DVqpwsRxjKQ
Q) Decision trees
Classification by Decision Tree Induction: or Discuss about building a decision tree and working of decision tree.
Decision tree induction is the learning of decision trees from class-labeled training tuples.
- A decision tree is a flowchart-like tree structure,where
- Each internal node denotes a test on an attribute.
- Each branch represents an outcome of the test.
- Each leaf node holds a class label.
- The topmost node in a tree is the root node.

Advantages of Decision trees
- A significant advantage of a decision tree is that it forces the consideration of all possible outcomes of a decision and traces each path to a conclusion.
- The construction of decision tree classifiers does not require any domain knowledge or parameter setting, and therefore is appropriate for exploratory knowledge discovery.
- Decision trees can handle high dimensional data.
- Their representation of acquired knowledge in tree form is easy to understand.
- They are robust to noisy data.
- The learning and classification steps of decision tree induction are simple and fast.
- In general, decision tree classifiers have good accuracy.
- Decision tree induction algorithms have been used for classification in many application areas, such as medicine,manufacturing and production, financial analysis, astronomy, and molecular biology
Disadvantages
- Decision trees are less appropriate for estimation tasks where the goal is to predict the value of a continuous attribute.
- Decision trees are prone to errors in classification problems with many class and relatively small number of training examples.



Here in this below example,
outlook is root node
humidity and wind are attributes/ columns
yes or no are class labels
This example shows if a child can play outside or not.
In decision tree, rectangle represents attributes/ columns
ellipse represents class labels
The main purpose of decision tree is to extract the rules for classification.
Example: if outlook = sunny and humidity = normal then play = yes
if outlook = overcast then play = yes
if outlook = rainy and wind = low then play = yes
Types of Decision Trees
1) un weighted decision tree: when there is no weight on any nodes of the decision tree, i.e, there are no biases in decision tree
2. weighted decision tree:
3. binary decision tree: where there are only two attributes or labels in a tree
4. Random forest: n number of decision trees combined
Q) What is a Neural Network?
A Neural Network is a machine learning model inspired by how the brain works.
It tries to learn patterns from data and make predictions — even if the relationship is complex or non-linear.
🧠 Basic Structure
Layer | Meaning |
Input Layer | Takes the input variables (like age, income, etc.) |
Hidden Layers | Perform the calculations — this is where learning happens |
Output Layer | Gives the prediction (Yes/No, number, etc.) |
Each layer is made of neurons (also called nodes or units), and each neuron does a weighted sum + activation.
🔧 Key Concepts
Concept | Meaning |
Weights | Strength of connection between neurons |
Activation Function | Decides how much signal to pass on (like sigmoid, ReLU) |
Learning | Adjusts weights using training data to reduce error |
Backpropagation | Technique to update weights by checking error at the output |
Epoch | One complete pass over the training data |
📘 R Code (Using nnet Package)
# Load package
library(nnet)
# Build neural network model
model <- nnet(Buy ~ Age + Income, data = train_data, size = 3)
# Predict on new data
predict(model, newdata = test_data, type = "class")
📋 Example Dataset
Age | Income | Buy |
25 | 40000 | No |
45 | 85000 | Yes |
35 | 60000 | Yes |
30 | 50000 | No |
# Sample data
data <- data.frame(
Age = c(25, 45, 35, 30),
Income = c(40000, 85000, 60000, 50000),
Buy = c("No", "Yes", "Yes", "No")
)
# Convert target to factor
data$Buy <- as.factor(data$Buy)
# Load library
library(nnet)
# Train the neural network with 3 hidden nodes
model <- nnet(Buy ~ Age + Income, data = data, size = 3)
# Predict for a new customer
new_customer <- data.frame(Age = 40, Income = 70000)
predict(model, new_customer, type = "class")
Q) Logistic Regression
Logistic regression is used to predict a “Yes/No” outcome, based on one or more input variables.
- Example: Will a customer buy the product or not?
- Output is 0 or 1, not a number like in linear regression.
Steps in logistic regression are as follows:
🔹 STEP 1: Linear Combination (Just like Linear Regression)
We calculate:
z=b0+b1x1+b2x2+.......
This is like saying:
z = intercept + age coefficient × age + income coefficient × income
🔹 STEP 2: Sigmoid Function Converts z to a Probability

This squashes any number (even negative or very large) to a range between 0 and 1, which is perfect for probabilities.
🔹 STEP 3: Decision Rule (Classify)
If the probability > 0.5 → Predict Yes
If the probability ≤ 0.5 → Predict No
Example: Predicting a mouse is obese or not

Q) Bayesian networks
Baye's Theorem
Bayes' Theorem is named after Thomas Bayes. There are two types of probabilities −
- Posterior Probability [P(H/X)]
- Prior Probability [P(H)]
where X is data tuple and H is some hypothesis.
According to Bayes' Theorem,
P(H/X)= P(X/H)P(H) / P(X)
Bayesian Belief Network
Bayesian Belief Networks specify joint conditional probability distributions. They are also known as Belief Networks, Bayesian Networks, or Probabilistic Networks.
- A Belief Network allows class conditional independencies to be defined between subsets of variables.
- It provides a graphical model of causal relationship on which learning can be performed.
- We can use a trained Bayesian Network for classification.
There are two components that define a Bayesian Belief Network −
- Directed acyclic graph
- A set of conditional probability tables
Directed Acyclic Graph
- Each node in a directed acyclic graph represents a random variable.
- These variable may be discrete or continuous valued.
- These variables may correspond to the actual attribute given in the data.
Directed Acyclic Graph Representation
The following diagram shows a directed acyclic graph for six Boolean variables.

The arc in the diagram allows representation of causal knowledge. For example, lung cancer is influenced by a person's family history of lung cancer, as well as whether or not the person is a smoker. It is worth noting that the variable PositiveXray is independent of whether the patient has a family history of lung cancer or that the patient is a smoker, given that we know the patient has lung cancer.
Conditional Probability Table
The conditional probability table for the values of the variable LungCancer (LC) showing each possible combination of the values of its parent nodes, FamilyHistory (FH), and Smoker (S) is as follows −

Short answer on SVM
SVM doesn't just draw any line that separates the two classes — it draws the line that:
- Separates the classes correctly
- Maximizes the distance (called margin) between the line and the closest points from each class.
Those closest points are called Support Vectors. They're the key players.
The margin is the space between the line and the closest points from each class.
The support vectors are the data points that lie closest to the line — they’re the ones that "support" the optimal line.
Q) Support vector machines (SVM)
The goal of the SVM algorithm is to create the best line or decision boundary that can segregate n-dimensional space into classes so that we can easily put the new data point in the correct category in the future. This best decision boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme cases are called as support vectors, and hence algorithm is termed as Support Vector Machine. Consider the below diagram in which there are two different categories that are classified using a decision boundary or hyperplane:
Types of SVM
SVM can be of two types:
- Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can be classified into two classes by using a single straight line, then such data is termed as linearly separable data, and classifier is used called as Linear SVM classifier.
- Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means if a dataset cannot be classified by using a straight line, then such data is termed as non-linear data and classifier used is called as Non-linear SVM classifier.
Hyperplane and Support Vectors in the SVM algorithm:
Hyperplane: There can be multiple lines/decision boundaries to segregate the classes in n-dimensional space, but we need to find out the best decision boundary that helps to classify the data points. This best boundary is known as the hyperplane of SVM.
The dimensions of the hyperplane depend on the features present in the dataset, which means if there are 2 features (as shown in image), then hyperplane will be a straight line. And if there are 3 features, then hyperplane will be a 2-dimension plane.
We always create a hyperplane that has a maximum margin, which means the maximum distance between the data points.
Support Vectors:
The data points or vectors that are the closest to the hyperplane and which affect the position of the hyperplane are termed as Support Vector. Since these vectors support the hyperplane, hence called a Support vector.
How does SVM works?
Linear SVM:
The working of the SVM algorithm can be understood by using an example. Suppose we have a dataset that has two tags (green and blue), and the dataset has two features x1 and x2. We want a classifier that can classify the pair(x1, x2) of coordinates in either green or blue. Consider the below image:

So as it is 2-d space so by just using a straight line, we can easily separate these two classes. But there can be multiple lines that can separate these classes. Consider the below image:

Hence, the SVM algorithm helps to find the best line or decision boundary; this best boundary or region is called as a hyperplane. SVM algorithm finds the closest point of the lines from both the classes. These points are called support vectors. The distance between the vectors and the hyperplane is called as margin. And the goal of SVM is to maximize this margin. The hyperplane with maximum margin is called the optimal hyperplane.

Non-Linear SVM:
If data is linearly arranged, then we can separate it by using a straight line, but for non-linear data, we cannot draw a single straight line. Consider the below image:

So to separate these data points, we need to add one more dimension. For linear data, we have used two dimensions x and y, so for non-linear data, we will add a third dimension z. It can be calculated as:
z=x2 +y2
By adding the third dimension, the sample space will become as below image:

So now, SVM will divide the datasets into classes in the following way. Consider the below image: 
Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we convert it in 2d space with z=1, then it will become as:

Hence we get a circumference of radius 1 in case of non-linear data.
Unit IV: Automated Data Preparation, Combining data files, Aggregate Data, Duplicate Removal, Sampling DATA, Data Caching, Partitioning data, Missing Values. Model Evaluation and Deployment Introduction, Model Validation, Rule Induction Using CHAD)
Here are notes on the requested data mining concepts:
Automated Data Preparation
- Refers to Data Preprocessing, a crucial multi-stage process in data mining.
- Transforms raw data into a useful and efficient format for analysis.
- Involves steps like data cleaning, feature extraction, data integration, data reduction, and data transformation.
- It improves data quality, accuracy, and efficiency of subsequent mining algorithms.
- Illustrated as "Forms of data preprocessing" in a figure.
Combining Data Files
- This refers to Data Integration.
- Combines data from multiple, heterogeneous sources (e.g., databases, data cubes, flat files) into a coherent data store.
- Aims to resolve schema inconsistencies, attribute naming variations, and data value conflicts.
- Helps to avoid redundancies.
Aggregate Data
- Often achieved through Data Cube Aggregation.
- Data cubes provide a multidimensional view of data by storing precomputed measures, such as count() or total sales().
- Example: a data cube for sales shown in Figure 3.11 and Figure 4.6.
- Facilitates OLAP operations like roll-up (generalizing data) and drill-down (specializing data).
Duplicate Removal
- A key task within Data Cleaning.
- Involves deleting redundant or irrelevant values from a dataset.
- Addresses tuple duplication as a problem during data integration.
Sampling Data
- A Data Reduction technique used to represent a large dataset by a smaller random data sample or subset.
- Common methods include:
- Simple Random Sampling Without Replacement (SRSWOR).
- Simple Random Sampling With Replacement (SRSWR).
- Cluster sampling.
- Stratified sampling, where data are divided into disjoint strata, and SRS is obtained from each.
- Figure 3.9 illustrates sampling methods.
- Reservoir sampling is a method for maintaining a dynamic sample from data streams.
Data Caching (In-Memory Data Management for Efficiency)
- While "data caching" is not explicitly named, the goal of keeping data in memory for efficiency is achieved through techniques like:
- Data cube materialization: precomputing all or parts of cuboids so they are readily available for fast query processing.
- Vertical data format (TID lists): used in algorithms like Eclat, where lists of transaction IDs are stored. While memory-intensive, it can be managed by partitioning.
- FP-Tree: compresses the database representation of frequent items, retaining association information to reduce search space and speed up mining.
- Projected databases: often constructed and scanned during recursive calls, it's crucial to maintain them in main memory to avoid disk-access costs.
- Partitioned ensembles: divide the transaction database into main-memory resident segments to reduce memory requirements and disk-access costs.
- AVC-sets: aggregate information regarding training data, stored in main memory to efficiently evaluate split criteria in decision trees.
Partitioning Data
- This term is used in several contexts:
- Clustering methods: constructs k partitions (clusters) of data, where each object typically belongs to exactly one group. Example: k-means, k-medoids.
- Frequent Itemset Mining: a technique to improve efficiency where the transaction database is divided into n partitions.
- Discretization: a method to transform numeric variables into categorical ones by dividing the range of a numeric attribute into intervals (bins). Examples include equal-frequency (equal-depth) and equal-width partitioning.
- Stratified Sampling: divides the data into mutually disjoint parts called strata.
Missing Values
- A common issue in real-world data, handled as part of Data Cleaning.
- Occur due to imperfect data collection or non-applicability of information.
- Techniques for handling them include:
- Ignoring the tuple.
- Filling in the missing value manually.
- Using a global constant (e.g., "unknown").
- Using the attribute mean or median.
- Using the attribute mean or median for all samples belonging to the same class.
- Using the most probable value (e.g., by regression or Bayes inference).
- For dependency-oriented data (like time series or spatial data), values from contextually nearby records can be used for imputation.
Model Evaluation and Deployment Introduction
- Model Evaluation: The process of assessing the generalization performance (prediction capability on unseen data) of a learning method or model.
- It involves comparing the classifier's prediction with the actual class label on a test set.
- Model Deployment: In the data science process, after analytical processing and model building, the model is put into use for decision-making.
Model Validation
- A critical part of model assessment and selection.
- The labeled data should ideally be divided into three parts: training set, validation set, and test set.
- The validation set is used for parameter tuning or model selection.
- Cross-validation is a technique where the dataset is partitioned multiple times to estimate prediction error and is used for model selection.
- Figure 3.10 illustrates information criteria for model selection.
Rule Induction Using CHAID
- CHAID (CHI-squared Automatic Interaction Detection) is a decision tree method.
- It is used for segmentation modeling and can find multivariate splits in decision trees, based on a linear combination of attributes.
- Rule induction generally refers to the process of extracting IF-THEN rules from a decision tree. These rules can provide deeper insight into data contents and offer a compressed data representation.
What is Rule Induction?
Rule induction is a method used in machine learning and data mining to find easy-to-understand “if–then” rules from a dataset. These rules help explain patterns or predict outcomes.
- A rule looks like:
IF [some conditions] THEN [result or class]
For example:
IF Age = Youth AND Student = Yes THEN BuysComputer = Yes TutorialsPoint
- Two important measures for a rule:
- Accuracy: How often is the rule correct?
- Coverage: How many cases in the data does this rule apply to?
Q)🧠 What is CHAID?
CHAID stands for:
Chi-squared Automatic Interaction Detection
It’s a decision tree algorithm used in statistics and data mining to make if–then rules from a dataset. It helps you answer questions like:
“Which features (like age, income, gender) are best at predicting if someone buys a product?”
✅ Where is CHAID Used?
- Marketing (Who will buy?)
- Medical research (Who has risk of a disease?)
- Banking (Who will repay a loan?)
- Education (Which students might fail?)
🪜 How CHAID Works (Simple Idea)
- You start with a dataset that has:
- One output column (target)
- Multiple input columns (predictors)
- CHAID checks which predictor is most related to the output using a statistical test called Chi-square.
- It splits the data into groups (like a decision tree) based on the strongest relationship.
- It continues splitting until no significant improvement is found.
🔍 A Simple Example (Pen-and-Paper)
🧾 Dataset – Predicting if Someone Buys a Laptop
Person | Age Group | Student | Buys Laptop |
1 | Young | Yes | Yes |
2 | Young | No | No |
3 | Middle | Yes | Yes |
4 | Middle | No | No |
5 | Old | No | No |
6 | Old | Yes | Yes |
We want to predict whether a person will buy a laptop.
🧠 Step-by-Step with CHAID
CHAID asks:
"Which column (Age Group or Student) gives the most useful split?"
To do this, it uses Chi-square test to check which predictor is more related to the target.
For now, let's assume you don’t need to calculate Chi-square — just understand the logic behind the splits.
🔹 First Check: Is Age Group related to Buys Laptop?
Let’s group the data:
Age Group | Buys Yes | Buys No |
Young | 1 | 1 |
Middle | 1 | 1 |
Old | 1 | 1 |
Not very helpful! Equal numbers for Yes and No — no strong pattern.
🔹 Next Check: Is Student related to Buys Laptop?
Student | Buys Yes | Buys No |
Yes | 3 | 0 |
No | 0 | 3 |
Whoa! That’s a perfect split! All students bought laptops, all non-students didn’t.
CHAID selects Student as the best splitting column.
🧩 So the first rule is:
IF Student = Yes THEN Buys Laptop = Yes
IF Student = No THEN Buys Laptop = No
This is the final decision tree. No further splits needed because the groups are already perfectly separated.
✏️ Why Use CHAID?
✅ Easy to understand
✅ Works well with categorical data (like Yes/No, Age groups, etc.)
✅ Can be used manually with small data
✅ Produces decision trees that are easy to explain
Unit V: Automating Models for Categorical and Continuous targets, Comparing and
Combining Models, Evaluation Charts for Model Comparison, Deploying Model, Assessing Model Performance, Updating a Model.
We want to build a model to predict student performance.
- Categorical target: Predict whether a student will Pass or Fail.
- Continuous target: Predict the exact marks a student will score.
Categorical Targets (Classification)
- Meaning: Predicting a label that is discrete (separate categories) and unordered.
- Example: Pass / Fail, Grade A / Grade B / Grade C.
- Models:
- Decision Trees: Rules like “If study hours > 5 and attendance > 80%, then Pass.”
- IF–THEN Rules: Simple logical rules for classification.
- Mathematical Formulae: Logistic regression equations.
- Neural Networks: Learn complex patterns from data.
- Naïve Bayes: Probabilistic model based on Bayes’ theorem.
- Clustering (k-means, k-medoids): Group students into clusters (e.g., high, medium, low performers) — sometimes used before classification.
Continuous Targets (Regression)
- Meaning: Predicting a numeric value.
- Example: Predicting a student’s marks out of 100.
- Models:
- Linear Regression: Straight-line relationship between study hours and marks.
- MARS (Multivariate Adaptive Regression Splines): Handles nonlinear patterns — e.g., marks improve quickly up to 6 hours of study, then level off.
- GAM (Generalized Additive Models): Flexible models that add together effects of different factors.
Automating Models
- Meaning: Letting the computer automatically choose the best model or best set of variables.
- How:
- Try many alternative models and compare performance.
- Often used when models are of the same type but differ in which explanatory variables they use.
- All models are decision trees, but one uses study hours + attendance, another uses study hours + homework score, etc.
- The system automatically picks the one with the best accuracy.
💡 Memory Hook:
- Classification → “Which category?” (Pass/Fail)
- Regression → “What number?” (Marks)
- Automation → “Let the system pick the best model or variables.”
Q) Explain how models can be compared?
We’re building models to predict whether a student will pass or fail an exam based on study hours, attendance, and homework completion.
We have several models (e.g., Decision Tree, Logistic Regression, Neural Network) and we want to compare them or combine them.
Comparing Models
Goal: Pick the best model.
- Model Selection:
Train different models and see which predicts best.
Example: Decision Tree predicts 85% correctly, Logistic Regression 88%, Neural Network 90% — Neural Network wins. - Statistical Tests of Significance:
Check if the difference in performance is real or just due to chance.
Example: Test if Neural Network’s 90% accuracy is significantly better than Logistic Regression’s 88%. - ROC Curves & Lift Curves:
Compare how well models separate pass vs fail.
Example: ROC curve for Neural Network is higher than others → better separation.
Q) Discuss ensemble methods for combining models
Goal: Improve predictions by using multiple models together.
- Bagging (Bootstrap Aggregating):
Train the same type of model on different random samples of the data, then take a majority vote.
Example: 10 Decision Trees trained on different bootstrapped student data vote on pass/fail → reduces overfitting. - Boosting:
Train models one after another, each focusing on mistakes made by the previous one, then combine them.
Example: First model misses students with low attendance, next model focuses on them → final “committee” is stronger. - Random Forests:
Many Decision Trees, each trained on a random subset of data and a random subset of features.
Example: One tree uses study hours & attendance, another uses homework & attendance → reduces correlation between trees. - Stacking:
First level: several different models make predictions.
Second level: another model learns how to best combine those predictions.
Example: Decision Tree, Logistic Regression, and Neural Network each predict pass/fail; a Meta‑Model learns how to blend their outputs for the final decision.
Two Types of Ensembles
- Model‑centered: Different algorithms on the same data.
Example: Decision Tree + Neural Network + Logistic Regression. - Data‑centered: Same algorithm on different subsets of data.
Example: Many Decision Trees on different bootstrapped samples.
💡 Memory Hook:
- Comparing = “Which single player is best?”
- Combining = “Make a dream team.”
Q) Evaluation Charts for Model Comparison
1. ROC Curve shows how well a model can separate two classes (e.g., “pass” vs “fail”).
Example: Model’s ability to separate pass vs fail vs random guessing. Model catches most pass students and rarely mistakes fail students for pass.
- A curve closer to the top‑left corner means better performance.
If the curve is close to the diagonal line, the model is no better than random guessing. - The AUC (Area Under Curve) is a single‑number summary — higher AUC means better class separation.
The ROC curve shows us what happens at every single possible threshold! It plots two key numbers at each threshold:
- The True Positive Rate (TPR): This is the percentage of actual "pass" students that our model correctly identified as "pass." We want this to be high!
- The False Positive Rate (FPR): This is the percentage of actual "fail" students that our model incorrectly identified as "pass." We want this to be low!
So, the ROC curve is basically a graph of the TPR vs. the FPR.
Monthly income after Retirement till 100 years of your age
ROC curves and Area Under the Curve explained (video)
2. Lift Curve : Shows how much better the model is at identifying positives compared to random selection.
- Compares: Model’s success rate in finding fail students vs picking students randomly.
- Example: Lift = 3→ Model is thrice as good as random at finding failures.
https://howtolearnmachinelearning.com/articles/the-lift-curve-in-machine-learning/
A Lift chart tells you how much better your model is at finding the “failures” compared to random guessing.
- If Lift = 1 → your model is no better than random.
- If Lift > 1 → your model is better than random (the higher, the better).
- If Lift < 1 → your model is worse than random.
It’s especially useful when you rank predictions and only want to act on the top portion of them (e.g., top 10% most likely cases).
Example: You have 100 students, and 10 of them are actually at risk of failing. Your model gives a score to every student..
- Rank your data You sort them from the student most likely to fail (score of 1) to the student least likely to fail (score of 0)
- Divide into groups Split into equal‑sized buckets (often 10 groups = deciles).
- Calculate Lift for each group
- Plot it
- X‑axis: % of the population targeted (top 10%, top 20%, etc.).
- Y‑axis: Lift value.
In given example: 10% of all students in your dataset are “at risk” of failing. If you take the top 10% of students ranked by your model and find that 30% of them are actually “at risk”: Lift=30/10 = 3
(Lift = (True Positives found by the model) ÷ (True Positives expected by random selection).
This means your model is 3× better than random selection at finding “at risk” students in that top slice.
3. Calibration Plot: Checks if the probabilities predicted by your model are realistic.
- Plots predicted probability vs actual probability.
- Example: If model says “80% chance of passing” for 10 students, about 8 should actually pass.
A Guide to Calibration Plots in Python – Chang Hsin Lee – Committing my thoughts to words.
4. Confusion Matrix
- Compares: Predicted labels (pass/fail) vs actual labels.
- Example:
- TP: Predicted pass & actually passed.
- TN: Predicted fail & actually failed.
- FP: Predicted pass but failed.
- FN: Predicted fail but passed.

5. Histogram : Show how data is spread out.
- Bars represent how many data points fall into each range. Example 1: Heights of students in a class — bars for 150–155 cm, 156–160 cm, etc.
Example 2: Most students have predicted pass probability between 0.6–0.9.
6. Boxplot : Show spread, median, and outliers in data.
- Compares: Spread of predicted probabilities for passers vs failers.
- Example: Passers’ box is higher than failers’ box.
7. Scatter Plot
- Compares: Study hours vs predicted pass probability.
- Example: More hours → higher predicted probability.
8. Information Criteria Plot (AIC/BIC)
- Compares: Different models’ complexity and fit.
- Example: Model with 3 features has lowest AIC → best choice.
AIC & BIC for Selecting Regression Models: Formula, Examples
9. Fitted Curve Plot
- Compares: Model’s predicted curve vs actual pass rates.
- Example: Curve matches the real trend — more study hours = higher pass rate.

💡 Simple memory line:
All these charts are just different ways of comparing model predictions to actual truth or to a random baseline.
Q) What are the key steps and techniques used to evaluate a model’s performance before deployment (OR) How model performance is assessed.
Deploying a Model
- Meaning: After you’ve built and tested your model, you put it into the real world so it can make decisions automatically.
- Example: You’ve trained a model to predict whether a student will pass or fail. Deployment means connecting it to the school’s system so teachers can instantly see predictions for new students.
Assessing Model Performance
This is about checking how good the model really is before and after deployment.
1. Classifier Evaluation
- Measures how accurate the model is on a dataset.
- Used for comparing models, picking the best one, and tuning settings.
2. Methodological Issues — How you split your data matters:
- Training set: Used to teach the model.
- Validation set: Used to fine‑tune the model or choose between models.
- Test set: Used only at the end to check how well the final model works on unseen data.
- Example: Train on Jan–Mar data, validate on April data, test on May data.
3. Quantification Issues — Ways to measure quality:
- Accuracy: % of correct predictions.
- Cost‑sensitive accuracy: Accuracy that also considers the cost of mistakes.
- ROC curve: Shows how well the model separates classes.
- Precision: Of the students predicted to pass, how many actually passed.
- Recall (Sensitivity): Of all students who passed, how many the model caught.
- F‑measure: Balance between precision and recall.
4. Cross‑Validation
- Split the data into parts, train on some, test on the rest, and repeat several times.
- Example: In 5‑fold cross‑validation, the data is split into 5 parts; each part gets a turn as the test set.
5. Bootstrap
- A resampling method: randomly pick data points (with replacement) to create many training sets.
- Helps estimate how the model might perform on new data.
💡 Big Picture:
You deploy the model to use it in real life, but you assess it carefully using smart data splits and performance measures to make sure it’s reliable.
Q) What are the different methods used to update a predictive model when new data becomes available?
1️⃣ Data Warehouse (DWH) Updates
- What it is: A big storage place where all your cleaned, organised data lives — like a giant library of past and current information.
- Why update: If you keep predicting based on old data, your model will be out of touch.
- Example: Every month, new exam scores, attendance records, and homework grades come in from different schools. The DWH must be refreshed so the model sees the latest patterns.
2️⃣ Prediction Model Validation with New Time Periods
- What it means: Don’t just test your model on the same time period it was trained on — test it on data from a different month or season.
- Why: To check if the model still works when conditions change.
- Example: Train the model on January–March student data, then test it on April data. If it still predicts well, it’s robust.
3️⃣ Incremental Decision Tree Induction
- Normal way: When new data arrives, you throw away the old decision tree and rebuild it from scratch.
- Incremental way: You update the existing tree with the new data, adjusting only the parts that need change.
- Example: If the model’s rule says “If study hours > 5, predict pass” but new data shows 4.5 hours is enough, the tree updates that branch without re‑learning everything.
4️⃣ Data Streams & Reservoir Sampling
- Data stream: Data that keeps coming in continuously, like a live video feed — you can’t store it all.
- Reservoir sampling: A smart way to keep a small, random, but representative sample of the stream so you can still train/update your model.
- Example: If exam results from all over the country arrive every second, you can’t save them all. Reservoir sampling keeps, say, 1,000 random results at any time, replacing old ones as new ones arrive — so your sample always reflects the latest situation.
💡 Big Picture:
Updating a model means:
- Keep your data fresh (update the warehouse).
- Test on new situations (different months).
- Update smartly (incremental learning instead of starting over).
- Handle endless data (sampling from streams).