SEMESTER-V
COURSE 14 B: PREDICTIVE AND ADVANCED ANALYTICS USING R

      Unit-I

Introduction to Data Mining Introduction, what is Data Mining? Concepts of Data mining, Technologies Used, Data Mining Process, KDD Process Model, CRISP — DM, Mining on various kinds of data, Applications of Data Mining, Challenges of Data Mining.

Q) Explain the stages of the Data mining process/KDD process ?

Ans: The knowledge discovery process is shown in Figure as an iterative sequence of the following steps:

1. Data cleaning : to remove noise and inconsistent data

2. Data integration :where multiple data sources may be combined

 

3. Data selection :where data relevant to the analysis task are taken from the database

 4. Data transformation: where data are transformed and consolidated into forms appropriate for mining

5. Data mining :  process where intelligent methods are applied to extract data patterns

 6. Pattern evaluation :to identify the truly interesting patterns

7. Knowledge presentation: where visualization and knowledge representation techniques are used to present mined knowledge to users.

 Steps 1 through 4 are different forms of data pre-processing, where data are prepared for mining.

The data mining step may interact with the user or a knowledge base. The interesting patterns are presented to the user and may be stored as new knowledge in the knowledge base.

Q) What is data mining?

Data mining is the process of studying large sets of data to find useful information. It helps turn raw, messy data into clear knowledge that people can understand and use.

🌟 Key Points:

So, data mining turns confusing data into something smart and useful for companies, hospitals, banks, and more.

Q) Explain about Data mining techniques? Or Data mining technologies used

Ans: Data mining functionalities are used to specify the kinds of patterns to be mined.

Data mining tasks can be classified into two categories: descriptive and predictive.

Descriptive tasks summarizes data and gives general properties

Predictive tasks analyse the data and constructs model which can predict the behaviour of new data.

There are several data mining functionalities. These include

  1. characterization and discrimination
  2. Mining of frequent patterns, associations, and correlations
  3. classification and regression
  4. clustering analysis
  5. outlier analysis

1. characterization and discrimination : Data characterization is a summarization of the general characteristics or features of a target class of data. Example: Summarize the characteristics of customers who spend more than Rs. 5000 a year at Modern Super Market.

Data discrimination is a comparison of the general features of the target class against the general features of one or more contrasting classes. For example: compare two groups of customers—those who buy computer products regularly and those who rarely buy such products.

2. Mining of frequent patterns, associations and correlations

Frequent patterns, as the name suggests, are patterns that occur frequently in data. Mining frequent patterns leads to the discovery of interesting associations and correlations within data.

3. Classification and regression: Classification is supervised learning technique. The class labels of training data will be given. A model is trained using training data.  The trained model or classifier is then is used to predict the category or the class label of new data.  If the model is used to predict a number (or continuous value) then it is called regression.Diagram

Description automatically generated

Example: Given the details of customer like age, credit rating, student or not. A decision tree is constructed based on training data. This tree is then used to predict whether a customer will buy (yes) computer or will not buy (no). Below the classification model is represented using decision tree.

Classification model can be represented using IF-THEN rules as follows:

age(X, “youth”) AND student(X, “high”)  🡺 buys(X, “yes”)

age(X, “middle_aged”)  🡺 buys(X, “yes”)

age(X, “senior”) AND credit_rating(X, “excellent”)  🡺 buys(X, “yes”)

4. Cluster Analysis:A picture containing calendar

Description automatically generated

 

Any group of objects that belongs to the same class is known as a cluster. In data mining, cluster analysis is a way to discover similar item groups.

Cluster analysis can be performed on Modern Super Market customer data to identify homogeneous subpopulations of customers. These clusters may represent individual target groups for marketing

A picture containing scatter chart

Description automatically generated

5. Outlier Analysis:  An outlier is a data point that is very different from other points.
It is
unusual, and may be an error or a rare event.

An outlier is a data object that deviates significantly from the rest of the objects. The analysis of outlier data is referred to as outlier analysis or anomaly mining. Outlier analysis tries to find unusual patterns in any dataset. Outlier detection is important in many applications in addition to fraud detection such as medical care, public safety and security, industry damage detection, image processing, sensor/video network surveillance, and intrusion detection.

Q) MINING ON VARIOUS KINDS OF DATA

The data is categorized into fundamental and more complex types:

Time Series Data: It refers to data collected over a period of time, such as stock prices or weather conditions, allowing for analysis of patterns and trends.

Biological Data: It contains information related to living organisms, such as genetic sequences or physiological measurements.This data aids research in areas like genetics, medicine, and ecology.

Spatial Data: Spatial data contains information about physical locations and. This data helps in analysis and visualisation of geographic patterns, such as maps or satellite imagery.

Social Network Data: It involves data about individuals and their relationships in a social network, offering insights into social interactions, influence, and community structures.

It's important to note that multiple types of data often coexist in real-world applications (e.g., web mining involves text, multimedia, and graph data). Mining these multiple sources can lead to richer findings but also presents challenges in data cleaning and integration.

Q)📊 Applications of Data Mining
Data mining is used in many areas to find useful patterns and help make better decisions. It turns huge amounts of data into meaningful information.

🔢 Area

🏷️ Use Case

📌Explanation

🏪 1. Business & Commerce

Retail & Marketing

- Find what products are bought together.

- Plan ads and layout.

Banking & Finance

- Detect fraud, check loan risk, group customers.

Telecom

- Find customers likely to leave.

- Analyze call usage.

🔬 2. Science & Engineering

Medicine & Bioinformatics

- Predict diseases.

- Study DNA and discover drugs.

Computer Engineering

- Detect bugs or attacks.

- Improve system performance.

Environmental Science

- Predict climate.

- Link geography and poverty.

🌐 3. Web & Internet

Search Engines

- Improve search.

- Show related ads or trending searches.

Recommender Systems

- Suggest products or shows (like Netflix, Amazon).

Text Mining

- Analyze opinions from reviews.

- Group articles.

🛡️ 4. Security & Society

Crime & Fraud Prevention

- Spot frauds in banking and insurance.

- Help crime detection.

Hidden Data Mining (Social Impact)

- Happens silently while shopping or browsing.

Q) Challenges in Data Mining

Data mining helps us discover useful information from large datasets. But it also faces several difficulties. These problems come from the nature of the data, the technical methods used, user interaction, and social concerns. The challenges are listed below

🧩 Problem Area

📌 Simple Explanation

🧹 Data Quality Problems

- Data may have mistakes or missing parts. - Merging data from many sources causes confusion. - Hard to find real patterns due to noise or outliers.

⚙️ Scalability & Efficiency

- Too much data to process. - Too many features (columns). - Live data is hard to handle. - Some algorithms are slow and heavy.

🧬 Variety in Data Types

- Data comes in many forms (text, image, video). - Some data like networks are complex to analyze.

🧪 Mining & Evaluation Issues

- Too many patterns, only a few are useful. - “Interesting” patterns are hard to define. - Some methods give only partial results.

👩‍💻 User Involvement

- Users want to change or explore during mining. - Hard to understand complex models. - Need good charts or visuals.

🔐 Privacy & Social Issues

- Mining may break personal privacy. - Can be used in bad ways. - Often done secretly, without users knowing.

Q)  CRISP-DM (Cross Industry Standard Process for Data Mining)

CRISP-DM is a common method used to do data mining projects in a step-by-step way. It helps people work with data clearly and effectively, in any industry or with any tool.

✅ Main Features of CRISP-DM

🔢 Six Steps of CRISP-DM

  1. Business Understanding – Know the goal of the project from a business point of view.
  2. Data Understanding – Collect data and check its quality.
  3. Data Preparation – Make the data ready for analysis (cleaning, formatting).
  4. Modeling – Use data mining techniques to build a model.
  5. Evaluation – Check if the model gives good results.
  6. Deployment – Use the model to help in real-world decision making.

Q) Differentiate between crisp-dm and data mining

DM is what you do. CRISP-DM is how you do it.

🔍 CRISP-DM vs. DM: Key Differences

Feature

CRISP-DM

Data Mining (DM)

Definition

A structured methodology for DM projects

The general process of extracting insights from data

Purpose

Guide teams through DM projects step-by-step

Discover patterns and knowledge from data

Phases

6 defined stages (Business to Deployment)

No fixed structure – depends on the approach

Flexibility

Highly adaptable across industries

Flexible but can be chaotic without a method

Tool Dependence

Tool-agnostic (works with any tool)

May depend on specific tools or algorithms

Project Management

Includes planning, evaluation, and real-world use

Often focused only on modeling and analysis

🧭 Why CRISP-DM Matters

CRISP-DM brings clarity, repeatability, and structure to data mining. It’s like having a GPS for your data journey—especially useful when working in teams or across industries


Unit II: Data Understanding and Preparation Introduction, Reading data from various sources, Data visualization, Distributions and summary statistics, Relationships among variables, Extent of Missing Data. Segmentation, Outlier detection

Q)  Data Understanding and Preparation – Introduction

In Data Science, it is very important to first understand your data and then prepare it. Data understanding helps us know what the data is about. Data preparation makes the data ready for analysis. In R, we use different tools and functions for these steps.

Q1. Data Understanding (Getting to Know Your Data)

Before you work with data, you need to understand it well. This means knowing what kind of data you have, what values it contains, and how these values are spread out.

a. Types of Data (Attributes): Data is made of "attributes" (also called variables or features). We can classify attributes into different types:

b. Basic Statistical Measures: To understand the data's values, we often look at statistical measures.

R Code Example:
 data_sample <- c(12, 7, 3, 4.2, 18, 2, 54, -21, 8, -5)

range_val <- max(data_sample) - min(data_sample) # Calculate range

iqr_val <- IQR(data_sample)                       # Calculate IQR

variance_val <- var(data_sample)                  # Calculate variance

sd_val <- sd(data_sample)                         # Calculate standard deviation

c. Data Visualization: Pictures (graphs) help us see patterns and problems in data more easily.

R Code Example (using iris dataset, which is preloaded in R):
 # Histogram for Sepal.Length

hist(iris$Sepal.Length, main="Histogram of Sepal Length", xlab="Sepal Length")

# Boxplot for Sepal.Length by Species

boxplot(Sepal.Length ~ Species, data=iris, ylab="Sepal Length")

# Scatter plot of Sepal.Length vs Sepal.Width

plot(iris$Sepal.Length, iris$Sepal.Width,  col=iris$Species) # Color points by species

Q2. Data Preparation (Preprocessing)

Data preparation is the most important step in data mining. It transforms raw data into a clean and useful format. This step takes a lot of time in a data science project.

The main steps in data preparation are:

a. Data Cleaning: Real-world data often has problems like missing information, errors, or inconsistencies. Data cleaning fixes these problems.

b. Data Integration: This step combines data from many different sources into one dataset. It is important to handle different ways data is named or stored (e.g., different units like meters vs. feet).

c. Data Reduction: Large datasets can be hard to work with. Data reduction makes the dataset smaller, but still keeps important information.

d. Data Transformation: This changes data into a suitable format for mining.

Q) Reading Data from Various Sources

In data science, we must collect data before we can analyze it.
Data can come from
different places (sources) and different formats.
We must know
how to read and load this data into our tools (like R, Python, Excel).

🔹 Types of Data Sources

🔢 Type

📌 Simple Meaning

📁 Example

🧪 R Syntax to Read

Flat Files

Text files with rows and columns

.csv, .txt, .tsv

read.csv("file.csv")
read.table("file.txt")

Excel Files

Spreadsheet format

.xls, .xlsx

readxl::read_excel("file.xlsx")

Databases

Organized storage with SQL access

MySQL, Oracle

DBI::dbReadTable(con, "table_name") 
(after DB connection)

Web Data

Data from websites or APIs

HTML tables, weather APIs

readLines("https://example.com")
jsonlite::fromJSON("url")

Cloud Storage

Online data storage services

Google Drive, AWS S3

Use googledrive, aws.s3,
or download to local & use
read.csv()

Sensor/Live Data

Streaming data from devices or logs

GPS, logs, IoT streams

readLines("sensor_log.txt")
scan()
 for simple text reading

Q) Data Visualization

Data Visualization means showing data in pictures (charts/graphs), so that we can understand it easily.

🔹 Why is it useful?

🔹 Common Types of Charts

📊 Chart Type

📌 Use

Bar Chart

To compare categories (e.g., number of students by class)

Pie Chart

To show parts of a whole (e.g., percentage of sales)

Histogram

To show distribution of numbers (e.g., age of people)

Line Chart

To see trends over time (e.g., monthly sales)

Box Plot

To show spread of values and outliers

Scatter Plot

To see relationships between two variables

The plot() function in R is used to create the line graph.

Syntax

The basic syntax to create a line chart in R is –

plot(v,type,col,xlab,ylab)        

Following is the description of the parameters used −

Program:

marks=c(15,22,35,55,45,65)

plot(marks, type="l", col="Blue")

Output:

Boxplots are created in R by using the boxplot() function.

The basic syntax to create a boxplot in R is −

boxplot(x, data, notch, varwidth, names, main)

Example:

data<- c(1,2,3,4,5)

boxplot(data)

In R the pie chart is created using the pie() function

Syntax

The basic syntax for creating a pie-chart using the R is −

pie(x, labels, radius, main, col, clockwise)

Example:

x <- c(50,40,10)

labels <- c("mpcs","mscs","dscs")

# Plot the chart.

pie(x,labels,col = rainbow(length(x)))

The basic syntax for creating a histogram using R is −

hist(v,main,xlab,xlim,ylim,breaks,col,border)

Q) Distributions: Key Concepts and Applications

Distributions are fundamental for understanding how data values are spread out. A probability distribution describes the "law governing a random variable" from which observed data originated, indicating the likelihood of different values being observed.

Core Concepts:

Key Types of Distributions:

  1.  You want to understand the height of your students. You measure them and plot the number of students with each height on a graph. You notice most of them are around 165–170 cm, with fewer students being very short or very tall. The graph looks like a hill or bell.

That’s a normal distribution — one of the most famous distributions.

  1. You toss a coin 10 times and count how many times you get heads. You repeat this many times and record the results. Most of the time you’ll get around 5 heads, sometimes 4 or 6, and rarely 0 or 10. That’s a binomial distribution.👉 Example Problem (Binomial Distribution): A coin is tossed 10 times. What is the probability of getting exactly 6 heads?

Solution: Use binomial formula: P(X=6) = C(10,6) * (0.5)^6 * (0.5)^4 = 210 * 0.015625 * 0.0625 = approx. 0.205

  1. If you toss the coin just once, it’s either head or tail — 1 or 0. That’s a Bernoulli distribution — the simplest one.

  2. Imagine customers arriving at a bank. You count how many come in each hour. The number varies, but usually it’s 3 to 5 per hour. This kind of count data is often modeled using the Poisson distribution.

👉 Example Problem (Poisson Distribution):On average, 4 customers visit a bank per hour. What is the probability that exactly 2 customers arrive in the next hour?
✅ Solution: Use Poisson formula:
P(X=2) = (e^-4 * 4^2) / 2! = (0.0183 * 16) / 2 = 0.146

  1. Let’s say you conduct a survey expecting 20 students to prefer chocolate over vanilla ice cream, but only 14 do. You want to know: is this difference by chance, or is it statistically significant? You use the Chi-squared distribution to test this.

👉 Example Problem (Chi-squared Distribution): Expected: 20 students like chocolate, 10 like vanilla. Observed: 14 like chocolate, 16 like vanilla.

Solution: Use Chi-squared formula: χ² = Σ((Observed - Expected)² / Expected)
χ² = (14-20)²/20 + (16-10)²/10 = (36/20) + (36/10) = 1.8 + 3.6 = 5.4

The critical value comes from the Chi-squared distribution table, which tells you the threshold beyond which your result is statistically significant. df = 1 and Significance level (α) = 0.05 You look up the value in the Chi-squared table, and you’ll find: Critical value ≈ 3.841 The difference is big enough (χ² = 5.4 > 3.841) that it’s unlikely to be due to chance.

So, you conclude: student preferences are different than expected—maybe chocolate isn’t as popular as you thought!

There are two types of Chi-squared tests, and df is calculated differently depending on which one you're using:

1. Goodness-of-Fit Test (comparing observed vs expected in one categorical variable)

2. Test of Independence (e.g., contingency tables)

df = (number of rows − 1) × (number of columns − 1)

  1. Inverse Gaussian distribution is used when modeling time until an event happens but assumes that the event rate changes over time (often used in advanced reliability and survival analysis).👉 Example Problem (Inverse Gaussian Distribution): Suppose a machine has an average lifetime of 1000 hours. What is the chance it will last more than 1200 hours?

✅ Solution (conceptual): Use the inverse Gaussian cumulative distribution function or software like R: 1 - pinvgauss(1200, mean = 1000, dispersion = 500)

  1. Now imagine you are tracking how long it takes between customer arrivals at a bank. You find that the waiting time is not always regular. You can model this kind of waiting time using the Gamma distribution.👉 Example Problem (Gamma Distribution): If the average time between arrivals is 3 minutes, and you want the probability that a customer arrives in 5 minutes, use the gamma probability function with shape and rate parameters.

✅ Solution (conceptual): You plug values into the gamma formula or use R’s pgamma() function: pgamma(5, shape, rate) to get the probability.

⚙️ Where Are Distributions Used?

Visualising Data Distributions: Visual tools are crucial for inspecting and understanding data distributions:

Outliers are data points that fall far outside the typical range. Specifically:

📌 Rule of Thumb:

Any data point:

is considered an outlier.

✅ 1. Quantile Plot – Detecting Unusual Values (Outliers)

📘 Example Scenario:

You are analyzing monthly expenses of 20 students in a college. Most spend between ₹5,000 to ₹7,000. But a few spend above ₹10,000.

You draw a quantile plot:

🧠 What You Learn:

🎯 Use in Analytics:

These outliers can be removed or treated before building a prediction model (like predicting spending based on background).

# Monthly expenses of 20 students

expenses <- c(5200, 5300, 5400, 5500, 5600, 5800, 5900, 6000,

              6100, 6200, 6400, 6500, 6600, 6800, 7000,

              10500, 11000, 11500, 12000, 13000)

# Quantile plot

qqplot <- plot(sort(expenses), ppoints(length(expenses)), ylab = "Percentile")

✅ 2. Q–Q Plot – Checking if Data is Normal

📘 Example Scenario:

You want to use linear regression to predict students' marks based on their study hours. This method assumes the data is normal.

You draw a Q–Q plot for marks:

🧠 What You Learn:

🎯 Use in Analytics:

This helps you choose the right model or prepare data better for accurate predictions.

# Marks of 20 students

marks <- c(45, 50, 55, 60, 65, 70, 72, 73, 74, 75,

           76, 77, 78, 80, 82, 85, 88, 90, 92, 95)

# Q-Q plot against normal distribution

qqnorm(marks, main = "Q–Q Plot of Student Marks")

qqline(marks, col = "red")  # Add reference line

-2 (≈ 2.5th percentile)

-1 (≈ 16th percentile)

0 (50th percentile or median)

+1 (≈ 84th percentile)

+2 (≈ 97.5th percentile)

These values are standard scores (z-scores) showing where a value would lie on a normal distribution curve.

✅ 3. Density Plot – Comparing Two Groups

📘 Example Scenario:

You want to compare the exam scores of two classes — Class A and Class B.

You draw density plots for both classes:

🧠 What You Learn:

🎯 Use in Analytics:

This analysis helps teachers know which class needs attention or personalized coaching.

# Scores of Class A and Class B

classA <- c(70, 72, 74, 75, 76, 77, 78, 78, 79, 80)

classB <- c(50, 55, 60, 65, 70, 75, 80, 85, 90, 95)

# Plot density curves

plot(density(classA), lwd = 2, ylim = c(0, 0.05))

lines(density(classB), col = "green", lwd = 2)

Class A has a narrow peak — consistent scores.

Class B is spread out — more variation.


Q) Summary Statistics: Describing Your Data

Summary statistics are numerical measures that provide concise descriptions of data features, particularly distributions. They are a fundamental part of Exploratory Data Analysis (EDA), helping to generate questions about data and identify properties like noise or outliers. Statistical data descriptions are useful to grasp data trends and identify anomalies.

Key Categories of Summary Statistics:

  1. Measures of Central Tendency: These indicate the middle or center of a data distribution.

  1. Measures of Dispersion (Spread): These indicate how spread out the data values are.

The Five-Number Summary:

Other Statistical Measures (for relationships):

Using R for Summary Statistics: R is a powerful tool for calculating these measures:

These summary statistics are crucial for initial data inspection and understanding the overall behaviour and properties of your data.

Q) short note on Distributions and Summary Statistics

🔹 What is a Distribution?

A distribution tells us how the values in a dataset are spread out.

🔹 Summary Statistics

These are numbers that tell us about the data in a short and simple way.

📏 Measure

📌 What it tells us

Mean (Average)

Total / number of items

Median

Middle value (when sorted)

Mode

Most repeated value

Range

Difference between highest and lowest

Standard Deviation

How spread out the data is

Variance

Square of standard deviation

Min/Max

Smallest and biggest values

Quartiles

Divide data into 4 equal parts

🔹 Example:

Marks: 40, 50, 50, 60, 70

Distributions show how data values are spread.
Summary statistics give short numerical info like average, median, etc.

Q) Relationships Among Variables

Understanding relationships among variables is a core task in data analysis, allowing us to find significant connections and patterns within data. This helps in extracting knowledge from data to solve business problems.

Variables can be broadly categorised into quantitative (numeric) and qualitative (categorical) types, and the methods for studying relationships vary depending on these types.

I. Measuring Relationships Between Quantitative Variables

For quantitative variables, which are numerical and can be measured on a scale (e.g., age, salary), we primarily look at:

  1. Correlation Coefficient (Pearson)

R Code Snippet:# View correlation between Sepal.Length and Petal.Length

print(cor(iris$Sepal.Length, iris$Petal.Length))

✅ You’ll get a value close to +0.87 — meaning a strong positive relation.

Covariance: This is a related measure that indicates how two variables change together, but it is not standardised like correlation.

  1. Visualising Quantitative Relationships: Scatter Plots

R Code Snippets:
data(iris)

plot(iris$Sepal.Length, iris$Petal.Length)

3. Scatterplot Matrix: A scatterplot matrix shows scatter plots between every pair of numeric variables in a dataset — all in one grid. It helps you quickly see which variables are related.

🧪 Example:mtcars is a built-in R dataset about 32 car models.mpg        Miles per gallon (fuel efficiency ),disp        Displacement (engine size), hp        Horsepower (engine power – higher means faster car), wt        Weight (in 1000 lbs – heavier cars have higher numbers)

📊 R Code:
data(mtcars)
pairs(mtcars[, c("mpg", "disp", "hp", "wt")])

🧠 Interpretation:

🎯 Use in Predictive Analytics:

 4. Local Regression (LOESS / LOWESS)

Local regression fits a smooth curve to your data — not a straight line.
It’s great when the relationship between variables is not linear (i.e., not a straight line).

🧪 Example:

Let’s say you want to study the relationship between Petal.Length and Petal.Width in the iris dataset, but it’s not perfectly linear.

📊 R Code:
plot(iris$Petal.Length, iris$Petal.Width,   pch = 19, col = "blue")

# Add LOESS smooth curve

lines(lowess(iris$Petal.Length, iris$Petal.Width), col = "red", lwd = 2)

🧠 Interpretation:

🎯 Use in Predictive Analytics:

Technique

Purpose

R Function

Used For

Scatterplot Matrix

Visualize relationships among variables

pairs()

Feature selection, EDA

Local Regression

Fit smooth (nonlinear) curves

lowess() or loess()

Smoothing, nonlinear modeling

II. Measuring Relationships Between Qualitative/Categorical Variables

For qualitative or categorical variables, which represent categories or labels (e.g., gender, marital status), different methods are used:

  1. Chi-squared (χ²) Test
    The Chi-squared test helps us find out if two categorical (qualitative) variables are related.
    You run a bookstore and want to check if
    gender affects book preference:

Gender

Book Type

Count

Male

Fiction

30

Male

Nonfiction

20

Female

Fiction

10

Female

Nonfiction

40

book_data <- matrix(c(30, 20, 10, 40), nrow = 2, byrow = TRUE)

colnames(book_data) <- c("Fiction", "Nonfiction")

rownames(book_data) <- c("Male", "Female")

# Perform Chi-squared Test

chisq.test(book_data)

✅ Output Insight:X-squared = 15.042, df = 1, p-value = 0.0001052

  1. Association Rules
  1. What they are: These are "IF-THEN" statements that describe relationships between items in a dataset, commonly used in "market basket analysis" to find frequently co-occurring products.
  2. Key measures:
  1. Support: How often the items in the rule appear together in the dataset.
  2. Confidence: How often the "THEN" part of the rule is true when the "IF" part is true.
  1. Example: "IF a customer buys bread AND butter THEN they also buy milk".
  1. Graphical Models
    Graphical models show connections between multiple variables using nodes and edges:

✅ Extent of Missing Data – Exam Notes (Simple English)

🔹 What is Missing Data?

Sometimes, some values in a dataset are empty or not available.
This is called
missing data.

🔹 Why Does Data Go Missing?

Reason

Example

❌ Not recorded properly

Student forgot to write age in a form

📂 Data lost during transfer

File corrupted while saving

👨‍💼 Person refused to answer

Patient did not share income details

🔍 System error or bug

App failed to collect GPS location

🔹 Extent of Missing Data

🧮 Example:
If a column has 100 values and 10 are missing →
10% missing

🔹 Why is Missing Data a Problem?

Problem

Meaning

📉 Reduces Accuracy

Wrong results in analysis or models

🚫 Some Methods Cannot Work

Some tools need full data

💡 May Hide Important Patterns

We may miss useful relationships

🔹 What to Do with Missing Data?

Method

Meaning

❌ Delete rows

Remove rows that have missing values

📥 Fill with average/median

Use average value to replace missing

🔁 Predict missing values

Use machine learning to guess the value

🚫 Leave as is (carefully)

Sometimes okay if only a few missing


Q) Cluster Analysis (Segmentation)

Cluster analysis is a powerful technique used in data mining to group data items that are similar to each other. Imagine you have a large collection of items, but you don't have any pre-defined categories or labels for them. Cluster analysis helps you to discover natural groupings or hidden patterns within this data.

The main idea is to make sure that:

This process is often called data segmentation because it effectively divides a large set of data into smaller, more manageable parts or segments.

How Do We Find Clusters? (Common Methods)

There are several ways to perform cluster analysis, each with its own approach:

  1. Partitioning Methods (e.g., k-Means)
  1. Hierarchical Methods
  1. Density-Based Methods (e.g., DBSCAN)

Example: Iris Dataset

Let's imagine the Iris dataset, which is a famous collection of measurements (like sepal length, sepal width, petal length, and petal width) for 150 iris flowers. These flowers actually belong to three known species (setosa, versicolor, and virginica).

Why is Cluster Analysis Important? (Applications)

Cluster analysis (segmentation) is used in many different areas:

Q) short note on Outlier Detection

🔹 Why Detect Outliers?

Reason

Example

🚨 To find errors

A person with age = 200 (not possible)

🕵️‍♂️ To catch fraud

Credit card used in 3 countries in 1 hour

📊 To improve accuracy

Wrong data affects model results

🔹 How to Detect Outliers?

Method

Simple Meaning

Box Plot

Shows outliers as dots outside the box

Z-Score

If a value is far from average (mean), it's an outlier

IQR Method

If value is far outside the range (Q1–Q3), it is outlier

Scatter Plot

Points that are far away from others

🔹 Example:

Marks: 45, 50, 55, 60, 150
 → 150 is an outlier, because it is too high compared to others.

Outlier = A value that is far away or unusual.
Detecting outliers helps improve data quality and detect fraud.


Q) Outlier Detection

Outlier detection is about finding data points that are significantly different from the majority of other data in a dataset. These unusual points are also known as anomalies. An outlier represents something that does not conform to the expected pattern of the data. For example, a credit card transaction that is much larger than a customer's usual spending might be flagged as an outlier, potentially indicating fraud.

It is important to differentiate outliers from 'noise'. While noise refers to random errors or irrelevant data that one usually aims to clean from a dataset, outliers often carry valuable information about unusual events or behaviours.

What are the Different Kinds of Outliers?

Outliers can be categorised into three main types:

  1. Global Outliers:
    📌
    What it means:
     A data point is very different from all other data, no matter the situation or context.

Simple Example:
 Most people finish a test in 30 to 60 minutes, but one person takes 5 hours.
That’s clearly
unusual — it’s a global outlier.

  1. Contextual Outliers:
    A data point seems normal
    in general, but becomes unusual in a specific context or condition.

Simple Example:

  1.  Collective Outliers

📌 What it means:
 A group of values together is unusual, even if each value alone doesn’t seem odd.

Simple Example:
 One student getting a low grade on a quiz isn’t strange.
But if
an entire class of students suddenly scores very low on the same quiz — that’s a collective outlier.
It may suggest a problem with the quiz or something else that affected everyone.

How are Outliers Found?

There are several primary approaches to detecting outliers:

  1. Statistical Methods:
    These methods look at the overall pattern of the data . Most of the time, we assume the data follows a common shape, like the bell curve (also called a normal distribution).
  1. Proximity-Based Methods:If something is far away from its neighbors, it's probably an outlier. Local Outlier Factor (LOF) checks how close a data point is to its neighbors. If it's in a low-density area, it's an outlier.
  2. Clustering-Based Methods:Normal points form big groups (clusters). If a point doesn’t belong to a group or is in a tiny group, it's an outlier.
    Use clustering algorithms (like K-Means or DBSCAN) to find groups. Points that don’t belong to a group are flagged as outliers.
  3. Classification-Based Methods: If you already know what “normal” and “abnormal” look like, you can train a model to detect new outliers.
    📌 Real Method: Train a model on normal data only (e.g., only healthy patients).
    When it sees something that doesn’t fit the pattern (e.g., an unusual health reading), it marks it as an outlier.

Unit III: Model development & techniques Data Partitioning, Model selection, Model Development Techniques, Neural networks, Decision trees, Logistic regression, Discriminant analysis, Support vector machine, Bayesian Networks, Linear Regression, Cox Regression, Association rules.

Model development is like creating a "smart program" or a "mathematical model" that learns from existing data to make predictions or find patterns in new, unseen data. This process helps translate a real-world problem into something a computer can solve, and then turn the computer's answers back into useful solutions.

Q) Data Partitioning (Splitting Data)

When developing a model, it is crucial to test its performance on data it has not seen before. This is like a student studying for an exam: you want to see if they can answer new questions, not just the ones they memorized. This is why data is split into different parts.

A typical way to split data is 50% for training, 25% for validation, and 25% for testing.

Q) Model Selection (Choosing the Best Model)

Once you have built different models, you need to decide which one is the "best" for your specific problem.

  1. Evaluation on Test Set: The primary way to select a model is to evaluate its performance (e.g., accuracy for classification) on the test set. A model that performs well on the test set is expected to perform well on real-world, new data.

2. Cross-Validation

3. Bootstrap

4. Information Criteria (AIC, BIC)

Sometimes, two models perform almost equally well on test data. Should we always pick the more complex one?
Not necessarily! Complex models may overfit.

🔑 In simple English:

Here k =  5 for example, you have 5 coefficients or weights or nodes in your model

Q) Association Rules

Association Rules are like "if-then" statements that describe relationships between different items or events in a large collection of data. They tell us that if one thing happens (the "if" part or condition), then another thing is likely to happen (the "then" part or consequence).

You can think of it like this: Condition ⇒ Consequence.

For example, in a computer store, an association rule might be: If a customer buys Computer, then they also buy Anti-Virus.

To find and understand these rules, we use some key ideas:

  1. Support: This tells How common is the rule in all transactions?.
    For example: Consider the rule: Computer ⇒
     Anti-virus

 Itemsets with high support are called frequent itemsets.

  1. Confidence: tells  How often is anti-virus bought when computer is bought? In other words, “Given that someone bought a computer, how often did they also buy anti-virus?”

It shows the "strength" of the rule.

  1. Lift: This measures how much more likely the "consequence" is to happen when the "condition" is met, compared to when the "condition" is not met.

How are Association Rules Found?

Finding association rules is usually a two-step process:

  1. Find Frequent Itemsets: First, identify all groups of items that appear together very often, based on a minimum "support" level.
  2. Generate Rules: Then, use these frequent itemsets to create rules that meet a minimum "confidence" level.

Common algorithms for finding these rules include:

Why are Association Rules Useful? (Applications)

Association rules are used in many real-world situations, such as:

Q) Cox Regression

Cox Regression is a type of statistical model used when you want to predict the time until an event happens. Think of it like trying to predict how long something will last before a specific event occurs.

Q) Explain in detail about linear regression and multiple linear regression.

Regression analysis is a very widely used statistical tool.

It is used to establish a relationship model between two variables.

One variable is called a dependent or response variable whose value must be predicted.

Other variable is called an independent or predictor variable whose value is known.

In Linear Regression these two variables are related through an equation.

Mathematically a linear relationship represents a straight line.

A non-linear relationship creates a curve.

The general mathematical equation for a linear regression is −

y = ax + b

Following is the description of the parameters used −

y is the dependent variable.

x is the independent variable.

a and b are constants which are called the coefficients.

Steps to Establish a Regression

A simple example of regression is to predict the weight of a person when his height is known. To do this we need to have the relationship between height and weight of a person.

Here y is weight and x is height.

The steps to create the relationship is −

  1. Gather the height and weight of a few people.
  2. Create a relationship model using the lm() functions in R.
  3. Find the coefficients from the model 
  4. Know the average error in prediction. Also called residuals.
  5. Use the predict() function to predict the weight of new persons.

For example:

heightx <- c(1,2,3)

weighty <- c(1,3,4)

relation <- lm(weighty~heightx)    # Apply the lm() function.

print(relation)

When we execute the above code, it give a and b value as coefficients

b= -0.3333      a= 1.5000

Hence the line equation in y=1.5x-0.33

Multiple regression is an extension of linear regression.

It finds relationships between more than two variables. In simple linear relation we have one independent and one dependent variable, but in multiple regression we have more than one independent variable and one dependent variable.

The general mathematical equation for multiple regression is −

y = a1x1+a2x2+...+b

Following is the description of the parameters used −

y is the response variable.

b, a1, a2...an are the coefficients.

x1, x2, ...xn are the predictor variables.

We create the regression model using the lm() function in R. The model determines the value of the coefficients using the input data. Next we can predict the value of the response variable for a given set of predictor variables using these coefficients.

lm() Function

This function creates the relationship model between the predictor and the response variable.

Syntax

The basic syntax for lm() function in multiple regression is −

lm(weighty ~ heightx+agex)

heightx = c(1,2,3)

weighty=c(1,3,4)

agex=c(0,3,4)

relation=lm(weighty~heightx+agex)

print(relation)

newdata = data.frame(heightx=2.5,agex=3)

predict(relation,newdata)

Q) Explain Feed forward neural networks

A feedforward neural network is a type of artificial neural network where the information flows only in one direction, from input to output, without any feedback or loops.

In a feedforward neural network, the input layer receives the input data and passes it to the first hidden layer. Each neuron in the hidden layer applies a mathematical function to the input and passes the output to the next layer. This process is repeated for all the hidden layers until the output layer is reached, which produces the final output of the network.

The output of each neuron is calculated by applying a weighted sum of the inputs and passing the result through an activation function. The weights are learned during the training process, where the network adjusts the weights to minimize the error between the predicted output and the actual output.

Feedforward neural networks are commonly used for a variety of tasks, including classification, regression, and pattern recognition. They are also used as building blocks for more complex neural network architectures, such as convolutional neural networks and recurrent neural networks.

Q) explain back propagation?

The backpropagation algorithm works by propagating the error backwards from the output layer to the input layer, adjusting the weights of the neurons in each layer along the way.

During training, the input data is fed into the neural network, and the output of the network is compared to the actual output. The difference between the predicted output and the actual output is called the error, and this error is used to adjust the weights of the neurons in the network.

The backpropagation algorithm starts by computing the error at the output layer, and then propagating this error backwards through the network, layer by layer. The amount of error that each neuron contributes to the output is computed by taking the partial derivative of the error with respect to the output of the neuron. The weights of the neurons are then adjusted based on the amount of error they contributed to the output.

The backpropagation algorithm is typically used in conjunction with gradient descent optimization, which is used to minimize the error in the network by adjusting the weights of the neurons in the direction of the steepest descent of the error surface.

Backpropagation is an important technique for training neural networks and is used in many popular neural network architectures, including feedforward neural networks, convolutional neural networks, and recurrent neural networks.

Q) Linear Discriminant Analysis (LDA)

When the class labels or response variable has more than 2 classes, We use LDA.

The objective of LDA is to perform dimensionality reduction. It can also be used for classification. In LDA, we create new axis and

data is projected onto the new axis

Here as shown in above figure, the two dimensional data is projected onto one dimensional data. Thus LDA has reduced one dimension. But this axis is not separating the two classes (red and blue).

LDA choses an axis that separates the two classes as shown below. The axis was chosen to maximize the distance between the means of 2 classes (red and blue) while minimizing the scatter.

https://www.youtube.com/watch?v=azXCzI57Yfc

https://www.youtube.com/watch?v=DVqpwsRxjKQ

Q) Decision trees

Classification by Decision Tree Induction: or Discuss about building a decision tree and working of decision tree.

 Decision tree induction is the learning of decision trees from class-labeled training tuples.

  1. A decision tree is a flowchart-like tree structure,where  
  2. Each internal node denotes a test on an attribute.  
  3. Each branch represents an outcome of the test.  
  4. Each leaf node holds a class label.
  5. The topmost node in a tree is the root node.

Advantages of Decision trees

  1. A significant advantage of a decision tree is that it forces the consideration of all possible outcomes of a decision and traces each path to a conclusion.
  2. The construction of decision tree classifiers does not require any domain knowledge or parameter setting, and therefore is appropriate for exploratory knowledge discovery.
  3. Decision trees can handle high dimensional data.
  4. Their representation of acquired knowledge in tree form is easy to understand.
  5. They are robust to noisy data.
  6.  The learning and classification steps of decision tree induction are simple and fast.
  7.  In general, decision tree classifiers have good accuracy.
  8. Decision tree induction algorithms have been used for classification in many application areas, such as medicine,manufacturing and production, financial analysis, astronomy, and molecular biology

Disadvantages

  1. Decision trees are less appropriate for estimation tasks where the goal is to predict the value of a continuous attribute.
  2. Decision trees are prone to errors in classification problems with many class and relatively small number of training examples.

 Here in this below example,
outlook  is root node
humidity and wind are attributes/ columns
yes or no are class labels
This example shows if a child can play outside or not.

In decision tree, rectangle represents attributes/ columns
ellipse represents class labels

The main purpose of decision tree is to extract the rules for classification.

Example: if outlook = sunny and humidity = normal then play = yes

if outlook = overcast then play = yes

if outlook = rainy and wind = low then play = yes

Types of Decision Trees
1) un weighted decision tree: when there is no weight on any nodes of the decision tree, i.e, there are no biases in decision tree

2. weighted decision tree:

3. binary decision tree: where there are only two attributes or labels in a tree

4. Random forest: n number of decision trees combined

Q) What is a Neural Network?

A Neural Network is a machine learning model inspired by how the brain works.
It tries to
learn patterns from data and make predictions — even if the relationship is complex or non-linear.

🧠 Basic Structure

Layer

Meaning

Input Layer

Takes the input variables (like age, income, etc.)

Hidden Layers

Perform the calculations — this is where learning happens

Output Layer

Gives the prediction (Yes/No, number, etc.)

Each layer is made of neurons (also called nodes or units), and each neuron does a weighted sum + activation.

🔧 Key Concepts

Concept

Meaning

Weights

Strength of connection between neurons

Activation Function

Decides how much signal to pass on (like sigmoid, ReLU)

Learning

Adjusts weights using training data to reduce error

Backpropagation

Technique to update weights by checking error at the output

Epoch

One complete pass over the training data

📘 R Code (Using nnet Package)

# Load package

library(nnet)

# Build neural network model

model <- nnet(Buy ~ Age + Income, data = train_data, size = 3)

# Predict on new data

predict(model, newdata = test_data, type = "class")

📋 Example Dataset

Age

Income

Buy

25

40000

No

45

85000

Yes

35

60000

Yes

30

50000

No

# Sample data

data <- data.frame(

  Age = c(25, 45, 35, 30),

  Income = c(40000, 85000, 60000, 50000),

  Buy = c("No", "Yes", "Yes", "No")

)

# Convert target to factor

data$Buy <- as.factor(data$Buy)

# Load library

library(nnet)

# Train the neural network with 3 hidden nodes

model <- nnet(Buy ~ Age + Income, data = data, size = 3)

# Predict for a new customer

new_customer <- data.frame(Age = 40, Income = 70000)

predict(model, new_customer, type = "class")

Q) Logistic Regression

Logistic regression is used to predict a “Yes/No” outcome, based on one or more input variables.

Steps in logistic regression are as follows:

🔹 STEP 1: Linear Combination (Just like Linear Regression)

We calculate:

z=b0+b1x1+b2x2+.......

This is like saying:

z = intercept + age coefficient × age + income coefficient × income

🔹 STEP 2: Sigmoid Function Converts z to a Probability

This squashes any number (even negative or very large) to a range between 0 and 1, which is perfect for probabilities.

🔹 STEP 3: Decision Rule (Classify)

If the probability > 0.5 → Predict Yes
 If the probability ≤ 0.5 → Predict No

Example: Predicting a mouse is obese or not

Q) Bayesian networks

Baye's Theorem

Bayes' Theorem is named after Thomas Bayes. There are two types of probabilities −

where X is data tuple and H is some hypothesis.

According to Bayes' Theorem,

P(H/X)= P(X/H)P(H) / P(X)

Bayesian Belief Network

Bayesian Belief Networks specify joint conditional probability distributions. They are also known as Belief Networks, Bayesian Networks, or Probabilistic Networks.

There are two components that define a Bayesian Belief Network −

Directed Acyclic Graph

Directed Acyclic Graph Representation

The following diagram shows a directed acyclic graph for six Boolean variables.

The arc in the diagram allows representation of causal knowledge. For example, lung cancer is influenced by a person's family history of lung cancer, as well as whether or not the person is a smoker. It is worth noting that the variable PositiveXray is independent of whether the patient has a family history of lung cancer or that the patient is a smoker, given that we know the patient has lung cancer.

Conditional Probability Table

The conditional probability table for the values of the variable LungCancer (LC) showing each possible combination of the values of its parent nodes, FamilyHistory (FH), and Smoker (S) is as follows −

Short answer on SVM

SVM doesn't just draw any line that separates the two classes — it draws the line that:

Those closest points are called Support Vectors. They're the key players.

The margin is the space between the line and the closest points from each class.

The support vectors are the data points that lie closest to the line — they’re the ones that "support" the optimal line.

Q) Support vector machines (SVM)

The goal of the SVM algorithm is to create the best line or decision boundary that can segregate n-dimensional space into classes so that we can easily put the new data point in the correct category in the future. This best decision boundary is called a hyperplane.

SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme cases are called as support vectors, and hence algorithm is termed as Support Vector Machine. Consider the below diagram in which there are two different categories that are classified using a decision boundary or hyperplane:

Types of SVM

SVM can be of two types:

Hyperplane and Support Vectors in the SVM algorithm:

Hyperplane: There can be multiple lines/decision boundaries to segregate the classes in n-dimensional space, but we need to find out the best decision boundary that helps to classify the data points. This best boundary is known as the hyperplane of SVM.

The dimensions of the hyperplane depend on the features present in the dataset, which means if there are 2 features (as shown in image), then hyperplane will be a straight line. And if there are 3 features, then hyperplane will be a 2-dimension plane.

We always create a hyperplane that has a maximum margin, which means the maximum distance between the data points.

Support Vectors:

The data points or vectors that are the closest to the hyperplane and which affect the position of the hyperplane are termed as Support Vector. Since these vectors support the hyperplane, hence called a Support vector.

How does SVM works?

Linear SVM:

The working of the SVM algorithm can be understood by using an example. Suppose we have a dataset that has two tags (green and blue), and the dataset has two features x1 and x2. We want a classifier that can classify the pair(x1, x2) of coordinates in either green or blue. Consider the below image:

Support Vector Machine Algorithm

So as it is 2-d space so by just using a straight line, we can easily separate these two classes. But there can be multiple lines that can separate these classes. Consider the below image:

Support Vector Machine Algorithm

Hence, the SVM algorithm helps to find the best line or decision boundary; this best boundary or region is called as a hyperplane. SVM algorithm finds the closest point of the lines from both the classes. These points are called support vectors. The distance between the vectors and the hyperplane is called as margin. And the goal of SVM is to maximize this margin. The hyperplane with maximum margin is called the optimal hyperplane.

Support Vector Machine Algorithm

Non-Linear SVM:

If data is linearly arranged, then we can separate it by using a straight line, but for non-linear data, we cannot draw a single straight line. Consider the below image:

Support Vector Machine Algorithm

So to separate these data points, we need to add one more dimension. For linear data, we have used two dimensions x and y, so for non-linear data, we will add a third dimension z. It can be calculated as:

z=x2 +y2

By adding the third dimension, the sample space will become as below image:

Support Vector Machine Algorithm

So now, SVM will divide the datasets into classes in the following way. Consider the below image: Support Vector Machine Algorithm

Since we are in 3-d Space, hence it is looking like a plane parallel to the x-axis. If we convert it in 2d space with z=1, then it will become as:

Support Vector Machine Algorithm

Hence we get a circumference of radius 1 in case of non-linear data.

Unit IV: Automated Data Preparation, Combining data files, Aggregate Data, Duplicate Removal, Sampling DATA, Data Caching, Partitioning data, Missing Values. Model Evaluation and Deployment Introduction, Model Validation, Rule Induction Using CHAD)

Here are notes on the requested data mining concepts:

Automated Data Preparation

Combining Data Files

Aggregate Data

Duplicate Removal

Sampling Data

Data Caching (In-Memory Data Management for Efficiency)

Partitioning Data

Missing Values

Model Evaluation and Deployment Introduction

Model Validation

Rule Induction Using CHAID

What is Rule Induction?

Rule induction is a method used in machine learning and data mining to find easy-to-understand “if–then” rules from a dataset. These rules help explain patterns or predict outcomes.

Q)🧠 What is CHAID?

CHAID stands for:

Chi-squared Automatic Interaction Detection

It’s a decision tree algorithm used in statistics and data mining to make if–then rules from a dataset. It helps you answer questions like:

“Which features (like age, income, gender) are best at predicting if someone buys a product?”

✅ Where is CHAID Used?

🪜 How CHAID Works (Simple Idea)

  1. You start with a dataset that has:
  1. CHAID checks which predictor is most related to the output using a statistical test called Chi-square.
  2. It splits the data into groups (like a decision tree) based on the strongest relationship.
  3. It continues splitting until no significant improvement is found.

🔍 A Simple Example (Pen-and-Paper)

🧾 Dataset – Predicting if Someone Buys a Laptop

Person

Age Group

Student

Buys Laptop

1

Young

Yes

Yes

2

Young

No

No

3

Middle

Yes

Yes

4

Middle

No

No

5

Old

No

No

6

Old

Yes

Yes

We want to predict whether a person will buy a laptop.


🧠 Step-by-Step with CHAID

CHAID asks:

"Which column (Age Group or Student) gives the most useful split?"

To do this, it uses Chi-square test to check which predictor is more related to the target.

For now, let's assume you don’t need to calculate Chi-square — just understand the logic behind the splits.


🔹 First Check: Is Age Group related to Buys Laptop?

Let’s group the data:

Age Group

Buys Yes

Buys No

Young

1

1

Middle

1

1

Old

1

1

Not very helpful! Equal numbers for Yes and No — no strong pattern.


🔹 Next Check: Is Student related to Buys Laptop?

Student

Buys Yes

Buys No

Yes

3

0

No

0

3

Whoa! That’s a perfect split! All students bought laptops, all non-students didn’t.

CHAID selects Student as the best splitting column.


🧩 So the first rule is:

IF Student = Yes THEN Buys Laptop = Yes
 IF Student = No THEN Buys Laptop = No

This is the final decision tree. No further splits needed because the groups are already perfectly separated.


✏️ Why Use CHAID?

✅ Easy to understand
✅ Works well with
categorical data (like Yes/No, Age groups, etc.)
✅ Can be used manually with small data
✅ Produces decision
trees that are easy to explain

     Unit V: Automating Models for Categorical and Continuous targets, Comparing and

Combining Models, Evaluation Charts for Model Comparison, Deploying Model, Assessing Model Performance, Updating a Model.

We want to build a model to predict student performance.


Categorical Targets (Classification)


Continuous Targets (Regression)


Automating Models


💡 Memory Hook:

Q) Explain how models can be compared?

We’re building models to predict whether a student will pass or fail an exam based on study hours, attendance, and homework completion.
We have several models (e.g., Decision Tree, Logistic Regression, Neural Network) and we want to
compare them or combine them.

Comparing Models

Goal: Pick the best model.

Q) Discuss ensemble methods for combining models

Goal: Improve predictions by using multiple models together.

Two Types of Ensembles

💡 Memory Hook:

Q) Evaluation Charts for Model Comparison

1. ROC Curve shows how well a model can separate two classes (e.g., “pass” vs “fail”).

Example: Model’s ability to separate pass vs fail vs random guessing. Model catches most pass students and rarely mistakes fail students for pass.

The ROC curve shows us what happens at every single possible threshold! It plots two key numbers at each threshold:

So, the ROC curve is basically a graph of the TPR vs. the FPR.

Monthly income after Retirement till 100 years of your age

 ROC curves and Area Under the Curve explained (video)

2. Lift Curve :  Shows how much better the model is at identifying positives compared to random selection.

https://howtolearnmachinelearning.com/articles/the-lift-curve-in-machine-learning/

A Lift chart tells you how much better your model is at finding the “failures” compared to random guessing.

It’s especially useful when you rank predictions and only want to act on the top portion of them (e.g., top 10% most likely cases).

Example: You have 100 students, and 10 of them are actually at risk of failing. Your model gives a score to every student..

  1. Rank your data You sort them from the student most likely to fail (score of 1) to the student least likely to fail (score of 0)
  2. Divide into groups Split into equal‑sized buckets (often 10 groups = deciles).
  3. Calculate Lift for each group
  4. Plot it
  1. X‑axis: % of the population targeted (top 10%, top 20%, etc.).
  2. Y‑axis: Lift value.

In given example: 10% of all students in your dataset are “at risk” of failing. If you take the top 10% of students ranked by your model and find that 30% of them are actually “at risk”: Lift=30/10 = 3  
(Lift = (True Positives found by the model) ÷ (True Positives expected by random selection).
This means your model is
3× better than random selection at finding “at risk” students in that top slice.

3. Calibration Plot: Checks if the probabilities predicted by your model are realistic.

A Guide to Calibration Plots in Python – Chang Hsin Lee – Committing my thoughts to words.

4. Confusion Matrix

5. Histogram : Show how data is spread out.

6. Boxplot : Show spread, median, and outliers in data.

7. Scatter Plot

8. Information Criteria Plot (AIC/BIC)

AIC & BIC for Selecting Regression Models: Formula, Examples

9. Fitted Curve Plot

💡 Simple memory line:
 All these charts are just different ways of comparing model predictions to actual truth or to a random baseline.

Q) What are the key steps and techniques used to evaluate a model’s performance before deployment (OR) How model performance is assessed.

Deploying a Model


Assessing Model Performance

This is about checking how good the model really is before and after deployment.

1. Classifier Evaluation


2. Methodological Issues — How you split your data matters:


3. Quantification Issues — Ways to measure quality:


4. Cross‑Validation


5. Bootstrap


💡 Big Picture:
 You deploy the model to use it in real life, but you assess it carefully using smart data splits and performance measures to make sure it’s reliable.

Q) What are the different methods used to update a predictive model when new data becomes available?

1️⃣ Data Warehouse (DWH) Updates


2️⃣ Prediction Model Validation with New Time Periods


3️⃣ Incremental Decision Tree Induction


4️⃣ Data Streams & Reservoir Sampling


💡 Big Picture:
 Updating a model means:

  1. Keep your data fresh (update the warehouse).
  2. Test on new situations (different months).
  3. Update smartly (incremental learning instead of starting over).
  4. Handle endless data (sampling from streams).