SEMESTER-V
COURSE 14 B: PREDICTIVE AND ADVANCED ANALYTICS USING R

Unit-I

Introduction to Data Mining Introduction, what is Data Mining? Concepts of Data mining, Technologies Used, Data Mining Process, KDD Process Model, CRISP — DM, Mining on various kinds of data, Applications of Data Mining, Challenges of Data Mining.

Q) Explain the stages of the Data mining process/KDD process ?

Ans: The knowledge discovery process is shown in Figure as an iterative sequence of the following steps:

1. Data cleaning : to remove noise and inconsistent data

2. Data integration :where multiple data sources may be combined

3. Data selection :where data relevant to the analysis task are taken from the database

4. Data transformation: where data are transformed and consolidated into forms appropriate for mining

5. Data mining : process where intelligent methods are applied to extract data patterns

6. Pattern evaluation :to identify the truly interesting patterns

7. Knowledge presentation: where visualization and knowledge representation techniques are used to present mined knowledge to users.

Steps 1 through 4 are different forms of data pre-processing, where data are prepared for mining.

The data mining step may interact with the user or a knowledge base. The interesting patterns are presented to the user and may be stored as new knowledge in the knowledge base.

Q) What is data mining?

Data mining is the process of studying large sets of data to find useful information. It helps turn raw, messy data into clear knowledge that people can understand and use.

🌟 Key Points:

Just like mining for gold, data mining means digging through a huge amount of data to find small, valuable pieces of information.
The main aim is to discover patterns or trends that can help in making better decisions.
It can be done with numbers (like sales records), text (like social media posts), or even images.
Without using proper tools, big data can be ignored or misused—people might guess or go with their gut instead of using real facts.

So, data mining turns confusing data into something smart and useful for companies, hospitals, banks, and more.

Q) Explain about Data mining techniques? Or Data mining technologies used

Ans: Data mining functionalities are used to specify the kinds of patterns to be mined.

Data mining tasks can be classified into two categories: descriptive and predictive.

Descriptive tasks summarizes data and gives general properties

Predictive tasks analyse the data and constructs model which can predict the behaviour of new data.

There are several data mining functionalities. These include

characterization and discrimination
Mining of frequent patterns, associations, and correlations
classification and regression
clustering analysis
outlier analysis

1. characterization and discrimination : Data characterization is a summarization of the general characteristics or features of a target class of data. Example: Summarize the characteristics of customers who spend more than Rs. 5000 a year at Modern Super Market.

Data discrimination is a comparison of the general features of the target class against the general features of one or more contrasting classes. For example: compare two groups of customers—those who buy computer products regularly and those who rarely buy such products.

2. Mining of frequent patterns, associations and correlations

Frequent patterns, as the name suggests, are patterns that occur frequently in data. Mining frequent patterns leads to the discovery of interesting associations and correlations within data.

3. Classification and regression: Classification is supervised learning technique. The class labels of training data will be given. A model is trained using training data. The trained model or classifier is then is used to predict the category or the class label of new data. If the model is used to predict a number (or continuous value) then it is called regression. Diagram

Description automatically generated

Example: Given the details of customer like age, credit rating, student or not. A decision tree is constructed based on training data. This tree is then used to predict whether a customer will buy (yes) computer or will not buy (no). Below the classification model is represented using decision tree.

Classification model can be represented using IF-THEN rules as follows:

age(X, “youth”) AND student(X, “high”) 🡺 buys(X, “yes”)

age(X, “middle_aged”) 🡺 buys(X, “yes”)

age(X, “senior”) AND credit_rating(X, “excellent”) 🡺 buys(X, “yes”)

4. Cluster Analysis: A picture containing calendar

Description automatically generated

Any group of objects that belongs to the same class is known as a cluster. In data mining, cluster analysis is a way to discover similar item groups.

Cluster analysis can be performed on Modern Super Market customer data to identify homogeneous subpopulations of customers. These clusters may represent individual target groups for marketing

A picture containing scatter chart

Description automatically generated

5. Outlier Analysis: An outlier is a data point that is very different from other points.
It is unusual, and may be an error or a rare event.

An outlier is a data object that deviates significantly from the rest of the objects. The analysis of outlier data is referred to as outlier analysis or anomaly mining. Outlier analysis tries to find unusual patterns in any dataset. Outlier detection is important in many applications in addition to fraud detection such as medical care, public safety and security, industry damage detection, image processing, sensor/video network surveillance, and intrusion detection.

Q) MINING ON VARIOUS KINDS OF DATA

The data is categorized into fundamental and more complex types:

Database Data (Relational Data):

This is data typically stored in tables, like those you might see in a spreadsheet, where information is organised into rows (representing objects like customers or items) and columns (representing characteristics or attributes like age or price).
Example: An electronics store's customer database, with details on customer IDs, names, addresses, and purchase history.

Data Warehouses (DWH):

These are large, centralised repositories of information collected from various sources within an organisation, specifically organised for analysis and decision-making. They often contain historical and summarised data.
Example: An AllElectronics company with branches worldwide might consolidate sales data from all branches into a single data warehouse for comprehensive analysis. This data is often modelled as a data cube, which allows data to be viewed in multiple dimensions (e.g., sales by time, item, branch, and location).

Transactional Data: This type of data records individual transactions, such as purchases or financial transactions, providing insights into customer behaviour and business operations.

Time Series Data: It refers to data collected over a period of time, such as stock prices or weather conditions, allowing for analysis of patterns and trends.

Biological Data: It contains information related to living organisms, such as genetic sequences or physiological measurements.This data aids research in areas like genetics, medicine, and ecology.

Spatial Data: Spatial data contains information about physical locations and. This data helps in analysis and visualisation of geographic patterns, such as maps or satellite imagery.

Social Network Data: It involves data about individuals and their relationships in a social network, offering insights into social interactions, influence, and community structures.

Other Kinds of Data (Complex Data Types): These present greater challenges and often require specialised mining methodologies.

Symbolic Sequence Data: Ordered lists of events or elements, with or without a precise time notion.

Example: Customer shopping sequences (e.g., buying a PC, then a digital camera, then a memory card) or web click streams.

Spatiotemporal Data and Moving Objects: Data that change over both space and time, often related to moving entities.

Example: Tracking vehicles in a city or monitoring weather patterns.

Cyber-Physical System Data: Data from interconnected computing and physical components, such as sensor networks.
Multimedia Data: Includes images, video, and audio.

Example: Mining images to classify objects or video data to detect specific events.

Text Data: Unstructured information stored as text.

Example: News articles, technical papers, customer reviews, or emails. Text mining can extract high-quality information, identify trends, or perform sentiment analysis.

Graph and Networked Data: Data representing relationships between interconnected objects.

Example: Social networks (friends linked to friends) or information networks (web pages linked to each other). Mining can discover hidden communities, hubs, or outliers.

Data Streams: Data that continuously flows into a system in vast volumes, changing dynamically, and potentially infinite.

Example: Real-time video surveillance feeds or sensor data. This requires efficient single-pass or few-pass algorithms due to their continuous nature.

It's important to note that multiple types of data often coexist in real-world applications (e.g., web mining involves text, multimedia, and graph data). Mining these multiple sources can lead to richer findings but also presents challenges in data cleaning and integration.

Q)📊 Applications of Data Mining
Data mining is used in many areas to find useful patterns and help make better decisions. It turns huge amounts of data into meaningful information.

🔢 Area	🏷️ Use Case	📌Explanation
🏪 1. Business & Commerce	Retail & Marketing	- Find what products are bought together. - Plan ads and layout.
	Banking & Finance	- Detect fraud, check loan risk, group customers.
	Telecom	- Find customers likely to leave. - Analyze call usage.
🔬 2. Science & Engineering	Medicine & Bioinformatics	- Predict diseases. - Study DNA and discover drugs.
	Computer Engineering	- Detect bugs or attacks. - Improve system performance.
	Environmental Science	- Predict climate. - Link geography and poverty.
🌐 3. Web & Internet	Search Engines	- Improve search. - Show related ads or trending searches.
	Recommender Systems	- Suggest products or shows (like Netflix, Amazon).
	Text Mining	- Analyze opinions from reviews. - Group articles.
🛡️ 4. Security & Society	Crime & Fraud Prevention	- Spot frauds in banking and insurance. - Help crime detection.
	Hidden Data Mining (Social Impact)	- Happens silently while shopping or browsing.

Q) Challenges in Data Mining

Data mining helps us discover useful information from large datasets. But it also faces several difficulties. These problems come from the nature of the data, the technical methods used, user interaction, and social concerns. The challenges are listed below

🧩 Problem Area	📌 Simple Explanation
🧹 Data Quality Problems	- Data may have mistakes or missing parts. - Merging data from many sources causes confusion. - Hard to find real patterns due to noise or outliers.
⚙️ Scalability & Efficiency	- Too much data to process. - Too many features (columns). - Live data is hard to handle. - Some algorithms are slow and heavy.
🧬 Variety in Data Types	- Data comes in many forms (text, image, video). - Some data like networks are complex to analyze.
🧪 Mining & Evaluation Issues	- Too many patterns, only a few are useful. - “Interesting” patterns are hard to define. - Some methods give only partial results.
👩‍💻 User Involvement	- Users want to change or explore during mining. - Hard to understand complex models. - Need good charts or visuals.
🔐 Privacy & Social Issues	- Mining may break personal privacy. - Can be used in bad ways. - Often done secretly, without users knowing.

Q) CRISP-DM (Cross Industry Standard Process for Data Mining)

CRISP-DM is a common method used to do data mining projects in a step-by-step way. It helps people work with data clearly and effectively, in any industry or with any tool.

✅ Main Features of CRISP-DM

It uses a step-by-step process.
It works in any field (business, health, etc.).
Steps can be repeated if needed.

🔢 Six Steps of CRISP-DM

Business Understanding – Know the goal of the project from a business point of view.
Data Understanding – Collect data and check its quality.
Data Preparation – Make the data ready for analysis (cleaning, formatting).
Modeling – Use data mining techniques to build a model.
Evaluation – Check if the model gives good results.
Deployment – Use the model to help in real-world decision making.

Q) Differentiate between crisp-dm and data mining

DM is what you do. CRISP-DM is how you do it.

🔍 CRISP-DM vs. DM: Key Differences

Feature	CRISP-DM	Data Mining (DM)
Definition	A structured methodology for DM projects	The general process of extracting insights from data
Purpose	Guide teams through DM projects step-by-step	Discover patterns and knowledge from data
Phases	6 defined stages (Business to Deployment)	No fixed structure – depends on the approach
Flexibility	Highly adaptable across industries	Flexible but can be chaotic without a method
Tool Dependence	Tool-agnostic (works with any tool)	May depend on specific tools or algorithms
Project Management	Includes planning, evaluation, and real-world use	Often focused only on modeling and analysis

🧭 Why CRISP-DM Matters

CRISP-DM brings clarity, repeatability, and structure to data mining. It’s like having a GPS for your data journey—especially useful when working in teams or across industries

Unit II: Data Understanding and Preparation Introduction, Reading data from various sources, Data visualization, Distributions and summary statistics, Relationships among variables, Extent of Missing Data. Segmentation, Outlier detection

Q) Data Understanding and Preparation – Introduction

In Data Science, it is very important to first understand your data and then prepare it. Data understanding helps us know what the data is about. Data preparation makes the data ready for analysis. In R, we use different tools and functions for these steps.

Q1. Data Understanding (Getting to Know Your Data)

Before you work with data, you need to understand it well. This means knowing what kind of data you have, what values it contains, and how these values are spread out.

a. Types of Data (Attributes): Data is made of "attributes" (also called variables or features). We can classify attributes into different types:

Nominal Attributes: These are names or labels, without any order.

Example: Gender (Male, Female), Colour (Red, Blue, Green).

Binary Attributes: These have only two possible values.

Example: Yes/No, True/False, 0/1.

Ordinal Attributes: These have an order, but the differences between values are not fixed.

Example: Ratings (Low, Medium, High), Shirt Size (Small, Medium, Large).

Numeric Attributes: These are numbers.

Discrete: Values are whole numbers or can be counted.

Example: Number of children, ZIP Code.

Continuous: Values can be any number within a range (like decimal numbers).

Example: Height, Weight, Temperature.

b. Basic Statistical Measures: To understand the data's values, we often look at statistical measures.

Measures of Central Tendency: These show the "middle" or "center" of the data.

Mean: The average value. You add all values and divide by how many there are.

R Code Example: mean_value <- mean(data).

Median: The middle value when data is sorted. If there are two middle values, it's their average.

R Code Example: median_value <- median(data).

Mode: The value that appears most often. R does not have a built-in mode() function for this, so you usually create one or find it by looking at frequencies.

Measures of Variability (Dispersion): These show how "spread out" the data is.

Range: The difference between the highest and lowest values.
InterQuartile Range (IQR): The range of the middle 50% of the data. It helps to see spread without outliers.
Variance: Measures how far values are from the mean, on average, squared.
Standard Deviation: The square root of the variance, showing typical distance from the mean.

R Code Example:
data_sample <- c(12, 7, 3, 4.2, 18, 2, 54, -21, 8, -5)

range_val <- max(data_sample) - min(data_sample) # Calculate range

iqr_val <- IQR(data_sample) # Calculate IQR

variance_val <- var(data_sample) # Calculate variance

sd_val <- sd(data_sample) # Calculate standard deviation

c. Data Visualization: Pictures (graphs) help us see patterns and problems in data more easily.

Histograms: Show the distribution of one numeric variable, how often values fall into certain ranges.
Boxplots: Show the distribution, median, quartiles, and possible outliers for one or more variables.
Scatter Plots: Show the relationship between two numeric variables, as points on a graph.

R Code Example (using iris dataset, which is preloaded in R):
# Histogram for Sepal.Length

hist(iris$Sepal.Length, main="Histogram of Sepal Length", xlab="Sepal Length")

# Boxplot for Sepal.Length by Species

boxplot(Sepal.Length ~ Species, data=iris, ylab="Sepal Length")

# Scatter plot of Sepal.Length vs Sepal.Width

plot(iris$Sepal.Length, iris$Sepal.Width, col=iris$Species) # Color points by species

Q2. Data Preparation (Preprocessing)

Data preparation is the most important step in data mining. It transforms raw data into a clean and useful format. This step takes a lot of time in a data science project.

The main steps in data preparation are:

a. Data Cleaning: Real-world data often has problems like missing information, errors, or inconsistencies. Data cleaning fixes these problems.

Handling Missing Values (NA): Missing values are shown as NA in R.

How to deal with them:

Ignore/Remove: Remove rows or columns with missing values.
Impute: Fill in missing values using methods like the mean, median, or a more advanced model.

Handling Noise and Outliers: Noise means random errors. Outliers are values that are very different from most other data.

Tools like boxplots can help identify outliers visually.

b. Data Integration: This step combines data from many different sources into one dataset. It is important to handle different ways data is named or stored (e.g., different units like meters vs. feet).

c. Data Reduction: Large datasets can be hard to work with. Data reduction makes the dataset smaller, but still keeps important information.

Sampling: Take a smaller part of the data that still represents the whole.
Feature Selection: Choose only the most important attributes (columns) that are useful for your analysis.
Dimensionality Reduction: Techniques like Principal Components Analysis (PCA) combine attributes to create fewer, new attributes.

d. Data Transformation: This changes data into a suitable format for mining.

Normalization: Scaling data values to a specific range (e.g., 0 to 1). This is useful when attributes have very different ranges.
Discretization: Changing numeric data into categorical "bins" or groups (e.g., age into "young", "medium", "old").

Q) Reading Data from Various Sources

In data science, we must collect data before we can analyze it.
Data can come from different places (sources) and different formats.
We must know how to read and load this data into our tools (like R, Python, Excel).

🔹 Types of Data Sources

🔢 Type	📌 Simple Meaning	📁 Example	🧪 R Syntax to Read
Flat Files	Text files with rows and columns	.csv, .txt, .tsv	read.csv("file.csv") read.table("file.txt")
Excel Files	Spreadsheet format	.xls, .xlsx	readxl::read_excel("file.xlsx")
Databases	Organized storage with SQL access	MySQL, Oracle	DBI::dbReadTable(con, "table_name") (after DB connection)
Web Data	Data from websites or APIs	HTML tables, weather APIs	readLines("https://example.com") jsonlite::fromJSON("url")
Cloud Storage	Online data storage services	Google Drive, AWS S3	Use googledrive, aws.s3, or download to local & use read.csv()
Sensor/Live Data	Streaming data from devices or logs	GPS, logs, IoT streams	readLines("sensor_log.txt") scan() for simple text reading

For Excel: you must install the readxl package:
install.packages("readxl")
For databases: Use DBI + a driver like RMySQL or RSQLite
For APIs or JSON: Use jsonlite or httr packages

Q) Data Visualization

Data Visualization means showing data in pictures (charts/graphs), so that we can understand it easily.

🔹 Why is it useful?

Shows patterns and trends in the data
Helps to find mistakes or outliers
Makes data easy to explain to others

🔹 Common Types of Charts

📊 Chart Type	📌 Use
Bar Chart	To compare categories (e.g., number of students by class)
Pie Chart	To show parts of a whole (e.g., percentage of sales)
Histogram	To show distribution of numbers (e.g., age of people)
Line Chart	To see trends over time (e.g., monthly sales)
Box Plot	To show spread of values and outliers
Scatter Plot	To see relationships between two variables

The plot() function in R is used to create the line graph.

Syntax

The basic syntax to create a line chart in R is –

plot(v,type,col,xlab,ylab)

Following is the description of the parameters used −

v is a vector containing the numeric values.
type takes the value "p" to draw only the points, "l" to draw only the lines and "o" to draw both points and lines.
xlab is the label for x axis.
ylab is the label for y axis.
main is the Title of the chart.
col is used to give colors to both the points and lines.

Program:

marks=c(15,22,35,55,45,65)

plot(marks, type="l", col="Blue")

Output:

Boxplots are created in R by using the boxplot() function.

The basic syntax to create a boxplot in R is −

boxplot(x, data, notch, varwidth, names, main)

Example:

data<- c(1,2,3,4,5)

boxplot(data)

In R the pie chart is created using the pie() function

Syntax

The basic syntax for creating a pie-chart using the R is −

pie(x, labels, radius, main, col, clockwise)

Example:

x <- c(50,40,10)

labels <- c("mpcs","mscs","dscs")

# Plot the chart.

pie(x,labels,col = rainbow(length(x)))

The basic syntax for creating a histogram using R is −

hist(v,main,xlab,xlim,ylim,breaks,col,border)

Q) Distributions: Key Concepts and Applications

Distributions are fundamental for understanding how data values are spread out. A probability distribution describes the "law governing a random variable" from which observed data originated, indicating the likelihood of different values being observed.

Core Concepts:

For continuous variables, distributions are described by a probability density function.
For discrete variables, they are described by a probability function.

Key Types of Distributions:

You want to understand the height of your students. You measure them and plot the number of students with each height on a graph. You notice most of them are around 165–170 cm, with fewer students being very short or very tall. The graph looks like a hill or bell.

That’s a normal distribution — one of the most famous distributions.

You toss a coin 10 times and count how many times you get heads. You repeat this many times and record the results. Most of the time you’ll get around 5 heads, sometimes 4 or 6, and rarely 0 or 10. That’s a binomial distribution.👉 Example Problem (Binomial Distribution): A coin is tossed 10 times. What is the probability of getting exactly 6 heads?

✅ Solution: Use binomial formula: P(X=6) = C(10,6) * (0.5)^6 * (0.5)^4 = 210 * 0.015625 * 0.0625 = approx. 0.205

If you toss the coin just once, it’s either head or tail — 1 or 0. That’s a Bernoulli distribution — the simplest one.
Imagine customers arriving at a bank. You count how many come in each hour. The number varies, but usually it’s 3 to 5 per hour. This kind of count data is often modeled using the Poisson distribution.

👉 Example Problem (Poisson Distribution):On average, 4 customers visit a bank per hour. What is the probability that exactly 2 customers arrive in the next hour?
✅ Solution: Use Poisson formula:
P(X=2) = (e^-4 * 4^2) / 2! = (0.0183 * 16) / 2 = 0.146

Let’s say you conduct a survey expecting 20 students to prefer chocolate over vanilla ice cream, but only 14 do. You want to know: is this difference by chance, or is it statistically significant? You use the Chi-squared distribution to test this.

👉 Example Problem (Chi-squared Distribution): Expected: 20 students like chocolate, 10 like vanilla. Observed: 14 like chocolate, 16 like vanilla.

✅ Solution: Use Chi-squared formula: χ² = Σ((Observed - Expected)² / Expected)
χ² = (14-20)²/20 + (16-10)²/10 = (36/20) + (36/10) = 1.8 + 3.6 = 5.4

The critical value comes from the Chi-squared distribution table, which tells you the threshold beyond which your result is statistically significant. df = 1 and Significance level (α) = 0.05 You look up the value in the Chi-squared table, and you’ll find: Critical value ≈ 3.841 The difference is big enough (χ² = 5.4 > 3.841) that it’s unlikely to be due to chance.

So, you conclude: student preferences are different than expected—maybe chocolate isn’t as popular as you thought!

There are two types of Chi-squared tests, and df is calculated differently depending on which one you're using:

1. Goodness-of-Fit Test (comparing observed vs expected in one categorical variable)

df = number of categories − 1
✅ This is the test you're using in your example (chocolate vs vanilla).

2. Test of Independence (e.g., contingency tables)

df = (number of rows − 1) × (number of columns − 1)

Inverse Gaussian distribution is used when modeling time until an event happens but assumes that the event rate changes over time (often used in advanced reliability and survival analysis).👉 Example Problem (Inverse Gaussian Distribution): Suppose a machine has an average lifetime of 1000 hours. What is the chance it will last more than 1200 hours?

✅ Solution (conceptual): Use the inverse Gaussian cumulative distribution function or software like R: 1 - pinvgauss(1200, mean = 1000, dispersion = 500)

Now imagine you are tracking how long it takes between customer arrivals at a bank. You find that the waiting time is not always regular. You can model this kind of waiting time using the Gamma distribution.👉 Example Problem (Gamma Distribution): If the average time between arrivals is 3 minutes, and you want the probability that a customer arrives in 5 minutes, use the gamma probability function with shape and rate parameters.

✅ Solution (conceptual): You plug values into the gamma formula or use R’s pgamma() function: pgamma(5, shape, rate) to get the probability.

⚙️ Where Are Distributions Used?

In Regression Models: Errors (difference between actual and predicted values) are often assumed to be normally distributed.
In Hypothesis Testing: We compare means, proportions, etc., using tests based on t-distribution or chi-squared distribution.
In Classification: We estimate probabilities of different classes using distributions.
In Clustering: We assume that data comes from a mix of several distributions (like multiple bell curves).

In Outlier Detection: Outliers are the values that lie far from the common range — in the “tails” of the distribution.

Visualising Data Distributions: Visual tools are crucial for inspecting and understanding data distributions:

Histograms: Used to show the frequency distribution of quantitative variables, providing insight into their shape and concentration.

R code example: hist(iris$Sepal.Length)

Boxplots: Effectively summarise the five-number summary (minimum, first quartile, median, third quartile, maximum) of a distribution and are useful for identifying outliers and comparing distributions across different groups. They can also indicate distribution densities using box whiskers.

R code example: boxplot(Petal.Length ~ Species, data = iris)

Outliers are data points that fall far outside the typical range. Specifically:

📌 Rule of Thumb:

Lower Bound = Q1 − 1.5 × IQR
Upper Bound = Q3 + 1.5 × IQR

Any data point:

< Lower Bound or
> Upper Bound

is considered an outlier.

✅ 1. Quantile Plot – Detecting Unusual Values (Outliers)

📘 Example Scenario:

You are analyzing monthly expenses of 20 students in a college. Most spend between ₹5,000 to ₹7,000. But a few spend above ₹10,000.

You draw a quantile plot:

Most dots form a smooth curve
Suddenly, the line jumps at the end

🧠 What You Learn:

Some students are spending much more than others.
These are outliers — maybe from rich families or incorrect entries.

🎯 Use in Analytics:

These outliers can be removed or treated before building a prediction model (like predicting spending based on background).

# Monthly expenses of 20 students

expenses <- c(5200, 5300, 5400, 5500, 5600, 5800, 5900, 6000,

6100, 6200, 6400, 6500, 6600, 6800, 7000,

10500, 11000, 11500, 12000, 13000)

# Quantile plot

qqplot <- plot(sort(expenses), ppoints(length(expenses)), ylab = "Percentile")

✅ 2. Q–Q Plot – Checking if Data is Normal

📘 Example Scenario:

You want to use linear regression to predict students' marks based on their study hours. This method assumes the data is normal.

You draw a Q–Q plot for marks:

If the dots lie on a straight line ➝ data is normal ✅
If the dots curve ➝ data is skewed ❌

🧠 What You Learn:

If it's not normal, you may need to transform the data (like taking log values) before applying regression.

🎯 Use in Analytics:

This helps you choose the right model or prepare data better for accurate predictions.

# Marks of 20 students

marks <- c(45, 50, 55, 60, 65, 70, 72, 73, 74, 75,

76, 77, 78, 80, 82, 85, 88, 90, 92, 95)

# Q-Q plot against normal distribution

qqnorm(marks, main = "Q–Q Plot of Student Marks")

qqline(marks, col = "red") # Add reference line

-2 (≈ 2.5th percentile)

-1 (≈ 16th percentile)

0 (50th percentile or median)

+1 (≈ 84th percentile)

+2 (≈ 97.5th percentile)

These values are standard scores (z-scores) showing where a value would lie on a normal distribution curve.

✅ 3. Density Plot – Comparing Two Groups

📘 Example Scenario:

You want to compare the exam scores of two classes — Class A and Class B.

You draw density plots for both classes:

Class A has a peak near 75
Class B has a flatter curve, with scores spread from 50 to 90

🧠 What You Learn:

Class A students are more consistent
Class B has more variation in marks

🎯 Use in Analytics:

This analysis helps teachers know which class needs attention or personalized coaching.

# Scores of Class A and Class B

classA <- c(70, 72, 74, 75, 76, 77, 78, 78, 79, 80)

classB <- c(50, 55, 60, 65, 70, 75, 80, 85, 90, 95)

# Plot density curves

plot(density(classA), lwd = 2, ylim = c(0, 0.05))

lines(density(classB), col = "green", lwd = 2)

Class A has a narrow peak — consistent scores.

Class B is spread out — more variation.

Q) Summary Statistics: Describing Your Data

Summary statistics are numerical measures that provide concise descriptions of data features, particularly distributions. They are a fundamental part of Exploratory Data Analysis (EDA), helping to generate questions about data and identify properties like noise or outliers. Statistical data descriptions are useful to grasp data trends and identify anomalies.

Key Categories of Summary Statistics:

Measures of Central Tendency: These indicate the middle or center of a data distribution.

Mean: The average of all values in a dataset.
Median: The middle value in a sorted dataset. It is the 50th percentile and effectively divides the data into two equal halves. It can be found for qualitative and quantitative attributes.
Mode: The value that occurs most frequently in a dataset. A dataset can have one (unimodal), two (bimodal), or more (multimodal) modes.

Measures of Dispersion (Spread): These indicate how spread out the data values are.

Range: The difference between the largest and smallest values in a dataset. This is used in techniques like Min-Max normalization.
Quartiles (Q1 and Q3):

Q1 (First Quartile): The 25th percentile, cutting off the lowest 25% of the data.
Q3 (Third Quartile): The 75th percentile, cutting off the lowest 75% (or highest 25%) of the data.
Together with the median, they indicate a distribution's center, spread, and shape.

Interquartile Range (IQR): The difference between Q3 and Q1 (IQR = Q3 - Q1). It defines the middle 50% of the data. A common rule for identifying suspected outliers is to single out values falling at least 1.5 × IQR above Q3 or below Q1.
Variance (σ2) and Standard Deviation (σ):

Measures of how spread out a data distribution is, relative to the mean.
A low standard deviation indicates data observations are close to the mean, while a high standard deviation means data are spread over a large range.
These measures are useful for identifying outliers. Data is often standardized using mean and standard deviation to ensure similar scaling and weighting for all attributes.

The Five-Number Summary:

A fuller summary of a distribution's shape, especially for skewed distributions.
Consists of: Minimum, Q1, Median (Q2), Q3, Maximum.

Other Statistical Measures (for relationships):

Correlation Coefficient and Covariance: Used for numeric attributes to measure how strongly one attribute implies another. They assess how one attribute's values vary from those of another.
Chi-squared (χ2) Measure: Used for nominal data to detect correlations.

Using R for Summary Statistics: R is a powerful tool for calculating these measures:

The summary() command provides a quick statistical summary for each variable in a dataset, including min, max, mean, median, and quartiles.
Specific functions are available for individual measures:

max() and min() for range.
var() for variance.
sd() for standard deviation.
To calculate the mean without considering NA (missing) values, you can use mean(examsquiz$per, na.rm=TRUE).

These summary statistics are crucial for initial data inspection and understanding the overall behaviour and properties of your data.

Q) short note on Distributions and Summary Statistics

🔹 What is a Distribution?

A distribution tells us how the values in a dataset are spread out.

It shows which values are common and which are rare.
Often shown using a histogram or curve.

🔹 Summary Statistics

These are numbers that tell us about the data in a short and simple way.

📏 Measure	📌 What it tells us
Mean (Average)	Total / number of items
Median	Middle value (when sorted)
Mode	Most repeated value
Range	Difference between highest and lowest
Standard Deviation	How spread out the data is
Variance	Square of standard deviation
Min/Max	Smallest and biggest values
Quartiles	Divide data into 4 equal parts

🔹 Example:

Marks: 40, 50, 50, 60, 70

Mean = (40+50+50+60+70)/5 = 54
Median = 50
Mode = 50
Range = 70 – 40 = 30

Distributions show how data values are spread.
Summary statistics give short numerical info like average, median, etc.

Q) Relationships Among Variables

Understanding relationships among variables is a core task in data analysis, allowing us to find significant connections and patterns within data. This helps in extracting knowledge from data to solve business problems.

Variables can be broadly categorised into quantitative (numeric) and qualitative (categorical) types, and the methods for studying relationships vary depending on these types.

I. Measuring Relationships Between Quantitative Variables

For quantitative variables, which are numerical and can be measured on a scale (e.g., age, salary), we primarily look at:

Correlation Coefficient (Pearson)

The Pearson correlation coefficient measures how strongly two numeric variables are linearly related.
It ranges from -1 to +1.
Interpretation:

Values close to +1 indicate a strong positive linear correlation (as one variable increases, the other tends to increase).
Values close to -1 indicate a strong negative linear correlation (as one variable increases, the other tends to decrease).
Values close to 0 suggest a weak or no linear correlation.

Limitations: Correlation only captures linear relationships and might not detect strong nonlinear relationships.

R Code Snippet:# View correlation between Sepal.Length and Petal.Length

print(cor(iris$Sepal.Length, iris$Petal.Length))

✅ You’ll get a value close to +0.87 — meaning a strong positive relation.

Covariance: This is a related measure that indicates how two variables change together, but it is not standardised like correlation.

Visualising Quantitative Relationships: Scatter Plots

What they are: Scatter plots are graphical representations where each data point is plotted as a dot based on its values for two variables. They are essential for understanding the relationship's direction and spread.
Interpretation:

If points generally slope from lower-left to upper-right, it suggests a positive correlation.
If points generally slope from upper-left to lower-right, it suggests a negative correlation.
Scattered points without a clear pattern indicate weak or no linear correlation.

R Code Snippets:
data(iris)

plot(iris$Sepal.Length, iris$Petal.Length)

3. Scatterplot Matrix: A scatterplot matrix shows scatter plots between every pair of numeric variables in a dataset — all in one grid. It helps you quickly see which variables are related.

🧪 Example:mtcars is a built-in R dataset about 32 car models.mpg Miles per gallon (fuel efficiency ),disp Displacement (engine size), hp Horsepower (engine power – higher means faster car), wt Weight (in 1000 lbs – heavier cars have higher numbers)

📊 R Code:
data(mtcars)
pairs(mtcars[, c("mpg", "disp", "hp", "wt")])

🧠 Interpretation:

mpg vs wt: Dots go down →cars wt perige koddu mpg thagguthundi→ negative relationship
hp vs disp: Dots go up → bigger engine → more horsepower → positive relationship

🎯 Use in Predictive Analytics:

helps you decide which variables to include in regression or machine learning models.
Used in multivariate analysis to detect clusters or relationships.

4. Local Regression (LOESS / LOWESS)

Local regression fits a smooth curve to your data — not a straight line.
It’s great when the relationship between variables is not linear (i.e., not a straight line).

🧪 Example:

Let’s say you want to study the relationship between Petal.Length and Petal.Width in the iris dataset, but it’s not perfectly linear.

📊 R Code:
plot(iris$Petal.Length, iris$Petal.Width, pch = 19, col = "blue")

# Add LOESS smooth curve

lines(lowess(iris$Petal.Length, iris$Petal.Width), col = "red", lwd = 2)

🧠 Interpretation:

The red curve shows the actual pattern in the data.
If the curve bends, the relationship is nonlinear.

🎯 Use in Predictive Analytics:

Captures nonlinear trends missed by linear regression.
Used for smoothing noisy data before model training.
Helps visualize and understand complex patterns.
Useful in time series forecasting, customer behavior modeling, etc.

Technique	Purpose	R Function	Used For
Scatterplot Matrix	Visualize relationships among variables	pairs()	Feature selection, EDA
Local Regression	Fit smooth (nonlinear) curves	lowess() or loess()	Smoothing, nonlinear modeling

II. Measuring Relationships Between Qualitative/Categorical Variables

For qualitative or categorical variables, which represent categories or labels (e.g., gender, marital status), different methods are used:

Chi-squared (χ²) Test
The Chi-squared test helps us find out if two categorical (qualitative) variables are related.
You run a bookstore and want to check if gender affects book preference:

Do males prefer fiction more than females?
Or is book preference independent of gender?
📊 Example Data:

Gender	Book Type	Count
Male	Fiction	30
Male	Nonfiction	20
Female	Fiction	10
Female	Nonfiction	40

book_data <- matrix(c(30, 20, 10, 40), nrow = 2, byrow = TRUE)

colnames(book_data) <- c("Fiction", "Nonfiction")

rownames(book_data) <- c("Male", "Female")

# Perform Chi-squared Test

chisq.test(book_data)

✅ Output Insight:X-squared = 15.042, df = 1, p-value = 0.0001052

If p-value < 0.05, there is a relationship between gender and book type.
If p-value > 0.05, they are independent….since p-value 0.0001052 < 0.05 there is relationship between gender and book type..

Association Rules

What they are: These are "IF-THEN" statements that describe relationships between items in a dataset, commonly used in "market basket analysis" to find frequently co-occurring products.
Key measures:

Support: How often the items in the rule appear together in the dataset.
Confidence: How often the "THEN" part of the rule is true when the "IF" part is true.

Example: "IF a customer buys bread AND butter THEN they also buy milk".

Graphical Models
Graphical models show connections between multiple variables using nodes and edges:

Nodes = variables (like age, income)
Edges = relationships (strong or weak)
No edge = variables are conditionally independent

✅ Extent of Missing Data – Exam Notes (Simple English)

🔹 What is Missing Data?

Sometimes, some values in a dataset are empty or not available.
This is called missing data.

🔹 Why Does Data Go Missing?

Reason	Example
❌ Not recorded properly	Student forgot to write age in a form
📂 Data lost during transfer	File corrupted while saving
👨‍💼 Person refused to answer	Patient did not share income details
🔍 System error or bug	App failed to collect GPS location

🔹 Extent of Missing Data

Extent means how much data is missing in the dataset.
We calculate the percentage or number of missing values.

🧮 Example:
If a column has 100 values and 10 are missing → 10% missing

🔹 Why is Missing Data a Problem?

Problem	Meaning
📉 Reduces Accuracy	Wrong results in analysis or models
🚫 Some Methods Cannot Work	Some tools need full data
💡 May Hide Important Patterns	We may miss useful relationships

🔹 What to Do with Missing Data?

Method	Meaning
❌ Delete rows	Remove rows that have missing values
📥 Fill with average/median	Use average value to replace missing
🔁 Predict missing values	Use machine learning to guess the value
🚫 Leave as is (carefully)	Sometimes okay if only a few missing

Q) Cluster Analysis (Segmentation)

Cluster analysis is a powerful technique used in data mining to group data items that are similar to each other. Imagine you have a large collection of items, but you don't have any pre-defined categories or labels for them. Cluster analysis helps you to discover natural groupings or hidden patterns within this data.

The main idea is to make sure that:

Items within the same group (called a cluster) are very much alike.
Items in different groups are very different from each other.

This process is often called data segmentation because it effectively divides a large set of data into smaller, more manageable parts or segments.

How Do We Find Clusters? (Common Methods)

There are several ways to perform cluster analysis, each with its own approach:

Partitioning Methods (e.g., k-Means)

Idea: These methods divide the data into a specific number of groups, which you decide beforehand (let's call this number 'k').
How k-Means works (simplified):

You tell the computer how many groups ('k') you want to find.
The algorithm picks 'k' starting points, which act like "centre points" for your groups.
Every data item is then put into the group whose "centre point" is closest to it.
Once all items are assigned, the computer recalculates the actual "centre point" for each group based on where all the items in that group are located.
Steps 3 and 4 are repeated until the groups no longer change much.

Good for: Finding clusters that are generally round or spherical.
Important Note: A challenge is deciding the best number of 'k' groups before you start. Finding the exact best solution can be very difficult computationally, so simpler, step-by-step (greedy) methods like k-means are commonly used.

Hierarchical Methods

Idea: These methods build a tree-like structure of clusters. You don't need to specify 'k' upfront.
Two main types:

Agglomerative (Bottom-up): This approach starts by treating every single data item as its own tiny cluster. Then, it repeatedly merges the two closest clusters together. This continues until all items are eventually merged into one large cluster, or until a certain stopping point is reached.
Divisive (Top-down): This is the opposite. It starts with all data items in one big cluster and then repeatedly divides the clusters into smaller and smaller ones.

Result: The clustering can be shown as a tree diagram, called a dendrogram, which helps visualise how clusters are related at different levels.
Important Note: For very large datasets, these methods can require a lot of computing power and memory.

Density-Based Methods (e.g., DBSCAN)

Idea: These methods find clusters by looking for areas where data points are densely packed together. Sparse areas between dense regions are considered boundaries, and isolated points are often seen as "noise" or outliers.
Good for: Discovering clusters that have irregular or complex shapes (not just circles or spheres), and for finding noisy data points that don't belong to any cluster.

Example: Iris Dataset

Let's imagine the Iris dataset, which is a famous collection of measurements (like sepal length, sepal width, petal length, and petal width) for 150 iris flowers. These flowers actually belong to three known species (setosa, versicolor, and virginica).

How segmentation is applied: If you didn't know the species beforehand, you could use cluster analysis on these measurements.
What it does: A k-means algorithm, for example, if told to find 3 clusters, would group the 150 flowers into three distinct groups based purely on their measurements. You could then check how well these automatically formed groups match the actual known species of the flowers. This helps to see if the measurements alone are good enough to tell the species apart.

Why is Cluster Analysis Important? (Applications)

Cluster analysis (segmentation) is used in many different areas:

Customer Segmentation: Businesses group customers based on their buying habits or preferences. This helps them create more effective marketing campaigns for specific customer groups.
Document Organisation: Grouping news articles or documents that talk about similar topics.
Biology: Identifying groups of genes or proteins that show similar behaviour.
City Planning: Segmenting areas in a city based on housing types or population characteristics for urban development

Q) short note on Outlier Detection

An outlier is a data point that is very different from other points.
It is unusual, and may be an error or a rare event.

🔹 Why Detect Outliers?

Reason	Example
🚨 To find errors	A person with age = 200 (not possible)
🕵️‍♂️ To catch fraud	Credit card used in 3 countries in 1 hour
📊 To improve accuracy	Wrong data affects model results

🔹 How to Detect Outliers?

Method	Simple Meaning
Box Plot	Shows outliers as dots outside the box
Z-Score	If a value is far from average (mean), it's an outlier
IQR Method	If value is far outside the range (Q1–Q3), it is outlier
Scatter Plot	Points that are far away from others

🔹 Example:

Marks: 45, 50, 55, 60, 150
→ 150 is an outlier, because it is too high compared to others.

Outlier = A value that is far away or unusual.
Detecting outliers helps improve data quality and detect fraud.

Q) Outlier Detection

Outlier detection is about finding data points that are significantly different from the majority of other data in a dataset. These unusual points are also known as anomalies. An outlier represents something that does not conform to the expected pattern of the data. For example, a credit card transaction that is much larger than a customer's usual spending might be flagged as an outlier, potentially indicating fraud.

It is important to differentiate outliers from 'noise'. While noise refers to random errors or irrelevant data that one usually aims to clean from a dataset, outliers often carry valuable information about unusual events or behaviours.

What are the Different Kinds of Outliers?

Outliers can be categorised into three main types:

Global Outliers:
📌 What it means:
A data point is very different from all other data, no matter the situation or context.

✅ Simple Example:
Most people finish a test in 30 to 60 minutes, but one person takes 5 hours.
That’s clearly unusual — it’s a global outlier.

Contextual Outliers:
A data point seems normal in general, but becomes unusual in a specific context or condition.

✅ Simple Example:

A speed of 80 km/h is normal on a highway 🚗 — not an outlier.
But 80 km/h in a school zone? That’s very unusual — it's a contextual outlier.
It only becomes an outlier when you look at the context (school zone).

Collective Outliers

📌 What it means:
A group of values together is unusual, even if each value alone doesn’t seem odd.

✅ Simple Example:
One student getting a low grade on a quiz isn’t strange.
But if an entire class of students suddenly scores very low on the same quiz — that’s a collective outlier.
It may suggest a problem with the quiz or something else that affected everyone.

How are Outliers Found?

There are several primary approaches to detecting outliers:

Statistical Methods:
These methods look at the overall pattern of the data . Most of the time, we assume the data follows a common shape, like the bell curve (also called a normal distribution).

Outliers are identified as data points that have a very low probability of occurring under this assumed distribution.
Boxplots are a simple visual tool that can help in identifying potential outliers.

Proximity-Based Methods:If something is far away from its neighbors, it's probably an outlier. Local Outlier Factor (LOF) checks how close a data point is to its neighbors. If it's in a low-density area, it's an outlier.
Clustering-Based Methods:Normal points form big groups (clusters). If a point doesn’t belong to a group or is in a tiny group, it's an outlier.
Use clustering algorithms (like K-Means or DBSCAN) to find groups. Points that don’t belong to a group are flagged as outliers.
Classification-Based Methods: If you already know what “normal” and “abnormal” look like, you can train a model to detect new outliers.
📌 Real Method: Train a model on normal data only (e.g., only healthy patients).
When it sees something that doesn’t fit the pattern (e.g., an unusual health reading), it marks it as an outlier.

Unit III: Model development & techniques Data Partitioning, Model selection, Model Development Techniques, Neural networks, Decision trees, Logistic regression, Discriminant analysis, Support vector machine, Bayesian Networks, Linear Regression, Cox Regression, Association rules.

Model development is like creating a "smart program" or a "mathematical model" that learns from existing data to make predictions or find patterns in new, unseen data. This process helps translate a real-world problem into something a computer can solve, and then turn the computer's answers back into useful solutions.

Q) Data Partitioning (Splitting Data)

When developing a model, it is crucial to test its performance on data it has not seen before. This is like a student studying for an exam: you want to see if they can answer new questions, not just the ones they memorized. This is why data is split into different parts.

Training Set: This is the largest part of your data, used to build and "teach" the model. The model learns patterns and relationships from this data.
Test Set: This part of the data is kept separate and is used only once, at the very end, to measure how well the final model performs on new, unseen data.
Validation Set (Optional but Recommended): After training, the model is evaluated multiple times on the validation set to:
1) Tune the model's hyperparameters (e.g., learning rate, number of layers, max iterations)
2) Decide which version of the model performs better

A typical way to split data is 50% for training, 25% for validation, and 25% for testing.

Q) Model Selection (Choosing the Best Model)

Once you have built different models, you need to decide which one is the "best" for your specific problem.

Evaluation on Test Set: The primary way to select a model is to evaluate its performance (e.g., accuracy for classification) on the test set. A model that performs well on the test set is expected to perform well on real-world, new data.

2. Cross-Validation

Idea: Instead of depending on just one test/train split, we repeatedly split the data and test each time.
How it works (k-fold CV):

Split data into k equal parts (folds).
In each round, use one fold as test set and the rest as training set.
Repeat for all k folds.
Average the performance across all rounds.

Why used in model selection?

Reduces risk of choosing a model that performs well just by chance on one test set.
Gives a more reliable estimate of model’s ability to generalize.

3. Bootstrap

Idea: Create multiple “new datasets” by sampling the original training data with replacement.
For each bootstrap sample:

Train a model on sampled data.
Test it on the data points that were not sampled (called out-of-bag data).

Repeat this process many times and average the results.
Why used in model selection?

Helps when dataset is small.
Provides an estimate of stability and reliability of model performance.

4. Information Criteria (AIC, BIC)

Sometimes, two models perform almost equally well on test data. Should we always pick the more complex one?
→ Not necessarily! Complex models may overfit.

AIC (Akaike Information Criterion)

Balances fit vs complexity.

Formula (conceptually):

AIC = Error (badness of fit) + Penalty (number of parameters)

Lower AIC → Better model.

BIC (Bayesian Information Criterion)

Similar to AIC, but penalizes complexity more strongly.
Works well when we want very simple and interpretable models.

🔑 In simple English:

If two models predict almost equally well, prefer the one with fewer parameters.
AIC and BIC give us a number to compare models: the smaller value = better choice.

Here k = 5 for example, you have 5 coefficients or weights or nodes in your model

Q) Association Rules

Association Rules are like "if-then" statements that describe relationships between different items or events in a large collection of data. They tell us that if one thing happens (the "if" part or condition), then another thing is likely to happen (the "then" part or consequence).

You can think of it like this: Condition ⇒ Consequence.

For example, in a computer store, an association rule might be: If a customer buys Computer, then they also buy Anti-Virus.

To find and understand these rules, we use some key ideas:

Support: This tells How common is the rule in all transactions?.
For example: Consider the rule: Computer ⇒ Anti-virus

Itemsets with high support are called frequent itemsets.

Confidence: tells How often is anti-virus bought when computer is bought? In other words, “Given that someone bought a computer, how often did they also buy anti-virus?”

It shows the "strength" of the rule.

Lift: This measures how much more likely the "consequence" is to happen when the "condition" is met, compared to when the "condition" is not met.

If Lift = 1, the items are independent (no real relationship).
If Lift > 1, there is a positive relationship (buying Computer "lifts" the chance of buying Anti-Virus).
If Lift < 1, there is a negative relationship (buying Computer decreases the chance of buying Anti-Virus). This is important because a rule can have high support and confidence but still be misleading if the items are negatively correlated.

How are Association Rules Found?

Finding association rules is usually a two-step process:

Find Frequent Itemsets: First, identify all groups of items that appear together very often, based on a minimum "support" level.
Generate Rules: Then, use these frequent itemsets to create rules that meet a minimum "confidence" level.

Common algorithms for finding these rules include:

Apriori Algorithm: This is one of the oldest and most famous methods. It works by first finding small groups of frequently bought items, then using these small groups to build larger ones, step by step.
ECLAT (Equivalence Class Transformation): This method works by changing how the data is stored to make finding frequent itemsets more efficient, especially for vertical data formats. You can implement both Apriori and ECLAT using R programs.

Why are Association Rules Useful? (Applications)

Association rules are used in many real-world situations, such as:

Market Basket Analysis: This is the most common use. Supermarkets use it to understand what customers buy together. For example, if customers often buy "beer" and "diapers" together, the store might place them closer on the shelves or offer special deals.
Target Marketing: Companies use rules to identify specific groups of customers who are likely to buy certain products. For instance, if people aged 85-95 often buy a certain brand of checkers, then the company can target advertisements for checkers to this age group.

Q) Cox Regression

Cox Regression is a type of statistical model used when you want to predict the time until an event happens. Think of it like trying to predict how long something will last before a specific event occurs.

What it's for: It is often used in situations where the "event" is something like patient survival (e.g., how long a patient lives after treatment) or the time until a machine fails. Because some observations might not have experienced the event yet (e.g., a patient is still alive at the end of the study, or a machine is still working), this type of data is called "censored survival data".
How it helps: It helps to understand how different factors (called “predictors” or "covariates" or "features") influence the time until that event. For example, in medical research, it can help doctors understand if a new treatment or a patient's age affects how long they live.
Example: It can be used in studies involving gene expression data to predict survival time in patients, such as those with lymphoma. This model helps to find which genes might be important in predicting how long these patients survive.

Q) Explain in detail about linear regression and multiple linear regression.

Regression analysis is a very widely used statistical tool.

It is used to establish a relationship model between two variables.

One variable is called a dependent or response variable whose value must be predicted.

Other variable is called an independent or predictor variable whose value is known.

In Linear Regression these two variables are related through an equation.

Mathematically a linear relationship represents a straight line.

A non-linear relationship creates a curve.

The general mathematical equation for a linear regression is −

y = ax + b

Following is the description of the parameters used −

y is the dependent variable.

x is the independent variable.

a and b are constants which are called the coefficients.

Steps to Establish a Regression

A simple example of regression is to predict the weight of a person when his height is known. To do this we need to have the relationship between height and weight of a person.

Here y is weight and x is height.

The steps to create the relationship is −

Gather the height and weight of a few people.
Create a relationship model using the lm() functions in R.
Find the coefficients from the model
Know the average error in prediction. Also called residuals.
Use the predict() function to predict the weight of new persons.

For example:

heightx <- c(1,2,3)

weighty <- c(1,3,4)

relation <- lm(weighty~heightx) # Apply the lm() function.

print(relation)

When we execute the above code, it give a and b value as coefficients

b= -0.3333 a= 1.5000

Hence the line equation in y=1.5x-0.33

Multiple regression is an extension of linear regression.

It finds relationships between more than two variables. In simple linear relation we have one independent and one dependent variable, but in multiple regression we have more than one independent variable and one dependent variable.

The general mathematical equation for multiple regression is −

y = a1x1+a2x2+...+b

Following is the description of the parameters used −

y is the response variable.

b, a1, a2...an are the coefficients.

x1, x2, ...xn are the predictor variables.

We create the regression model using the lm() function in R. The model determines the value of the coefficients using the input data. Next we can predict the value of the response variable for a given set of predictor variables using these coefficients.

lm() Function

This function creates the relationship model between the predictor and the response variable.

Syntax

The basic syntax for lm() function in multiple regression is −

lm(weighty ~ heightx+agex)

heightx = c(1,2,3)

weighty=c(1,3,4)

agex=c(0,3,4)

relation=lm(weighty~heightx+agex)

print(relation)

newdata = data.frame(heightx=2.5,agex=3)

predict(relation,newdata)

Q) Explain Feed forward neural networks

A feedforward neural network is a type of artificial neural network where the information flows only in one direction, from input to output, without any feedback or loops.

In a feedforward neural network, the input layer receives the input data and passes it to the first hidden layer. Each neuron in the hidden layer applies a mathematical function to the input and passes the output to the next layer. This process is repeated for all the hidden layers until the output layer is reached, which produces the final output of the network.

The output of each neuron is calculated by applying a weighted sum of the inputs and passing the result through an activation function. The weights are learned during the training process, where the network adjusts the weights to minimize the error between the predicted output and the actual output.

Feedforward neural networks are commonly used for a variety of tasks, including classification, regression, and pattern recognition. They are also used as building blocks for more complex neural network architectures, such as convolutional neural networks and recurrent neural networks.

Q) explain back propagation?

The backpropagation algorithm works by propagating the error backwards from the output layer to the input layer, adjusting the weights of the neurons in each layer along the way.

During training, the input data is fed into the neural network, and the output of the network is compared to the actual output. The difference between the predicted output and the actual output is called the error, and this error is used to adjust the weights of the neurons in the network.

The backpropagation algorithm starts by computing the error at the output layer, and then propagating this error backwards through the network, layer by layer. The amount of error that each neuron contributes to the output is computed by taking the partial derivative of the error with respect to the output of the neuron. The weights of the neurons are then adjusted based on the amount of error they contributed to the output.

The backpropagation algorithm is typically used in conjunction with gradient descent optimization, which is used to minimize the error in the network by adjusting the weights of the neurons in the direction of the steepest descent of the error surface.

Backpropagation is an important technique for training neural networks and is used in many popular neural network architectures, including feedforward neural networks, convolutional neural networks, and recurrent neural networks.

Q) Linear Discriminant Analysis (LDA)

When the class labels or response variable has more than 2 classes, We use LDA.

The objective of LDA is to perform dimensionality reduction. It can also be used for classification. In LDA, we create new axis and

data is projected onto the new axis

Here as shown in above figure, the two dimensional data is projected onto one dimensional data. Thus LDA has reduced one dimension. But this axis is not separating the two classes (red and blue).

LDA choses an axis that separates the two classes as shown below. The axis was chosen to maximize the distance between the means of 2 classes (red and blue) while minimizing the scatter.

https://www.youtube.com/watch?v=azXCzI57Yfc

https://www.youtube.com/watch?v=DVqpwsRxjKQ

Q) Decision trees

Classification by Decision Tree Induction: or Discuss about building a decision tree and working of decision tree.

Decision tree induction is the learning of decision trees from class-labeled training tuples.

A decision tree is a flowchart-like tree structure,where
Each internal node denotes a test on an attribute.
Each branch represents an outcome of the test.
Each leaf node holds a class label.
The topmost node in a tree is the root node.

Advantages of Decision trees

A significant advantage of a decision tree is that it forces the consideration of all possible outcomes of a decision and traces each path to a conclusion.
The construction of decision tree classifiers does not require any domain knowledge or parameter setting, and therefore is appropriate for exploratory knowledge discovery.
Decision trees can handle high dimensional data.
Their representation of acquired knowledge in tree form is easy to understand.
They are robust to noisy data.
The learning and classification steps of decision tree induction are simple and fast.
In general, decision tree classifiers have good accuracy.
Decision tree induction algorithms have been used for classification in many application areas, such as medicine,manufacturing and production, financial analysis, astronomy, and molecular biology

Disadvantages

Decision trees are less appropriate for estimation tasks where the goal is to predict the value of a continuous attribute.
Decision trees are prone to errors in classification problems with many class and relatively small number of training examples.

Here in this below example,
outlook is root node
humidity and wind are attributes/ columns
yes or no are class labels
This example shows if a child can play outside or not.

In decision tree, rectangle represents attributes/ columns
ellipse represents class labels

The main purpose of decision tree is to extract the rules for classification.

Example: if outlook = sunny and humidity = normal then play = yes

if outlook = overcast then play = yes

if outlook = rainy and wind = low then play = yes

Types of Decision Trees
1) un weighted decision tree: when there is no weight on any nodes of the decision tree, i.e, there are no biases in decision tree

2. weighted decision tree:

3. binary decision tree: where there are only two attributes or labels in a tree

4. Random forest: n number of decision trees combined

Q) What is a Neural Network?

A Neural Network is a machine learning model inspired by how the brain works.
It tries to learn patterns from data and make predictions — even if the relationship is complex or non-linear.

🧠 Basic Structure

Layer	Meaning
Input Layer	Takes the input variables (like age, income, etc.)
Hidden Layers	Perform the calculations — this is where learning happens
Output Layer	Gives the prediction (Yes/No, number, etc.)

Each layer is made of neurons (also called nodes or units), and each neuron does a weighted sum + activation.

🔧 Key Concepts

Concept	Meaning
Weights	Strength of connection between neurons
Activation Function	Decides how much signal to pass on (like sigmoid, ReLU)
Learning	Adjusts weights using training data to reduce error
Backpropagation	Technique to update weights by checking error at the output
Epoch	One complete pass over the training data

📘 R Code (Using nnet Package)

# Load package

library(nnet)

# Build neural network model

model <- nnet(Buy ~ Age + Income, data = train_data, size = 3)

# Predict on new data

predict(model, newdata = test_data, type = "class")

📋 Example Dataset

Age	Income	Buy
25	40000	No
45	85000	Yes
35	60000	Yes
30	50000	No

# Sample data

data <- data.frame(

Age = c(25, 45, 35, 30),

Income = c(40000, 85000, 60000, 50000),

Buy = c("No", "Yes", "Yes", "No")

)

# Convert target to factor

data$Buy <- as.factor(data$Buy)

# Load library

library(nnet)

# Train the neural network with 3 hidden nodes

model <- nnet(Buy ~ Age + Income, data = data, size = 3)

# Predict for a new customer

new_customer <- data.frame(Age = 40, Income = 70000)

predict(model, new_customer, type = "class")

Q) Logistic Regression

Logistic regression is used to predict a “Yes/No” outcome, based on one or more input variables.

Example: Will a customer buy the product or not?
Output is 0 or 1, not a number like in linear regression.

Steps in logistic regression are as follows:

🔹 STEP 1: Linear Combination (Just like Linear Regression)

We calculate:

z=b0+b1x1+b2x2+.......

This is like saying:

z = intercept + age coefficient × age + income coefficient × income

🔹 STEP 2: Sigmoid Function Converts z to a Probability

This squashes any number (even negative or very large) to a range between 0 and 1, which is perfect for probabilities.

🔹 STEP 3: Decision Rule (Classify)

If the probability > 0.5 → Predict Yes
If the probability ≤ 0.5 → Predict No

Example: Predicting a mouse is obese or not

Q) Bayesian networks

Baye's Theorem

Bayes' Theorem is named after Thomas Bayes. There are two types of probabilities −

Posterior Probability [P(H/X)]
Prior Probability [P(H)]

where X is data tuple and H is some hypothesis.

According to Bayes' Theorem,

P(H/X)= P(X/H)P(H) / P(X)

Bayesian Belief Network

Bayesian Belief Networks specify joint conditional probability distributions. They are also known as Belief Networks, Bayesian Networks, or Probabilistic Networks.

A Belief Network allows class conditional independencies to be defined between subsets of variables.
It provides a graphical model of causal relationship on which learning can be performed.
We can use a trained Bayesian Network for classification.

There are two components that define a Bayesian Belief Network −

Directed acyclic graph
A set of conditional probability tables

Directed Acyclic Graph

Each node in a directed acyclic graph represents a random variable.
These variable may be discrete or continuous valued.
These variables may correspond to the actual attribute given in the data.

Directed Acyclic Graph Representation

The following diagram shows a directed acyclic graph for six Boolean variables.

The arc in the diagram allows representation of causal knowledge. For example, lung cancer is influenced by a person's family history of lung cancer, as well as whether or not the person is a smoker. It is worth noting that the variable PositiveXray is independent of whether the patient has a family history of lung cancer or that the patient is a smoker, given that we know the patient has lung cancer.

Conditional Probability Table

The conditional probability table for the values of the variable LungCancer (LC) showing each possible combination of the values of its parent nodes, FamilyHistory (FH), and Smoker (S) is as follows −

Short answer on SVM

SVM doesn't just draw any line that separates the two classes — it draws the line that:

Separates the classes correctly
Maximizes the distance (called margin) between the line and the closest points from each class.

Those closest points are called Support Vectors. They're the key players.

The margin is the space between the line and the closest points from each class.

The support vectors are the data points that lie closest to the line — they’re the ones that "support" the optimal line.

Q) Support vector machines (SVM)

The goal of the SVM algorithm is to create the best line or decision boundary that can segregate n-dimensional space into classes so that we can easily put the new data point in the correct category in the future. This best decision boundary is called a hyperplane.

SVM chooses the extreme points/vectors that help in creating the hyperplane. These extreme cases are called as support vectors, and hence algorithm is termed as Support Vector Machine. Consider the below diagram in which there are two different categories that are classified using a decision boundary or hyperplane:

Types of SVM

SVM can be of two types:

Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can be classified into two classes by using a single straight line, then such data is termed as linearly separable data, and classifier is used called as Linear SVM classifier.
Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means if a dataset cannot be classified by using a straight line, then such data is termed as non-linear data and classifier used is called as Non-linear SVM classifier.

Hyperplane and Support Vectors in the SVM algorithm:

Hyperplane: There can be multiple lines/decision boundaries to segregate the classes in n-dimensional space, but we need to find out the best decision boundary that helps to classify the data points. This best boundary is known as the hyperplane of SVM.

The dimensions of the hyperplane depend on the features present in the dataset, which means if there are 2 features (as shown in image), then hyperplane will be a straight line. And if there are 3 features, then hyperplane will be a 2-dimension plane.

We always create a hyperplane that has a maximum margin, which means the maximum distance between the data points.

Support Vectors:

The data points or vectors that are the closest to the hyperplane and which affect the position of the hyperplane are termed as Support Vector. Since these vectors support the hyperplane, hence called a Support vector.

How does SVM works?

Linear SVM:

The working of the SVM algorithm can be understood by using an example. Suppose we have a dataset that has two tags (green and blue), and the dataset has two features x1 and x2. We want a classifier that can classify the pair(x1, x2) of coordinates in either green or blue. Consider the below image: