�Environmental Science and Engineering Department (ESED),
Indian Institute of Technology Bombay
ES225 Presentation
on
Data Analysis
Group:
Name | Roll No. |
Aditya Gavali | 23B4213 |
Mukund Raj | 23B4245 |
Sheikh Adnan Matyoddin | 23B4207 |
Achintya Jha | 23B4216 |
Name: Aditya Gavali
Roll No: 23B4213
Station Location & Data Statistics
3
Trend Analysis
4
Seasonal Analysis
5
Model Result
6
Model Result
7
MODEL | R-SQUARED | MSE | RMSE |
LINEAR REGRESSION | 0.907 | 37.849 | 6.152 |
LASSO REGRESSION | 0.907 | 37.862 | 6.153 |
DECISION TREE | 0.902 | 39.959 | 6.321 |
RANDOM FOREST | 0.918 | 33.406 | 5.779 |
Name: Mukund Raj
Roll No: 23B4245
Station Location & Data Statistics
My Station location : DM office Bhagalpur , Bihar
9
Trend Analysis
10
Trend Analysis
11
Seasonal Analysis
12
Seasonal Analysis
Box Plot of PM2.5 by Season
Box Plot of PM10 by Season
Box Plot of RH by Season
13
Model Result
14
After evaluating the Linear Regression, Lasso Regression, and tree-based models (Decision Tree and Random Forest), Random Forest performed the best among all models. This conclusion is based on its highest R-squared (R²) value and lowest Root Mean Squared Error (RMSE), indicating that it explains the most variance in PM2.5 levels and has the smallest prediction errors. The Random Forest model outperformed others because of its ability to capture complex, non-linear relationships between PM2.5 and other factors like PM10, temperature, and humidity. Linear and Lasso Regression models, while straightforward, are limited in capturing non-linear dependencies, making them less effective for this dataset. Decision Tree provided decent performance but was less consistent than Random Forest due to its susceptibility to overfitting, which the ensemble approach of Random Forest mitigates by averaging multiple trees.
Model Result (Actual vs predicted values)
15
Model Result (Residual plots)
16
Random Forest outperformed the other models when comparing actual vs. predicted values and analyzing residual plots. In the actual vs. predicted scatter plots, Random Forest predictions align more closely along the 45-degree line, showing a higher accuracy in predicting PM2.5 levels. The residual plots further confirm this, as Random Forest residuals are more evenly distributed around zero with fewer large deviations, indicating a better fit.
Name: Shaikh Adnan Matiyoddin
Roll No: 23B4207
Station Location & Data Statistics
My Station location : Gurdeo Nagar, Aurangabad – BSPCB, Maharashtra
18
19
19
Station Location & Data Statistics
Trend Analysis
Data trends plots:-
20
Trend Analysis
21
21
Seasonal Analysis
22
Seasonal Analysis
23
23
Model Result
24
MODEL | R-SQUARED | MSE | RMSE |
LINEAR REGRESSION | 0.7767 | 503.3740 | 22.4360 |
LASSO REGRESSION | 0.7767 | 503.3732 | 22.4360 |
DECISION TREE | 0.8491 | 295.2482 | 17.1828 |
RANDOM FOREST | 0.9805 | 38.1468 | 6.1763 |
Model Result
25
25
Name: Achintya Jha
Roll No: 23B4216
Station Location & Data Statistics
Data is given in the link:- https://1drv.ms/x/c/33aaa8d3c018a552/EcQ3k04Vl_pFhmibOmIDBl0BYNg2EH5bZq1KarUzMgG3xw?e=X0BuoT
27
Trend Analysis
28
Trend Analysis
Interpretation
29
The black line shows that PM10 levels fluctuate significantly, with several high spikes throughout the time period.
PM10 values are generally higher compared to PM2.5 and often indicate extreme peaks, suggesting episodes of
high particulate pollution.
The red line indicates PM2.5 levels, which tend to follow a similar pattern to PM10 but are generally lower in
magnitude.PM2.5 correlates with PM10, as both represent particulate matter, with PM2.5 being a subset of smaller
particles.
Higher spikes in temperature often coincide with high PM10 and PM2.5 levels. This
could indicate that warmer conditions (often linked to stagnant air) allow particulates to accumulate, leading to
pollution spikes.
High relative humidity (RH) values tend to appear when PM10 and PM2.5 levels are low.
This inverse relationship could be because high humidity helps settle particulates or indicates rainy/wet conditions,
which clean the air by washing out particulates.
The pink line for WS (Wind Speed) remains low with only minor fluctuations.Higher wind speeds might help disperse
particulate matter, potentially affecting PM10 and PM2.5, though this effect isn’t strongly apparent in the graph.
Seasonal Analysis
Season wise analysis through boxplots of different seasons for PM2.5,PM10 and Relative Humidity
30
Seasonal Analysis
Interpretation through box-plots
31
1. Winter
increased emissions from heating and industry.
2. Pre-Monsoon
3. Monsoon
4. Post-Monsoon
Model Result
Linear regression model Lasso regression model
Decision tree model Random Forest Regressor Model
Interpretation :- According to R^2,mse and rmse values for my data the random forest regressor model was the best because it has highest r^2 value amongst the four , the lowest mse and rmse values amongst the four. The worst performed model was the decision tree regressor model as it has the lowest r^2 and highest mse and rmse values. The lasso regression model had a bit higher r^2 than the linear regression model but it didn’t create a significant difference.
32
Model Result
Linear regression model Lasso Regression model
Decision tree model Random forest regression model
33
Model Result
Interpretation:- We can see from the above
scatter plot that the random forest regression
model gives the best results because the best fit
plot for actual vs residual is nearest to y= x in
random forest model . The residual plot is also
the most random around x = 0 in the random
forest regression model.
34
Results Comparison
35
Results Comparison
36
| Aditya Gavali | Mukund Raj | Shaikh Adnan | Achintya Jha |
R2 | 0.918 | 0.92 | 0.9805 | 0.711 |
MSE | 33.406 | 317.50 | 38.1468 | 218.397 |
RMSE | 5.779 | 17.82 | 6.1763 | 14.778 |
Since the Random Forest model outperformed other models across all four datasets, a comparative evaluation of its performance for each student can provide valuable insights.
Results Comparison
The evaluation of the Random Forest model's performance for predicting PM2.5 levels across four students' datasets reveals significant variations in predictive accuracy and reliability.
The Random Forest model performs best on Shaikh Adnan's dataset, with the highest R² and lowest error metrics. Aditya Gavali and Mukund Raj also show strong predictive results, though Mukund's dataset has higher prediction errors. Achintya Jha's dataset shows the weakest model performance, likely indicating unique data characteristics that reduce predictive accuracy. These differences highlight the model's varying effectiveness across datasets, suggesting potential for further data preprocessing or tuning for datasets like Achintya’s.
37
References
Sources for data
38
References
1. NumPy
2. Pandas
3. Matplotlib
4. Seaborn
39
References
Role: Offers advanced scientific and statistical functions.
Use in Modeling: Used for statistical testing, optimization, and handling specialized computations needed for certain models.
Role: Core machine learning library.
Use in Modeling: Contains algorithms for model training, tools for evaluation and cross-validation, preprocessing functions, and utilities for building ML pipelines.
40
Thank You
41