1 of 41

�Environmental Science and Engineering Department (ESED),

Indian Institute of Technology Bombay

ES225 Presentation

on

Data Analysis

Group:

Name	Roll No.
Aditya Gavali	23B4213
Mukund Raj	23B4245
Sheikh Adnan Matyoddin	23B4207
Achintya Jha	23B4216

2 of 41

Name: Aditya Gavali

Roll No: 23B4213

3 of 41

Station Location & Data Statistics

Chhatrapati Shivaji Maharaj International Airport, Mumbai

3

4 of 41

Trend Analysis

4

5 of 41

Seasonal Analysis

5

6 of 41

Model Result

6

7 of 41

Model Result

7

MODEL	R-SQUARED	MSE	RMSE
LINEAR REGRESSION	0.907	37.849	6.152
LASSO REGRESSION	0.907	37.862	6.153
DECISION TREE	0.902	39.959	6.321
RANDOM FOREST	0.918	33.406	5.779

Linear Regression and Lasso Regression perform similarly, with high R-squared (0.91) and MSE around 37.85, indicating a good fit but potential limitations in capturing non-linear patterns.
Decision Tree has a slightly lower R-squared (0.90) and higher MSE (39.96), suggesting overfitting and reduced predictive accuracy.
Random Forest outperforms the other models with the lowest MSE (33.41) and highest R-squared (0.92), capturing data complexity better with improved prediction accuracy.

8 of 41

Name: Mukund Raj

Roll No: 23B4245

9 of 41

Station Location & Data Statistics

My Station location : DM office Bhagalpur , Bihar

Before pre-processing, the dataset showed variability across variables due to missing values.
Rows with missing data were removed to improve consistency in the dataset.
After cleaning, minor statistical shifts were observed:
- PM2.5 mean increased to 80.87 µg/m³.
- PM10 mean increased to 150.40 µg/m³, indicating slight changes in air quality indicators.
- Relative humidity’s mean remained stable at 68.41%.
- Temperature's mean held steady at 30.23°C.
- Wind speed showed minimal change.
These minor shifts suggest that missing values were fairly random and their removal did not significantly impact the dataset’s overall trends or insights.

9

10 of 41

Trend Analysis

10

11 of 41

Trend Analysis

PM2.5 and PM10 levels show high variability, peaking early in the year, decreasing mid-year, and rising again towards year-end.
The initial peak indicates high pollution, likely due to winter heating.
The mid-year decline corresponds with summer, when dispersion improves and particulate matter decreases.
The rise at year-end suggests poorer air quality due to cooler temperatures and increased heating emissions.
PM2.5 levels inversely correlate with temperature, being higher in colder months and lower in warmer months.
Lower temperatures contribute to higher PM2.5 levels due to increased heating emissions and stagnant air.
PM10 levels fluctuate significantly, with low wind speeds contributing to pollutant accumulation.
There is no strong correlation between PM10 and wind speed, indicating limited dispersion.
Overall, higher PM2.5 and PM10 levels occur in colder months due to emissions and poor dispersion.
Relative humidity may influence particulate matter behavior, affecting PM levels under different conditions.

11

12 of 41

Seasonal Analysis

12

13 of 41

Seasonal Analysis

Box Plot of PM2.5 by Season

PM2.5 levels are highest during winter due to increased heating and stagnant weather trapping pollutants.
There is a noticeable decrease in PM2.5 concentrations during pre-monsoon as atmospheric mixing improves and rainfall begins.
PM2.5 levels remain relatively low during the monsoon season due to rain washing away particulates.
PM2.5 levels rise again in post-monsoon, indicating a transition back to stable weather patterns that promote pollutant accumulation.

Box Plot of PM10 by Season

PM10 levels peak during winter, linked to increased heating emissions and stagnant atmospheric conditions.
A decrease in PM10 levels is observed during pre-monsoon due to better air quality conditions and higher precipitation.
PM10 levels are significantly lower in the monsoon season as increased rainfall contributes to washing away particulate matter.
PM10 concentrations rise again in post-monsoon, suggesting a return to dry weather patterns and potential increases in dust.

Box Plot of RH by Season

RH levels are higher during winter due to cooler temperatures, contributing to fog formation and trapping pollutants.
In pre-monsoon, RH levels decrease as higher temperatures lead to lower humidity, correlating with better dispersion of pollutants.
During the monsoon, RH levels are at their highest due to persistent rainfall and humidity affecting air quality.
After the monsoon, RH levels drop, indicating a transition to drier conditions that influence pollutant behavior.

13

14 of 41

Model Result

14

After evaluating the Linear Regression, Lasso Regression, and tree-based models (Decision Tree and Random Forest), Random Forest performed the best among all models. This conclusion is based on its highest R-squared (R²) value and lowest Root Mean Squared Error (RMSE), indicating that it explains the most variance in PM2.5 levels and has the smallest prediction errors. The Random Forest model outperformed others because of its ability to capture complex, non-linear relationships between PM2.5 and other factors like PM10, temperature, and humidity. Linear and Lasso Regression models, while straightforward, are limited in capturing non-linear dependencies, making them less effective for this dataset. Decision Tree provided decent performance but was less consistent than Random Forest due to its susceptibility to overfitting, which the ensemble approach of Random Forest mitigates by averaging multiple trees.

15 of 41

Model Result (Actual vs predicted values)

15

16 of 41

Model Result (Residual plots)

16

Random Forest outperformed the other models when comparing actual vs. predicted values and analyzing residual plots. In the actual vs. predicted scatter plots, Random Forest predictions align more closely along the 45-degree line, showing a higher accuracy in predicting PM2.5 levels. The residual plots further confirm this, as Random Forest residuals are more evenly distributed around zero with fewer large deviations, indicating a better fit.

17 of 41

Name: Shaikh Adnan Matiyoddin

Roll No: 23B4207

18 of 41

Station Location & Data Statistics

My Station location : Gurdeo Nagar, Aurangabad – BSPCB, Maharashtra

Discuss the descriptive statistics table before and after pre-processing
Before preprocessing, the dataset exhibited variability across variables due to missing values.
Rows with missing data were removed to improve consistency in the dataset.
The descriptive statistics before and after pre-processing:-
PM2.5 and PM10: The mean values of PM2.5 (65.4 → 62.4) and PM10 (143.4 → 145.2) shifted slightly, but their maximum values dropped considerably (PM2.5 from 651 to 227.8, PM10 from 865 to 480), indicating that high outliers were removed in the processing.
Temperature (Temp): The average temperature remains similar (29.92 → 29.85), but the maximum dropped (43.9 → 37.7), suggesting extreme temperature values were also filtered out.

18

19 of 41

19

Station Location & Data Statistics

Relative Humidity (RH): The mean relative humidity decreased (61.9 → 58.1) along with a similar reduction in its range, which indicates that some high RH values may have been adjusted or removed.
Wind Speed (WS): The average wind speed slightly dropped (1.28 → 1.13), and its maximum was reduced (5.9 → 3.47), pointing to a reduction in unusually high wind speeds.
Wind Direction (WD): While the mean of WD changed slightly (192.2 → 196.4), the range remains nearly the same, suggesting little adjustment in this parameter's distribution.

20 of 41

Trend Analysis

Data trends plots:-

20

21 of 41

Trend Analysis

PM2.5 and PM10 Correlation: Strong positive correlation; both pollutants rise and fall together due to similar sources, though PM10 shows larger spikes. This suggests that PM10 is more sensitive to events that generate larger particles, such as dust storms or construction.
Humidity and Particulates: Inversely correlated with PM levels; high humidity helps settle PM2.5 and PM10, reducing air pollution, especially during rainy seasons. This settling effect is particularly noticeable in humid months when both PM levels drop.
Temperature Effect: Indirect influence; colder months show higher PM levels, likely due to heating activities and atmospheric inversions that trap pollutants. Warmer months see better dispersion, improving air quality.
Wind Speed Impact: Low wind speed has minimal effect on dispersing PM10, indicating wind isn’t a major factor in PM variability in this dataset. Higher wind speeds, if present, could potentially help reduce PM concentrations by aiding dispersion.
Seasonal Patterns: Higher PM levels in colder months, with increased variability for PM10 due to factors like dust or construction during dry periods. Seasonal and environmental conditions seem to be the primary drivers of PM fluctuations.

21

22 of 41

Seasonal Analysis

22

23 of 41

Seasonal Analysis

1. Winter (Nov-Dec) has the highest PM2.5 and PM10 levels due to stagnant air, heating, and industrial activities.
2. Monsoon (May-Jul) brings the cleanest air, with significant pollution reduction due to frequent rainfall.
3. Pre-Monsoon (Feb-Apr) and Post-Monsoon (Aug-Oct) have intermediate pollution levels, influenced by wind speed and humidity.
4. Relative Humidity (RH) impacts air quality, peaking during monsoon and dipping in pre-monsoon.
5. Rainfall significantly reduces pollution, highlighting its importance in air quality improvement.
6. Seasonal factors (wind speed, humidity, and human activities) contribute to varying pollution levels across seasons.

23

24 of 41

Model Result

Present your linear and non-linear model results along with interpretation.

24

MODEL	R-SQUARED	MSE	RMSE
LINEAR REGRESSION	0.7767	503.3740	22.4360
LASSO REGRESSION	0.7767	503.3732	22.4360
DECISION TREE	0.8491	295.2482	17.1828
RANDOM FOREST	0.9805	38.1468	6.1763

25 of 41

Model Result

Model Performance Overview: Four models were tested to predict PM2.5 levels, each evaluated based on R-Squared, MSE, and RMSE. The models ranged from simple linear regression techniques to more complex ensemble methods like Random Forest.
Random Forest Superiority: Among the models, Random Forest showed the best predictive ability, with an R-Squared of 0.9805, meaning it explained over 98% of the variance in PM2.5 concentrations. This indicates its strength in capturing complex data patterns and delivering accurate predictions.
Linear and Lasso Regression Limitations: Linear and Lasso Regression both achieved an R-Squared around 0.7767, capturing roughly 77% of the variance. Their performance was nearly identical, suggesting that the dataset might lack significant multicollinearity or complex interactions, which limited the benefit of Lasso’s regularization.
Decision Tree Model: The Decision Tree model achieved a higher R-Squared (0.8491) than linear models, capturing about 85% of variance. However, it’s prone to overfitting due to lack of averaging like in Random Forest, making it less stable for generalization on unseen data.
Feature Importance: Key features influencing PM2.5 levels included PM10 (most strongly correlated), Relative Humidity (RH), which affects particle suspension, and Temperature, providing secondary predictive information.
Conclusion and Optimal Model Selection: Random Forest emerged as the optimal model, providing the best balance between accuracy and model complexity. Its ensemble nature allowed it to capture the data’s non-linear patterns effectively, outperforming simpler and single-tree models.

25

26 of 41

Name: Achintya Jha

Roll No: 23B4216

27 of 41

Station Location & Data Statistics

The CPCB station near my home town is Police Commissionerate, Jaipur – RSPCB

Data is given in the link:- https://1drv.ms/x/c/33aaa8d3c018a552/EcQ3k04Vl_pFhmibOmIDBl0BYNg2EH5bZq1KarUzMgG3xw?e=X0BuoT

The descriptive statistics before and after pre-processing:-
Count Reduction:

After pre-processing, the counts for all variables decreased slightly (e.g., from 366 to 356 for PM2.5), indicating that missing or incomplete data rows were removed

Mean Changes:

Minor changes in the mean values across variables, such as PM10 shifting from 136.85 to 137.13, occurred due to the removal of some rows, which likely had different values from the overall mean

Median (50%) Adjustments:

Median values also shifted slightly (e.g., Temp median from 27.22 to 27.30) after dropping missing data, reflecting a recalibration of central values due to the cleaner dataset

Standard Deviation (std) Reduction:

Standard deviations for variables like Temp and RH became slightly smaller (e.g., Temp from 5.60 to 5.60), suggesting a reduction in data variability due to the removal of inconsistent rows

Min/Max Values:

Minimum and maximum values remain largely unchanged, indicating that extreme values were likely not removed during pre-processing, preserving the overall range

27

28 of 41

Trend Analysis

Data trends plots:-

28

29 of 41

Trend Analysis

Interpretation

29

PM10:

The black line shows that PM10 levels fluctuate significantly, with several high spikes throughout the time period.

PM10 values are generally higher compared to PM2.5 and often indicate extreme peaks, suggesting episodes of

high particulate pollution.

PM2.5:

The red line indicates PM2.5 levels, which tend to follow a similar pattern to PM10 but are generally lower in

magnitude.PM2.5 correlates with PM10, as both represent particulate matter, with PM2.5 being a subset of smaller

particles.

Temperature and PM Levels:

Higher spikes in temperature often coincide with high PM10 and PM2.5 levels. This

could indicate that warmer conditions (often linked to stagnant air) allow particulates to accumulate, leading to

pollution spikes.

Humidity and PM Levels:

High relative humidity (RH) values tend to appear when PM10 and PM2.5 levels are low.

This inverse relationship could be because high humidity helps settle particulates or indicates rainy/wet conditions,

which clean the air by washing out particulates.

Wind Speed (WS):

The pink line for WS (Wind Speed) remains low with only minor fluctuations.Higher wind speeds might help disperse

particulate matter, potentially affecting PM10 and PM2.5, though this effect isn’t strongly apparent in the graph.

30 of 41

Seasonal Analysis

Season wise analysis through boxplots of different seasons for PM2.5,PM10 and Relative Humidity

30

31 of 41

Seasonal Analysis

Interpretation through box-plots

31

1. Winter

PM2.5 and PM10: Highest levels due to temperature inversions that trap pollutants near the surface, along with

increased emissions from heating and industry.

Relative Humidity: Moderate, but stable atmospheric conditions prevent pollutant dispersion.
Summary: Winter has the poorest air quality, with significant pollutant buildup.

2. Pre-Monsoon

PM2.5 and PM10: Moderate to high due to dry, windy conditions that stir up dust, especially in arid regions.
Relative Humidity: Generally low, allowing dust to remain airborne.
Summary: Pre-monsoon has elevated pollution from dust and dryness.

3. Monsoon

PM2.5 and PM10: Lowest levels as frequent rains wash out pollutants from the air.
Relative Humidity: Very high, helping to settle airborne particles.
Summary: Monsoon provides the cleanest air, with natural purification from rainfall.

4. Post-Monsoon

PM2.5 and PM10: Gradual increase as rains decrease, allowing pollutants to accumulate.
Relative Humidity: Initially high, then gradually lowers, reducing particle settling.
Summary: Post-monsoon shows a rising trend in pollution, leading into winter levels.

32 of 41

Model Result

Model metrics for all the four models

Linear regression model Lasso regression model

Decision tree model Random Forest Regressor Model

Interpretation :- According to R^2,mse and rmse values for my data the random forest regressor model was the best because it has highest r^2 value amongst the four , the lowest mse and rmse values amongst the four. The worst performed model was the decision tree regressor model as it has the lowest r^2 and highest mse and rmse values. The lasso regression model had a bit higher r^2 than the linear regression model but it didn’t create a significant difference.

32

33 of 41

Model Result

Actual vs predicted plots

Linear regression model Lasso Regression model

Decision tree model Random forest regression model

33

34 of 41

Model Result

Residual plots and interpretation

Interpretation:- We can see from the above

scatter plot that the random forest regression

model gives the best results because the best fit

plot for actual vs residual is nearest to y= x in

random forest model . The residual plot is also

the most random around x = 0 in the random

forest regression model.

34

35 of 41

Results Comparison

35

36 of 41

Results Comparison

36

	Aditya Gavali	Mukund Raj	Shaikh Adnan	Achintya Jha
R2	0.918	0.92	0.9805	0.711
MSE	33.406	317.50	38.1468	218.397
RMSE	5.779	17.82	6.1763	14.778

Since the Random Forest model outperformed other models across all four datasets, a comparative evaluation of its performance for each student can provide valuable insights.

37 of 41

Results Comparison

The evaluation of the Random Forest model's performance for predicting PM2.5 levels across four students' datasets reveals significant variations in predictive accuracy and reliability.

The Random Forest model performs best on Shaikh Adnan's dataset, with the highest R² and lowest error metrics. Aditya Gavali and Mukund Raj also show strong predictive results, though Mukund's dataset has higher prediction errors. Achintya Jha's dataset shows the weakest model performance, likely indicating unique data characteristics that reduce predictive accuracy. These differences highlight the model's varying effectiveness across datasets, suggesting potential for further data preprocessing or tuning for datasets like Achintya’s.

37

38 of 41

References

Mukund Raj 23B4245 data link - : Data.xlsx
Achintya Jha 23B4216 data link - : Recent data.xlsx
Aditya Gavali link 23B4213 - : newdata.xlsx
Adnan Shaikh link 23B4207 - : data-cpcb

Sources for data

Taken from CPCB website:- https://airquality.cpcb.gov.in/ccr/#/caaqm-dashboard-all/caaqm-landing
The Central Pollution Control Board (CPCB) website in India is a valuable resource for accessing real-time and historical data on environmental quality, specifically focusing on air and water quality.

38

39 of 41

References

Libraries used:-

1. NumPy

Role: Provides efficient operations for numerical computations and array manipulation.
Use in Modeling: Useful for data preparation, performing matrix operations, and supporting the core calculations needed for many machine learning algorithms.

2. Pandas

Role: Handles data manipulation and preprocessing.
Use in Modeling: Essential for cleaning, transforming, and exploring data, making it ready for analysis and model input.

3. Matplotlib

Role: Basic data visualization library.
Use in Modeling: Helps visualize data distributions, model diagnostics, and performance metrics, aiding in EDA and interpretation.

4. Seaborn

Role: Statistical data visualization based on Matplotlib.
Use in Modeling: Provides visually appealing plots for correlation analysis, feature distribution, and complex statistical visualizations for EDA.

39

40 of 41

References

5. SciPy

Role: Offers advanced scientific and statistical functions.

Use in Modeling: Used for statistical testing, optimization, and handling specialized computations needed for certain models.

6. Scikit-Learn

Role: Core machine learning library.

Use in Modeling: Contains algorithms for model training, tools for evaluation and cross-validation, preprocessing functions, and utilities for building ML pipelines.

40

41 of 41

Thank You

41