1 of 17

Noise Pollution EDA (Exploratory Data Analysis)

Presented by: Andy K., Sidharth M., Aidan V., Wayland L.

2 of 17

OUR TEAM!

Andy Kuang

Data Science + Economics

Sid M.

Data Science + Intended Haas + Urban Science

Wayland La

Applied Math + Intended Data Science

Aidan Van Dalen

Cognitive Science + Intended Data Science

3 of 17

Why this Project?

Noise pollution is often glanced over compared to other pollutants, such as air or water pollution. Therefore, we decided to explore noise pollution by analyzing a US cities population dataset on Kaggle and a noise pollution dataset on OpenStreetMap (OSM) to find patterns in the relationship between Population Density and corresponding Noise Pollution.

Our ultimate project goal was to create a site where tourists could identify more tranquil areas with less noise near the cities they plan to visit, thus giving users a better vacation experience.

4 of 17

Data Set Overview

We analyzed two datasets on Population Data and Noise Data. The data combined contained 75 rows and 34 distinct columns. Some important columns to note were city, density and N_median (noise median).

We tried finding datasets combining population and noise pollution data but all of the datasets we could find were excessively large—some more than 12 GB.

5 of 17

Data Cleaning

We removed all duplicate cities and combined the datasets using the city column. (Removed duplicate rows and delete unnecessary columns).

Combined the two data sets and removed the outlier in the average noise pollution column.

6 of 17

Heatmap

We developed a correlation matrix heatmap to understand associations between certain variables. We noticed that there was a negative association with population density and pollution. Which prompted us to delve deeper into density vs noise pollution.

The heatmap displays correlation coefficients between different numeric features but we can still see correlations between MAX & RANGE of noise pollution with population and population density.

7 of 17

Hypothesis Testing

Question: Is there a correlation between Population density and excess noise pollution?

Null hypothesis: There is no direct correlation between population density and excess noise and any inconsistencies are due to either menial factors or chance

Alternative hypothesis: The proportion of excess noise to population density are correlated for all cities

8 of 17

The Plan: Linear Regression + Bootstrapping

Track how much x changes when y changes using Linear Regression

by sampling with replacement to see if the slope is likely to be zero

Observed Value (Slope of Linear Regression): 0.0115644

9 of 17

Testing Results

Tracking Slopes

Histogram of Slopes

Conclusion: We can reject our null hypothesis and say that there is a correlation between the two variables due to the absence of zero as a possible slope in histogram

10 of 17

Density Map_box

Through density mapping, we can see which cities in the U.S are highly congested.

This is one of our most useful graphs as we believe density could be directly correlated to noise pollution.

E.g. code

11 of 17

Map of Noise Pollution: The high density areas and the areas with higher concentrations of noise pollution are similar.

Since San Francisco, Boston, NYC, and LA are very dense in comparison to the other cities, the density map is realistic.

Population Density

Noise

12 of 17

Choropleth Mapping

Choropleth Mapping allows for visualizing geospatial patterns.

For this project, it helps us understand how total population distribution correlates with noise pollution.

E.g. code

13 of 17

Mean Noise Pollution by City/State in the U.S. You can see how lack of data can affect the output of Decibels of Noise Pollution as somehow California has a lower mean than Kansas.

Population Density by City/State in the U.S. You can see possible similarities between Pop Density & Db, but also outliers like Illinois.

Population Density

Mean Noise

14 of 17

Random Forest Regression

Create Different Tree instances
Train each tree separately
Make predictions for each singular tree
Take averages of all the predictions
Return the averaged outputs

15 of 17

Random Forest Regression

We used Random Forest Regression to fill in a predicted_noise_pollution value for every city in Populations dataframe. Lack of proper data caused values to be too similar. If we had more cities, the predicted values would have been more accurate.

average squared difference between the actual and predicted values

16 of 17

Conclusions

We got in touch with Lukas Martinelli to ask if we may use his API on Global Noise Pollution. Regretfully, he informed us that since we weren't working with a company, he wouldn't be able to provide the key.

Noise data:

~ 300 cities in the US

Population data:

~30,000 cities in the US

This noise data sample was not large enough make generalizations about the noise data in all the cities in of our population data.

We plan to implement a search bar dropdown menu that will allow users to choose any US city of their choice and provide them with noise pollution information.

Noise Dataset Difficulty

Noise Data

Dropdown Menu

17 of 17

THANKS!

Especially Madhuri & Alan!