Motor Vehicle Accidents: Hudson County, New Jersey
By: Azucena Amez, Nick Andreula, Rich Brownlee,
& Francis Pino
The primary goal of our project was to identify the most important factors associated with vehicle accidents that resulted in serious injury or death. Within the past 10 years, Hudson County, New Jersey has experienced a great deal of accidents with a total of 198,620. Out of these incidents over 50,000 vehicle occupants or pedestrians were injured and 223 were killed. This topic area is also very important for us as a mother of one of our group members was recently hit by a car while walking. In conducting our project, 10 years of crash data from the NJDOT was obtained and compiled into one file. From here, the dataset was uploaded onto Microsoft Azure where multiple models were ran, which provided insight on the most significant factors in classifying car accidents.
The ultimate goal is that the insight provided can be used by county officials in making changes to relatively controllable factors such as redesigning roads and intersections that have been historically dangerous. Not only will this project have great economic utility, but it will greatly contribute to the the overall safety and well being of county residents and visitors.
II. Business Case
It is estimated that a motor vehicle accident occurs every 14 seconds in the United States. As a result of these accidents, individuals suffer from debilitating injuries, significant financial burdens, and in some cases death. Not only do the vehicle occupants suffer physically, but accidents often come with large financial burdens for the drivers involved, the government, and ultimately the taxpayer. It was estimated by AAA that accidents cost Americans nearly “164.2 billion” on a yearly basis. Due to many of the negative implications associated with motor vehicle accidents, our goal for the project was to classify the most significant factors and key problem areas that are related to these incidents. In attempt to provide insight to the problem at hand, we decided to use a sample dataset from the New Jersey Department of Transportation.
The specific location that was chosen to run the analysis on was Hudson County, NJ as it is one of the most densely populated counties in the state and is one which is well known by most of the group members. Initial analysis of the data highlighted staggering statistics for Hudson County. Within the last 10 years, Hudson County as a whole had 198,620 car accidents. As a result of these accidents, over 50,000 civilians were injured and over 223 civilians died.
With those statistics in mind, it became obvious that providing insight into some of the most relevant factors while identifying key problem areas would be invaluable for both the citizens and the county itself. This will allow influential decision makers within the county to obtain a better understanding of the issue, the most effective methods to mitigate accidents, and the negative effects that result from accidents. Although accidents occur for a variety of reasons, we believe that our modeling and analysis will allow us to identify many of the controllable aspects such as posted speed, yellow light duration, and roadway design, which can be later addressed and corrected by officials.
III. Data Acquisition & Analysis
The data utilized for this project was obtained from the New Jersey Department of Transportation (NJDOT), which provided all of the motor vehicle accident data records in Hudson County from 2005-2014. As previously stated, Hudson County was chosen because it is one of the most densely populated counties in New Jersey, and as a result has many accidents that are severe and in some cases, fatal. Data was downloaded for each individual year, which was then compiled into one master file containing the 10 years’ of data. The dataset ultimately obtained for 198,620 car accidents with 48 features which provided details of the accident and many of the relevant details associated with the incident as a whole. Some example of these were: crash year, total killed, total injured, time of crash, light conditions, and crash location. The most noteworthy data transformation performed was the target variable, which was recoded as a categorical variable. Any accident that was severe in nature and resulted in injury or death was classified as a 1, whereas any accident in which there were no injuries or deaths was coded as 0. With the target variable now re-coded, total accidents were 198,620; 36,887 accidents resulted in injury or death, whereas 161,733 were minor in nature in which nobody was injured or killed.
(Graph 1)
Microsoft Software Azure was used for the project modeling and analysis. Azure is an extremely useful program, which is used for developing predictive models and utilizing a wide variety of available algorithms. A variety of models were ran through the utilization of Azure in an attempt to classify the most important features and key problem areas, which result in accidents within the county. The first model ran was the Two-class Logistic Regression, which is a statistical technique that is used for modeling multiple types of outcomes. The second model was the Two-class Decision Forest, which defines the model that can be used to predict a target that has two values. Then, Two-class Boosted Decision Tree prediction was used based on the entire ensemble of trees together that makes the prediction. Finally, Two-Class Neural Network was employed to predict a target that has only two values. In identifying the most effective model in regard to overall performance, four of the most common metrics are accuracy, precision, specificity, and sensitivity. These metrics are important to assessing the data.
In regard to our business case, the metric used to identify the best performing model was sensitivity, as it identifies the model that most accurately picked up the true positives, which were the accidents that resulted in injury or death. The model that most accurately identified the true positives was critical in telling us which model effectively classifies serious or fatal accidents. In addition, it highlights key problem areas and relevant factors. As seen below in the table, the best performing model identified by the sensitivity metric was the Two-Class Boosted Decision Tree.
With the best performing model now identified, the permutation feature importance node was then utilized to assist in the identification of the most important factors. As seen below, some of the most important features identified were crash type code, total vehicles involved, cross street, accident data, accident time, light condition, and alcohol involved. Although these variables did receive relatively low scores, these features do seem to be extremely relevant and are all important factors when classifying accidents that resulted in injury or death.
The first feature identified by our model was crash type code. Crash type simply identifies the type of accident in which the accident occurred. As seen below, a chart was included to display the crash type which results in the highest amount of serious or fatal accidents along with a diagram depicting the crash codes. This is extremely important as it identifies the most commonly occurring serious accident type within the county. Same direction (rear end) accidents may indicate that yellow light duration needs to be increased to prevent these accidents while right angle accidents may tell us that vehicle operators are either running stop signs and red lights or encountering busy intersections with potential blind spots.
The next informative feature identified by our model was total count of vehicles involved in an accident. This is important as it shows us the most common number of vehicles involved in serious accidents while validating two of our most important crash type codes, same direction rear end, and right angle. It also validated the crash type code for the pedestrian by indicating that car vs. pedestrian was a very common accident type within Hudson County.
The next most informative feature was Cross Street Name. This gave us the streets that had the most amount of deaths/injuries. The worst street was Broadway which surprised us when looking at a satellite image because the street is very short. It only spans around 5-6 blocks long, however it intersects with a highway which is obviously causing significant issues. The second worst street was Bergen Avenue followed by Kennedy BLVD. From finding these results it’s clear where the problem areas are.
Lighting conditions was the next most informative feature. Results showed that most accidents occur during the day, which is believed to be caused by rush hour traffic in the morning and at night. As well as glare during sunrise and sunset, and other factors like texting while driving and talking on the phone while driving, this would cause drivers to be distracted and more prone to accidents. Daytime crashes accounted for 66%, while nighttime crashes accounted for 27%, and the remaining 7% were all others. Out of these all others include inclement weather like snow, ice, rain, fog, etc.
The last factor analyzed was alcohol involved. When looking at this feature it was apparent that the worst area for drunk driving related incidents was Jersey City. When seeing the numbers, it is evident that alcohol does really make a difference, indicating that police presence at night in key problem areas may be necessary or further increased. The police could set up roadblocks to check for drunk drivers in known areas.
IV. Summary of Key Insights & Conclusion
In summary, utilizing the acquired crash data can be extremely useful in identifying the most important factors and classifying key problem areas. Although a great deal of accidents are caused by drivers, this research has identified and confirmed many of the key problems and consistent crash locations that do account for many serious accidents.
Being that rear-end crashes are very common, this study suggests that more specific studies should be done regarding the increase of yellow light duration as well as reducing the speed limited on specified roads. Some companies have already began addressing this issue by implementing front-end accident prevention systems which will automatically stop the car regardless of what the driver is doing in. Other findings are that accidents commonly occur at the same road or intersection. With that being said, this study highlights some key problem areas which can be redesigned to decrease the chance that an accident may occur.
In our study, Hudson County is the most decently populated area and pedestrian accidents are significant. Our recommendation for pedestrians and non-motorized safety is to further develop and improve crosswalks, bike lanes, and walk signals. In areas such as schools where there is high pedestrian traffic, it is critical for there to be crossing guards or speed signals in the area to make sure people are only crossing when they are supposed to be.
Lastly, it is crucial for municipalities to increase police presence during the day when the majority of serious accidents occur or to decrease speed in heavily populated areas such as Broadway in Jersey City. Having an increased police presence during the day on the streets that were classified as the dangerous should help provide officials with critical insight to locations which require improvement. We hope that our findings will be considered useful in bringing valid solutions to the communities of Hudson County.
Resources