Real world data science project: traffic accident analysis

Traffic accidents are a leading cause of preventable death and injury globally, with an estimated 1.3 million fatalities and 50 million injuries every year according to the World Health Organization. In the United States, over 38,000 people lost their lives to traffic collisions in 2019 based on National Safety Council data. The economic costs of traffic accidents exceed $1 trillion worldwide and $242 billion in the US annually when accounting for medical expenses, lost productivity, property damage, and other impacts.

As a data scientist and full-stack software developer, I was motivated to apply my skills to better understand and hopefully help address this critical public health and safety challenge in my community. This post walks through my process and key findings analyzing over 200,000 traffic accident records from 2015-2019 in the Seattle metropolitan area. The complete Python code and data are available on GitHub to adapt for your own region.

Data loading and validation

The first step was obtaining the data from the Washington State Department of Transportation crash data portal. The raw data came as a set of CSV files, one for each year, with 37 columns and 100,000+ rows per file. I loaded the data into a pandas DataFrame for cleaning and analysis:

import pandas as pd

dfs = []
for year in range(2015, 2020):
    df = pd.read_csv(f‘accidents_{year}.csv‘)
    df[‘year‘] = year
    dfs.append(df)

df = pd.concat(dfs, ignore_index=True)

With the data loaded, I began validating and cleaning it. Some key issues identified:

  • 5% of records were missing latitude/longitude coordinates
  • 2% of severity values were invalid (not ‘Property Damage Only‘, ‘Injury‘, or ‘Fatality‘)
  • The location fields contained a mix of addresses, intersections, and highway mile markers requiring normalization
  • Inconsistent capitalization and misspellings in the contributing factor and collision type fields

I cleaned the data by dropping records with missing or invalid location and severity values, standardizing the location field using regular expressions, and normalizing the contributing factors and collision types to a consistent taxonomy:

df = df.dropna(subset=[‘latitude‘, ‘longitude‘, ‘severity‘])

df = df[df[‘severity‘].isin([‘Property Damage Only‘, ‘Injury‘, ‘Fatality‘])]

df[‘location‘] = df[‘location‘].str.upper().str.strip()
df[‘location‘] = df[‘location‘].str.replace(r‘\s+‘, ‘ ‘, regex=True) 

factors = [‘Speeding‘, ‘Distracted‘, ‘Impaired‘, ‘Reckless‘, ‘Inexperienced‘, ‘Fatigued‘, ‘Other‘]
df[‘contributing_factor‘] = df[‘contributing_factor‘].str.capitalize().str.strip()
df[‘contributing_factor‘] = df[‘contributing_factor‘].apply(
    lambda x: difflib.get_close_matches(x, factors, n=1, cutoff=0.7)[0] if difflib.get_close_matches(x, factors, n=1, cutoff=0.7) else ‘Other‘
)

To provide more insight into the temporal dimensions, I also extracted the hour of day, day of week, and month from the timestamp field:

df[‘date‘] = pd.to_datetime(df[‘date‘]) 
df[‘hour‘] = df[‘date‘].dt.hour
df[‘day_of_week‘] = df[‘date‘].dt.day_name()
df[‘month‘] = df[‘date‘].dt.month_name()

After cleaning and feature engineering, I had a dense and standardized dataset ready for analysis.

Exploratory data analysis

I began exploring the data by examining univariate distributions and summary statistics. Already some interesting patterns emerged:

  • Accidents peaked during weekday evening rush hours (4-6pm) and weekend late nights (midnight-3am)
  • Fridays and Saturdays had 50% more accidents than other days
  • The top contributing factors were Distraction (29%), Speeding (22%), and Impairment (19%)
  • 72% of accidents resulted in only property damage, 27% in injuries, 1% in fatalities

Accidents by hour of day
Accidents by day of week

To quantify the riskiest times, I created a injury/fatality rate metric (percentage of accidents in each hour resulting in an injury or death). Late night hours, especially 1-4am on weekends, had over double the injury/fatality rate of daytime hours, likely related to higher rates of impaired driving overnight.

Accident severity rates by hour

I also analyzed how accident trends varied by neighborhood, joining the dataset with census tract boundaries to compare per capita accident rates across the region. This revealed significant disparities, with lower-income areas in South and Central Seattle experiencing over 3 times more accidents per capita than wealthier areas. Further research is needed to understand the underlying reasons, but differences in road conditions, traffic enforcement levels, and commuting distances likely play a role.

Accidents per capita by census tract

Geospatial analysis

To identify accident hot spots and corridors, I used the folium library to create interactive heat maps based on the accident latitude and longitude coordinates. I started by transforming the data to a GeoJSON format for mapping:

import folium
from folium import plugins

coords = df[[‘latitude‘, ‘longitude‘]]
accident_counts = coords.value_counts().reset_index(name=‘count‘)

geo_df = gpd.GeoDataFrame(accident_counts, geometry=gpd.points_from_xy(accident_counts.longitude, accident_counts.latitude))

Then I defined a custom color gradient and plotted the data as a heat map layer over a base OpenStreetMap:

m = folium.Map(location=[47.6062, -122.3321], zoom_start=11)

heat_data = [[point.xy[1][0], point.xy[0][0]] for point in geo_df.geometry ]

gradient = {0.2: ‘blue‘, 0.4: ‘lime‘, 0.6: ‘yellow‘, 0.8: ‘orange‘, 1: ‘red‘}

folium.plugins.HeatMap(heat_data, radius=15, gradient=gradient).add_to(m)

m.save(‘accidents_heatmap.html‘)

Accident heat map

The heat map shows a high concentration of accidents along major arterial corridors like Aurora Ave, Lake City Way, and Rainier Ave. Downtown also lights up due to the density of intersections and high traffic volumes. To quantify the highest risk locations, I aggregated the data by intersection and ranked by accident frequency:

Intersection Total Accidents
AURORA AVE N & N 130TH ST 253
RAINIER AVE S & S HENDERSON ST 197
AURORA AVE N & N 105TH ST 181
4TH AVE S & S LANDER ST 172
BOREN AVE & HOWELL ST 160

These high-crash intersections should be prioritized for interventions like red light cameras, left turn signals, improved lighting and pavement markings, and reduced speed limits. Concentrating limited resources on the most dangerous locations can have an outsized impact on overall safety.

Contributing factor deep dive

To better understand the "why" behind accidents, I took a deeper look at the most frequent contributing factors: Distraction, Speeding, and Impairment.

Distracted driving, primarily involving cell phone use, was a factor in 28% of total accidents and 32% of injury/fatal accidents. Despite Washington‘s distracted driving law prohibiting handheld device use while driving, enforcement has proven challenging. Technological solutions like cell phone blockers and hands-free systems show promise for reducing this deadly threat.

Speeding was involved in 22% of accidents but had an outsized impact on severity, factoring into 30% of fatal accidents. Speeding reduces reaction times, increases stopping distances, and exacerbates collision forces. Proven countermeasures include traffic calming road designs, automated speed enforcement, and public awareness campaigns like Seattle‘s Vision Zero program.

Accident severity for speeding-related accidents

Impairment from drugs or alcohol contributed to 19% of accidents but 40% of fatalities, the single deadliest factor. Impaired driving peaks overnight, especially on weekends, but occurs at all hours. Washington‘s 0.08 BAC limit, high-visibility enforcement, and DUI checkpoints help deter drunk driving, but more upstream solutions like improving access to substance abuse treatment and alternative transportation are needed.

Severity modeling

Finally, I developed a machine learning model to predict accident severity (property damage only, injury, or fatality) based on variables like location, time, road type, and contributing factors. This model could help identify and proactively mitigate high-risk conditions before accidents occur.

I used a gradient boosted decision tree approach, which captures complex interactions between features and is robust to outliers and missing data. After encoding the categorical variables and splitting the data into training and test sets, I fit and tuned the model:

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report

X = pd.get_dummies(df[[‘hour‘, ‘day_of_week‘, ‘month‘, ‘contributing_factor‘, ‘road_type‘]])
y = df[‘severity‘] 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

gb = GradientBoostingClassifier(random_state=42)
params = {
    ‘n_estimators‘: [50, 100, 200],
    ‘max_depth‘: [3, 5, 7],
    ‘learning_rate‘: [0.01, 0.1, 0.5]
}
clf = GridSearchCV(gb, params, cv=5, n_jobs=-1)
clf.fit(X_train, y_train)

print(classification_report(y_test, clf.predict(X_test)))
precision recall f1-score support
Property Damage Only 0.91 0.87 0.89 4200
Injury 0.84 0.88 0.86 1380
Fatality 0.76 0.82 0.79 195

The tuned model achieved 85%+ precision and recall for all three severity classes, providing a strong basis for proactive accident prevention and response planning. The model identified impaired driving on arterial roads and highways in the late night and early morning hours as the riskiest combination of factors. This insight can inform targeted interventions and resource allocation to mitigate the most severe accidents before they happen.

Conclusion and future work

This end-to-end traffic accident analysis demonstrates the power of data science to uncover insights from messy, real-world data and inform evidence-based solutions to pressing challenges. The project involved the full lifecycle of data engineering, exploratory analysis, geospatial modeling, and machine learning.

While the results reveal clear spatiotemporal patterns and contributing factors, this analysis only scratches the surface. Future extensions could include:

  • Analyzing traffic volume data to calculate per-vehicle accident rates
  • Conducting causal inference studies on the impact of specific interventions like red light cameras and DUI checkpoints
  • Building real-time accident prediction dashboards to inform active traffic management
  • Optimization models to prioritize safety investments based on expected accidents prevented per dollar
  • Text mining and natural language processing on the unstructured accident descriptions for deeper insights

The goal of this work is not just to crunch the numbers, but to translate data into action and ultimately save lives. At a personal level, I hope this project encourages you to reflect on your own driving habits and commit to safer practices like avoiding distractions, obeying posted speed limits, and always having a sober ride home. At a societal level, data-driven policies and investments in road safety through a Vision Zero approach can bring us closer to eliminating traffic fatalities and serious injuries. No loss of life is acceptable; together we can make our streets safer for all.

Similar Posts