Predicting Severity Code From Collision Report

袁晗 | Luo, Yuan Han
The Startup
Published in
9 min readOct 21, 2020

--

Photo: University of Washington, college of built environment, http://be.uw.edu/2017-cbe-research-open-labs/satelliteseattle/

Methodology:

  1. Introduction to the problem
  2. Data
  3. EDA(exploratory data analysis)
  4. Data preparation
  5. Feature Selection
  6. Model Building.
  7. Model Evaluation
  8. Conclusion

I. Introduction

For a long time, car collisions have not only been the center of commuter’s stress points, it also causes great amounts of damage to public infrastructure (i.e., road signs, traffic lights) and resources (i.e., emergency calls that invoke police, firefighters, and ambulances). If city planners can clearly identify what conditions cause infrastructure damage as appose to personal injury, perhaps city planners can better shape the cities of tomorrow.

II. Data

Fortunately, we do not have to find or scrape our own data. It is provided by SPD(Seattle police department) and recorded by Traffic Records. The data has no duplicated values, consists of 194673 collisions and 37(not counting index) attributes/columns with a target/dependent variable/feature ‘SEVERITYCODE’ and 36 independent variables/features. The plan is to fill the missing values or delete columns with a large amount of missing values. After the data is properly prepared, we will encode the categorical features and normalize it along with numerical features before testing their relevance to target variable ‘SEVERITYCODE’. The final step will be to select a handful of relevant features and feed it to the model. But without taking a closer look at the data, it’s very hard to say exactly what we can do, so let’s start by exploring our data.

III. EDA (Exploratory Data Analysis)

After dicing our data into pieces, we notice a few things.

Although metadata dictates that there are 5 possible values(0–3, 2b) that our target variable can take on, the column consists of only 2: 1(136486) or 2(58188). This is a classification problem. That means each independent variable will be used to classify whether they belong to ‘SEVERITYCODE’ 1 or 2. Great, right off bat, EDA illuminates the goal for us, let’s see what follows.

As an old Chinese saying, “To shot your enemy, shot the horse first. To capture bandits, capture their king first.” so lets look at the target variable that we are trying to predict.

1(teal): property damage/2(light orange): injury

This simple bar graph tells us that there are more property damage than injury cases within the recorded data. Ok great, now we captured the king, let’s further dissect the data that we are working with, starting with what’s missing.

Missing values

The features with more than 75,000 missing values and the following features will get deleted.

  • I dropped ‘SEVERITYDESC ‘, ‘ST_COLDESC’, and ‘SDOT_COLDESC’ because they are just descriptions of ‘SEVERITYCODE‘, ‘ST_COLDCODE’, and ‘SDOT_COLDCODE’ respectively.
  • I dropped ‘INTKEY’ because it has over 7000 unique values.
  • since pd data frame comes with unique ID we do not need unique keys like ‘OBJECTID’, ‘COLDETKEY’, and ‘INCKEY’ that comes with the data
  • I also dropped ‘ some redundant values such as SEVERITYCODE.1’ , ‘INCDATE’, and ‘LOCATION’.
  • Meta-data has no documentation on ‘REPORTNO’ so I am going to drop it.
  • Over 90% of ‘INTKEY’, ‘CROSSWALKKEY’, and ‘SEGLANEKEY ‘ takes on mono value of 0 or a value less than one, so I will drop both.

Now we have the missing values cleared(pun intended) out of the way lets move on to Data Preparation.

IV. Data Preparation

I worked on this project one feature at a time, but of course I will only highlight a few key attributes that later proved to be important.

Conventional wisdom tells me that there is some relationship between location of an accident and what kind of accident it is.

X and Y coordinates, Blue line = property damage, Orange line = injury

There is a difference in the ratio of severity code, but it is not nearly the amount I was hoping for. So let’s move on, maybe machines can pick up some subtle differences.

Both features ‘JUNCTIONTYPE’ and ‘COLLISIONTYPE’ are good representatives of features with strong connection to ‘SEVERITYCODE’

Keep in mind that our goal is to have a feature to give us some confidence where given an input(i.e., Mid-Block) for ‘JUNCTIONTYPE’ we can output with how confidence that the severity code is 1 or 2. In this case, “Mid-Block” is obviously correlated with property damage whereas “At Intersection” is correlated with injuries.

Another way of looking at this is we are basically categorizing each cases as it belongs to either severity code 1 or 2 base on the given feature. For example, if an input(feature of a case) is “Mid-Block”, I can tell you that there is an overwhelming evidence to classify this case as severity code 1(aka property damage. On the flip side, if an input(feature of a case) is “At Intersection”, the probability that this case cause injury is way higher than property damage. In another word, the quantity of each possible inputs(“At Intersection” or “Mid-Block…etc) is not important for our purpose. What’s important is the ratio of severity code(1/2) within each individual inputs(feature of a case). Ok, I went on a rant, let’s go to the next exemplar feature.

The final feature I will discuss a little is INCDTTM. It is quite a compact feature that consists of date and time, so I break it down to year, month, date, hour, and min. I further polish up month, date, and hour while discarding the rest.

If you think I wasted my time on this, you are not wrong because I did. After extensive amount of work, these feature doesn’t seem to lead to anything. But the reason I pointed this out is to show you that among 36 features, 1/3 of them are similar to this so we do not have to go through other ones like this.

Since much of our features are nominal categorical data, we have to one hot encode them before we can fit it to our model.

After we “one-hot-encode” all the nominal categorical features we got 88 columns.

V. Feature Selection

I employed 2 methods to select features: Univariate Selection and Feature Importance. Each of them gave me different sets of features to work with, so I decided to try both sets and select the best set based on the results in model building.

There is a discrepancy, the graph in data preparation phase showed us that ‘DATE’ is not a good indicator for severity code. But, univariate selection method tells us that it is the most important feature. We will put it to the test and see who is right.

Another interesting result, Although ‘DATE’ got replaced by ‘Y’ as the most important feature, it still occupies the second most important spot. And even though ‘Y’ appears to be a moderate indicator of the severity code, I would not say is the most important feature to predict severity code. The 2 methods are challenging my observation, let’s see what the models have to say.

VI Model Building

Since we are dealing with classification predictive modeling, I selected KNN, decision tree, linear SVM, and logistic regression. Since the procedural steps and results are more or less similar I will only show the findings I got from KNN because I also use another block of code to find best n(a parameter in KNN).

KNN best n

To fully conclude the battle of features, the results showed that heat map and graph has a slit higher score than the 2 other feature selection methods. I tried to out-smart the system by adding in all the best features for different methods, but the results are worst. Hence, I stick with the recommendation from heatmap and move on to evaluate each model.

VII. Model evaluation

I used Jaccard, F1-score, and LogLoss to evaluate the models and the results are as follows.

Of course all models did well in cross validation data than in test data, though not by much: they range from .70-.75 to .58-.71. Decision Tree seems to be the most promising model here so I would not be surprise if tweaking random forest model can increase the score further.

VIII. Conclusion

In the beginning of this journey, we dived deep into the land of data provided by SPD(seattle police department) and found ourselves swimming among an ocean of features. Some seem promising, others not so much. Initially, features such as weather, X(longitude), Y(latitude), UNDERINFL(under influence), ‘HITPARKEDCAR’, ‘ROADCOND’, and ‘LIGHTCOND’ seem promising while arbitrary features such as ST_COLCODE, SDOT_COLCODE, STATUS seemed unrelated or has weak connections at best. It turns out that neither was my initial intuition correct nor is it wrong. After all, I think this is what makes good findings in the realm of statistics: surprising but not outlandish.

Despite what “Feature Importance’’ feature selection technique is telling me, I spent a great amount of time only to realize longitude and latitude actually don’t matter that much. The first hammer landed on my head when I used a folium library to display car accidents via marker clusters(a technique that combines massive amounts of tags in dense areas into one circle). After filling all the missing/NA values and nearly crashing my laptop only to find little pattern between location and severity code, I persisted to feature selection and even training models. I did get some promising results from initial intuition with “distplot” showing features such as ‘JUNCTIONTYPE’ and ‘ROADCOND’ has some good correlations, just not as much as I would like.

Let’s go back to the drawing board and see what we are missing. Ah ha, “distplot” once again saved the day by telling us that ‘ST_COLCODE’ and ‘SDOT_COLCODE’ have a huge difference in severity code 1 and 2. Although I deleted them because I didn’t know what to do with them for the greater half of my project. But that doesn’t matter, because now we realize our mistakes and once we add these magic in our model we will get a 99% accuracy rate, pop champagne, celebrate, right? No, both feature selection techniques(Univariate_feature & feature_Importance) only show ‘ST_COLCODE’ and ‘SDOT_COLCODE’ has average correlation and heatmap showing them as even below average importance. Again my perseverance has been tested. So I went back to feature engineering and encoding, hoping this time the 2 features will rise to the top like a phoenix. It didn’t, It actually dropped lower than before and remained practically unchanged in the heatmap(face palm). At this point even if the model increases 10% in accuracy it’s worth it. I got some mix results; while SDOT_SOLCODE provided volatile results, ST_COLCODE is consistently recommended by all 3 feature selection techniques, yes, not a complete waste of time.

So if it’s neither super obvious features nor it’s super arcane features, then what? It turns out the answer is right in front of me all along(just like the movies). ‘ST_COLCODE’, ‘PEDCOUNT’, ‘ROADCON’, ‘COLLISIONTYPE’, ‘JUNCTIONTYPE’ and few others proved to be indispensable features for the training models. They are not super obvious or arcane. So common sense did prevail, kind of, in the form of ‘PEDCOUNT’..etc. But it did little to justify features such as ‘ST_COLCODE’ or ‘HITPARKEDCAR’. But wait a minute, doesn’t it make sense that ‘COLLISIONTYPE’ of “parked car” also means more property damage? Because it literally said hit-parked-car? Or cases with ST_COLCODE that literally spelled out the nature of the accident(i.e., hitting a cyclist) are followed by a high injury rate? Well , yes it does, it just doens’t sound too exciting. However, common sense did win and I would argue it should always win. Sometimes information is hidden in a form that we need technology to clear the clusters of noises in order for our common sense to see things more clearly. In this project, ‘ST_COLCODE’ is that kind of information. Location, weather, and time gave us little indication of severity code. What actually happened in the record such as ‘ST_COLCODE’ and ‘JUNCTIONTYPE’ gave us the biggest clue. Thank you for reading.

--

--