Wednesday, August 29, 2018

Taking a Statistical Approach to Analyze School Shootings


Introduction


School shootings are ubiquitous in America. What’s as ubiquitous are people’s ideas about why they occur so often and what society can do about it.I had the idea of using statistics to examine this issue. It would be cool if it were possible to use geography and time in order to predict when and/or where the next mass school shooting will occur. Obviously, this could have enormously positive implications. Unfortunately, it's probably an impossible task, given what data is available to do it. There's not a lot of good record keeping when it comes to school shootings--an overabundance of missing values, non-collection of interesting variables, etc. But I did find one dataset that was interesting--a repository on GitHub maintained by the WashingtonPost. I examined this data for many months and what follows are my thoughts on what these data reveal. If you also want to see a machine learning technique called "cluster analysis" used on this data for a more in-depth look into it, see my other post, here.


Exploring the Data

As of August 27, 2018, the dataset contained information on 221 school shootings occurring since (and including) Columbine in 1999. A handful of observations pertaining to specific variables were missing, but for the most part, there was enough to draw meaningful conclusions in a few areas about which I was interested. A precursory exploration of the data revealed some insights that were not surprising, such as:
  • 89% of the shootings were known to be committed by a male (or multiple males)
  • 61% of the shootings were known to be committed by non-adult aged individuals (usually current or former students)
  • 35% of the shootings occurred on a campus where resource officers were present
  • 72% of the shootings occurred on campuses where at least 25% of the student-body was eligible for a reduced-priced or free lunch program, indicating a high correlation between poverty and shootings



A look at the final dataset (after I made my manipulations) is available here.

To examine some of the less-obvious results, I needed to decide upon a more-advanced approach. I determined that the most interesting way to do this would be to model and predict the number of casualties in a given school shooting using metrics available in the dataset. Note, this is not to say I would be able to predict anything about when a subsequent school shooting would occur. Rather, I could attempt to answer the question that given a shooting has occurred, what are the factors that lead it to be more or less prejudicial?


Statistical Modeling

There are two ways to approach modeling data in this way—classical statistical modeling (otherwise known as econometrics) and there is machine learning. Machine learning usually produces very good predictions, but it is hard to interpret the model output and it requires a lot of data to be able return reliable predictions (see overfitting). Because I did not have a lot of data at my disposal and I wanted to be able to interpret the model inputs (be able to say that increasing variable x leads to a certain amount of increase or decrease in variable y, etc.), I pursued modeling the data with econometrics.

Econometrics require assuming a distribution--a determination of what the general patterns of the variables in the models are. There were three distributions I tried assuming: normal, Poisson, and negative binomial. Normal distributions are the simplest (think traditional bell curves) but are not always reasonable with real-world data which are usually more complex. For instance, in the dataset at hand, for obvious reasons, casualties can never be negative. They can also never be decimal amounts. When assuming normality, no real number can be excluded from the distribution. For this reason, the normal distribution was most likely not going to be ideal. But that doesn't mean more complex assumptions would be any better. When normality works as well as anything else, it is the best assumption to make because it is so simple.

A Poisson distribution can work well at times when the values of the data being predicted can never be negative or non-integer, as is the case with casualties in this dataset. The Poisson distribution does make one simplifying assumption that can undermine its effectiveness: that the mean of the predicted variable equals its variance. With casualties in this dataset, the variance was much greater than its mean which is why I also decided to model with a negative binomial distribution, correcting for such “overdispersion.” Again, this is not to say the negative binomial distribution will always be better than the Poisson in these cases—the negative binomial assumption is much more complicated, and sometimes the simplest assumptions are the best assumptions.



Running the Models

I modeled casualties using three different models (a model under each distribution assumption), each time with the same 31 inputs, including the gender and age of the shooter, how each weapon was obtained, the weapon type, poverty levels in the school, whether a resource officer was on site, and other such general information. The advantage of using more inputs when modeling is that interpretations of each input are in a context of “holding all other factors constant” (or at least as many other inputs are in the model) which helps the modeler flesh out actual causality and not just correlation. There is a limit though—the more variables used, the wider confidence intervals become on each input’s effect, and it is hard to prove that any of the inputs are statistically significant--you lose degrees of freedom and run into problems pertaining to collinearity. I settled on the inputs I did because, theoretically, each one seemed like it could add something of value to the model and would have an interesting interpretation. See the data’s summary statistics for information about each input.

The negative binomial model worked best. I determined this by creating the following function on R to derive the model log likelihood, AIC, and BIC (these three metrics together are known as the information criteria):



Running these lines of code, I extracted each model’s information criteria:



Information criteria is a general indication of residual dispersion in the fitted models—how well the model predictions fit the data. The higher the value of the log likelihood, the better the model performs. The lower the values of the AIC and BIC, the same can be said. In effect, these are three different methods to capture the same general information, and while they do not always agree with one another, in this case, they did. It was easy to determine that the negative binomial model (NB) fit best on the given data.


Meaningful Insights

Resource Officer – Using the superior NB model as the standard for deriving results, interpreting most of what the model returned was fairly intuitive (example - when a shooting is indiscriminate such as the case of Sandy Hook, Columbine, and Stoneman Douglas, more casualties can be expected than from an accidental shooting). But there were some surprises as well. For one thing, the data highly suggested that the greater the percentage of students eligible for the reduced-priced lunch program, the fewer casualties occur per school shooting, holding all else constant. The opposite can be said about if a resource officer is on campus—when a resource officer is on campus during a shooting, more casualties occur, all else constant (on average, 3.5 more). Not only that, but both the resource officer and the school lunch variables were some of the most significant inputs in the model—we can be very confident that these effects were not measured by chance. Also, because there were many other inputs in the model, we can say that other potentially corollary factors were held constant, further suggesting causal relationships.



Many times, we assume the more poverty in a school, the more violence. We can also think that having a resource officer on campus will shut down a shooter more quickly before many casualties have occurred. But the data suggest otherwise. This was a finding I was surprised to find. In the cases where results are unexpected like this, I would like to see these results replicated in another study. But it may simply be that our intuition needs to change when thinking about the effects of a resource officer and poverty in schools. Also worthy of being noted, every model I ran seemed to suggest this same general relationship and statistical significance.

Weapon Type – In most cases, the model suggested that the weapon used in the shooting makes a significant difference in the number of casualties. In particular, the most statistically significant input pertaining to weapon type was “rifle,” which denotes any kind of rifle (assault or otherwise) used in the shooting. According to the model, rifles cause on average 6 more casualties when used in a school shooting compared to when some other weapon is used. This may be due to one-off school shooting events where a handful of particular dangerous shooters just happened to use a rifle, but again, the variable was very statistically significant, so to make the claim that rifles do not cause any more casualties in a school shooting than any other weapon, some compelling reason would have to be offered.





Illegally Obtained Weapons – As soon as I saw that the data had an indicator of whether or not the weapon used in the shooting was obtained illegally, I wanted to know what it had to say about casualties. When modeling this, the data seemed to show that fewer casualties were caused when a weapon was obtained illegally. It should be noted that this input was statistically insignificant (meaning we either need new data or a different study to draw any meaningful inferences from how the illegality of a weapon affects casualties). Nevertheless, the relationship seemed to be suggestive of fewer casualties (or at least the same number of casualties) when an illegally obtained weapon is used vs. a legally obtained weapon.

 

Wrapping Up

For a complete look at the code I used (all on R) to produce this analysis, see here. For a complete look at the final negative binomial model output and its interpretations, see here. Although the data used in this analysis proved to have limitations, it still revealed insights that were not obvious and were intriguing. I would like to obtain other data in different formats (ideally without any missing values) to perform other analyses to further validate or invalidate these findings. For now, the insight that having a resource officer on campus leads to more school shootings was interesting. A campus which is generally more impoverished (as indicated by the campuses with the highest percentages of students eligible for reduced-priced lunch) saw fewer casualties per school shootings, even when other factors were held constant. I did not expect either of these findings and found such discoveries fascinating.

No comments:

Post a Comment