School shootings are ubiquitous
in America. What’s as ubiquitous are people’s ideas about why they occur so
often and what society can do about it.I had the idea of using statistics
to examine this issue. It would be cool if it were possible to use geography and time in
order to predict when and/or where the next mass school shooting will occur.
Obviously, this could have enormously positive
implications. Unfortunately, it's probably an impossible task, given what data is available to do it. There's not a lot of good record keeping when it comes to school shootings--an overabundance of missing values, non-collection of interesting variables, etc. But I did find one dataset that was interesting--a repository on GitHub maintained by the WashingtonPost. I examined this data for many months and what follows are my thoughts on what these data reveal. If you also want to see a machine learning technique called "cluster analysis" used on this data for a more in-depth look into it, see my other post, here.
Exploring the Data
As of August 27, 2018, the dataset contained information on 221 school shootings occurring since (and
including) Columbine
in 1999. A handful of observations pertaining to specific variables were missing,
but for the most part, there was enough to draw meaningful conclusions in a few
areas about which I was interested. A precursory exploration of the data
revealed some insights that were not surprising, such as:
- 89% of the shootings were known to be committed by a male (or multiple males)
- 61% of the shootings were known to be committed by non-adult aged individuals (usually current or former students)
- 35% of the shootings occurred on a campus where resource officers were present
- 72% of the shootings occurred on campuses where at least 25% of the student-body was eligible for a reduced-priced or free lunch program, indicating a high correlation between poverty and shootings
A look at the final dataset (after I made my manipulations) is
available here.
To examine some of the less-obvious results, I needed to
decide upon a more-advanced approach. I determined that the most interesting
way to do this would be to model and predict the number of casualties in a
given school shooting using metrics available in the dataset. Note, this is not
to say I would be able to predict anything about when a subsequent school
shooting would occur. Rather, I could attempt to answer the question that given
a shooting has occurred, what are the factors that lead it to be more or less
prejudicial?
Statistical Modeling
There are two ways to approach modeling data in this
way—classical statistical modeling (otherwise known as econometrics)
and there is machine learning. Machine learning usually produces very good
predictions, but it is hard to interpret the model output and it requires a lot of data to be able return reliable
predictions (see overfitting).
Because I did not have a lot of data at my disposal and I wanted to be able to
interpret the model inputs (be able to say that increasing variable x leads to
a certain amount of increase or decrease in variable y, etc.), I pursued
modeling the data with econometrics.
Econometrics require assuming a distribution--a determination of what the general patterns of the variables in the models are. There were three distributions I tried assuming: normal,
Poisson, and negative binomial. Normal distributions are the simplest (think
traditional bell curves)
but are not always reasonable with real-world data which are usually more
complex. For instance, in the dataset at hand, for obvious reasons, casualties
can never be negative. They can also never be decimal amounts. When assuming
normality, no real number can be excluded from the distribution. For this
reason, the normal distribution was most likely not going to be ideal. But that doesn't mean more complex assumptions would be any better. When normality works as well as anything
else, it is the best assumption to make because it is so simple.
A Poisson distribution can work well at times when the
values of the data being predicted can never be negative or non-integer, as is
the case with casualties in this dataset. The Poisson distribution does make
one simplifying assumption that can undermine its effectiveness: that the mean
of the predicted variable equals its variance. With casualties in this dataset,
the variance was much greater than its mean which is why I also decided to
model with a negative binomial distribution, correcting for such “overdispersion.”
Again, this is not to say the negative binomial distribution will always be
better than the Poisson in these cases—the negative binomial assumption is much more complicated, and sometimes the simplest assumptions are
the best assumptions.
Running the Models
I modeled casualties using three different models (a model
under each distribution assumption), each time with the same 31 inputs,
including the gender and age of the shooter, how each weapon was obtained, the
weapon type, poverty levels in the school, whether a resource officer was on site, and other such general information. The advantage of using more inputs when modeling
is that interpretations of each input are in a context of “holding all other
factors constant” (or at least as many other inputs are in the model) which
helps the modeler flesh out actual causality and not just correlation. There is
a limit though—the more variables used, the wider confidence intervals become
on each input’s effect, and it is hard to prove that any of the inputs are
statistically significant--you lose degrees of freedom and run into problems pertaining to collinearity. I settled on the inputs I did because,
theoretically, each one seemed like it could add something of value to the
model and would have an interesting interpretation. See the data’s summary
statistics for information about each input.
The negative binomial model worked best. I determined this
by creating the following function on R to derive the model log likelihood,
AIC, and BIC (these three metrics together are known as the information
criteria):
Running these lines of code, I extracted each model’s
information criteria:
Information criteria is a general indication of residual
dispersion in the fitted models—how well the model predictions fit the data.
The higher the value of the log likelihood, the better the model performs. The
lower the values of the AIC and BIC, the same can be said. In effect, these are
three different methods to capture the same general information, and while they
do not always agree with one another, in this case, they did. It was easy to
determine that the negative binomial model (NB) fit best on the given data.
Meaningful Insights
Resource Officer – Using the superior NB model as the
standard for deriving results, interpreting most of what the model returned was
fairly intuitive (example - when a shooting is indiscriminate such as the case
of Sandy Hook, Columbine, and Stoneman Douglas, more casualties can be expected
than from an accidental shooting). But there were some surprises as well. For
one thing, the data highly suggested that the greater the percentage of
students eligible for the reduced-priced lunch program, the fewer casualties
occur per school shooting, holding all else constant. The opposite can be said
about if a resource officer is on campus—when a resource officer is on campus
during a shooting, more casualties occur, all else constant (on average, 3.5
more). Not only that, but both the resource officer and the school lunch
variables were some of the most significant inputs in the model—we can be very
confident that these effects were not measured by chance. Also, because there
were many other inputs in the model, we can say that other potentially
corollary factors were held constant, further suggesting causal relationships.
Many times, we assume the more poverty in a school, the more
violence. We can also think that having a resource officer on campus will shut
down a shooter more quickly before many casualties have occurred. But the data
suggest otherwise. This was a finding I was surprised to find. In the cases where
results are unexpected like this, I would like to see these results replicated
in another study. But it may simply be that our intuition needs to change when
thinking about the effects of a resource officer and poverty in schools. Also
worthy of being noted, every model I ran seemed to suggest this same general
relationship and statistical significance.
Weapon Type – In most cases, the model suggested that
the weapon used in the shooting makes a significant difference in the number of
casualties. In particular, the most statistically significant input pertaining
to weapon type was “rifle,” which denotes any kind of rifle (assault or
otherwise) used in the shooting. According to the model, rifles cause on
average 6 more casualties when used in a school shooting compared to when some
other weapon is used. This may be due to one-off school shooting events where a
handful of particular dangerous shooters just happened to use a rifle, but
again, the variable was very statistically significant, so to make the claim
that rifles do not cause any more casualties in a school shooting than any
other weapon, some compelling reason would have to be offered.
Illegally Obtained Weapons – As soon as I saw that
the data had an indicator of whether or not the weapon used in the shooting was
obtained illegally, I wanted to know what it had to say about casualties. When
modeling this, the data seemed to show that fewer casualties were caused when a
weapon was obtained illegally. It should be noted that this input was statistically
insignificant (meaning we either need new data or a different study to draw any
meaningful inferences from how the illegality of a weapon affects casualties).
Nevertheless, the relationship seemed to be suggestive of fewer casualties (or
at least the same number of casualties) when an illegally obtained weapon is
used vs. a legally obtained weapon.
Wrapping Up
For a complete look at the code I used (all on R) to produce
this analysis, see here. For
a complete look at the final negative binomial model output and its
interpretations, see here.
Although the data used in this analysis proved to have limitations, it still
revealed insights that were not obvious and were intriguing. I would like to
obtain other data in different formats (ideally without any missing values) to
perform other analyses to further validate or invalidate these findings. For now,
the insight that having a resource officer on campus leads to more school
shootings was interesting. A campus which is generally more impoverished (as
indicated by the campuses with the highest percentages of students eligible for
reduced-priced lunch) saw fewer casualties per school shootings, even when
other factors were held constant. I did not expect either of these findings and
found such discoveries fascinating.
No comments:
Post a Comment