Applied Projects Mike Keith: August 2018

Friday, August 31, 2018

Unsupervised Learning to Classify and Group School Shootings

Introduction

In a previous analysis, I took a statistical approach to modeling school shootings using data maintained by the Washington Post. Specifically, I looked at several models and several distribution assumptions to predict how factors such as poverty, the presence of resource officers on campus, type of weapon used, and other factors affect casualties in a school shooting, given a shooting has occurred. But, what if I weren’t interested in modeling and predicting casualties? What if I didn’t know what I was interested in, but knew I wanted to be able to break down the data in some other way than just looking at it or deriving obvious summaries? In that case, I might consider uing a machine learning technique called unsupervised learning.

What unsupervised learning allows me to do is group certain observations in my data and say observations in one grouping are more similar to each other than to other observations. I could also dig deeper and analyze why that is the case using the model output. If done successfully, I would be able to separate each of the observations into a subset of “clusters” and give a general description of each one. This is useful because, obviously, not all school shootings are the same. It’s a point I hear expressed often—should an accidental shooting where no one was hurt really be called the same thing as an indiscriminate event such as Sandy Hook? Correct labeling here matters and has important implications.

PCA

When confronted with a problem like this, the first place I go is to principal component analysis (PCA). This is a technique used to reduce the number of variables in a given dataset by calculating a linear combination of all the observations and thereby capturing more of the data’s variance with fewer variables. Imagine a dataset which contains the values 12, 32, 28, and 8. We can call this set A. Each of the values in set A (not actually, but for simplicity’s sake) makes up 25% of the set’s total variance. You want to choose one value (from inside or outside the set) that will capture as much of the total variance in the data as possible. So, you select 4, the lowest common denominator of each value. Four makes up a third of the variance in the first value, an eighth of the second, a seventh of the third, and half of the fourth. All together, with just one value, you have captured about 27.5% of the variance in the original set, more than you could have with any single value from within the set. That is the power of PCA—to simplify a large dataset with many variables into a smaller subset. In this case, we would call the value 4 principal component 1 (and every subsequent factor which would respectively make up less and less of the total variance would become principal component 2, 3, etc.).

Why would you want to do this? There are several reasons including creating a reduced set of variables in the dataset, which is important for when you have a lot of variables and don’t have a lot of observations (which was my case). Besides that, it allows you to flesh out which variables are most important to explaining everything in the dataset and how much more important such variables are than the others. For instance, in set A, we know that 8 is very important because so much of its variance (50%) can be decomposed into a single number which is also a common component of the other values.

For PCA to work however, there needs to be some correlation between variables. The more correlation, the better it works. So, returning to my dataset, I examined correlation between all the variables and found that there wasn’t a ton of correlation—suggesting that extracting a large amount of the variance in the data with only a few principal components would be unlikely.

Pearson Correlation Matrix

Darker Colored Boxes (Ignoring the Middle Diagonal) Suggest More Correlation Between Variables

In spite of this, I ran the PCA algorithm, and examining the first two principal components, we can more or less begin to envision natural groupings of each observation and see which variables matter most. A sample is given below.

Variable	Rotation
Variable	PC1	PC2
casualties	0.204	-0.209
minority proportion	-0.262	0.291
enrollment	0.094	0.191
resource officer	0.116	0.150
reason indiscriminate	0.282	-0.248
lunch program 75%	-0.137	0.106

Rotation is a weighting pattern given to each variable through each principal component (PC). PC1, for instance, says that the variable casualties is fairly important in the component’s formation as it’s given a weight of about 20% (as opposed to 9% for enrollment, for example). PC2 also shows casualties being of significant import, but in the opposite direction. If we could imagine a scatter plot with PC1 on the horizontal axis and PC2 on the vertical, we would expect to generally see the shootings with the greatest casualties toward the bottom right of such a plot, high in value PC1 and low in PC2. Of course, that will not always be true because of the weighting of the other variables. For instance, minority proportion (the proportion of students identifying as nonwhite) shows exactly the opposite relationship as casualties, so that (imagining the same scatterplot) if there were a high-casualty shooting which occurred in a minority-heavy school, we might see such an observation right in the middle of the graph. For a complete look at each variable’s rotation through principal component 1 and 2, see here.

Clustering Algorithm

Using the first two principal components, I extracted about 20.6% of the variance in the dataset. I then used several clustering algorithms to see if I could extract the best grouping of the observations in a way that made intuitive sense. This was not as easy as it may sound. Most of the algorithms I ran tended to group all the variables together into one cluster, with very few returning to the other groups. But I did find one solution that seemed to work fairly well—a k-means clustering algorithm with three centers (meaning it formed three distinct groups). We can view the results in the following scatterplot.

There is no exact way to label these resultant groups, but I decided to do so by identifying the clusters by their relative number of casualties and level of poverty. We’ll explore why that works later. As we can see and as could have been predicted, the school shootings with the highest number of casualties (which I have highlighted by the date, name of the school, and number of casualties) fell together in the bottom-right of the scatter plot, the blue group—which generally displayed shootings occurring at schools with lower poverty. This cluster also contained all of the school shootings which garnered most national attention.

So, what other information can be extracted from this analysis? For one thing, as is displayed in the next plot, the most impoverished schools seemed to be large in number, and generally farther away from the higher-casualty school shootings. The implication here seems to be a tradeoff in type of school—if a school is more impoverished, it may see more shootings, all else constant (or the likelihood of a shooting occurring there will be higher), but it will generally see fewer casualties. That makes sense. The schools that seem to receive the most media attention from a shooting are generally less impoverished and might be more of a magnet for people who want to cause as much destruction as possible. On the other hand, poverty is usually correlated with more violence generally.

Poverty in this dataset is measured by the percentage of students in the school eligible for a reduce-priced lunch program. The points I have highlighted with text (date, city/state, and number of casualties) are the schools in the dataset that had 99% or more of the student body eligible for this program. As we see, the most impoverished schools generally did not see more than 2 casualties per school shooting. The one exception I found was Salvador B. Castro Middle School in February of 2018 which had 90% of its student body eligible for this program and suffered 5 casualties in a school shooting.

Lastly, I wanted to show how having a resource officer on campus factors into this analysis. The next plot highlights schools with text (date and number of casualties) that had resource officers on campus while being less impoverished (we can think of these as typical middle-class-neighborhood schools).

Generally, we see that the schools which I highlighted with a text box tended to clutter into the bottom right of the graphic and be higher-casualty. The campuses which were more impoverished and had a resource officer seemed to see fewer casualties generally, but it’s the low-poverty-low-casualty group (bottom-left) which almost didn’t have any resource officers on campus, and suffered very few casualties. In my other analysis, the conclusion was drawn that having a resource officer on campus actually causes more casualties when controlling for the level of poverty, all else constant, and this breakdown of the data seems to confirm that.

Conclusion

With an unsupervised breakdown of the data, we are able to confirm some of our intuitions about school shootings as well as some of the less obvious conclusions that were derived in my previous analysis. Schools with higher poverty seem to see fewer casualties per school shooting, all else constant, and resource officers present on campuses don’t seem to actually reduce casualties at all. I was also able to show how certain kinds of school shooting events generally seem to cluster together using a k-means algorithm. There are many more ways this data can be broken down using unsupervised learning, but I have included only that which I found most interesting. For the full R script used for this analysis, see here.

Wednesday, August 29, 2018

Taking a Statistical Approach to Analyze School Shootings

Introduction

School shootings are ubiquitous in America. What’s as ubiquitous are people’s ideas about why they occur so often and what society can do about it.I had the idea of using statistics to examine this issue. It would be cool if it were possible to use geography and time in order to predict when and/or where the next mass school shooting will occur. Obviously, this could have enormously positive implications. Unfortunately, it's probably an impossible task, given what data is available to do it. There's not a lot of good record keeping when it comes to school shootings--an overabundance of missing values, non-collection of interesting variables, etc. But I did find one dataset that was interesting--a repository on GitHub maintained by the WashingtonPost. I examined this data for many months and what follows are my thoughts on what these data reveal. If you also want to see a machine learning technique called "cluster analysis" used on this data for a more in-depth look into it, see my other post, here.

Exploring the Data

As of August 27, 2018, the dataset contained information on 221 school shootings occurring since (and including) Columbine in 1999. A handful of observations pertaining to specific variables were missing, but for the most part, there was enough to draw meaningful conclusions in a few areas about which I was interested. A precursory exploration of the data revealed some insights that were not surprising, such as:

89% of the shootings were known to be committed by a male (or multiple males)
61% of the shootings were known to be committed by non-adult aged individuals (usually current or former students)
35% of the shootings occurred on a campus where resource officers were present
72% of the shootings occurred on campuses where at least 25% of the student-body was eligible for a reduced-priced or free lunch program, indicating a high correlation between poverty and shootings

A look at the final dataset (after I made my manipulations) is available here.

To examine some of the less-obvious results, I needed to decide upon a more-advanced approach. I determined that the most interesting way to do this would be to model and predict the number of casualties in a given school shooting using metrics available in the dataset. Note, this is not to say I would be able to predict anything about when a subsequent school shooting would occur. Rather, I could attempt to answer the question that given a shooting has occurred, what are the factors that lead it to be more or less prejudicial?

Statistical Modeling

There are two ways to approach modeling data in this way—classical statistical modeling (otherwise known as econometrics) and there is machine learning. Machine learning usually produces very good predictions, but it is hard to interpret the model output and it requires a lot of data to be able return reliable predictions (see overfitting). Because I did not have a lot of data at my disposal and I wanted to be able to interpret the model inputs (be able to say that increasing variable x leads to a certain amount of increase or decrease in variable y, etc.), I pursued modeling the data with econometrics.

Econometrics require assuming a distribution--a determination of what the general patterns of the variables in the models are. There were three distributions I tried assuming: normal, Poisson, and negative binomial. Normal distributions are the simplest (think traditional bell curves) but are not always reasonable with real-world data which are usually more complex. For instance, in the dataset at hand, for obvious reasons, casualties can never be negative. They can also never be decimal amounts. When assuming normality, no real number can be excluded from the distribution. For this reason, the normal distribution was most likely not going to be ideal. But that doesn't mean more complex assumptions would be any better. When normality works as well as anything else, it is the best assumption to make because it is so simple.

A Poisson distribution can work well at times when the values of the data being predicted can never be negative or non-integer, as is the case with casualties in this dataset. The Poisson distribution does make one simplifying assumption that can undermine its effectiveness: that the mean of the predicted variable equals its variance. With casualties in this dataset, the variance was much greater than its mean which is why I also decided to model with a negative binomial distribution, correcting for such “overdispersion.” Again, this is not to say the negative binomial distribution will always be better than the Poisson in these cases—the negative binomial assumption is much more complicated, and sometimes the simplest assumptions are the best assumptions.

Running the Models

I modeled casualties using three different models (a model under each distribution assumption), each time with the same 31 inputs, including the gender and age of the shooter, how each weapon was obtained, the weapon type, poverty levels in the school, whether a resource officer was on site, and other such general information. The advantage of using more inputs when modeling is that interpretations of each input are in a context of “holding all other factors constant” (or at least as many other inputs are in the model) which helps the modeler flesh out actual causality and not just correlation. There is a limit though—the more variables used, the wider confidence intervals become on each input’s effect, and it is hard to prove that any of the inputs are statistically significant--you lose degrees of freedom and run into problems pertaining to collinearity. I settled on the inputs I did because, theoretically, each one seemed like it could add something of value to the model and would have an interesting interpretation. See the data’s summary statistics for information about each input.

The negative binomial model worked best. I determined this by creating the following function on R to derive the model log likelihood, AIC, and BIC (these three metrics together are known as the information criteria):

Running these lines of code, I extracted each model’s information criteria:

Information criteria is a general indication of residual dispersion in the fitted models—how well the model predictions fit the data. The higher the value of the log likelihood, the better the model performs. The lower the values of the AIC and BIC, the same can be said. In effect, these are three different methods to capture the same general information, and while they do not always agree with one another, in this case, they did. It was easy to determine that the negative binomial model (NB) fit best on the given data.

Meaningful Insights

Resource Officer – Using the superior NB model as the standard for deriving results, interpreting most of what the model returned was fairly intuitive (example - when a shooting is indiscriminate such as the case of Sandy Hook, Columbine, and Stoneman Douglas, more casualties can be expected than from an accidental shooting). But there were some surprises as well. For one thing, the data highly suggested that the greater the percentage of students eligible for the reduced-priced lunch program, the fewer casualties occur per school shooting, holding all else constant. The opposite can be said about if a resource officer is on campus—when a resource officer is on campus during a shooting, more casualties occur, all else constant (on average, 3.5 more). Not only that, but both the resource officer and the school lunch variables were some of the most significant inputs in the model—we can be very confident that these effects were not measured by chance. Also, because there were many other inputs in the model, we can say that other potentially corollary factors were held constant, further suggesting causal relationships.

Many times, we assume the more poverty in a school, the more violence. We can also think that having a resource officer on campus will shut down a shooter more quickly before many casualties have occurred. But the data suggest otherwise. This was a finding I was surprised to find. In the cases where results are unexpected like this, I would like to see these results replicated in another study. But it may simply be that our intuition needs to change when thinking about the effects of a resource officer and poverty in schools. Also worthy of being noted, every model I ran seemed to suggest this same general relationship and statistical significance.

Weapon Type – In most cases, the model suggested that the weapon used in the shooting makes a significant difference in the number of casualties. In particular, the most statistically significant input pertaining to weapon type was “rifle,” which denotes any kind of rifle (assault or otherwise) used in the shooting. According to the model, rifles cause on average 6 more casualties when used in a school shooting compared to when some other weapon is used. This may be due to one-off school shooting events where a handful of particular dangerous shooters just happened to use a rifle, but again, the variable was very statistically significant, so to make the claim that rifles do not cause any more casualties in a school shooting than any other weapon, some compelling reason would have to be offered.

Illegally Obtained Weapons – As soon as I saw that the data had an indicator of whether or not the weapon used in the shooting was obtained illegally, I wanted to know what it had to say about casualties. When modeling this, the data seemed to show that fewer casualties were caused when a weapon was obtained illegally. It should be noted that this input was statistically insignificant (meaning we either need new data or a different study to draw any meaningful inferences from how the illegality of a weapon affects casualties). Nevertheless, the relationship seemed to be suggestive of fewer casualties (or at least the same number of casualties) when an illegally obtained weapon is used vs. a legally obtained weapon.

Wrapping Up

For a complete look at the code I used (all on R) to produce this analysis, see here. For a complete look at the final negative binomial model output and its interpretations, see here. Although the data used in this analysis proved to have limitations, it still revealed insights that were not obvious and were intriguing. I would like to obtain other data in different formats (ideally without any missing values) to perform other analyses to further validate or invalidate these findings. For now, the insight that having a resource officer on campus leads to more school shootings was interesting. A campus which is generally more impoverished (as indicated by the campuses with the highest percentages of students eligible for reduced-priced lunch) saw fewer casualties per school shootings, even when other factors were held constant. I did not expect either of these findings and found such discoveries fascinating.