Friday, August 31, 2018

Unsupervised Learning to Classify and Group School Shootings

Introduction

In a previous analysis, I took a statistical approach to modeling school shootings using data maintained by the Washington Post. Specifically, I looked at several models and several distribution assumptions to predict how factors such as poverty, the presence of resource officers on campus, type of weapon used, and other factors affect casualties in a school shooting, given a shooting has occurred. But, what if I weren’t interested in modeling and predicting casualties? What if I didn’t know what I was interested in, but knew I wanted to be able to break down the data in some other way than just looking at it or deriving obvious summaries? In that case, I might consider uing a machine learning technique called unsupervised learning.

What unsupervised learning allows me to do is group certain observations in my data and say observations in one grouping are more similar to each other than to other observations. I could also dig deeper and analyze why that is the case using the model output. If done successfully, I would be able to separate each of the observations into a subset of “clusters” and give a general description of each one. This is useful because, obviously, not all school shootings are the same. It’s a point I hear expressed often—should an accidental shooting where no one was hurt really be called the same thing as an indiscriminate event such as Sandy Hook? Correct labeling here matters and has important implications.


PCA

When confronted with a problem like this, the first place I go is to principal component analysis (PCA). This is a technique used to reduce the number of variables in a given dataset by calculating a linear combination of all the observations and thereby capturing more of the data’s variance with fewer variables. Imagine a dataset which contains the values 12, 32, 28, and 8. We can call this set A. Each of the values in set A (not actually, but for simplicity’s sake) makes up 25% of the set’s total variance. You want to choose one value (from inside or outside the set) that will capture as much of the total variance in the data as possible. So, you select 4, the lowest common denominator of each value. Four makes up a third of the variance in the first value, an eighth of the second, a seventh of the third, and half of the fourth. All together, with just one value, you have captured about 27.5% of the variance in the original set, more than you could have with any single value from within the set. That is the power of PCA—to simplify a large dataset with many variables into a smaller subset. In this case, we would call the value 4 principal component 1 (and every subsequent factor which would respectively make up less and less of the total variance would become principal component 2, 3, etc.).

Why would you want to do this? There are several reasons including creating a reduced set of variables in the dataset, which is important for when you have a lot of variables and don’t have a lot of observations (which was my case). Besides that, it allows you to flesh out which variables are most important to explaining everything in the dataset and how much more important such variables are than the others. For instance, in set A, we know that 8 is very important because so much of its variance (50%) can be decomposed into a single number which is also a common component of the other values.

For PCA to work however, there needs to be some correlation between variables. The more correlation, the better it works. So, returning to my dataset, I examined correlation between all the variables and found that there wasn’t a ton of correlation—suggesting that extracting a large amount of the variance in the data with only a few principal components would be unlikely.


Pearson Correlation Matrix
Darker Colored Boxes (Ignoring the Middle Diagonal) Suggest More Correlation Between Variables

In spite of this, I ran the PCA algorithm, and examining the first two principal components, we can more or less begin to envision natural groupings of each observation and see which variables matter most. A sample is given below.

Variable
Rotation
PC1
PC2
casualties
0.204
-0.209
minority proportion
-0.262
0.291
enrollment
0.094
0.191
resource officer
0.116
0.150
reason indiscriminate
0.282
-0.248
lunch program 75%
-0.137
0.106

Rotation is a weighting pattern given to each variable through each principal component (PC). PC1, for instance, says that the variable casualties is fairly important in the component’s formation as it’s given a weight of about 20% (as opposed to 9% for enrollment, for example). PC2 also shows casualties being of significant import, but in the opposite direction. If we could imagine a scatter plot with PC1 on the horizontal axis and PC2 on the vertical, we would expect to generally see the shootings with the greatest casualties toward the bottom right of such a plot, high in value PC1 and low in PC2. Of course, that will not always be true because of the weighting of the other variables. For instance, minority proportion (the proportion of students identifying as nonwhite) shows exactly the opposite relationship as casualties, so that (imagining the same scatterplot) if there were a high-casualty shooting which occurred in a minority-heavy school, we might see such an observation right in the middle of the graph. For a complete look at each variable’s rotation through principal component 1 and 2, see here.


Clustering Algorithm

Using the first two principal components, I extracted about 20.6% of the variance in the dataset. I then used several clustering algorithms to see if I could extract the best grouping of the observations in a way that made intuitive sense. This was not as easy as it may sound. Most of the algorithms I ran tended to group all the variables together into one cluster, with very few returning to the other groups. But I did find one solution that seemed to work fairly well—a k-means clustering algorithm with three centers (meaning it formed three distinct groups). We can view the results in the following scatterplot.




There is no exact way to label these resultant groups, but I decided to do so by identifying the clusters by their relative number of casualties and level of poverty. We’ll explore why that works later. As we can see and as could have been predicted, the school shootings with the highest number of casualties (which I have highlighted by the date, name of the school, and number of casualties) fell together in the bottom-right of the scatter plot, the blue group—which generally displayed shootings occurring at schools with lower poverty. This cluster also contained all of the school shootings which garnered most national attention.

So, what other information can be extracted from this analysis? For one thing, as is displayed in the next plot, the most impoverished schools seemed to be large in number, and generally farther away from the higher-casualty school shootings. The implication here seems to be a tradeoff in type of school—if a school is more impoverished, it may see more shootings, all else constant (or the likelihood of a shooting occurring there will be higher), but it will generally see fewer casualties. That makes sense. The schools that seem to receive the most media attention from a shooting are generally less impoverished and might be more of a magnet for people who want to cause as much destruction as possible. On the other hand, poverty is usually correlated with more violence generally.



Poverty in this dataset is measured by the percentage of students in the school eligible for a reduce-priced lunch program. The points I have highlighted with text (date, city/state, and number of casualties) are the schools in the dataset that had 99% or more of the student body eligible for this program. As we see, the most impoverished schools generally did not see more than 2 casualties per school shooting. The one exception I found was Salvador B. Castro Middle School in February of 2018 which had 90% of its student body eligible for this program and suffered 5 casualties in a school shooting.

Lastly, I wanted to show how having a resource officer on campus factors into this analysis. The next plot highlights schools with text (date and number of casualties) that had resource officers on campus while being less impoverished (we can think of these as typical middle-class-neighborhood schools).



Generally, we see that the schools which I highlighted with a text box tended to clutter into the bottom right of the graphic and be higher-casualty. The campuses which were more impoverished and had a resource officer seemed to see fewer casualties generally, but it’s the low-poverty-low-casualty group (bottom-left) which almost didn’t have any resource officers on campus, and suffered very few casualties. In my other analysis, the conclusion was drawn that having a resource officer on campus actually causes more casualties when controlling for the level of poverty, all else constant, and this breakdown of the data seems to confirm that.


Conclusion

With an unsupervised breakdown of the data, we are able to confirm some of our intuitions about school shootings as well as some of the less obvious conclusions that were derived in my previous analysis. Schools with higher poverty seem to see fewer casualties per school shooting, all else constant, and resource officers present on campuses don’t seem to actually reduce casualties at all. I was also able to show how certain kinds of school shooting events generally seem to cluster together using a k-means algorithm. There are many more ways this data can be broken down using unsupervised learning, but I have included only that which I found most interesting. For the full R script used for this analysis, see here.

1 comment:

  1. This is fantastic Michael! Really useful stuff. Great job!

    ReplyDelete