Introduction
In a previous analysis,
I took a statistical approach to modeling school shootings using data
maintained by the Washington
Post. Specifically, I looked at several models and several
distribution assumptions to predict how factors such as poverty, the presence
of resource officers on campus, type of weapon used, and other factors affect casualties in a school shooting, given a shooting has occurred. But, what if I
weren’t interested in modeling and predicting casualties? What if I didn’t know what I was
interested in, but knew I wanted to be able to break down the data in some
other way than just looking at it or deriving obvious summaries? In that case, I might consider
uing a machine learning technique called unsupervised
learning.
What unsupervised learning allows me to do is group certain
observations in my data and say observations in one grouping are more similar
to each other than to other observations. I could also dig deeper and analyze why that is the case using the model output. If done
successfully, I would be able to separate each of the observations into a
subset of “clusters” and give a general description of each one. This is useful
because, obviously, not all school shootings are the same. It’s a point I
hear expressed often—should an accidental shooting where no one was hurt
really be called the same thing as an indiscriminate event such as Sandy
Hook? Correct labeling here matters and has important implications.
PCA
When confronted with a problem like this, the first place I
go is to principal
component analysis (PCA). This is a technique used to reduce the number
of variables in a given dataset by calculating a linear combination of all the
observations and thereby capturing more of the data’s variance with fewer
variables. Imagine a dataset which contains the values 12, 32, 28, and 8. We
can call this set A. Each of the values in set A (not actually, but for
simplicity’s sake) makes up 25% of the set’s total variance. You want to choose
one value (from inside or outside the set) that will capture as much of the
total variance in the data as possible. So, you select 4, the lowest common
denominator of each value. Four makes up a third of the variance in the first
value, an eighth of the second, a seventh of the third, and half of the fourth.
All together, with just one value, you have captured about 27.5% of the variance
in the original set, more than you could have with any single value from within
the set. That is the power of PCA—to simplify a large dataset with many variables
into a smaller subset. In this case, we would call the value 4 principal
component 1 (and every subsequent factor which would respectively make up less
and less of the total variance would become principal component 2, 3, etc.).
Why would you want to do this? There are several reasons
including creating a reduced set of variables in the dataset, which is
important for when you have a lot of variables and don’t have a lot of observations (which was my case). Besides that, it allows
you to flesh out which variables are most important to explaining everything in
the dataset and how much more important such variables are than the others.
For instance, in set A, we know that 8 is very important because so much of its
variance (50%) can be decomposed into a single number which is also a common
component of the other values.
For PCA to work however, there needs to be some correlation
between variables. The more correlation, the better it works. So, returning to
my dataset, I examined correlation between all the variables and found that
there wasn’t a ton of correlation—suggesting that extracting a large amount
of the variance in the data with only a few principal components would be
unlikely.
Pearson Correlation Matrix
Darker Colored Boxes (Ignoring the Middle Diagonal) Suggest More
Correlation Between Variables
|
In spite of this, I ran the PCA algorithm, and examining the
first two principal components, we can more or less begin to envision natural
groupings of each observation and see which variables matter most. A sample is
given below.
Variable
|
Rotation
|
|
PC1
|
PC2
|
|
casualties
|
0.204
|
-0.209
|
minority
proportion
|
-0.262
|
0.291
|
enrollment
|
0.094
|
0.191
|
resource
officer
|
0.116
|
0.150
|
reason
indiscriminate
|
0.282
|
-0.248
|
lunch
program 75%
|
-0.137
|
0.106
|
Rotation is a weighting pattern given to each variable
through each principal component (PC). PC1, for instance, says that the
variable casualties is fairly important in the component’s formation as it’s
given a weight of about 20% (as opposed to 9% for enrollment, for example). PC2 also shows
casualties being of significant import, but in the opposite direction. If we
could imagine a scatter plot with PC1 on the horizontal axis and PC2 on the
vertical, we would expect to generally see the shootings with the greatest casualties
toward the bottom right of such a plot, high in value PC1 and low in PC2. Of course, that will
not always be true because of the weighting of the other variables. For
instance, minority proportion (the proportion of students identifying as
nonwhite) shows exactly the opposite relationship as casualties, so that
(imagining the same scatterplot) if there were a high-casualty shooting which
occurred in a minority-heavy school, we might see such an observation right in
the middle of the graph. For a complete look at each variable’s rotation
through principal component 1 and 2, see here.
Clustering Algorithm
Using the first two principal components, I extracted about
20.6% of the variance in the dataset. I then used several clustering algorithms
to see if I could extract the best grouping of the
observations in a way that made intuitive sense. This was not as easy as it may
sound. Most of the algorithms I ran tended to group all the variables together
into one cluster, with very few returning to the other groups. But I did find
one solution that seemed to work fairly well—a k-means clustering algorithm
with three centers (meaning it formed three distinct groups). We can view the
results in the following scatterplot.
There is no exact way to label these resultant groups, but I
decided to do so by identifying the clusters by their relative number of
casualties and level of poverty. We’ll explore why that works
later. As we can see and as could have been predicted, the school shootings
with the highest number of casualties (which I have highlighted by the date,
name of the school, and number of casualties) fell together in the bottom-right
of the scatter plot, the blue group—which generally displayed shootings
occurring at schools with lower poverty. This cluster also contained all of the
school shootings which garnered most national attention.
So, what other information can be extracted from this
analysis? For one thing, as is displayed in the next plot, the most
impoverished schools seemed to be large in number, and generally farther away
from the higher-casualty school shootings. The implication here seems to be a
tradeoff in type of school—if a school is more impoverished, it may see more
shootings, all else constant (or the likelihood of a shooting occurring there
will be higher), but it will generally see fewer casualties. That makes sense.
The schools that seem to receive the most media attention from a shooting are
generally less impoverished and might be more of a magnet for people who want
to cause as much destruction as possible. On the other hand, poverty is usually correlated with more violence generally.
Poverty in this dataset is measured by the percentage of
students in the school eligible for a reduce-priced lunch program. The points I
have highlighted with text (date, city/state, and number of casualties) are the
schools in the dataset that had 99% or more of the student body eligible for
this program. As we see, the most impoverished schools generally did not see
more than 2 casualties per school shooting. The one exception I found was
Salvador B. Castro Middle School in February of 2018 which had 90% of its
student body eligible for this program and suffered 5 casualties in a school
shooting.
Lastly, I wanted to show how having a resource officer on
campus factors into this analysis. The next plot highlights schools with text
(date and number of casualties) that had resource officers on campus while
being less impoverished (we can think of these as typical
middle-class-neighborhood schools).
Generally, we see that the schools which I highlighted with a text box tended
to clutter into the bottom right of the graphic and be higher-casualty. The
campuses which were more impoverished and had a resource officer seemed to see
fewer casualties generally, but it’s the low-poverty-low-casualty group
(bottom-left) which almost didn’t have any resource officers on campus, and
suffered very few casualties. In my other analysis, the conclusion was drawn
that having a resource officer on campus actually causes more casualties when
controlling for the level of poverty, all else constant, and this breakdown of
the data seems to confirm that.
Conclusion
With an unsupervised breakdown of the data, we are able to
confirm some of our intuitions about school shootings as well as some of the
less obvious conclusions that were derived in my previous analysis. Schools
with higher poverty seem to see fewer casualties per school shooting, all else
constant, and resource officers present on campuses don’t seem to actually
reduce casualties at all. I was also able to show how certain kinds of school
shooting events generally seem to cluster together using a k-means
algorithm. There are many more ways this data can be broken down using
unsupervised learning, but I have included only that which I found most
interesting. For the full R script used for this analysis, see here.