Modeling #Squadgoals: Finding the Squads
Exec Summary: #Squadgoals is an index of fandoms often ignored by the popular press, this computational approach mines the use of this hashtag, with all the possibilities and pitfalls inherent in that method.
If you are not familiar with youth culture, this post won’t be particularly meaningful for you, even some folks I know who are tuned in, don’t know about squad goals. Such as…

Things you need to know, fan cultures are organized around references to a central identity or fandom with other lesser elements organized around that. In this case, one could be a part of the cheetah girls fandom, and thus identify with the imaginary community of the squad. When the entire operation is well-oiled, you are in fact, on point.
But what are the most important squads today? Clearly the Cheetahs were important a decade ago but their continued squad-ness depends on nostalgia — not a new fandom. My students (Oregon State, Survey of Social Media) really wanted to know more about the squads, especially which ones were dominant on Twitter.
Method
Our method? We scraped Twitter for all uses of the hashtag #squadgoals, this is frankly more interesting than the use of #relationshipgoals or even just #goals. We then used Mallet to do topic analysis of the resulting scraped Tweets.
There was one big problem: Twitter isn’t exactly reliable. There were over five-thousand squadgoals tweets over the past thirty-six hours. When we asked for the last 200,000 it returned that the API could only return just under 36,000 tweets and that those only went back a week. Issues with Twitter are known, if one wants real longitudinal Twitter information they need to observe over time. There is no last second or retroactive research solution.
Also, here is a bigger problem. Any sort of token analysis like this is vulnerable to noise. A robust stopwords file can filter the results for better analysis. The choice of stopwords is fraught with danger, as the stopwords file improves, the topics assigned and the model will appear to fit the data, or at least the researchers sensibility of the data, more and more closely. There is a real risk that a researcher using a frequency table and a stopwords file could sculpt a computer reading of a document sent that fits their needs. Of course, all research can fall prey to the problem of heuristic availability — some researchers go hunting for a significant p value, there are many possible sins to be committed here. For the purposes of this project, I used my main stopwords file, stopwords2.txt, which I can provide if you would like it. As an aside: I do think that the development of stopwords files is an important topic for critical cultural studies, especially as practitioners deploy computer listening, reading, and vision.
Unfortunately, I am not going to be renewing my Tableau license until October, and Mircosoft excel is about as useful as a hammer for polishing a glass menagerie in situations like this, so R driven graphics will be coming your way.
Results
Results expressed as a dendrogram:

As the chart cascades down, smaller topics are broken from the larger topics, or more precisely, the topics that are lowest in the tree are those first collapsed into the larger topic. The dominant squads of the third week of August, 2015.
Squad Ranks
Results as a list with pictures.
1. One Direction (trying to figure out the relationship to Timberlake)

2. Walking Dead
3. Sports/Yankees
4. 5 Seconds of Summer
5. Hottopic
6. Hunger Games.
7. Fifth Harmony. (Taylor Swift just officially added Fifth to her squad, btw.)
8. Greys Anatomy

9. Summerslam

10 (tie) Outfithaven/Clothes Hack, femninistiajones
11. NFL Kicker Pat McAfee
12. Little Mix from X-Factor season 8(England)
HOLD THE TRUCK
What if we look at the squad goals that were the most popular on an individual basis? Then our key squad is:

#teamturk, Srubs 4 Ever. Etc.
This method of sorting reveals another problem, this particular Braff tweet was listed 17 times in the data, just as Braff. That suggests that there are other problems, and the lack of status text drops this powerful image behind the text rich posts related to One Direction.
By this ranking method, our squads are:
Srubs, Starwars(Vader), Napoleon Dynamite, Guardians of the Galaxy, Blue Mountain State (a television program on Spike TV, the network for men), Taylor Swift, some sort of poorly edited image, Eid, and then One Direction.
What did we learn?
Mapping the squads is difficult. Automated means allowed us to see past Braff’s raw numerical superiority. In a world where there a third of the dataset could have possibly been tied to Braff’s tweet, it seems possible, if not likely that large swaths of the data could have been lost in the process of building this model. I believe that the problem here is not with our approach or tools, but with Twitter itself. It was not our software, but the API that was exhausted. Perhaps this is the truth of the squad, it exists on the level of the aspiration and the imagination, not on the level of the database.