I generated a 50-topic model (see below for more detail) from the 1400 speeches and press releases from senators in 2008. I then embedded each
document into 2-dimensions using t-distributed stochastic neighbor embedding (see below for more detail) so that we can better visualize the clusters.
Some notes about the following visualization:
Speaker | Topic | Summary |
---|---|---|
One of the more interesting things that I think can be done with computers is analyze text. It’s one of those things that computers can do that lets you suspend disbelief and imagine your computer can understand this non-context-free mishmash of syntax and semantics that we call natural language. Because of this, I’ve always been fascinated by the discipline of natural language processing.
In particular, analyzing political text, some of the most contextual, sentiment-filled text in existence, seems like a great goal. This difficulty largely prevents analytical insights from coming easy. I think, however, that throwing heavy machinery at such a hard nut will never bear much fruit. Rather, I think that the heavy machinery would be better suited to doing what computers are adept at: organizing and visualizing the data in a way which better allows a human to ask questions of it.
So, to that end, let’s take senatorial speeches and press releases, learn natural groupings based on their content and see if there are topics that are inherently discussed more by members of one party versus another.
I ran across a great little project by John Myles White computing Ideal Point Analysis. During the course of this project, he gathered around 1400 speeches and press releases from Senators from 2008.
I’ve been playing with this visualization for a bit and here are a few interesting points that I see:
There are plenty more, but generally, the impressions that I got for this organization/visualization technique were on the whole favorable:
I began to think, “What interesting questions can I ask of this data?” One of my favorite bits of unsupervised learning in natural language processing is topic modeling. This, in short, is generating from the data a set of topics which the data covers. Further, it gives you a distribution of each topic for each document and the ability to generate this topic distribution from arbitrary (unseen) documents. I like to think of this as, given a set of newspapers, determine the sections (e.g. sports, business, comics, etc.). For our purposes, we will represent to us what each of these topics are by a set of keywords which best represent the topic.
From David Blei:
Topic models are a suite of algorithms that uncover the hidden thematic structure in document collections. These algorithms help us develop new ways to search, browse and summarize large archives of texts.
There has been much written on topic modeling and, in particular, the favored approach to doing it (latent dirichlet allocation). Rather than give yet another description, I’ll link to my favorite with accompanying video.
So, given this ability to generate topics and get a distribution for each of the speeches or press releases, what can I do with it? Well, one interesting question is whether there are certain topics that are inherrently partisan in nature. Which I mean, are there clusters of documents mainly dominated by senators of a certain party.
But, before we can investigate that, we need to visualize these documents by their topics. One way to do this is to just project down to the dominant topic, which is to say, for each document look at just the topic that is strongest. This way we can group the documents by topic. This has the benefit of being easily represented, but it loses the granularity of a document being a mixture of topics.
Instead, what we’d like to do is look at how the documents cluster by their topics. The issue with this is that we could have 50, 100, or even 300 topics which means visualizing a 300-dimensional space. That is clearly out of the question. What we’d need is some way to transform those high-dimensional points (representing the full distribution of topics) down into a space that we can visualize in such a way that the clustering is preserved. Or, as a mathematician would say, we need to embed a high-dimensional surface into $R^2$ or $R^3$ in a way which (largely) preserves a distance metric.
There has been quite a lot of discussion, research and high-powered mathematical thinking given to ways to embed high-dimensional space into lower dimensional spaces. One of the most interesting recent results from this space is t-Distributed Stochastic Neighbor Embedding by Laurens van der Maaten and the prolific Geoffrey Hinton of Deep Learning fame. In fact, in lieu of a detailed explanation of t-SNE, I’m going to defer to Dr. van der Maaten’s fantastic tech talk at google about his approach.
I will, however, give you a taste of the high level attributes of this embedding:
When I say “local structure”, I mean that points that are nearby in the high dimensional space are likely to continue to be nearby in the low-dimensional space, whereas the individual clusters in the high dimensional space might be farther away, relatively, when projected into the lower dimensional space. This is ideal for preserving the notion of clusters, which is all we care about.
There were some caveats, though, that I’d like to mention:
On the whole, it was an entertaining and fascinating exercise. I’m definitely going to be trying this out for other sets of documents. All of the code and data used to create this is hosted on github and I urge you all to pull it down and try it out!
You can find all of the code and data at the following github repo.
Some notes about the code:
I’d be very interested to see if the t-sne algorithm can be implemented or parallelized via Spark which would open it up to much larger datasets. As it stands, the algorithm (in naive form) is $O(n^2)$, but there is a variant called Barnes-Hut SNE ( L.J.P. van der Maaten. Barnes-Hut-SNE. In Proceedings of the International Conference on Learning Representations, 2013) which is $O(n\log(n))$.
Also, I’m interested in visualizing other texts. I think this is an interesting and fascinating way to explore a corpus of data. I’ll leave the rest of the ideas for future blog posts. You can expect one describing the challenges of a spark implementation soon, I expect.
Casey Stella 07 May 2014 Cleveland, OH