I’m a pretty heavy Unix user and I tend to prefer doing things the Unix Way™, which is to say, composing many small command line oriented utilities. With composability comes power and with specialization comes simplicity.
Although, sometimes if two utilities are used all the time, sometimes it makes sense for either:
For example, one thing that I find myself doing a lot of is searching a directory recursively for files that contain an expression:
Despite the fact that you can do this, specialized utilities, such as ack have come up to simplify this style of querying. Turns out, there’s also power in not having to consult the man pages all the damned time.
Another example, is the interaction between uniq and sort. uniq presumes sorted data. Of course, you need not sort your data using the Unix utility sort, but often you find yourself with a flow such as this:
This is so common that a -u flag was added to sort to support this flow, like so:
Now, obviously, uniq has utilities beyond simply providing distinct output
from a stream, such as providing counts for each distinct occurence.
Even so, it’s nice for the situation where you don’t need the
full power of uniq for the minimal functionality of uniq to be a part of sort.
These simple motivating examples got me thinking:
This sounds like a great challenge and an even greater opportunity to try out a new (to me) analytics platform, Apache Spark. So, I’m going to take you through a little journey doing some simple analysis and illustrate the general steps. We’re going to cover
We’ll close with my impressions of using Spark as an analytics platform. Hope you enjoy!
Of course, to be data driven, a minimal requirement is to have some data. The good folks at CommandLineFu have a great service and an even better REST-ful API to use. For those of you who do not know, CommandLineFu is a repository of crowd-managed one-liners for common tasks. For instance, if you forget how to share ssh keys and you’re on a Mac that doesn’t have ssh-copy-id (like me), then you can find the appropriate one-liner.
Using their REST-ful API, we can extract a nice sample of useful sets of commands to accomplish real tasks.
Now, this data, unfortunately, leaves much to be desired for drawing robust conclusions:
Even so, perhaps we can see some high-level patterns emerge. At the very least, we’ll be able to try out a new analytics platform.
Word collocations
(AKA Statistically Significant Phrases) are pairs (or tuples)
of words that appear together more likely than random chance would imply.
These could be things like proper names, multi-word colloquial phrases.
Natural Language Processing has a number of techniques to generate and rank
these collocations by “importance”:
Much has been written about
this over the years from
technical
blogs to
research papers and
textbooks.
I’m going to follow the approach suggested by Ted Dunning in
“Accurate Methods for Statistics of Surprise and Coincidence”, namely using the $G$ Statistical test.
Statistical significance tests tell us whether the probability that something occurs is not due to just random chance. Traditionally, when you are looking at these sorts of things, you encounter tests like the $\chi^2$ test, which can tell you give you a significance measure. In fact, $G$-tests, and their broader class of tests called maximum likelihood statistical tests, are a generalization of the $\chi^2$ test.
In particular, the problem with the traditional $\chi^2$ test when used in the domain of natural language processing is well known and criticized. Consider the following quote from “Foundations of Statistical Natural Language Processing”:
Just as the application of the $t$ test is problematic becuase of the underlying normality assumption, so is the application of $\chi^2$ in cases where numbers in the $2$-by-$2$ table are small. Snedecore and Cochran(1989:127) advise against using $\chi^2$ if the total sample size is smaller than 20 or if it is between 20 adn 40 and the expected value in any of the cells is 5 or less.
As you can see, there are some issues with bigrams (in our case, pairs of commands) that are rare. Thankfully there is another approach that deals with the low-support situation.
The $G$ statistical test is a maximum likelihood statistical significance test. In fact, it turns out, the $\chi^2$ test statistic is an approximation of the log-likelihood ratio used in the $G$ statistical test. You can refer to Ted Dunning’s “Accurate Methods for Statistics of Surprise and Coincidence” for a spirited argument in favor of log-likelihood statistical tests as the basis of bigram collocation analysis.
As part of my career, I have specialized in Data Science on the Hadoop stack. Amp Labs at Berkeley have created some very interesting pieces of infrastructure with favorable performance characteristics and a functional flavor.
Apache Spark is their offering at the computational layer. It’s a cache-aggressive data processing engine with first-class Scala, Java and Python API bindings. With Hadoop 2.x now supporting other communication models than MapReduce, Spark runs as a Yarn application; so if you have a modern Hadoop cluster, you can try this out yourself. Hell, even if you don’t own a Hadoop cluster, you can pull down a VM that runs the full Hadoop stack here. (Disclosure: I work for Hortonworks; this is a VM with our distribution installed).
As a fan of scala and data science, I would be lying if I said that I haven’t been looking forward to finding a project to try it out on, but I wanted something a bit more interesting than the canonical examples of wordcount or $\pi$ estimation. I think this just fits the bill!
As part of any self-respecting data science project, a good portion of the challenge is in data engineering. The data engineering challenge here boil down to parsing the command line examples and extracting the pairs of commands that collocate.
The general approach is:
If you’re interested, you can find the scala code to do this here.
The general methodology is:
Once we have these two rankings, we can eyeball the rankings and see how they differ and see if anything pops out.
I want to briefly discuss a bit of technology that we will be using heavily in the ensuing analysis, the Resilient Distributed Dataset (aka RDD).
Resilient Distributed Datasets:
Generally when you want to get something done in Spark, you’re operating on an RDD.
Raw Frequency is really simple, so we’ll start with that. It fits our assumptions of how to rank collocations, so we’ll use it as the base-case to evaluate against.
Let’s start with some useful typedefs
Now, let’s define our function.
Using the parser that we created during the Data Engineering portion of the project, we can extract the Bigrams excluding the automatic bookend command called “END” into an RDD called bigramsWithoutEnds.
Now we can compute the collocation counts by bigram by doing something very similar to word count using map and reduceByKey.
With those counts, we now need only the total of all bigrams and to divide each of the bigram counts by the total to get the raw frequency. We’ll further order by the score.
Et voilà, we now have a RDD of Bigrams sorted by raw frequency.
Now for the meat of the implementation. Let’s take the theory discussed earlier and proposed in “Accurate Methods for Statistics of Surprise and Coincidence” and implement it.
The technique boils down to computing a contingency table of counts for each pair of words $w_1$ and $w_2$ and use LogLikelihood.logLikelihoodRatio from Mahout:
Contingency Table | ||
---|---|---|
k11 | $\rightarrow$ | The number of times $w_1$ and $w_2$ appear next to each other. |
k12 | $\rightarrow$ | The number of times $w_2$ appears without $w_1$ |
k21 | $\rightarrow$ | The number of times $w_1$ appears without $w_2$ |
k22 | $\rightarrow$ | The number of times neither $w_1$ or $w_2$ appear together. |
Just as before with raw frequency, let’s start with some useful typedefs
Again, just as before, let’s define our function:
For the purposes of this analysis, it’s convenient to have an inner function which is useful for computing the $G$ significance given some of the counts mentioned in the contingency table above.
So, again, the prelude is similar to the raw frequency case. Given a line, let’s compute the bigrams of commands.
First, let’s get some statistics about individual commands. To do this, we need to count the individual commands (which are not the bookend “END” command that our CLIParserDriver puts in for us).
With Spark, we can pull down this RDD locally into memory into the Driver application. Since this scales only with the unique commands, it should be sensible to pull this data locally.
We’re going to use these counts to create a map with which we can look up counts for
a given command.
Spark is going to ship that map to the cluster, so we can use it local to
the transformations applied to the RDDs. This is a super useful bit of candy and
unexpected for those of us coming from the traditional Hadoop stack (i.e. MapReduce
and Pig development).
Now that we’ve gotten some counts for individual commands, let’s do the same for pairs of collocated commands.
Now that we have a RDD with the bigrams and their counts, bigramCounts, the map with the counts by individual command, commandCountsMap, we can compute our $G$ statistical significance test for each pair of collocated commands.
Now, we have a RDD with the bigrams ranked by the $G$ statistical significance test.
The result of this analysis is two rankings. One based on frequency and one based on $G$ score. I’ve taken the top 10 results from each ranking and placed them in a table here along with their scores and relative rankings. Note that while the table is sorted by $G$ score, the columns in the following table are sortable, so go explore. Also, for each score, the higher the score, the more important the pair of commands.
Command | Frequency | Relative Rank | $G$ score | Relative Rank | Absolute Difference in Rank | |
---|---|---|---|---|---|---|
find | xargs | 0.022 | 2 | 336.60 | 1 | 1 |
sort | uniq | 0.019 | 3 | 262.10 | 2 | 1 |
du | sort | 0.011 | 8 | 233.09 | 3 | 5 |
sort | grep | 0.00037 | 500 | 163.51 | 4 | 496 |
awk | grep | 0.003 | 41 | 156.78 | 5 | 36 |
grep | sort | 0.002 | 66 | 108.48 | 6 | 60 |
uniq | sort | 0.011 | 7 | 94.37 | 7 | 0 |
ps | grep | 0.009 | 10 | 86.39 | 8 | 1 |
netstat | grep | 0.006 | 15 | 83.88 | 9 | 4 |
cut | grep | 0.00092 | 192 | 82.88 | 10 | 182 |
sort | head | 0.009 | 9 | 44.21 | 34 | 25 |
grep | awk | 0.031 | 1 | 24.37 | 77 | 76 |
grep | cut | 0.014 | 4 | 10.27 | 369 | 365 |
awk | xargs | 0.013 | 5 | 4.37 | 905 | 900 |
awk | sort | 0.013 | 6 | 4.37 | 1632 | 1626 |
Takeaways from the analysis:
The secondary purpose for this project is to discuss Spark as a platform for Data science. I’ll break it down into pros and cons regarding Spark.
Spark, contrary to traditional big data analytics platforms like Hadoop MapReduce, does not assume that caching is of no use. That being said, despite aggressive caching, they have worked hard to have graceful degradation when you cannot fit a workingset into memory. This assumption is particularly true in data science, where you are refining a large dataset into a (possibly) much smaller dataset to operate on.
The dominant pattern for modern big data analytics systems has always borrowed heavily from functional languages. Spark substantially reinforces this pattern. Not only did it choose a functional language, Scala, as a first-level programming language, but it borrowed most of your favorite higher-order functional primitives for collections or streams (e.g. foldl, foldr, flatMap, reduce, etc.).
Existing within the JVM (for the Scala and Java API bindings) and Python ecosystem (for the Python bindings), comes with the ability to use libraries written for two of the most popular programming environments available today in a dead-simple way. Everything from Stanford CoreNLP to scikit-learn is available without any of the integration challenges that you see in the big data scripting languages (i.e. wrapping the calls in user defined functions).
Spark did not try to bite off more than it could chew. The AmpLabs guys realized that resource allocation in distributed systems is hard and they chose, quite correctly, to separate the analytics framework from the resource allocation framework. Initially, and continuing to today, they have strong support for Apache Mesos.
With the advent of Hadoop 2.0 resource management and scheduling have been abstracted into YARN. This means that Spark can now run on Hadoop, which comes with the distinct advantage that you can run Spark alongside the rest of the Hadoop ecosystem applications, such as Apache Hive, Apache Pig, etc.
With Hadoop’s increased adoption into the data center as a batch analytics system for doing ETL, it substantially decreases the barrier to entry to have Spark available as it can use the same hardware.
An open question, however, is just how good Spark is at multitenancy. If anyone has any examples of large Hadoop MapReduce-based workloads being run alongside Spark (as opposed to running in separate clusters), I’d love to hear impressions and challenges.
Almost all of the previous positive points tag my orientation as in the “comfortable with coding” type of data scientist. I have attempted to keep my foot in both worlds and that biases my perspective. Many people working and doing good analytics are daunted, to say the least, with a (new to them, most likely) language like Scala.
That being said, I think that two points alleviate this criticism:
These are attempts by the Spark community to reach out to the comfort-zone of the existing community of Data scientists by supporting some of their favorite tooling. I do not think that this will make someone who has had a career writing SAS comfortable in the system, unfortunately, but it does help.
This is one of those things that gets fixed with maturity, but it’s a lot easier to run a Pig script, say, than a Spark app. For instance, running the canonical example from Spark, the $\pi$ estimator (from the docs):
A combination of sensible defaults, programmatic specifications which can be overridden from the command line and generally less magic would be very grateful here. Furthermore, after you do succeed in running this, you have to dig in logs to get what was printed out in stdout from the driver. Logs (again, straight out of the docs) like:
On the whole, I’m super interested in Spark. I like much about it and I’m especially interested in their ecosystem projects like MLLib. Chalk me up as a fan.
All of the code used to do this analysis is located here.
I am not shipping the data, unfortunately, because I doubt that the commandlinefu
guys would like me providing their data on my github repo. However, you can easily
re-pull the data via their APIs.