Structure & Process

A Blockchain Story Told Through The Eyes of Two Users

2018-01-22T00:00:00+00:00

Blockchains are Big Data

I saw a commercial for Enterprise blockchains by Oracle during a football game this weekend. I’ll just pause to let that sink in. It is undeniable that this little (slightly) esoteric corner distributed computing is fully riding the hype train right now. It’s no doubt that the runup in price by the core cryptocurrencies combined with pointed skepticism from mainline economists and financial analysts are driving interest in the technology. It’s the perfect mixture of nerdiness, drama and money to pique the interests of even the most bloodless in the tech industry.

I’m a data scientist working in a very specific niche: dealing with “Big Data” (shout out to Apache Metron). When blockchains came to my notice the sheer transparency of them was exceptionally exciting and liberating. Traditionally things like currencies operate like a black box, where one looks at the inputs, the outputs and tries to develop sensible guesses as to what is going on inside the black box. With blockchains, due to the fact that they are essentially immutable ledgers of transactions, one can crack open the nut and get at the juicy transaction details kept inside.

Blockchains as they stand right now operate at relatively anemic transaction rates as compared to other financial transaction systems that one uses day-to-day (e.g. Visa). Also, they’ve been around for a somewhat limited amount of time. These two aspects together put into question whether this truly is a “big data” problem or just a regular data problem. I contend, and hopefully will show in a bit here, that nontrivial analysis of blockchains puts us in a “small-to-medium data, big compute” territory. As such, this fits well within my preferred data analysis tools of Apache Spark, Python and Jupyter.

Ethereum: A Virtual Machine on a Chain

The most attractive blockchain to analyze, in my opinion, is Ethereum. From Wikipedia:

Ethereum is an open-source, public, blockchain-based distributed computing platform featuring smart contract (scripting) functionality. Ether is a cryptocurrency whose blockchain is maintained by the Etherium platform, which provides a distributed ledger for transactions. Ether can be transferred between accounts and used to compensate participant nodes for computations performed. Ethereum provides a decentralized Turing-complete virtual machine, the Ethereum Virtual Machine (EVM), which can execute scripts using an international network of public nodes. “Gas”, an internal transaction pricing mechanism, is used to mitigate spam and allocate resources on the network.

I like several aspects of this project:

It is well used every day and growing in popularity
It seems to have a broad vision; the blockchain as a platform for smart contracts is enticing
It’s moving away from a proof of work model, which results in huge energy consumption
Gathering transaction data from geth, the ethereum node, is do-able via a JSON-RPC interface they provide.

The thing that I like the most, however, is that it seems to be a multi-use chain. You see a lot happening on this blockchain:

Cat breeding games
A proper cryptocurrency (named Ether)
Other cryptocurrencies (e.g. ERC-20) and initial coin offerings

For these reasons, Ethereum seems like the blockchain most ripe for analysis. Specifically it would be interesting to find some analytics that might yield insights on how this chain works on a day-to-day basis. While not necessarily tied to predicting price, it would be of particular interest to investors to find something which connects, even indirectly, to future price.

The Tale of Two Users

It’s easy to say one should be looking at advanced analytics using the full data from the blockchain. It’s quite a different story to actually suggest what to look at here. I will proceed from a couple of observations:

Transaction data forms a graph, so it is possible to borrow machinery from Graph Theory if necessary
There are at least two interesting actors in this scenario: the new user and the established player

The “new user” is a user who is using the blockchain for the first time, whereas the “established player” is a hash which is important and somewhat central to the blockchain (e.g. involved in both sending and receiving transactions to many people). I maintain that these are two interesting actors insomuch as observing the blockchain transactions from the vantage point of these users will yield insights as to the general health, well-being and state of the blockchain. If either of these actors change their behavior appreciably, it’s worth knowing and will probably have some impact on the fundamental usage patterns of the blockchain in question. Maybe even, if we’re very lucky, give us a hint on how the price may change.

We now face a couple of challenges:

Formally defining these two actors in such a way that one can distinguish between them could be computationally daunting
What precisely should one measure through the lens of these actors?

Starting from the bottom, I think a sensible starting point here is to measure the daily percentage of transactions being done by each of these actors. Plotting this opposite the price, one may see the effect that each of these actors may have on the price.

The New User Impact

Let’s call the daily percentage of transactions involving a hash never before seen to be the “new user impact.” Just the act of picking out hashes that have never been seen before can be rather daunting given that there have been over 20 million distinct hashes between Ethereum’s inception and January 18, 2018. Doing this sort of analysis belies a simple SQL query but is well within Spark’s sweet spot of enabling more low level operations and distributed computing primitives. Judicious usage of bloom filters in Spark opens us up to performing these kinds of computations in a scalable way.

Observe the above plot from the timerange between January 1, 2017 until January 18, 2018 with the closing price per day in blue plotted opposite the percentage of the daily transactions involving a hash never before seen (the daily new user impact) in red.

Note the discordant nature of the new user impact and how little correlation to price is happening prior to mid-November. The behavior prior to mid-November is in stark contrast to the run-up in price and strong connection to the new user impact that happens from mid-November until early January. The fascinating thing here is that the new user impact seems to dip prior to the price dip in early January. It’s unclear whether this is a reliable early indicator (especially given its chaos earlier in the year), but it’s certainly worth investigating. It is somewhat unfortunate how volatile the new user impact becomes from mid-December onward.

The Established Player Impact

In contrast to the “new user” as an actor, whose definition is easy to pin down in a technical way, the established player is tougher to specify in a rigorous way. Given the fact that the transactions on a blockchain form a graph, one can borrow from graph analytics some tooling to help us out. Specifically, define an “established player” for a specific day to be a hash such that the undirected pagerank of the hash is in the top 10% of pageranks given the transaction graph of the previous 14 days. The intuition here is that this will define a set of “important” hashes in the network. Tracking how much of the network operates from these important hashes daily will give us some idea of the impact of the big players, such as exchanges and market makers, in the network.

Observe the above plot from an abbreviated timerange of July 2017 until January 18, 2018 with the closing price per day in blue plotted opposite the percentage of the daily transactions involving a hash from an established player in red. Note that this timerange is abbreviated because it’s fairly costly to compute the pagerank of even 2 weeks worth of transaction data (note that a more serious analysis would imply more serious compute and thus might adjust these parameters).

The thing I immediately notice here is that, like the new user impact, the established player impact seems to couple with the price starting in mid-November. Also, similar to the new user impact, it deviates prior to the actual cost drop, but is decidedly less chaotic immediately prior to the mid-January dip and thus possibly more reliable.

In Conclusion

The core impulse behind this exercise is to find some essential analytics to summarize behavior of the network from a particular vantage point (or set of vantage points). One must be careful drawing conclusions of predictive leading indicators of price from this exercise. Rather, stepping back, there are the beginnings of a set of analytics that one can monitor over time to better understand how Ethereum, as a blockchain, moves, lives and breathes on a day-to-day basis. Inflection points in these analytics tie to usage shifts and assumptions in the technical analysis of this blockchain should be reevaluated or else risk becoming stale or less-effective. For instance, if one sees a precipitous drop in the new user impact over a week, then either users are not using the chain (which you can see in early 2017 in the “New User Impact” plot) or Ethereum has reached saturation (i.e. no new users, but still much usage). For a young blockchain, new user usage is imperative for robust growth and thus it’ll be a turning point when the chain is saturated.

Thinking beyond this analysis, I plan to go on and look at some of the other graph theoretic analytics that can be tracked over time in both Ethereum as well as other established blockchains, most obviously Bitcoin:

The number of transaction triangles per day to get an indication of the transaction movement in the chain
The number of “communities” in the transaction graph by applying a label propagation algorithm to the transaction graph daily.

Also, looking closer at analytics involving the amount of ethereum transacted per day:

50th, 75th, 90th and 95th percentile of the amount of ether transacted by new users
50th, 75th, 90th and 95th percentile of the amount of ether transacted by established players

Word2Vec with Non-Textual Data

2015-12-04T00:00:00+00:00

At least half of the battle of data analysis and data science is understanding your data.

That sounds obvious, but I’ve seen whole data science projects fail because not nearly enough time was spent on the exercise of understanding your data. There are only two real ways to go about doing this:

Ask an expert
Ask the data

To have a shot at doing this you really have to do both.

In the course of this blog post, I’m going to describe some of the challenges with understanding data and I’ll go into some technical detail of how to borrow some scalable unsupervised learning from natural language processing coupled with a very nice data visualization to facilitate understanding the natural organization and arrangement of data.

Subject Matter Experts

I spend a lot of time with healthcare data and the obvious subject matter experts are nurses and doctors. These people are very gracious, very knowledgeable and extremely pressed for time. The problem with expert knowledge is that it’s surprisingly hard to communicate effectively sufficient nuance to help the working data scientist accomplish their goals. Furthermore, it’s extremely time consuming.
This is made doubly hard when the expert is entirely unclear about the goal.

The second, perhaps less obvious, challenge is that subject matter experts knowledge is biased toward that which is already known. Often data scientists and analysts are trying to understand the data not as an ends, but rather as a means to gaining insight. If you only take into account received knowledge, then making unexpected insights can be challenging. That being said, spending time with subject matter experts is a necessary yet insufficient part of data analysis.

Unsupervised Data Understanding

To complete the task of understanding your data, I have found that it is necessary to spend time looking at the data. One can think of the entire field of statistics as an exercise in building a mechanism to ask data pointed questions and get answers that we can trust, often with caveats.
The goal is generally to get a sense of how the data is organized or arranged.
With the unbelievable complexity of most real data, we are forced to simplify our representations. The question is just precisely how to simply that representation to find the proper balance between simplicity and complexity. More than that, some representations of the data offer useful views of the data for certain purposes and not for others.

Common simplified representations of data are things like distributions, histograms, and plots. Of course there are other even more complex ways to represent your data. Whole companies have been formed around providing a way to gain insight through more complex organizations of the data, taking some of the burden of interpretation from our brain and encoding it in an organization scheme.

Today, I’d like to talk about another approach to data simplification for event data which provides not just an interesting representation, but also a way to ask the data certain kinds of useful questions of your data.

Word2Vec

One common way to impose order on data that is used by engineers and mathematicians everywhere is to embed your data in a vector space with a metric on it.
This gives us a couple things :

Data now has a distance which can be interpreted as the degree of “difference” between the data
Data can be combined via addition and subtraction operations which can be interpreted as combination and separation operations

The issue now is how you impose this structure by embedding your data, which may not even be numeric, into a vector space. Thankfully, the nice people at Google developed a nice way of doing this in the domain of natural language text called Word2Vec.

I won’t go into extravagant detail into the implementation as Radim Řehůřek did a great job here.
The major takeaways, however, is that using the inherrent structure of natural language, Word2Vec is able to construct a vector space such that a

Word similarity can be interpreted as a distance calculation
The notion of analogies can be interpreted using the addition and subtraction operators (e.g. the vector representation of king - male + female is near the vector representation of queen).

This is a surprisingly rich organization of data and one that has proven very effective in enhancing the accuracy of machine learning models that deal with natural language. Perhaps the most surprising part of this is that the vectorization model does not utilize any of the grammatical structure of the natural language directly. It simply analyzes the words within the sentences and through usage it fits the proper embedding. This led me to consider whether other, non-textual data which has some inherrent structure can also be organized this way with the same algorithm.

Medical Data

Whenever we go to the doctor, a set of events happen:

Measurements are made (e.g. blood pressure, pulse, height, weight)
Labs are drawn and ordered (e.g. blood tests)
Procedures are performed (e.g. an x-ray)
Diagnoses are made
Drugs are prescribed

These events happen in a certain overall order but the order varies based on the patient situation and according to the medical staff’s best judgement. We will call this set of events a medical encounter and they happen every day all over the world.

This sequence of events has a similar tone to what we’re familiar with in natural language. The encounter can be thought of as a sort of medical sentence. Each medical event within the encounter can be thought of as a medical word. The type of event (lab, procedure, diagnoses, etc.) can be considered as a sort of part-of-speech.

It remains to determine if this structure can be teased out and encoded into a vector space model like natural language can be. If so, then we can ask questions like:

How similar are two diseases based on how they are treated and comorbidities found in the same encounter?
Can we compose diseases and make them similar to other diseases? For instance, is the vector representation of type 2 diabetes - obesity close to type 1 diabetes?

When considering trying this technique out the problem, of course, is getting access to medical data. This data is extremely sensitive and is covered by HIPAA here in the United States. What we need is a good, depersonalized set of medical encounter data.

Thankfully, back in 2012 an electronic medical records system, Practice Fusion released a set of 10,000 depersonalized medical records as part of a kaggle competition. This opened up the possibility of actually doing this analysis, albeit on a small subset of the population.

Implementation

Since I’ve been doing a lot with Spark lately at work, I wanted to see if I could use the Word2Vec implementation built into SparkML to accomplish this. Also, frankly, having worked with medical data at some big hospitals and insurance companies, I am aware that there is a real scale problem when doing something this complex for millions of medical encounters and I wanted to ensure that anything I did could scale.

The implementation boiled down into a few steps, which are common to most projects that I’ve seen run on Hadoop. I have created a small github repo to capture the code collateral used to process the data here.

Ingest the Practice Fusion database dumps into Hadoop.
- Shell script here
Pin up Hive tables for each of the tables, roughly corresponding to a table per medical event.
- The set of DDL’s are here
Transform this tabular data into a corpus of medical event sentences.
- The ETL pig scripts are here
- The shell script executing the pig scripts are here
Build the word2vec model with Spark.

You can see from the Jupyter notebook detailing the model building portion and results here that model building is only a scant few lines:


 from pyspark import SparkContext
 from pyspark.mllib.feature import Word2Vec
 sentences = sc.textFile("practice_fusion/sentences_nlp").map(lambda row: row.split(" "))
 word2vec = Word2Vec()
 word2vec.setSeed(0)
 word2vec.setVectorSize(100)
 model = word2vec.fit(sentences)

#Results

One of the problems with unsupervised models is evaluating how well our model is describing reality. For the purpose of this entirely unscientific analysis, we’ll restrict ourselves to just diagnoses and ask a couple of questions of the model:

Does the model correctly recover what we currently know based on medical research?
Does the model show us anything that is novel and likely, but unknown at present?

One thing to note before we get started. This model uses cosine similarity as the score. This measure of similarity ranges from 0 to 1, with 1 being most similar and 0 being least similar.

Atherosclerosis

Also known as heart disease or hardening of the arteries. This disease is the number one killer of Americans. Our model found the following similar diseases:

ICD9 Code	Description	Score
v12.71	Personal history of peptic ulcer disease	0.930
533.40	Chronic or unspecified peptic ulcer of unspecified site with hemorrhage, without mention of obstruction	0.926
153.6	Malignant neoplasm of ascending colon	0.910
238.75	Myelodysplastic syndrome, unspecified	0.910
389.10	Sensorineural hearing loss, unspecified	0.907
428.30	Diastolic heart failure, unspecified	0.904
v43.65	Knee joint replacement	0.902

Peptic Ulcers

There have been long-standing connections noticed between ulcers and atherosclerosis. Partiaully due to smokers having a higher than average incidence of peptic ulcers and atherosclerosis. You can see an editorial in the British Medical Journal all the way back in the 1970’s discussing this.

Hearing Loss

From an article from the Journal of Atherosclerosis in 2012:

Sensorineural hearing loss seemed to be associated with vascular endothelial dysfunction and an increased cardiovascular risk

Knee Joint Replacements

These procedures are common among those with osteoarthritis and there has been a solid correlation between osteoarthritis and atherosclerosis in the literature.

Crohn’s Disease

Crohn’s disease is a type of inflammatory bowel disease that is caused by a combination of environmental, immune and bacterial factors. Let’s see if we can recover some of these connections from the data.

ICD9 Code	Description	Score
274.03	Chronic gouty arthropathy with tophus (tophi)	0.870
522.5	Periapical abscess without sinus	0.869
579.3	Other and unspecified postsurgical nonabsorption	0.863
135	Sarcoidosis	0.859
112.3	Candidiasis of skin and nails	0.855
v16.42	Family history of malignant neoplasm of prostate	0.853

Arthritis

From the Crohn’s and Colitis Foundation of America:

Arthritis, or inflammation of the joints, is the most common extraintestinal complication of IBD. It may affect as many as 25% of people with Crohn’s disease or ulcerative colitis. Although arthritis is typically associated with advancing age, in IBD it often strikes the youngest patients.

Dental Abscesses

While not much medical literature exists with a specific link to dental abscesses and Crohn’s (there are general oral issues noticed here), you do see lengthy discussions on the Crohn’s forums about abscesses being a common occurance with Crohn’s.

Yeast Infections

Candidiasis of skin and nails is a form of yeast infection on the skin. From the journal “Critical Review of Microbiology” here.

It is widely accepted that Candidia could result from an inappropriate inflammatory response to intestinal microorganisms in a genetically susceptible host. Most studies to date have concerned the involvement of bacteria in disease progression. In addition to bacteria, there appears to be a possible link between the commensal yeast Candida albicans and disease development.

Visualization

For further investigation, I have used t-distributed stochastic neighbor embedding to embed the 100-dimensional vector space into 2 dimensions. This embedding should retain the general connections within the data, so you can look at similar diagnoses, drugs and allergies.

You can choose to look at all types, just diagnoses or just drugs.
Highlight in the canvas below and drag around. The points that you’ve selected will show up in the table below along with a description in plain text.

Please play around with this data and let me know what you find!

All
Provider Specialty
Diagnoses
Drugs
Allergies

Type	Name	Description
Highlight some points above for this summary to be filled in.

Data Science and Hadoop: Impressions and Example

2014-10-24T00:00:00+00:00

A somewhat regular part of my job lately is discussing with people how exactly one might go about doing data science on Hadoop. It’s really a very interesting subject and one about which almost everyone even cursorily associated with ``Big Data’’ has an opinion. Remarks are made, emails written, Powerpoint decks created; it’s a busy day, for sure.

People cannot be blamed for being concerned since according to Jeff Kelly , a Wikibon analyst, the ROI of these big data projects does not match expectations:

In the long term, they expect 3 to 4 dollar return on investment for every dollar. But based on our analysis, the average company right now is getting a return of about 55 cents on the dollar.

That’s pretty concerning for those of us hoping for Hadoop to cross the chasm soon. As one might imagine, there’s been quite a bit of hand wringing about the problem. I don’t take such a dim view of it, though. It’s a matter of maturity and I’ll give some of my impressions shortly on why it may be hard to fulfill the data science portion of the ROI currently.

Outline

Data Science Challenges
Example Analysis with PySpark and Hadoop
Conclusions
- This is a Dull Blade Exercise
- PySpark + Hadoop as a Platform

Data Science Challenges

One benefit from my vantage point within the consulting wing of a Hadoop distribution is that I get to see quite a few Hadoop projects. Being that I’m part of the Data Science team, I get to have a decidedly Data Science oriented view of the Hadoop world. Furthermore, I get to see them in both startups as well as big enterprises. Even better, living in and working with organizations from a fly-over state, I have a decidedly non-Silicon Valley perspective.

From this position, it’s not hard to see making the leap from owning a cluster to gaining insight from your data can be a daunting task. I’ll just list a few challenges that I’ve noticed in my travels:

Data has inertia
Hadoop is still maturing as a platform
Choices can be paralyzing

The first is an organizational challenge, the second a technical/product challenges and the final is a challenge of human nature.

Data has Inertia

One of the competitive advantages of Hadoop is that inexpensive, commodity hardware and a powerful distributed computing environment makes Hadoop a pretty nice, cozy place for your data. This all looks great on paper and in architecture slides. The challenge, however, is actually getting the data to the platform.

Turns out moving data can be a tricky prospect. Much ink and bits have been spilled discussing the technical approaches and challenges to collecting your data into a data lake. I won’t make you suffer through yet another discussion of the finer points between sqoop, flume, etc. The technical challenges are almost never the long poles in the tent.

Rather, what I have witnessed is that getting that data to start moving can be arduous and require political capital. I have noticed that there is a tendency to treat those who come to you asking for data with a fair amount of skepticism.

However, once data channels open up, data has a tendency to flow more and more smoothly. This is why most of the successful projects that I’ve been involved in have the following attributes:

A sponsor with sufficient political power and the willingness to use it to get the data required to succeed
An iterative attitude so that the time to value is minimal

These attributes are not specific to data science projects. Rather, the same principal applies to all projects that require an abundance of data. No data-oriented project can survive if starved of data and almost all Hadoop projects are data-oriented.

Hadoop is Still Maturing as a Platform

When I was young, I liked to climb trees. Growing up in rural Louisiana, I had plenty of opportunities on this front. As such, I got fairly good at picking out a good climbing tree. There is a non-zero intersection of trees which are good for climbing and trees which are pretty to look at or have some satisfying structural characteristics.

Often, however, the properties did not coexist in the same tree. Climbing trees were best if there were relatively low, thick branches with good spacing. Trees which were nice to look at were much more manicured with delicate branches and a certain symmetry.

Platforms have the same characteristics, I think. You have platforms that are very finely manicured with a focus on internal consistency and contained borders. This yields a good experience for those who use the system as the originators intended. These systems are pretty to look at, to be sure.

Ironically enough, I’ve always liked the sprawling systems with an emphasis on many integration points. They give the feeling that they are reaching out to you. That act of reaching out is the act of engaging. Hadoop is transitioning quickly from a finely manicured topiary sculpture to a fantastic climbing tree.

It started out very self contained and internally consistently. If you used Java, you were going to have a good time (sometimes ;-). While you could use pipes and streaming to hook up your C++ code or perl scripts, you weren’t going to have nearly as good of a time. Equivalently, on the algorithm front, if you could express what you wanted to do in MapReduce then the world was straightforward.

_{Topiary Elephant in Bang Pa In Palace, Thailand. CC BY-SA 3.0}

Now, as Hadoop matures, we see branches to other platforms growing and branches to other distributed computing paradigms growing. On the technical side, we can now write pure non-JVM UDF’s in Pig, Spark has proper first-class bindings for Python, you can even write yarn apps in languages other than those which run on the JVM. Much of this is thanks to the new architecture in 2.0, but more than just a technical direction, it’s the realization by the community that we need more choices.

That being said, it’s early days and we’re not that far down the path to the new way of thinking. This will be solved with time and maturity.

Analysis Paralysis

Data science isn’t a new thing. I know, this is a brave statement and a deep conclusion. Forgiving its obviousness and pith, I actually mean that most organizations are already doing and have been doing for years one of the core things people talk about as data science: developing insights from their data.

I walk into organizations and I talk with the data analysts and I ask them about how they do their job on a day-to-day basis. Most of them talk to me about things somewhere between logistic regression in SAS and doing very complex SQL in a data warehouse. I ask them what their pains are and almost to a person, they always say something like the following:

Copies of the data are expensive with my limited quota
Getting the data from one system to another takes 24 hours at least.

The data scientists aren’t clamoring for the things that you see so often touted as the benefits of ``Big Data’’:

Unstructured data
Running your models on a petabyte of data
Running sexy new algorithms at massive scale

Does this mean that those things aren’t really needed? If so, our job is easy, all we have to do is recreate SQL on Hadoop and convince organizations to put their data there. That solves big portions of the top complaints above.

The answer is obviously that the current gripes do not remove the need for more data, differently structure data, other techniques in the data science toolbag. So, why aren’t the data analysts that I talk to chomping at the bit for them?

One reason, I think, is that with increasingly complex data comes increasing complexities in dealing/processing that data. Furthermore, in structured data, the act of extraction/transformation/loading of the data was not a data scientist activity. It’s possible that, given more complicated data, just extracting features from it might require more arduous programming than analysts are used to. A good example of this is within the realm of natural language processing projects.

Also, ``Big Data’’ data science isn’t as convenient as small-data data science. Contrast the ease of using Mahout or Spark’s MLLib with python’s scikit-learn, R or SAS. It’s not a contest; it’s easier and quicker to deal with a few megabytes of data. Since there is value in dealing with much more data, we have to eat the elephant, but it can be daunting without guidance and examples are few and far between.

Ultimately I think we focus so heavily on new and novel techniques, the game changing paradigm shifts (with our tongues placed firmly in our cheeks sometimes) without discussing the journey to getting there. If we constantly look across the chasm without looking at the bridge beneath our feet, we run the risk of falling into the drink.

Example Analysis with PySpark and Hadoop

This brings me to why I wanted to create this post. I intend to show a worked example of how you do what I’ve seen as day-to-day work as data analysts along with some natural extension points that show how to use Hadoop to do possibly more interesting analysis. Namely :

Understand some fundamental characteristics of the data to be analyzed
Generate reporting/images to communicate those characteristics to other people
Mine the data for likely incorrect or interesting data points that break with the characteristics found above.

Over the course of the next few blog posts, I will take some recently opened data from the Center of Medicare and Medicaid detailing the financial relationships between physicians, hospitals, etc and medical manufacturers and use Spark’s Python bindings to look at the data, its shape, its outliers and look for data that may be amiss.

The individual phases have been split into 4 parts:

Conclusions

This is a Dull Blade Exercise

I have been very careful to not draw conclusions or explicitly look for fraud. This is intended to be a demonstration of technique and I cannot verify that this dataset isn’t rife with bad or misclassified data. As such, I intended to demonstrate some of the basic and slightly more advanced analysis techniques that are open to you using the Hadoop platform.

PySpark + Hadoop as a Platform

If you have the interest/ability to be comfortable in a Python environment, I think that for data investigation and ad hoc reporting, interacting with Hadoop via IPython Notebook and the Spark Python bindings is a fantastic experience.

Interacting between SQL and more complex, fundamental analysis is seamless. It all communicates in terms of RDDs for maximum ease of composition. I could have used any of the rest of the Spark stack, such as MLLib or GraphX.

Having all of this running on Hadoop, allowing me to do ETL and work in the other parts of the ecosystem such as Pig, Hive, etc. is an extremely compelling aspect as well. Ultimately, we’re approaching a very cost effective and well thought out system for analyzing data.

It’s not all roses, however. When something goes wrong, it can be challenging to trace back the problem from the mix of Java/Scala and Python stack trace that is returned to you.

There can be some IT challenges as well. If you use a package in python in a RDD operation, you must have the package installed on the cluster. This may pose some challenges as many different people are going to need differing versions of dependencies. Traditionally this is handled through things like virtualenv, but executing a function within the context of a virtualenv isn’t supported and, even if it were, managing a virtualenv across a set of data nodes can be a challenge in itself.

If you would prefer to see the raw IPython Notebook, you can find it hosted on nbviewer.ipython.org.

Data Science and Hadoop: Part 5, Benford's Law Analysis

2014-10-24T00:00:00+00:00

Context

This is the last part of a 5 part series on analyzing data with PySpark:

Benford’s Law

Benford’s Law is an interesting observation made by physicist Frank Benford in the 30’s about the distribution of the first digits of many naturally occurring datasets. In particular, the distribution of digits in each position follows an expected distribution.

I will leave the explanation of why Benford’s law exists to better sources, but this observation has become a staple of forensic accounting analysis. The reason for this interest is that financial data often fits the law and humans, when making up numbers, have a terrible time picking numbers that fit the law. In particular, we’ll often intuit that digits should be more uniformly distributed than they naturally occur.

Even so, violation of Benford’s law alone is insufficient to cry foul. It’s only an indicator to be used with other evidence to point toward fraud. It can be a misleading indicator because not all datasets abide by the law and it can be very sensitive to number of observations.

Ranking by Goodness of Fit

It’s of interest to us to figure out if this payment data fits the law and, if so, are there any payers who do not fit the law in a strange way. That leaves us with the technical challenge of determining how close two distributions are so that we can rank goodness of fit. There are a few approaches to this problem:

They are two approaches, one statistical and one information-theoretic, that will accomplish similar goals: tell how close two distributions are together.

I chose to rank based on KL divergence, but I compute the $\chi^2$ test as well. I’ll quote briefly about Kullback-Leibler divergence to give a sense of what the ranking means:

The Kullback–Leibler divergence of Q from P, denoted $D_{KL}(P || Q)$ , is a measure of the information lost when Q is used to approximate P. The KL divergence measures the expected number of extra bits required to code samples from P when using a code based on Q, rather than using a code based on P. Typically P represents the “true” distribution of data, observations, or a precisely calculated theoretical distribution. The measure Q typically represents a theory, model, description, or approximation of P.

In our case, P is the expected distribution based on Benford’s law and Q is the observed distribution. We’d like to see just how much more “complex” in some sense Q is versus P.

The Benford Digit Distribution

For the first digit, Benford’s law states that the probability of digit $d$ occurring is $P(d) = \log_{10}(1 + \frac{1}{d})$ .

There is a generalization beyond the first digit which can be used as well. The probability of digit $d$ occurring in the second digit is $P(d) = \sum_{k=1}^{9} \log_{10}(1 + \frac{1}{10k + d})$ .

#Compute benford's distribution for first and second digit respectively
benford_1=np.array([0] + [math.log10(1+1.0/i) for i in xrange(1,10)])
benford_2=np.array([ sum( [ math.log10(1 + 1.0/(j*10 + i)) \
                            for j in xrange(1, 10) \
                          ]\
                        )\
                     for i in xrange(0,10)\
                   ]\
                  )

Implementation of Benford’s Law Ranking

The way we’ll approach this problem is to determine the first and second digit distributions of payments by payer/reason and rank by goodness of fit for the first digit. Then we’ll look at the top fitting and bottom fitting payers for a few different reasons and see if we can see any patterns. One caveat is that we’ll be throwing out data under the following scenarios:

The payer/specialty does not have at least 400 payments
The amount is less than $10 (therefore not having a second digit)

Obviously the second one may skew things as it throws out some data within a partition but not the whole partition. In real analysis, I’d find a way to include the data points.

#Return a numpy array of zeros of specified $length except at
#position $index, which has value $value
def array_with_value(length, index, value):
    arr = np.zeros(length)
    arr[index] = value
    return arr

#Perform chi-square test between an expected probability
#distribution and a list of empirical frequencies.
#Returns the chi-square statistic and the p-value for the test.
def goodness_of_fit(emp_counts, expected_probabilities):
    #convert from probabilities to counts
    exp_distr = expected_probabilities*np.sum(emp_counts)
    return stats.chisquare(emp_counts, f_exp=exp_distr)
    
        
#For each (reason, payer) pair compute the first and second digit distribution
#for all payments.  Return a RDD with a ranked list based on likely goodness 
#of fit to the distribution of first digits predicted by Benford's "Law".
def benfords_law(min_payments=350):
    """
    Benford's "law" is a rough observation that the distribution of numbers 
    for each digit position of certain data fits a specific distribution.  
    It holds for quite a bit real-world data and, thus, has become of 
    interest to forensic accountants.
    
    This function computes the distribution of first and second digits for 
    each (reason, payer) pair and ranks them by goodness of fit to 
    Benford's Law based on the first digit distribution.
    In particular, the goodness of fit metric that it is ranked by is 
    kullback-liebler divergence, but chi-squared goodness of fit test 
    is also computed and the results are cached.
    """
    
    #We use this one quite a bit in reducers, so it's nice to have it handy here
    sum_values = lambda x,y:x+y

    #Project out the reason, payer, amount, and amount_str, throwing 
    #away amounts < 10 since they don't have 2nd digits.  This probably 
    #skews the results, so in real-life, I'd not throw out entries so
    #cavalierly, but for the purpose of simplicity, I've done it here.
    
    #Also, we're pulling out the first and second digits here
    reason_payer_amount_info = sqlContext.sql("""select reason
                                                      , payer
                                                      , amount
                                                      , amount_str 
                                                 from payments
                                              """)\
                                    .filter(lambda t:len(t[0]) > 3 
                                                 and t[2] > 9.99\
                                           ) \
                                    .map(lambda t: ( (t[1], t[0]) \
                                                   ,dict( payer=t[1]\
                                                        , reason=t[0] \
                                                        , first_digit=t[3][0] \
                                                        , second_digit=t[3][1] \
                                                        )\
                                                   )\
                                        )
    reason_payer_amount_info.take(1)
    
    #filter out the reason / payer combos that have fewer payments than 
    #the minimum number of payments                       
    reason_payer_count = reason_payer_amount_info.map(lambda t: (t[0],1)) \
                                                 .reduceByKey(sum_values) \
                                                 .filter(lambda t: \
                                                    t[1] > min_payments
                                                        )
                                              
    #inner join with the reason/payer's that fit the count requirement 
    #and annotate value with the num payments
    reason_payer_digits = \
      reason_payer_amount_info.join(reason_payer_count) \
                              .map(lambda t: (t[0] \
                                             , annotate_dict(join_lhs(t)\
                                                            , 'num_payments'\
                                                            , join_rhs(t)\
                                                            )\
                                              )\
                                   )
    #compute the first digit distribution.
    #First we count each of the 9 possible first digits, then we translate 
    #that count into a vector of dimension 10 with count for digit i 
    #in position i.  We then sum those vectors, thereby getting the 
    #full frequency per digit.
    first_digit_distribution = \
      reason_payer_digits.map(lambda t: ( (t[0], t[1]['first_digit'] ) , 1) ) \
                         .reduceByKey(sum_values) \
                         .map(lambda t: (t[0][0]\
                                        , array_with_value(10\
                                                          , int(t[0][1])\
                                                          , t[1]\
                                                          )\
                                        )\
                              ) \
                         .reduceByKey(sum_values)
    #same thing with the 2nd digit
    second_digit_distribution = \
      reason_payer_digits.map(lambda t: ( (t[0], t[1]['second_digit']) , 1) ) \
                         .reduceByKey(sum_values) \
                         .map(lambda t: (t[0][0]\
                                        , array_with_value(10\
                                                          , int(t[0][1])\
                                                          , t[1]\
                                                          )\
                                        )\
                              ) \
                         .reduceByKey(lambda x,y:np.array(x) + np.array(y))

    #We join the two, compute the goodness of fit based on chi-square test
    #and the distance from benford's distribution based on kl divergence.
    #Finally we sort by kl-divergence ascending (good fits come first).
    return first_digit_distribution\
            .join(second_digit_distribution) \
            .map(lambda t : ( t[0]\
                            , dict( payer=t[0][0] \
                                  , reason=t[0][1] \
                                  , first_digit_distr=join_lhs(t) \
                                  , second_digit_distr=join_rhs(t) \
                                  , first_digit_fit = \
                                      goodness_of_fit(join_lhs(t)[1:10] \
                                                     , benford_1[1:10] \
                                                     ) \
                                  , second_digit_fit = \
                                      goodness_of_fit(join_rhs(t), benford_2) \
                                  , kl_divergence = \
                                      stats.entropy( benford_1[1:10]\
                                                   , join_lhs(t)[1:10]\
                                                   )\
                                  ) \
                             ) \
                 ) \
            .map(lambda t : (t[1]['kl_divergence'], t[1]) )\
            .sortByKey(True) \
            .map(lambda t : ( (t[1]['payer'], t[1]['reason']), t[1]) )

benford_data = benfords_law(400)

#Plot the distribution of first and second digit side-by-side for a set of payers.
def plot_figure(title,entries):
    num_rows = len(entries)
    fig, axes = plt.subplots(nrows=len(entries), ncols=2, figsize=(12,12))
    fig.tight_layout()
    plt.subplots_adjust(top=0.91, hspace=0.55, wspace=0.3)
    fig.suptitle(title, size=14)
    
    bar_width = .4
    
    for i,entry in enumerate(entries):
        first_ax = axes[i][0]
        first_digit_distr = entry[1]['first_digit_distr'][1:10]
        sample_label_1 = \
          """$\chi$={}, $p$={}, kl={}, n={}"""\
            .format( locale.format('%0.2f'\
                                 , float(entry[1]['first_digit_fit'][0])\
                                 ) \
                   , locale.format('%0.2f'\
                                  , float(entry[1]['first_digit_fit'][1])\
                                  )\
                   , locale.format('%0.2f'\
                                  , float(entry[1]['kl_divergence'])\
                                  )\
                   , int(np.sum(first_digit_distr))\
                   )

        first_digit_distr = first_digit_distr/np.sum(first_digit_distr)
        
        first_ax.bar(np.arange(1,10) ,first_digit_distr, alpha=0.35\
                    , color='blue', width=bar_width, label="Sample"\
                    )
        first_ax.bar(np.arange(1,10)+bar_width ,benford_1[1:10]\
                    , alpha=0.35, color='red', width=bar_width\
                    , label='Benford')
        first_ax.set_xticks(np.arange(1,10))
        first_ax.legend()
        first_ax.grid()
        
        first_ax.set_ylabel('Probability')
        first_ax.set_title("{} First Digit\n{}"\
                            .format(entry[0][0].encode('ascii', 'ignore')\
                                   , sample_label_1\
                                   )\
                          )
        
        second_ax = axes[i][1]
        second_digit_distr = entry[1]['second_digit_distr']
        sample_label_2 = \
          '$\chi$={}, $p$={}, n={}'\
            .format(locale.format('%0.2f'\
                   , float(entry[1]['second_digit_fit'][0])\
                                 ) \
                   , locale.format('%0.2f'\
                                  , float(entry[1]['second_digit_fit'][1])\
                                  )\
                   , int(np.sum(second_digit_distr))\
                   )
        second_digit_distr = second_digit_distr/np.sum(second_digit_distr)
        
        second_ax.bar(np.arange(0,10) ,second_digit_distr, alpha=0.35\
                     , color='blue', width=bar_width, label="Sample")
        second_ax.bar(np.arange(0,10) + bar_width,benford_2, alpha=0.35\
                     , color='red', width=bar_width, label='Benford')
        second_ax.set_xticks(np.arange(0,10))
        second_ax.legend()
        second_ax.grid()
        second_ax.set_ylabel('Probability')
        second_ax.set_title("{} Second Digit\n{}"\
                             .format(entry[0][0].encode('ascii', 'ignore')\
                                    , sample_label_2\
                                    )\
                           )
        
#Take n-worst or best (depending on t) entries for reason based on goodness
# of fit for benford's law and plot the first/second digit distributions 
#versus benford's distribution side-by-side as well as the distribution 
#of kl-divergences.
def benford_summary(reason, data = benford_data, n=5, t='best'):
    raw_data = data.filter(lambda t:t[0][1] == reason).collect()
    s = []
    if t == 'best':
        s=raw_data[0:n]
        plot_figure("Top 5 Best Fitting Benford Analysis for {}"\
                   .format(reason)\
                   , s\
                   )
    else:
        s=raw_data[-n:][::-1]
        plot_figure("Top 5 Worst Fitting Benford Analysis for {}"\
                   .format(reason)\
                   , s\
                   )
    plot_outliers( [d[1]['kl_divergence'] for d in raw_data]\
                 , [d[1]['kl_divergence'] for d in s]\
                 , reason + " KL Divergence"\
                 )

Best and Worst Fitting Payers for Gifts

Pretty much across the board the p-values for the $\chi^2$ test were super weak, but the first 4 are a pretty good fit. The last one is interesting, that spike at the 6 digit is the kind of thing that are of interest to forensic accountants. I repeat, however, that this is not an indicator which can be used safely alone to level a charge of fraud.

# Gift Benford Analysis (Best)
benford_summary('Gift', t='best')

Below is the density plot for the Kullback-Leibler divergences for the top best fits. You can see there’s a clump at 0 and a clump a bit farther out, but no real outliers.

Now we look at the worst fitting gift payers. The lists overlap, as you can see, because there just aren’t that many organizations that pay more than 350 gifts out over the course of the year.

Two things that are interesting:

Mentor Worldwide does not fit the decreasing probability distribution that we would expect. Many payers diverge from Benford’s law, but it’s interesting when they break the basic form of decreasing probabilities as digits progress from 1 through 9.
Benco Dental Supply has a huge amount of payments starting with 1. This is likely an indication that they have a standard gift that they give out.

benford_summary('Gift', t='worst')

Best and Worst Fitting Payers for Travel and Lodging

# Travel and Lodging Benford Analysis
benford_summary("Travel and Lodging")

The top fitting payers for travel and lodging fit fairly well, so it’s certainly possible to fit the distribution well.

benford_summary("Travel and Lodging", t='worst')

You can see a few payers diverging from the general form of Benford’s distribution here. LDR Holding is the outlier in terms of goodness of fit as you can see from the density plot below as well.

Best and Worst Fitting Payers for Consulting Fees

benford_summary("Consulting Fee")

The good fits are pretty good.

benford_summary("Consulting Fee", t='worst')

For both UCB and Merck, we see a payer with a huge distribution of payments starting with 7. This indicates a standardized payment of some sort, I’d wager. The interesting thing about Merck is that the 1’s distribution is pretty spot on, the rest of the density gets pushed into 7.

Conclusion

This concludes the basic example of doing analytics with the Spark platform. General conclusions and impressions from this whole exercise can be found here.

Data Science and Hadoop: Part 4, Outlier Analysis

2014-10-24T00:00:00+00:00

Context

This is the fourth part of a 5 part series on analyzing data with PySpark:

Outlier Analysis

Generating summary statistics is very helpful for

Understanding the overall shape of the data
Looking at trends at the extreme (answering most, least, etc style questions)

One thing we’re doing when we’re looking for fishy things in data is looking to quickly zoom in on things which are outside of the norm. Furthermore, the aim is to automatically tag these as data is ingested so they can be acted on. This action can be raising an alert, logging a warning or generating a report, but ultimately we want a technique to find these outliers quickly and without human intervention.

Median Absolute Deviation

We’re looking to create a mechanism to rank data points by their likelihood of being an outlier along with a threshold to differentiate them from inliers.

The area of outlier analysis is a vibrant one and there are quite a few techniques to choose from. Ranging from the exotic, like fitting an elliptic envelope to the straightforward, like setting a threshold based on standard deviations away from the mean. For our purposes, we’ll choose a middle path, but be aware that there are book-length treatments of the subject of outlier analysis.

Median Absolute Deviation is a robust statistic used, as standard deviations, as a measure of variability in a univariate dataset. It’s definition is straightforward:

Given univariate data $X$ with $\tilde{x}=$median($X$), MAD($X$)=median({$\forall x_i \in X \lvert |x_i - \tilde{x}|$}).

As compared to standard deviation, it’s a bit more resilient to outliers because it doesn’t have a square weighing large values very heavily. Quoting from the Engineering Statistics Handbook:

The standard deviation is an example of an estimator that is the best we can do if the underlying distribution is normal. However, it lacks robustness of validity. That is, confidence intervals based on the standard deviation tend to lack precision if the underlying distribution is in fact not normal.

The median absolute deviation and the interquartile range are estimates of scale that have robustness of validity. However, they are not particularly strong for robustness of efficiency.

If histograms and probability plots indicate that your data are in fact reasonably approximated by a normal distribution, then it makes sense to use the standard deviation as the estimate of scale. However, if your data are not normal, and in particular if there are long tails, then using an alternative measure such as the median absolute deviation, average absolute deviation, or interquartile range makes sense.

In the implementation we’ll be taking guidance the Engineering Statistics Handbook and Iglewicz and Hoaglin. As such, define an outlier like so:

For a set of univariate data $X$ with $\tilde{x} =$ median($X$), an outlier is an element $x_i \in X$ such that $M_i = \frac{0.6745(x_i − \tilde{x} )}{MAD(X)} > 3.5$, where $M_i$ is denoted the modified Z-score.

Before we jump into the actual algorithm, we create some helper functions to make the code a bit more readable and allow us to display the outliers.

#Some useful functions for more advanced analytics

#Joins in spark take RDD[K,V] x RDD[K,U] => RDD[K, [U,V] ]
#This function returns U
def join_lhs(t):
    return t[1][0]

#Joins in spark take RDD[K,V] x RDD[K,U] => RDD[K, [U,V] ]
#This function returns V
def join_rhs(t):
    return t[1][1]

#Add a key/value to a dictionary and return the dictionary
def annotate_dict(d, k, v):
    d[k] = v
    return d

#Plots a density plot of a set of points representing inliers and outliers
#A rugplot is used to indicate the points and the outliers are marked in red.
def plot_outliers(inliers, outliers, reason):
    fig, ax = plt.subplots(nrows=1)
    sns.distplot(inliers + outliers, ax=ax, rug=True, hist=False)
    ax.plot(outliers, np.zeros_like(outliers), 'ro', clip_on=False)
    fig.suptitle('Distribution for {} Values'.format(reason), size=14)

Now, onto the implementation of the outlier analysis. But, before we start, I’d like to make a couple notes about the implementation and possible scalability challenges going forward.

We are partitioning the data by (payment reason, physician specialty). I do not want to analyze outliers based on a cohort of data across a whole reason, but rather I want to know if a point is an outlier for a given specialty and reason.

If a coarser partitioning strategy is taken or the amount of data per partition becomes very large, the median implementation may become a limiting factor scalability wise. There are a few things to do, including using a tighter implementation (numpy’s implementation could be tighter as of this writing) or a streaming estimate. Needless to say, this is something that bears some thought going forward.

#Outlier analysis using Median Absolute Deviation

#Using reservoir sampling, uniformly sample N points
#requires O(N) memory
def sample_points(points, N):
    sample = [];
    for i,point in enumerate(points):
        if i < N:
            sample.append(point)
        elif i >= N and random.random() < N/float(i+1):
            replace = random.randint(0,len(sample)-1)
            sample[replace] = point
    return sample

#Returns a function which will extract the median at location 'key'
#a list of dictionaries.
def median_func(key):
    #Right now it uses numpy's median, but probably a quickselect
    #implementation is called for as I expect this doesn't scale
    return lambda partition_value : ( partition_value[0]\
                                    , np.median( 
                                        [ d[key] \
                                         for d in partition_value[1]\
                                        ]\
                                               )\
                                    )

#Compute the modified z-score for use by  as per Iglewicz and Hoaglin:
#Boris Iglewicz and David Hoaglin (1993),
#"Volume 16: How to Detect and Handle Outliers",
#The ASQC Basic References in Quality Control: Statistical Techniques
#, Edward F. Mykytka, Ph.D., Editor.
def get_z_score(reason_to_diff):
    med = join_rhs(reason_to_diff)
    if med > 0:
        return 0.6745 * join_lhs(reason_to_diff)['diff'] / med
    else:
        return 0

def is_outlier(thresh):
    return lambda reason_to_diff : get_mad(reason_to_diff) > thresh

#Return a RDD of a uniform random sample of a specified size per key
def get_inliers(reason_amount_pairs, size=2000):
    group_by_reason = reason_amount_pairs.groupByKey()
    return group_by_reason.map(lambda t : (t[0], sample_points(t[1], size)))


#Return the outliers based on Median Absolute Deviation
#See http://www.itl.nist.gov/div898/handbook/eda/section3/eda35h.htm 
#for more info.
#The input key structure is reason_specialty => dict(amount
#                                                   , physician
#                                                   , payer
#                                                   , specialty
#                                                   )
def get_outliers(reason_amount_pairs, thresh=3.5):
    """
        This uses the median absolute deviation (MAD) statistic to find
        outliers for each reason x specialty partitions.
        
        Outliers are computed as follows: 
        * Let X be all the payments for a given specialty, reason pair
        * Let x_i be a payment in X
        * Let MAD be the median absolute deviation, defined as
          MAD = median( for all x in X, | x - median(X)| )
        * Let M_i be the modified z-score for payment x_i, defined as
          0.6745*(x_i − median(X) )/MAD
        
        As per the recommendations by Iglewicz and Hoaglin, a payment is
        considered an outlier if the modified z-score, M_i > thresh, which
        is 3.5 by default.
        
        
        REFERENCE:
        Boris Iglewicz and David Hoaglin (1993),
        "Volume 16: How to Detect and Handle Outliers",
        The ASQC Basic References in Quality Control: Statistical Techniques,
        Edward F. Mykytka, Ph.D., Editor.
    """
    
    group_by_reason = reason_amount_pairs.groupByKey()
   
    #Filter by only reason/specialty's with more than 1k entries
    #and compute the median of the amounts across the partition.
    
    #NOTE: There may be some scalability challenges around median,
    #so some care should be taken to reimplement this if partitioning
    #by (reason, specialty) does not yield small enough numbers to 
    #handle in an individual map function.
    reason_to_median = group_by_reason.filter(lambda t: len(t[1]) > 1000) \
                                      .map(median_func('amount'))
   
    #Join the base, non-grouped data, with the median per key,
    #consider just the payments more than the median
    #since we're looking for large money outliers and annotate 
    #the dictionary for each entry x_i with the following:
    # * diff = |x_i - median(X)| in the parlance of the comment above.
    #   NOTE: Strictly speaking I can drop the absolute value since 
    #         x_i > median(X), but I choose not to.
    # * median = median(X)
    # 
    reason_abs_dist_from_median = \
        reason_amount_pairs.join(reason_to_median) \
                           .filter(lambda t : \
                              join_lhs(t)['amount'] > join_rhs(t)\
                                  ) \
                           .map(lambda t :\
                                 ( t[0]\
                                 , dict( diff=abs(\
                                      join_lhs(t)['amount'] - join_rhs(t)\
                                                 )\
                                       , row=annotate_dict(join_lhs(t) \
                                                          , 'median' \
                                                          , join_rhs(t) \
                                                          )\
                                       )\
                                 )\
                               )
                
    # Given diff cached per element, we need only compute the median 
    # of the diffs to compute the MAD.  
    #Remember, MAD = median( for all x in X, | x - median(X)| )
    reason_to_MAD = reason_abs_dist_from_median.groupByKey() \
                                               .map(median_func('diff'))
    reason_to_MAD.take(1)
    
    # Joining the grouped data to get both | x_i - median(X) | 
    # and MAD in the same place, we can compute the modified z-score
    # , 0.6475*| x_i - median(X)| / MAD, and filter by the ones which 
    # are more than threshold we can then do some pivoting of keys and 
    #sort by that threshold to give us the ranked list of outliers.
    return reason_abs_dist_from_median\
              .join(reason_to_MAD) \
              .filter(is_outlier(thresh))\
              .map(lambda t: (get_z_score(t)\
                             , annotate_dict(join_lhs(t)['row']\
                                            , 'key'\
                                            , t[0]\
                                            )\
                             )\
                  ) \
              .sortByKey(False) \
              .map(lambda t: ( t[1]['key']\
                             , annotate_dict( t[1]\
                                            , 'mad'\
                                            , t[0]\
                                            )\
                             )\
                  )

#Filter the outliers by reason and return a RDD with just the outliers 
#of a specified reason.
def get_by_reason(outliers, reason):
    return outliers.filter(lambda t: str.startswith(t[0].encode('ascii'\
                                                               , 'ignore'\
                                                               )\
                                                   ,reason\
                                                   )\
                          )

#Grab data using Spark-SQL and filter with spark core RDD operations 
#to only yield the data we want, ones with physicians, payers and reasons
reason_amount_pairs = \
    sqlContext.sql("""select reason
                           , physician_specialty
                           , amount
                           , physician_id
                           , payer 
                      from payments"""\
                  )\
              .filter(lambda row:len(row.reason) > 3 \
                              and len(row.physician_id) > 3\
                              and len(row.payer) > 3\
                     ) \
              .map(lambda row: ( "{}_{}".format( row.reason\
                                               , row.physician_specialty\
                                               )\
                             , dict(amount=row.amount\
                                   ,physician_id=row.physician_id\
                                   ,payer=row.payer\
                                   ,specialty=row.physician_specialty\
                                   )\
                             )\
                   )

#Get the outliers based on a modified z-score threshold of 3.5
outliers = get_outliers(reason_amount_pairs, 3.5)
#Get a sample per specialty/reason partition
inliers = get_inliers(reason_amount_pairs)

Now that we have found the outliers per specialty/reason partition, and a sample of inliers, let’s display them so that we can get a sense of how sensitive the outlier detection is.

#display the top k outliers in a table and a distribution plot
#of an inlier sample along with the outliers rug-plotted in red
def display_info(inliers_raw, outliers_raw_tmp, reason, k=None):
    outliers_raw = []
    if k is None:
        outliers_raw = sorted(outliers_raw_tmp\
                             , key=lambda d:d[1]['amount']\
                             , reverse=True\
                             )
    else:
        outliers_raw = sorted(outliers_raw_tmp\
                             , key=lambda d:d[1]['amount']\
                             , reverse=True\
                             )[0:k]
    inlier_pts = []
    for i in [d[1] for d in inliers_raw]:
        for j in i:
            inlier_pts.append(j['amount'])
    outlier_pts= [d[1]['amount'] for d in outliers_raw]
    plot_outliers(inlier_pts[0:1500], outlier_pts, reason)

    print_table(['Physician','Specialty', 'Payer', 'Amount']\
               , [d[1]['physician_id'] for d in outliers_raw]\
               , [ [ d[1]['specialty'] \
                   , d[1]['payer'].encode('ascii', 'ignore') \
                   , '$' + locale.format('%0.2f', d[1]['amount'], grouping=True)\
                   ] \
                 for d in outliers_raw]\
               )

Food and Beverage Purchase Outliers

Let’s look at the top 4 outliers for Food and Beverage payments. I could have shown all of the outliers, but I found the first few to be the biggest bang for our buck in terms of interesting findings.

#outliers for food and beverage purchases
food_outliers = get_by_reason(outliers, 'Food and Beverage').collect()
food_inliers = get_by_reason(inliers, 'Food and Beverage').collect()
display_info(food_inliers, food_outliers, 'Food and Beverage', 4)

As we can see, the misclassified data from Teleflex is rearing its head again with a huge single payment for food. However, looking down the list, Biolase is paying quite a bit to some dentist for food.

Physician	Specialty	Payer	Amount
200720	Allopathic & Osteopathic Physicians/ Surgery	Teleflex Medical Incorporated	$68,750.00
28946	Dental Providers/ Dentist/ General Practice	BIOLASE, INC.	$13,297.15
28946	Dental Providers/ Dentist/ General Practice	BIOLASE, INC.	$8,111.82
28946	Dental Providers/ Dentist/ General Practice	BIOLASE, INC.	$8,111.82

Below is a density plot of outliers and a sample of inliers with a rug plot and the outliers marked in red. You can see how far along the tail of the densitiy plot the outliers here are. Most food and payment data hovers much closer to $0$.

Travel and Lodging Outliers

travel_outliers = get_by_reason(outliers, 'Travel and Lodging').collect()
travel_inliers = get_by_reason(inliers, 'Travel and Lodging').collect()
display_info(travel_inliers, travel_outliers, 'Travel and Lodging', 10)

All I can say is that Physician 106320 travels far more than I do to get paid $155k in 2013. Hope that triple platinum on Delta is worth it. :)

Physician	Specialty	Payer	Amount
106320	Allopathic & Osteopathic Physicians/ Psychiatry & Neurology/ Neurology	Boehringer Ingelheim Pharma GmbH & Co.KG	$155,772.00
472722	Allopathic & Osteopathic Physicians/ Internal Medicine/ Nephrology	Merck Sharp & Dohme Corporation	$75,000.00
371379	Allopathic & Osteopathic Physicians/ Orthopaedic Surgery	Exactech, Inc.	$65,798.00
198801	Allopathic & Osteopathic Physicians/ Internal Medicine/ Cardiovascular Disease	Medtronic Vascular, Inc.	$41,232.80
382697	Allopathic & Osteopathic Physicians/ Internal Medicine/ Nephrology	Genentech, Inc.	$39,978.80
169095	Allopathic & Osteopathic Physicians/ Surgery	Medtronic Vascular, Inc.	$37,683.00
80052	Allopathic & Osteopathic Physicians/ Family Medicine	Boehringer Ingelheim Pharma GmbH & Co.KG	$24,911.25
202461	Allopathic & Osteopathic Physicians/ Thoracic Surgery (Cardiothoracic Vascular Surgery)	Covidien LP	$21,594.51
378722	Allopathic & Osteopathic Physicians/ Internal Medicine	GlaxoSmithKline, LLC.	$20,112.40
243205	Allopathic & Osteopathic Physicians/ Internal Medicine/ Interventional Cardiology	Medtronic Vascular, Inc.	$19,273.90

You can see on the density plot that the next nearest outlier is pretty far away and we have clumps of outliers around 20k and 40k. Interesting things to look into.

Consulting Fee Outliers

consulting_outliers = get_by_reason(outliers, 'Consulting Fee').collect()
consulting_inliers = get_by_reason(inliers, 'Consulting Fee').collect()
display_info(consulting_inliers, consulting_outliers, 'Consulting Fee', 10)

Looking at consulting fee outliers, you can see some clumping, but most of the data is lower than 50k. That makes the 200k outlier from Teva all that much more interesting. Of course, none of this is any indication of wrong-doing, just interesting spikes in the data.

Physician	Specialty	Payer	Amount
104930	Allopathic & Osteopathic Physicians/ Psychiatry & Neurology/ Neurology	Teva Pharmaceuticals USA, Inc.	$207,500.00
151515	Other Service Providers/ Specialist	Alcon Research Ltd	$150,000.00
309376	Allopathic & Osteopathic Physicians/ Internal Medicine	Teva Pharmaceuticals USA, Inc.	$137,559.67
231913	Allopathic & Osteopathic Physicians/ Orthopaedic Surgery	Exactech, Inc.	$108,125.00
465481	Allopathic & Osteopathic Physicians/ Internal Medicine/ Rheumatology	Vision Quest Industries Inc.	$102,196.09
409799	Allopathic & Osteopathic Physicians/ Internal Medicine/ Endocrinology, Diabetes & Metabolism	Pfizer Inc.	$100,000.00
206227	Allopathic & Osteopathic Physicians/ Orthopaedic Surgery	DePuy Synthes Sales Inc.	$93,750.00
436192	Allopathic & Osteopathic Physicians/ Internal Medicine	Pfizer Inc.	$90,000.00
306965	Allopathic & Osteopathic Physicians/ Psychiatry & Neurology/ Neurology	Teva Pharmaceuticals USA, Inc.	$64,125.00
163888	Allopathic & Osteopathic Physicians/ Internal Medicine/ Cardiovascular Disease	Boehringer Ingelheim Pharmaceuticals, Inc.	$61,025.00

You can see most of the density is less than 20k, which makes that 200k outlier so interesting.

Gift Outliers

gift_outliers = get_by_reason(outliers, 'Gift').collect()
gift_inliers = get_by_reason(inliers, 'Gift').collect()
display_info(gift_inliers, gift_outliers, 'Gift', 10)

Gifts, I think, are the most interesting payment reasons in this whole dataset. I am intrigued when a physician might get a gift versus an outright fee. I imagined, going in, that gifts would be low-value items, but the table clearly shows that dentists are getting substantial gifts. What is most interesting to me is that all of the top 10 outliers are dentists.

Physician	Specialty	Payer	Amount
225073	Dental Providers/ Dentist/ General Practice	Dentalez Alabama, Inc.	$56,422.00
167931	Dental Providers/ Dentist	DENTSPLY IH Inc.	$8,672.50
380517	Dental Providers/ Dentist	DENTSPLY IH Inc.	$8,672.50
380073	Dental Providers/ Dentist/ General Practice	Benco Dental Supply Co.	$7,570.00
403926	Dental Providers/ Dentist	A-dec, Inc.	$5,430.00
429612	Dental Providers/ Dentist	PureLife, LLC	$5,058.72
404935	Dental Providers/ Dentist	A-dec, Inc.	$5,040.00
8601	Dental Providers/ Dentist/ General Practice	DentalEZ, Inc.	$3,876.35
385314	Dental Providers/ Dentist/ General Practice	Henry Schein, Inc.	$3,789.99
389592	Dental Providers/ Dentist/ General Practice	Henry Schein, Inc.	$3,789.99

Up Next

Next, we look for anomalies in our payment data by using Benford’s Law. This is part of a broader series of posts about Data Science and Hadoop.