Structure & Process2018-01-23T13:52:48+00:00http://blog.caseystella.comCasey StellaA Blockchain Story Told Through The Eyes of Two Users2018-01-22T00:00:00+00:00id:/ethereum-blockchain-analysis<h2 id="blockchains-are-big-data">Blockchains are Big Data</h2>
<p>I saw a commercial for Enterprise blockchains by Oracle during a football game this weekend. I’ll just pause to let that sink in. It is undeniable that this little (slightly) esoteric corner distributed computing is fully riding the hype train right now. It’s no doubt that the runup in price by the core cryptocurrencies combined with pointed skepticism from mainline economists and financial analysts are driving interest in the technology. It’s the perfect mixture of nerdiness, drama and money to pique the interests of even the most bloodless in the tech industry.</p>
<p>I’m a <a href="https://www.linkedin.com/in/casey-stella-84b9a11">data scientist</a> working in a very specific niche: dealing with “Big Data” (shout out to <a href="http://metron.apache.org">Apache Metron</a>). When blockchains came to my notice the sheer transparency of them was exceptionally exciting and liberating. Traditionally things like currencies operate like a black box, where one looks at the inputs, the outputs and tries to develop sensible guesses as to what is going on inside the black box. With blockchains, due to the fact that they are essentially immutable ledgers of transactions, one can crack open the nut and get at the juicy transaction details kept inside.</p>
<p>Blockchains as they stand right now operate at relatively anemic transaction rates as compared to other financial transaction systems that one uses day-to-day (e.g. Visa). Also, they’ve been around for a somewhat limited amount of time. These two aspects together put into question whether this truly is a “big data” problem or just a regular data problem. I contend, and hopefully will show in a bit here, that nontrivial analysis of blockchains puts us in a “small-to-medium data, big compute” territory. As such, this fits well within my preferred data analysis tools of <a href="http://spark.apache.org">Apache Spark</a>, Python and Jupyter.</p>
<h2 id="ethereum-a-virtual-machine-on-a-chain">Ethereum: A Virtual Machine on a Chain</h2>
<p>The most attractive blockchain to analyze, in my opinion, is Ethereum. From <a href="https://en.wikipedia.org/wiki/Ethereum">Wikipedia</a>:</p>
<blockquote>
<p>Ethereum is an open-source, public, blockchain-based distributed computing platform featuring smart contract (scripting)
functionality. Ether is a cryptocurrency whose blockchain is maintained by the Etherium platform, which provides a
distributed ledger for transactions. Ether can be transferred between accounts and used to compensate participant
nodes for computations performed. Ethereum provides a decentralized Turing-complete virtual machine, the Ethereum
Virtual Machine (EVM), which can execute scripts using an international network of public nodes. “Gas”, an internal
transaction pricing mechanism, is used to mitigate spam and allocate resources on the network.</p>
</blockquote>
<p>I like several aspects of this project:</p>
<ul>
<li>It is well used every day and growing in popularity</li>
<li>It seems to have a broad vision; the blockchain as a platform for smart contracts is enticing</li>
<li>It’s moving away from a proof of work model, which results in huge energy consumption</li>
<li>Gathering transaction data from <a href="https://github.com/ethereum/go-ethereum/wiki/geth">geth</a>, the ethereum node, is do-able via a JSON-RPC interface they provide.</li>
</ul>
<p>The thing that I like the most, however, is that it seems to be a multi-use chain. You see a lot happening on this blockchain:</p>
<ul>
<li><a href="https://www.prnewswire.com/news-releases/cryptokitties-the-worlds-first-ethereum-game-launches-today-660494083.html">Cat breeding games</a></li>
<li>A proper cryptocurrency (named <a href="https://en.wikipedia.org/wiki/Ethereum">Ether</a>)</li>
<li>Other cryptocurrencies (e.g. <a href="https://en.wikipedia.org/wiki/ERC20">ERC-20</a>) and <a href="https://en.wikipedia.org/wiki/Initial_coin_offering">initial coin offerings</a></li>
</ul>
<p>For these reasons, Ethereum seems like the blockchain most ripe for analysis. Specifically it would be interesting to find some analytics that might yield insights on how this chain works on a day-to-day basis. While not necessarily tied to predicting price, it would be of particular interest to investors to find something which connects, even indirectly, to future price.</p>
<h2 id="the-tale-of-two-users">The Tale of Two Users</h2>
<p>It’s easy to say one should be looking at advanced analytics using the full data from the blockchain. It’s quite a different story to actually suggest <em>what</em> to look at here. I will proceed from a couple of observations:</p>
<ul>
<li>Transaction data forms a graph, so it is possible to borrow machinery from Graph Theory if necessary</li>
<li>There are at least two interesting actors in this scenario: the new user and the established player</li>
</ul>
<p>The “new user” is a user who is using the blockchain for the first time, whereas the “established player” is a hash which is important and somewhat central to the blockchain (e.g. involved in both sending and receiving transactions to many people). I maintain that these are two interesting actors insomuch as observing the blockchain transactions from the vantage point of these users will yield insights as to the general health, well-being and state of the blockchain. If either of these actors change their behavior appreciably, it’s worth knowing and will probably have some impact on the fundamental usage patterns of the blockchain in question. Maybe even, if we’re very lucky, give us a hint on how the price may change.</p>
<p>We now face a couple of challenges:</p>
<ul>
<li>Formally defining these two actors in such a way that one can distinguish between them could be computationally daunting</li>
<li>What precisely should one measure through the lens of these actors?</li>
</ul>
<p>Starting from the bottom, I think a sensible starting point here is to measure the daily percentage of transactions being done by each of these actors. Plotting this opposite the price, one may see the effect that each of these actors may have on the price.</p>
<h2 id="the-new-user-impact">The New User Impact</h2>
<p>Let’s call the daily percentage of transactions involving a hash never before seen to be the “new user impact.” Just the act of picking out hashes that have never been seen before can be rather daunting given that there have been over 20 million distinct hashes between Ethereum’s inception and January 18, 2018. Doing this sort of analysis belies a simple SQL query but is well within Spark’s sweet spot of enabling more low level operations and distributed computing primitives. Judicious usage of bloom filters in Spark opens us up to performing these kinds of computations in a scalable way.</p>
<p><img src="files/ref_data/ethereum_analysis/new_hashes.png" style="width:100%" /></p>
<p>Observe the above plot from the timerange between January 1, 2017 until January 18, 2018 with the closing price per day in blue plotted opposite the percentage of the daily transactions involving a hash never before seen (the daily new user impact) in red.</p>
<p>Note the discordant nature of the new user impact and how little correlation to price is happening prior to mid-November. The behavior prior to mid-November is in stark contrast to the run-up in price and strong connection to the new user impact that happens from mid-November until early January. The fascinating thing here is that the new user impact seems to dip prior to the price dip in early January. It’s unclear whether this is a reliable early indicator (especially given its chaos earlier in the year), but it’s certainly worth investigating. It is somewhat unfortunate how volatile the new user impact becomes from mid-December onward.</p>
<h2 id="the-established-player-impact">The Established Player Impact</h2>
<p>In contrast to the “new user” as an actor, whose definition is easy to pin down in a technical way, the established player is tougher to specify in a rigorous way. Given the fact that the transactions on a blockchain form a graph, one can borrow from graph analytics some tooling to help us out. Specifically, define an “established player” for a specific day to be a hash such that the <a href="https://en.wikipedia.org/wiki/PageRank#PageRank_of_an_undirected_graph">undirected pagerank</a> of the hash is in the top 10% of pageranks given the transaction graph of the previous 14 days. The intuition here is that this will define a set of “important” hashes in the network. Tracking how much of the network operates from these important hashes daily will give us some idea of the impact of the big players, such as exchanges and market makers, in the network.</p>
<p><img src="files/ref_data/ethereum_analysis/pagerank_plot.png" style="width:100%" /></p>
<p>Observe the above plot from an abbreviated timerange of July 2017 until January 18, 2018 with the closing price per day in blue plotted opposite the percentage of the daily transactions involving a hash from an established player in red. Note that this timerange is abbreviated because it’s fairly costly to compute the pagerank of even 2 weeks worth of transaction data (note that a more serious analysis would imply more serious compute and thus might adjust these parameters).</p>
<p>The thing I immediately notice here is that, like the new user impact, the established player impact seems to couple with the price starting in mid-November. Also, similar to the new user impact, it deviates prior to the actual cost drop, but is decidedly less chaotic immediately prior to the mid-January dip and thus possibly more reliable.</p>
<h2 id="in-conclusion">In Conclusion</h2>
<p>The core impulse behind this exercise is to find some essential analytics to summarize behavior of the network from a particular vantage point (or set of vantage points). One must be careful drawing conclusions of predictive leading indicators of price from this exercise. Rather, stepping back, there are the beginnings of a set of analytics that one can monitor over time to better understand how Ethereum, as a blockchain, moves, lives and breathes on a day-to-day basis. Inflection points in these analytics tie to usage shifts and assumptions in the technical analysis of this blockchain should be reevaluated or else risk becoming stale or less-effective. For instance, if one sees a precipitous drop in the new user impact over a week, then either users are not using the chain (which you can see in early 2017 in the “New User Impact” plot) or Ethereum has reached saturation (i.e. no new users, but still much usage). For a young blockchain, new user usage is imperative for robust growth and thus it’ll be a turning point when the chain is saturated.</p>
<p>Thinking beyond this analysis, I plan to go on and look at some of the other graph theoretic analytics that can be tracked over time in both Ethereum as well as other established blockchains, most obviously Bitcoin:</p>
<ul>
<li>The number of <a href="https://www.geeksforgeeks.org/number-of-triangles-in-a-undirected-graph/">transaction triangles</a> per day to get an indication of the transaction movement in the chain</li>
<li>The number of “communities” in the transaction graph by applying a <a href="https://en.wikipedia.org/wiki/Label_Propagation_Algorithm">label propagation algorithm</a> to the transaction graph daily.</li>
</ul>
<p>Also, looking closer at analytics involving the amount of ethereum transacted per day:</p>
<ul>
<li>50th, 75th, 90th and 95th percentile of the amount of ether transacted by new users</li>
<li>50th, 75th, 90th and 95th percentile of the amount of ether transacted by established players</li>
</ul>
Word2Vec with Non-Textual Data2015-12-04T00:00:00+00:00id:/nlp-for-non-text<p>At least half of the battle of data analysis and data science is understanding your data.</p>
<p>That sounds obvious, but I’ve seen whole data science projects fail
because not nearly enough time was spent on the exercise of understanding
your data. There are only two real ways to go about doing this:</p>
<ul>
<li>Ask an expert</li>
<li>Ask the data</li>
</ul>
<p>To have a shot at doing this you really have to do both.</p>
<p>In the course of this blog post, I’m going to describe some of the
challenges with understanding data and I’ll go into some technical
detail of how to borrow some scalable unsupervised learning from natural language
processing coupled with a very nice data visualization to facilitate
understanding the natural organization and arrangement of data.</p>
<h1 id="subject-matter-experts">Subject Matter Experts</h1>
<p>I spend a lot of time with healthcare
data and the obvious subject matter experts are nurses and doctors.
These people are very gracious, very knowledgeable and extremely pressed
for time. The problem with expert knowledge is that it’s surprisingly hard to communicate
effectively sufficient nuance to help the working data scientist
accomplish their goals. Furthermore, it’s extremely time consuming.<br />
This is made doubly hard when the expert is entirely unclear about the goal.</p>
<p>The second, perhaps less obvious, challenge is that subject matter
experts knowledge is biased toward that which is already known. Often
data scientists and analysts are trying to understand the data not as an
ends, but rather as a means to gaining insight. If you only take into
account received knowledge, then making unexpected insights can be
challenging. That being said, spending time with subject matter experts is
a necessary yet insufficient part of data analysis.</p>
<h1 id="unsupervised-data-understanding">Unsupervised Data Understanding</h1>
<p>To complete the task of understanding your data, I have found that it is
necessary to spend time looking at the data. One can think of the
entire field of statistics as an exercise in building a mechanism to ask
data pointed questions and get answers that we can trust, often with
caveats.<br />
The goal is generally to get a sense of how the data is organized or arranged.<br />
With the unbelievable complexity of
most real data, we are forced to simplify our representations. The
question is just precisely how to simply that representation to find the
proper balance between simplicity and complexity. More than that, some
representations of the data offer useful views of the data for certain
purposes and not for others.</p>
<p>Common simplified representations of data are things like distributions, histograms, and plots. Of course there are other even more complex ways to represent your data. Whole <a href="http://www.ayasdi.com/">companies</a> have been formed around providing a way to gain insight through more complex organizations of the data, taking some of the burden of interpretation from our brain and encoding it in an organization scheme.</p>
<p>Today, I’d like to talk about another approach to data simplification
for event data which provides not just an interesting representation, but also a way to
ask the data certain kinds of useful questions of your data.</p>
<h2 id="word2vec">Word2Vec</h2>
<p>One common way to impose order on data that is used by engineers and
mathematicians everywhere is to embed your data in a <a href="https://en.wikipedia.org/wiki/Vector_space">vector space</a> with a
<a href="https://en.wikipedia.org/wiki/Metric_space">metric</a> on it. <br />
This gives us a couple things :</p>
<ul>
<li>Data now has a distance which can be interpreted as the degree of
“difference” between the data</li>
<li>Data can be combined via addition and subtraction operations which can
be interpreted as combination and separation operations</li>
</ul>
<p>The issue now is how you impose this structure by embedding your data,
which may not even be numeric, into a vector space. Thankfully, the
nice people at <a href="http://www.google.com">Google</a> developed a nice way of
doing this in the domain of natural language text called
<a href="http://arxiv.org/abs/1310.4546">Word2Vec</a>.</p>
<p>I won’t go into extravagant detail into the implementation as Radim
Řehůřek did a great job <a href="http://rare-technologies.com/making-sense-of-word2vec/">here</a>.<br />
The major takeaways, however, is that using the inherrent structure of natural
language, Word2Vec is able to construct a vector space such that a</p>
<ul>
<li>Word similarity can be interpreted as a distance calculation</li>
<li>The notion of analogies can be interpreted using the addition and
subtraction operators (e.g. the vector representation of king - male + female is near the vector representation of queen).</li>
</ul>
<p>This is a surprisingly rich organization of data and one that has proven
very effective in enhancing the accuracy of machine learning models that
deal with natural language. Perhaps the most surprising part of this is
that the vectorization model does not utilize any of the grammatical
structure of the natural language directly. It simply analyzes the
words within the sentences and through usage it fits the proper
embedding. This led me to consider whether other, non-textual data
which has some inherrent structure can also be organized this way with
the same algorithm.</p>
<h2 id="medical-data">Medical Data</h2>
<p>Whenever we go to the doctor, a set of events happen:</p>
<ul>
<li>Measurements are made (e.g. blood pressure, pulse, height, weight)</li>
<li>Labs are drawn and ordered (e.g. blood tests)</li>
<li>Procedures are performed (e.g. an x-ray)</li>
<li>Diagnoses are made</li>
<li>Drugs are prescribed</li>
</ul>
<p>These events happen in a certain <em>overall</em> order but the order varies based on the
patient situation and according to the medical staff’s best judgement.
We will call this set of events a <strong>medical encounter</strong> and they happen every day all over the world.</p>
<p>This sequence of events has a similar tone to what we’re familiar with
in natural language. The encounter can be thought of as a sort of
medical sentence. Each medical event within the encounter can be
thought of as a medical word. The type of event (lab, procedure,
diagnoses, etc.) can be considered as a sort of part-of-speech.</p>
<p>It remains to determine if this structure can be teased out and encoded
into a vector space model like natural language can be. If so, then we
can ask questions like:</p>
<ul>
<li>How similar are two diseases based on how they are treated and
<a href="https://en.wikipedia.org/wiki/Comorbidity">comorbidities</a> found in
the same encounter?</li>
<li>Can we compose diseases and make them similar to other diseases? For
instance, is the vector representation of type 2 diabetes - obesity
close to type 1 diabetes?</li>
</ul>
<p>When considering trying this technique out the problem, of course, is getting
access to medical data. This data is extremely sensitive and is covered
by
<a href="https://en.wikipedia.org/wiki/Health_Insurance_Portability_and_Accountability_Act">HIPAA</a> here in the United States.
What we need is a good, depersonalized set of medical encounter data.</p>
<p>Thankfully, back in 2012 an electronic medical records system, <a href="http://www.practicefusion.com">Practice
Fusion</a> released a set of 10,000 depersonalized medical records as
part of a kaggle <a href="https://www.kaggle.com/c/pf2012-diabetes">competition</a>. This opened up the possibility of actually doing this analysis, albeit on a small subset of the population.</p>
<h2 id="implementation">Implementation</h2>
<p>Since I’ve been doing a lot with Spark lately at work, I wanted to see
if I could use the Word2Vec implementation built into SparkML to
accomplish this. Also, frankly, having worked with medical data at some
big hospitals and insurance companies, I am aware that there is a real
scale problem when doing something this complex for millions of medical
encounters and I wanted to ensure that anything I did could scale.</p>
<p>The implementation boiled down into a few steps, which are common to
most projects that I’ve seen run on Hadoop. I have created a small
github repo to capture the code collateral used to process the data
<a href="https://github.com/cestella/presentations/tree/master/NLP_on_non_textual_data">here</a>.</p>
<ul>
<li>Ingest the Practice Fusion database dumps into Hadoop.
<ul>
<li>Shell script <a href="https://github.com/cestella/presentations/blob/master/NLP_on_non_textual_data/src/main/bash/ingest_practice_fusion.sh">here</a></li>
</ul>
</li>
<li>Pin up Hive tables for each of the tables, roughly corresponding to a table per medical event.
<ul>
<li>The set of DDL’s are <a href="https://github.com/cestella/presentations/tree/master/NLP_on_non_textual_data/src/main/ddl/practicefusion">here</a></li>
</ul>
</li>
<li>Transform this tabular data into a corpus of medical event sentences.
<ul>
<li>The ETL pig scripts are <a href="https://github.com/cestella/presentations/tree/master/NLP_on_non_textual_data/src/main/pig/practicefusion">here</a></li>
<li>The shell script executing the pig scripts are <a href="https://github.com/cestella/presentations/blob/master/NLP_on_non_textual_data/src/main/bash/etl.sh">here</a></li>
</ul>
</li>
<li>Build the word2vec model with Spark.</li>
</ul>
<p>You can see from the Jupyter notebook detailing the model building
portion and results <a href="https://github.com/cestella/presentations/blob/master/NLP_on_non_textual_data/src/main/ipython/clinical2vec.ipynb">here</a> that model building is only a scant few lines:</p>
<pre>
<code>
from pyspark import SparkContext
from pyspark.mllib.feature import Word2Vec
sentences = sc.textFile("practice_fusion/sentences_nlp").map(lambda row: row.split(" "))
word2vec = Word2Vec()
word2vec.setSeed(0)
word2vec.setVectorSize(100)
model = word2vec.fit(sentences)
</code>
</pre>
<p>#Results</p>
<p>One of the problems with unsupervised models is evaluating how well our
model is describing reality. For the purpose of this entirely
unscientific analysis, we’ll restrict ourselves to just diagnoses and
ask a couple of questions of the model:</p>
<ul>
<li>Does the model correctly recover what we currently know based on
medical research?</li>
<li>Does the model show us anything that is novel and likely, but unknown
at present?</li>
</ul>
<p>One thing to note before we get started. This model uses <a href="https://en.wikipedia.org/wiki/Cosine_similarity">cosine similarity</a> as the score. This measure of similarity ranges from 0 to 1, with 1 being most similar and 0 being least similar.</p>
<h2 id="atherosclerosis">Atherosclerosis</h2>
<p>Also known as heart disease or hardening of the arteries. This disease
is the number one killer of Americans. Our model found the following
similar diseases:</p>
<table id="atherosclerosisTable" class="tablesorter">
<thead>
<tr>
<th>ICD9 Code</th><th>Description</th><th>Score</th>
</tr>
</thead>
<tbody>
<tr><td>v12.71</td><td>Personal history of peptic ulcer disease</td><td>0.930</td></tr>
<tr><td>533.40</td><td>Chronic or unspecified peptic ulcer of unspecified site with hemorrhage, without mention of obstruction</td><td>0.926</td></tr>
<tr><td>153.6</td><td>Malignant neoplasm of ascending colon</td><td>0.910</td></tr>
<tr><td>238.75</td><td>Myelodysplastic syndrome, unspecified</td><td>0.910</td></tr>
<tr><td>389.10</td><td>Sensorineural hearing loss, unspecified</td><td>0.907</td></tr>
<tr><td>428.30</td><td>Diastolic heart failure, unspecified</td><td>0.904</td></tr>
<tr><td>v43.65</td><td>Knee joint replacement</td><td>0.902</td></tr>
</tbody>
</table>
<p><br /></p>
<p><strong>Peptic Ulcers</strong></p>
<p>There have been long-standing connections noticed between ulcers and
atherosclerosis. Partiaully due to smokers having a higher than average
incidence of peptic ulcers and atherosclerosis. You can see an <a href="http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1611891/">editorial</a>
in the British Medical Journal all the way back in the 1970’s discussing
this.</p>
<p><strong>Hearing Loss</strong></p>
<p>From an <a href="http://www.ncbi.nlm.nih.gov/pubmed/23102449">article</a> from the Journal of Atherosclerosis in 2012:</p>
<blockquote>
<p>Sensorineural hearing loss seemed to be associated with vascular
endothelial dysfunction and an increased cardiovascular risk</p>
</blockquote>
<p><strong>Knee Joint Replacements</strong></p>
<p>These procedures are common among those with osteoarthritis and there
has been a solid correlation between osteoarthritis and atherosclerosis
in <a href="http://www.ncbi.nlm.nih.gov/pubmed/22563029">the</a> <a href="http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3196360/">literature</a>.</p>
<h2 id="crohns-disease">Crohn’s Disease</h2>
<p>Crohn’s disease is a type of inflammatory bowel disease that is caused
by a combination of environmental, immune and bacterial factors. Let’s
see if we can recover some of these connections from the data.</p>
<table id="crohnsTable" class="tablesorter">
<thead>
<tr>
<th>ICD9 Code</th><th>Description</th><th>Score</th>
</tr>
</thead>
<tbody>
<tr><td>274.03</td><td>Chronic gouty arthropathy with tophus (tophi)</td><td>0.870</td></tr>
<tr><td>522.5</td><td>Periapical abscess without sinus</td><td>0.869</td></tr>
<tr><td>579.3</td><td>Other and unspecified postsurgical nonabsorption</td><td>0.863</td></tr>
<tr><td>135</td><td>Sarcoidosis</td><td>0.859</td></tr>
<tr><td>112.3</td><td>Candidiasis of skin and nails</td><td>0.855</td></tr>
<tr><td>v16.42</td><td>Family history of malignant neoplasm of prostate</td><td>0.853</td></tr>
</tbody>
</table>
<p><br /></p>
<p><strong>Arthritis</strong></p>
<p>From the <a href="http://www.ccfa.org/resources/arthritis.html">Crohn’s and Colitis Foundation of
America</a>:</p>
<blockquote>
<p>Arthritis, or inflammation of the joints, is the most common extraintestinal complication of IBD. It may affect as many as 25% of people with Crohn’s disease or ulcerative colitis. Although arthritis is typically associated with advancing age, in IBD it often strikes the youngest patients.</p>
</blockquote>
<p><strong>Dental Abscesses</strong></p>
<p>While not much medical literature exists with a specific link to dental
abscesses and Crohn’s (there are general oral issues noticed
<a href="http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1410927/">here</a>), you
do see lengthy discussions on the Crohn’s <a href="http://www.crohnsforum.com/showthread.php?t=37075">forums</a> about abscesses being a
common occurance with Crohn’s.</p>
<p><strong>Yeast Infections</strong></p>
<p>Candidiasis of skin and nails is a form of yeast infection on the skin. From the journal “Critical Review of Microbiology” <a href="http://www.ncbi.nlm.nih.gov/pubmed/23855357">here</a>.</p>
<blockquote>
<p>It is widely accepted that Candidia could result from an inappropriate
inflammatory response to intestinal microorganisms in a genetically
susceptible host. Most studies to date have concerned the involvement of
bacteria in disease progression. In addition to bacteria, there appears
to be a possible link between the commensal yeast Candida albicans and
disease development.</p>
</blockquote>
<h2 id="visualization">Visualization</h2>
<p>For further investigation, I have used <a href="https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding">t-distributed stochastic neighbor embedding</a>
to embed the 100-dimensional vector space into 2 dimensions. This
embedding should retain the general connections within the data, so you
can look at similar diagnoses, drugs and allergies.</p>
<ul>
<li>You can choose to look at all types, just diagnoses or just drugs.</li>
<li>Highlight in the canvas below and drag around. The points that you’ve
selected will show up in the table below along with a description in
plain text.</li>
</ul>
<p>Please play around with this data and let me know what you find!</p>
<style>
.tooltip {
position: absolute;
width: 300px;
height: auto;
pointer-events: none;
border: 1px solid #000;
background-color: #FFF;
border-radius: 5px;
padding:10px;
}
.brush {
fill: teal;
stroke: teal;
fill-opacity: 0.2;
stroke-opacity: 0.8;
}
</style>
<link rel="stylesheet" href="files/css/theme.cstella.css" />
<script type="text/javascript" src="files/ref_data/word_vectors_rx.js"></script>
<script type="text/javascript" src="files/ref_data/word_vectors_dx.js"></script>
<script type="text/javascript" src="files/ref_data/word_vectors_all.js"></script>
<script src="files/js/d3.min.js"></script>
<script src="files/js/crossfilter.min.js"></script>
<script src="files/js/jquery.min.js"></script>
<script src="files/js/jquery.tablesorter.js"></script>
<script src="files/js/jquery.tablesorter.widgets.js"></script>
<script src="files/js/widget-scroller.js"></script>
<script>
$(document).ready(function()
{
$("#resultTable").tablesorter( {
theme: "cstella"
, widthFixed: true
, showProcessing: true
, widgets: ['zebra','uitheme', 'scroller']
, widgetOptions : {
scroller_height : 200,
scroller_barWidth : 17,
scroller_jumpToHeader: true,
scroller_idPrefix : 's_'
}
}
);
$("#atherosclerosisTable").tablesorter( {
theme: "cstella"
, widthFixed: true
, showProcessing: true
, widgets: ['zebra','uitheme', 'scroller']
, widgetOptions : {
scroller_height : 300,
scroller_barWidth : 17,
scroller_jumpToHeader: true,
scroller_idPrefix : 's_'
}
}
);
$("#crohnsTable").tablesorter( {
theme: "cstella"
, widthFixed: true
, showProcessing: true
, widgets: ['zebra','uitheme', 'scroller']
, widgetOptions : {
scroller_height : 300,
scroller_barWidth : 17,
scroller_jumpToHeader: true,
scroller_idPrefix : 's_'
}
}
);
}
);
</script>
<div id="chart">
</div>
<table style="border-spacing: 5px">
<tr>
<td><a href="#" onclick="display_plot(data_all);updateDots();return false;">All</a></td>
<td style="border:solid 2px black" width="50px"></td>
</tr>
<tr>
<td>Provider Specialty</td>
<td style="border:solid 2px black" width="50px" bgcolor="black"></td>
</tr>
<tr>
<td><a href="#" onclick="display_plot(data_dx);updateDots();return false;">Diagnoses</a></td>
<td style="border:solid 2px black" width="50px" bgcolor="red"></td>
</tr>
<tr>
<td><a href="#" onclick="display_plot(data_rx);updateDots();return false;">Drugs</a></td>
<td style="border:solid 2px black" width="50px" bgcolor="blue" />
</tr>
<tr>
<td>Allergies</td>
<td style="border:solid 2px black" width="50px" bgcolor="orange" />
</tr>
</table>
<div id="table">
<table id="resultTable" class="tablesorter">
<thead>
<tr>
<th>Type</th>
<th>Name</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr><td colspan="3"><center><b>Highlight some points above for this summary to be filled in.</b></center></td></tr>
</tbody>
</table>
</div>
<script>
var width = 900,
height = 900,
margin = 40;
function update_table(points) {
var tableDiv = document.getElementById('table');
while(tableDiv.firstChild) {
tableDiv.removeChild(tableDiv.firstChild);
}
// create elements <table> and a <tbody>
var tbl = document.createElement("table");
tbl.id = 'resultTable';
tbl.class = 'tablesorter';
var thead = document.createElement("thead");
var tr = document.createElement("tr");
{
var cell = document.createElement("th");
var cellText = document.createTextNode('Type');
cell.appendChild(cellText);
tr.appendChild(cell);
}
{
var cell = document.createElement("th");
var cellText = document.createTextNode('Name');
cell.appendChild(cellText);
tr.appendChild(cell);
}
{
var cell = document.createElement("th");
var cellText = document.createTextNode('Description');
cell.appendChild(cellText);
tr.appendChild(cell);
}
thead.appendChild(tr);
tbl.appendChild(thead);
var tblBody = document.createElement("tbody");
for(var i = 0;i < points.length;++i)
{
// table row creation
var row = document.createElement("tr");
var d = points[i];
{
var cell = document.createElement("td");
var txt = d['type'];
if(txt == 'dx') {
txt = 'Diagnosis';
}
else if(txt == 'rx') {
txt = 'Drugs';
}
else if(txt == 'provider specialty') {
txt = 'Provider Specialty';
}
cell.innerHTML = txt;
row.appendChild(cell);
}
{
var cell = document.createElement("td");
var txt = '' + d['name'] + '';
cell.innerHTML = txt;
row.appendChild(cell);
}
{
var cell = document.createElement("td");
var txt = '' + points[i]['description'] + '';
cell.innerHTML = txt;
row.appendChild(cell);
}
tblBody.appendChild(row);
}
tbl.appendChild(tblBody);
tableDiv.appendChild(tbl);
$("#resultTable").tablesorter( {
theme: "cstella"
, widthFixed: true
, showProcessing: true
, widgets: ['zebra','uitheme', 'scroller']
, widgetOptions : {
scroller_height : 500,
scroller_barWidth : 17,
scroller_jumpToHeader: true,
scroller_idPrefix : 's_'
}
}
);
}
var xScale = d3.scale.linear()
.range([0, width])
, xValue = function(d) { return d["vec"][0];}
, xMap = function(d) { return xScale(xValue(d));};
var yScale = d3.scale.linear()
.range([height, 0])
, yValue = function(d) { return d["vec"][1];}
, yMap = function(d) { return yScale(yValue(d));};
// setup fill color
var cValue = function(d) {
if(d["type"] == 'dx') {
return 'red';
}
else if(d["type"] == 'rx') {
return 'blue';
}
else if(d["type"] == 'provider specialty') {
return 'black';
}
else
{
return 'orange';
}
};
var xf;
var xDim;
var yDim;
var svg;
var tooltip;
function display_plot(data) {
// don't want dots overlapping axis, so add in buffer to data domain
xScale.domain([d3.min(data, xValue)-1, d3.max(data, xValue)+1]);
yScale.domain([d3.min(data, yValue)-1, d3.max(data, yValue)+1]);
var xAxis = d3.svg.axis()
.scale(xScale)
.orient('bottom');
var yAxis = d3.svg.axis()
.scale(yScale)
.orient('left');
var brush = d3.svg.brush()
.x(xScale)
.y(yScale);
// add the tooltip area to the webpage
d3.select('#chart_svg').remove();
d3.select('#chart_tooltip').remove();
tooltip = d3.select("#chart").append("div")
.attr("class", "tooltip")
.attr("id", "chart_tooltip")
.style("opacity", 0);
svg = d3.select('#chart')
.append('svg')
.attr("id", "chart_svg")
.attr('width', width+2*margin)
.attr('height', height+2*margin)
.append('g')
.attr('transform', 'translate('+margin+','+margin+')');
svg.append('g')
.attr('class', 'x axis')
.attr('transform', 'translate(0,'+height+')')
.call(xAxis);
svg.append('g')
.attr('class', 'y axis')
.call(yAxis);
svg.append('g')
.attr('class', 'brush')
.call(brush);
xf = crossfilter(data);
xDim = xf.dimension(xValue);
yDim = xf.dimension(yValue);
brush.on('brush', function() {
var extent = brush.extent(),
xExtent = [extent[0][0], extent[1][0]],
yExtent = [extent[0][1], extent[1][1]];
xDim.filterRange(xExtent);
yDim.filterRange(yExtent);
update_table(xDim.top(Infinity));
// console.log(xDim.top(Infinity));
// updateDots();
});
}
function updateDots() {
var dots = svg.selectAll('.dot')
.data(xDim.top(Infinity));
dots.enter().append('circle')
.attr('class', 'dot')
.attr('r', 3)
.attr('fill', cValue)
.on("mouseover", function(d) {
tooltip.transition()
.duration(200)
.style("opacity", .9);
tooltip.html("" + d['name'] + "<br>@ (" + d['vec'][0].toFixed(2) + ',' + d['vec'][1].toFixed(2) + ')')
.style("left", (d3.event.pageX + 10) + "px")
.style("top", (d3.event.pageY - 10) + "px");
})
;
dots
.attr('cx', xMap)
.attr('cy', yMap);
dots.exit().remove();
}
display_plot(data_all);
updateDots();
</script>
Data Science and Hadoop: Impressions and Example2014-10-24T00:00:00+00:00id:/pyspark-openpayments-analysis<p>A somewhat regular part of my job lately is discussing with people how
exactly one might go about doing data science on Hadoop. It’s really a
very interesting subject and one about which almost everyone even cursorily
associated with ``Big Data’’ has an opinion. Remarks are
made, emails written, Powerpoint decks created; it’s a busy day, for
sure.</p>
<p>People cannot be blamed for being concerned since
according to <a href="http://www.informationweek.com/big-data/big-data-analytics/3-roadblocks-to-big-data-roi/d/d-id/1111593?">Jeff Kelly</a>
, a Wikibon analyst, the ROI of these big data projects does not match
expectations:</p>
<blockquote>
<p>In the long term, they expect 3 to 4 dollar return on investment for every
dollar. But based on our analysis, the average company right now is
getting a return of about 55 cents on the dollar.</p>
</blockquote>
<p>That’s pretty concerning for those of us hoping for Hadoop to <a href="http://en.wikipedia.org/wiki/Crossing_the_Chasm">cross the chasm</a> soon. As one might imagine, there’s been quite a bit of hand wringing about the problem. I don’t take such a dim view of it, though. It’s a matter of maturity and I’ll give some of my impressions shortly on why it may be hard to fulfill the data science portion of the ROI currently.</p>
<h1 id="outline">Outline</h1>
<ul>
<li><a href="#data_science_challenge">Data Science Challenges</a>
<ul>
<li><a href="#data_has_inertia">Data has Inertia</a></li>
<li><a href="#hadoop_maturing">Hadoop is Still Maturing as a Platform</a></li>
<li><a href="#analysis_paralysis">Analysis Paralysis</a></li>
</ul>
</li>
<li><a href="#example_analysis">Example Analysis with PySpark and Hadoop</a>
<ul>
<li><a href="pyspark-openpayments-analysis-part-2.html">Data Overview and Preprocessing</a></li>
<li><a href="pyspark-openpayments-analysis-part-3.html">Basic Structural Analysis</a></li>
<li><a href="pyspark-openpayments-analysis-part-4.html">Outlier Analysis</a></li>
<li><a href="pyspark-openpayments-analysis-part-5.html">Benford’s Law Analysis</a></li>
</ul>
</li>
<li><a href="#conclusions">Conclusions</a>
<ul>
<li><a href="#dull_blade">This is a Dull Blade Exercise</a></li>
<li><a href="#pyspark">PySpark + Hadoop as a Platform</a></li>
</ul>
</li>
</ul>
<div id="data_science_challenge" />
<h1 id="data-science-challenges">Data Science Challenges</h1>
<p>One benefit from my vantage point within the consulting wing of a Hadoop
<a href="http://www.hortonworks.com">distribution</a> is that I get to see quite a
few Hadoop projects. Being that I’m part of the Data Science team, I
get to have a decidedly Data Science oriented view of the Hadoop world.
Furthermore, I get to see them in both startups as well as big
enterprises. Even better, living in and working with
organizations from a <a href="http://en.wikipedia.org/wiki/Flyover_country">fly-over state</a>, I have a decidedly non-Silicon Valley perspective.</p>
<p>From this position, it’s not hard to see making the leap from
owning a cluster to gaining insight from your data can be a daunting task. I’ll
just list a few challenges that I’ve noticed in my travels:</p>
<ul>
<li>Data has inertia</li>
<li>Hadoop is still maturing as a platform</li>
<li>Choices can be paralyzing</li>
</ul>
<p>The first is an organizational challenge, the second a
technical/product challenges and the final is a challenge of human
nature.</p>
<div id="data_has_inertia" />
<h2 id="data-has-inertia">Data has Inertia</h2>
<p>One of the competitive advantages of Hadoop is that inexpensive, commodity
hardware and a powerful distributed computing environment makes Hadoop a
pretty nice, cozy place for your data. This all looks great on paper
and in architecture slides. The challenge, however, is actually getting
the data to the platform.</p>
<p>Turns out moving data can be a tricky prospect. Much ink and
bits have been spilled discussing the technical approaches and
challenges to collecting
your data into a data lake. I won’t make you suffer through yet another
discussion of the finer points between <a href="http://sqoop.apache.org">sqoop</a>, <a href="http://flume.apache.org">flume</a>, etc. The technical challenges are almost never the long poles in the tent.</p>
<p>Rather, what I have witnessed is that getting that data to start moving
can be arduous and require political capital. I have noticed that there
is a tendency to treat those who come to you asking for data with a fair
amount of skepticism.</p>
<p>However, once data channels open up, data has a tendency to flow
more and more smoothly. This is why most of the successful projects
that I’ve been involved in have the following attributes:</p>
<ul>
<li>A sponsor with sufficient political power and the willingness to use it to get the data required to succeed</li>
<li>An iterative attitude so that the time to value is minimal</li>
</ul>
<p>These attributes are not specific to data science projects. Rather, the
same principal applies to all projects that require an abundance of
data. No data-oriented project can survive if starved of data and
almost all Hadoop projects are data-oriented.</p>
<div id="hadoop_maturing" />
<h2 id="hadoop-is-still-maturing-as-a-platform">Hadoop is Still Maturing as a Platform</h2>
<p>When I was young, I liked to climb trees. Growing up in rural Louisiana, I had plenty of opportunities on this front. As such, I got fairly good at picking out a good climbing tree. There is a non-zero intersection of trees which are good for climbing and trees which are pretty to look at or have some satisfying structural characteristics.</p>
<p>Often, however, the properties did not coexist in the same tree. Climbing trees were best if there were relatively low, thick branches with good spacing. Trees which were nice to look at were much more manicured with delicate branches and a certain symmetry.</p>
<p>Platforms have the same characteristics, I think. You have platforms that are very finely manicured with a focus on internal consistency and contained borders. This yields a good experience for those who use the system as the originators intended. These systems are pretty to look at, to be sure.</p>
<p>Ironically enough, I’ve always liked the sprawling systems with an emphasis on many integration points. They give the feeling that they are reaching out to you. That act of reaching out is the act of engaging. Hadoop is transitioning quickly from a finely manicured topiary sculpture to a fantastic climbing tree.</p>
<p>It started out very self contained and internally consistently. If you used Java, you were going to have a good time (sometimes ;-). While you <em>could</em> use pipes and streaming to hook up your C++ code or perl scripts, you weren’t going to have nearly as good of a time. Equivalently, on the algorithm front, if you could express what you wanted to do in MapReduce then the world was straightforward.</p>
<p><img src="files/ref_data/open_payments_files/Topiaryelephant.jpg" style="width:650px" />
<sub><a href="http://en.wikipedia.org/wiki/Topiary#mediaviewer/File:Topiaryelephant.jpg">Topiary Elephant</a> in Bang Pa In Palace, Thailand. CC BY-SA 3.0</sub></p>
<p>Now, as Hadoop matures, we see branches to other platforms growing
and branches to other distributed computing paradigms growing. On the
technical side, we can now write pure non-JVM UDF’s in Pig, Spark has
proper first-class bindings for Python, you can even write yarn apps in
languages other than those which run on the JVM. Much of this is thanks
to the new architecture in 2.0, but more than just a technical direction,
it’s the realization by the community that we need more choices.</p>
<p>That being said, it’s early days and we’re not that far down the path to
the new way of thinking. This will be solved with time and maturity.</p>
<div id="analysis_paralysis" />
<h2 id="analysis-paralysis">Analysis Paralysis</h2>
<p>Data science isn’t a new thing. I know, this is a brave statement and a
deep conclusion. Forgiving its obviousness and pith, I actually mean
that most organizations are already doing and have been doing for years one of the core <em>things</em> people talk about as data science: developing insights from their data.</p>
<p>I walk into organizations and I talk with the data analysts and I ask
them about how they do their job on a day-to-day basis. Most of them
talk to me about things somewhere between logistic regression in SAS and doing very complex SQL in a data warehouse. I ask them what their pains are and almost to a person, they always say something like the following:</p>
<ul>
<li>Copies of the data are expensive with my limited quota</li>
<li>Getting the data from one system to another takes 24 hours at least.</li>
</ul>
<p>The data scientists aren’t clamoring for the things that you see so
often touted as the benefits of ``Big Data’’:</p>
<ul>
<li>Unstructured data</li>
<li>Running your models on a petabyte of data</li>
<li>Running sexy new algorithms at massive scale</li>
</ul>
<p>Does this mean that those things aren’t really needed? If so, our job
is easy, all we have to do is recreate SQL on Hadoop and convince
organizations to put their data there. That solves big portions of the
top complaints above.</p>
<p>The answer is obviously that the current gripes do not remove
the need for more data, differently structure data, other techniques in
the data science toolbag. So, why aren’t the data analysts that I talk
to chomping at the bit for them?</p>
<p>One reason, I think, is that with increasingly complex data comes
increasing complexities in dealing/processing that data. Furthermore,
in structured data, the act of extraction/transformation/loading of the
data was not a data scientist activity. It’s possible that, given more
complicated data, just extracting features from it might require more
arduous programming than analysts are used to. A good example of this
is within the realm of natural language processing projects.</p>
<p>Also, ``Big Data’’ data science isn’t as convenient as small-data data science. Contrast the ease of using <a href="http://mahout.apache.org">Mahout</a> or Spark’s <a href="http://spark.apache.org/docs/1.1.0/mllib-guide.html">MLLib</a> with python’s <a href="http://scikit-learn.org/stable/">scikit-learn</a>, <a href="http://www.r-project.org/">R</a> or <a href="http://www.sas.com/en_us/home.html">SAS</a>. It’s not a contest; it’s easier and quicker to deal with a few megabytes of data. Since there is value in dealing with much more data, we have to eat the elephant, but it can be daunting without guidance and examples are few and far between.</p>
<p>Ultimately I think we focus so heavily on new and novel techniques, the game changing paradigm shifts (with our tongues placed firmly in our cheeks sometimes) without discussing the journey to getting there. If we constantly look across the chasm without looking at the bridge beneath our feet, we run the risk of falling into the drink.</p>
<div id="example_analysis" />
<h1 id="example-analysis-with-pyspark-and-hadoop">Example Analysis with PySpark and Hadoop</h1>
<p>This brings me to why I wanted to create this post. I intend to show a
worked example of how you do what I’ve seen as day-to-day work as data analysts along with some natural extension points that show how to use Hadoop to do possibly more interesting analysis. Namely :</p>
<ul>
<li>Understand some fundamental characteristics of the data to be analyzed</li>
<li>Generate reporting/images to communicate those characteristics to other people</li>
<li>Mine the data for likely incorrect or interesting data points that break with the characteristics found above.</li>
</ul>
<p>Over the course of the next few blog posts, I will take some recently opened data from the Center of Medicare and Medicaid detailing the financial relationships between physicians, hospitals, etc and medical manufacturers and use Spark’s Python bindings to look at the data, its shape, its outliers and look for data that may be amiss.</p>
<p>The individual phases have been split into 4 parts:</p>
<ul>
<li><a href="pyspark-openpayments-analysis-part-2.html">Data Overview and Preprocessing</a></li>
<li><a href="pyspark-openpayments-analysis-part-3.html">Basic Structural Analysis</a></li>
<li><a href="pyspark-openpayments-analysis-part-4.html">Outlier Analysis</a></li>
<li><a href="pyspark-openpayments-analysis-part-5.html">Benford’s Law Analysis</a></li>
</ul>
<div id="conclusions" />
<h1 id="conclusions">Conclusions</h1>
<div id="dull_blade" />
<h2 id="this-is-a-dull-blade-exercise">This is a Dull Blade Exercise</h2>
<p>I have been very careful to not draw conclusions or explicitly look for
fraud. This is intended to be a demonstration of technique and I cannot
verify that this dataset isn’t rife with bad or misclassified data. As
such, I intended to demonstrate some of the basic and slightly more
advanced analysis techniques that are open to you using the Hadoop
platform.</p>
<div id="pyspark" />
<h2 id="pyspark--hadoop-as-a-platform">PySpark + Hadoop as a Platform</h2>
<p>If you have the interest/ability to be comfortable in a Python
environment, I think that for data investigation and ad hoc reporting,
interacting with Hadoop via IPython Notebook and the Spark Python
bindings is a fantastic experience.</p>
<p>Interacting between SQL and more complex, fundamental analysis is
seamless. It all communicates in terms of RDDs for maximum ease of
composition. I could have used any of the rest of the Spark stack, such
as MLLib or GraphX.</p>
<p>Having all of this running on Hadoop, allowing me to do ETL and work in
the <em>other</em> parts of the ecosystem such as Pig, Hive, etc. is an
extremely compelling aspect as well. Ultimately, we’re approaching a
very cost effective and well thought out system for analyzing data.</p>
<p>It’s not all roses, however. When something goes wrong, it can be challenging to trace back the problem from the mix of Java/Scala and Python stack trace that is returned to you.</p>
<p>There can be some IT challenges as well. If you use a package in python in a RDD operation, you must have the package installed on the cluster. This may pose some challenges as many different people are going to need differing versions of dependencies. Traditionally this is handled through things like virtualenv, but executing a function within the context of a virtualenv isn’t supported and, even if it were, managing a virtualenv across a set of data nodes can be a challenge in itself.</p>
<p>If you would prefer to see the raw IPython Notebook, you can find it
hosted on <a href="http://nbviewer.ipython.org/url/blog.caseystella.com/files/ref_data/open_payments_files/open_payments.ipynb">nbviewer.ipython.org</a>.</p>
Data Science and Hadoop: Part 5, Benford's Law Analysis2014-10-24T00:00:00+00:00id:/pyspark-openpayments-analysis-part-5<h2 id="context">Context</h2>
<p>This is the last part of a 5 part <a href="pyspark-openpayments-analysis.html">series</a> on analyzing data with PySpark:</p>
<ul>
<li><a href="pyspark-openpayments-analysis.html">Data Science and Hadoop : Impressions</a></li>
<li><a href="pyspark-openpayments-analysis-part-2.html">Data Overview and Preprocessing</a></li>
<li><a href="pyspark-openpayments-analysis-part-3.html">Basic Structural Analysis</a></li>
<li><a href="pyspark-openpayments-analysis-part-4.html">Outlier Analysis</a></li>
<li>Benford’s Law Analysis</li>
</ul>
<h1 id="benfords-law">Benford’s Law</h1>
<p><a href="http://en.wikipedia.org/wiki/Benford's_law">Benford’s Law</a> is an
interesting observation made by physicist Frank Benford in the 30’s
about the distribution of the first digits of many naturally occurring
datasets. In particular, the distribution of digits in each position
follows an expected distribution.</p>
<p>I will leave the explanation of why Benford’s law exists to better
<a href="http://www.dspguide.com/ch34.htm">sources</a>, but this observation has
become a staple of forensic accounting analysis. The reason for this
interest is that financial data often fits the law and humans, when
making up numbers, have a terrible time picking numbers that fit the
law. In particular, we’ll often intuit that digits should be more
uniformly distributed than they naturally occur.</p>
<p>Even so, violation of Benford’s law alone is insufficient to cry foul.
It’s only an indicator to be used with other evidence to point toward
fraud. It can be a misleading indicator because not all datasets abide
by the law and it can be very sensitive to number of observations.</p>
<h2 id="ranking-by-goodness-of-fit">Ranking by Goodness of Fit</h2>
<p>It’s of interest to us to figure out if this payment data fits
the law and, if so, are there any payers who do not fit the law in a
strange way. That leaves us with the technical challenge of determining how close two distributions are so that we can rank goodness of fit. There are a few approaches to this problem:</p>
<ul>
<li><a href="http://en.wikipedia.org/wiki/Chi-squared_test">$\chi^2$ Statistical Test</a></li>
<li><a href="http://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence">Kullback-Leibler Divergence</a></li>
</ul>
<p>They are two approaches, one statistical and one information-theoretic,
that will accomplish similar goals: tell how close two distributions are
together.</p>
<p>I chose to rank based on KL divergence, but I compute the $\chi^2$ test as well.
I’ll <a href="http://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence">quote</a> briefly about Kullback-Leibler divergence to give a sense of what the ranking means:</p>
<blockquote>
<p>The Kullback–Leibler divergence of Q from P, denoted $D_{KL}(P || Q)$
, is a measure of the information lost
when Q is used to
approximate P. The KL divergence measures the expected number of
extra bits required to code samples from P when using a code based on
Q, rather than using a code based on P. Typically P represents the
“true” distribution of data, observations, or a precisely calculated
theoretical distribution. The measure Q typically represents a theory,
model, description, or approximation of P.</p>
</blockquote>
<p>In our case, P is the expected distribution based on Benford’s law and Q
is the observed distribution. We’d like to see just how much more
“complex” in some sense Q is versus P.</p>
<h2 id="the-benford-digit-distribution">The Benford Digit Distribution</h2>
<p>For the first digit, Benford’s law states that the probability of digit
$d$ occurring is <script type="math/tex">P(d) = \log_{10}(1 + \frac{1}{d})</script>.</p>
<p>There is a generalization beyond the first digit which can be used as
well. The probability of digit $d$ occurring in the second digit is <script type="math/tex">P(d) = \sum_{k=1}^{9} \log_{10}(1 + \frac{1}{10k + d})</script>.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="c">#Compute benford's distribution for first and second digit respectively</span><br data-jekyll-commonmark-ghpages="" /><span class="n">benford_1</span><span class="o">=</span><span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span><span class="mi">0</span><span class="p">]</span> <span class="o">+</span> <span class="p">[</span><span class="n">math</span><span class="o">.</span><span class="n">log10</span><span class="p">(</span><span class="mi">1</span><span class="o">+</span><span class="mf">1.0</span><span class="o">/</span><span class="n">i</span><span class="p">)</span> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">xrange</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">10</span><span class="p">)])</span><br data-jekyll-commonmark-ghpages="" /><span class="n">benford_2</span><span class="o">=</span><span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">([</span> <span class="nb">sum</span><span class="p">(</span> <span class="p">[</span> <span class="n">math</span><span class="o">.</span><span class="n">log10</span><span class="p">(</span><span class="mi">1</span> <span class="o">+</span> <span class="mf">1.0</span><span class="o">/</span><span class="p">(</span><span class="n">j</span><span class="o">*</span><span class="mi">10</span> <span class="o">+</span> <span class="n">i</span><span class="p">))</span> \<br data-jekyll-commonmark-ghpages="" /> <span class="k">for</span> <span class="n">j</span> <span class="ow">in</span> <span class="nb">xrange</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">10</span><span class="p">)</span> \<br data-jekyll-commonmark-ghpages="" /> <span class="p">]</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">)</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">xrange</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="mi">10</span><span class="p">)</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">]</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">)</span></code></pre></figure>
<h2 id="implementation-of-benfords-law-ranking">Implementation of Benford’s Law Ranking</h2>
<p>The way we’ll approach this problem is to determine the first and second
digit distributions of payments by payer/reason and rank by goodness of fit for the first digit. Then we’ll look at the top fitting and bottom fitting payers for a few different reasons and see if we can see any patterns. One caveat is that we’ll be throwing out data under the following scenarios:</p>
<ul>
<li>The payer/specialty does not have at least 400 payments</li>
<li>The amount is less than $10 (therefore not having a second digit)</li>
</ul>
<p>Obviously the second one may skew things as it throws out some data
within a partition but not the whole partition. In real analysis, I’d
find a way to include the data points.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="c">#Return a numpy array of zeros of specified $length except at</span><br data-jekyll-commonmark-ghpages="" /><span class="c">#position $index, which has value $value</span><br data-jekyll-commonmark-ghpages="" /><span class="k">def</span> <span class="nf">array_with_value</span><span class="p">(</span><span class="n">length</span><span class="p">,</span> <span class="n">index</span><span class="p">,</span> <span class="n">value</span><span class="p">):</span><br data-jekyll-commonmark-ghpages="" /> <span class="n">arr</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">(</span><span class="n">length</span><span class="p">)</span><br data-jekyll-commonmark-ghpages="" /> <span class="n">arr</span><span class="p">[</span><span class="n">index</span><span class="p">]</span> <span class="o">=</span> <span class="n">value</span><br data-jekyll-commonmark-ghpages="" /> <span class="k">return</span> <span class="n">arr</span><br data-jekyll-commonmark-ghpages="" /><br data-jekyll-commonmark-ghpages="" /><span class="c">#Perform chi-square test between an expected probability</span><br data-jekyll-commonmark-ghpages="" /><span class="c">#distribution and a list of empirical frequencies.</span><br data-jekyll-commonmark-ghpages="" /><span class="c">#Returns the chi-square statistic and the p-value for the test.</span><br data-jekyll-commonmark-ghpages="" /><span class="k">def</span> <span class="nf">goodness_of_fit</span><span class="p">(</span><span class="n">emp_counts</span><span class="p">,</span> <span class="n">expected_probabilities</span><span class="p">):</span><br data-jekyll-commonmark-ghpages="" /> <span class="c">#convert from probabilities to counts</span><br data-jekyll-commonmark-ghpages="" /> <span class="n">exp_distr</span> <span class="o">=</span> <span class="n">expected_probabilities</span><span class="o">*</span><span class="n">np</span><span class="o">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">emp_counts</span><span class="p">)</span><br data-jekyll-commonmark-ghpages="" /> <span class="k">return</span> <span class="n">stats</span><span class="o">.</span><span class="n">chisquare</span><span class="p">(</span><span class="n">emp_counts</span><span class="p">,</span> <span class="n">f_exp</span><span class="o">=</span><span class="n">exp_distr</span><span class="p">)</span><br data-jekyll-commonmark-ghpages="" /> <br data-jekyll-commonmark-ghpages="" /> <br data-jekyll-commonmark-ghpages="" /><span class="c">#For each (reason, payer) pair compute the first and second digit distribution</span><br data-jekyll-commonmark-ghpages="" /><span class="c">#for all payments. Return a RDD with a ranked list based on likely goodness </span><br data-jekyll-commonmark-ghpages="" /><span class="c">#of fit to the distribution of first digits predicted by Benford's "Law".</span><br data-jekyll-commonmark-ghpages="" /><span class="k">def</span> <span class="nf">benfords_law</span><span class="p">(</span><span class="n">min_payments</span><span class="o">=</span><span class="mi">350</span><span class="p">):</span><br data-jekyll-commonmark-ghpages="" /> <span class="s">"""<br data-jekyll-commonmark-ghpages="" /> Benford's "law" is a rough observation that the distribution of numbers <br data-jekyll-commonmark-ghpages="" /> for each digit position of certain data fits a specific distribution. <br data-jekyll-commonmark-ghpages="" /> It holds for quite a bit real-world data and, thus, has become of <br data-jekyll-commonmark-ghpages="" /> interest to forensic accountants.<br data-jekyll-commonmark-ghpages="" /> <br data-jekyll-commonmark-ghpages="" /> This function computes the distribution of first and second digits for <br data-jekyll-commonmark-ghpages="" /> each (reason, payer) pair and ranks them by goodness of fit to <br data-jekyll-commonmark-ghpages="" /> Benford's Law based on the first digit distribution.<br data-jekyll-commonmark-ghpages="" /> In particular, the goodness of fit metric that it is ranked by is <br data-jekyll-commonmark-ghpages="" /> kullback-liebler divergence, but chi-squared goodness of fit test <br data-jekyll-commonmark-ghpages="" /> is also computed and the results are cached.<br data-jekyll-commonmark-ghpages="" /> """</span><br data-jekyll-commonmark-ghpages="" /> <br data-jekyll-commonmark-ghpages="" /> <span class="c">#We use this one quite a bit in reducers, so it's nice to have it handy here</span><br data-jekyll-commonmark-ghpages="" /> <span class="n">sum_values</span> <span class="o">=</span> <span class="k">lambda</span> <span class="n">x</span><span class="p">,</span><span class="n">y</span><span class="p">:</span><span class="n">x</span><span class="o">+</span><span class="n">y</span><br data-jekyll-commonmark-ghpages="" /><br data-jekyll-commonmark-ghpages="" /> <span class="c">#Project out the reason, payer, amount, and amount_str, throwing </span><br data-jekyll-commonmark-ghpages="" /> <span class="c">#away amounts < 10 since they don't have 2nd digits. This probably </span><br data-jekyll-commonmark-ghpages="" /> <span class="c">#skews the results, so in real-life, I'd not throw out entries so</span><br data-jekyll-commonmark-ghpages="" /> <span class="c">#cavalierly, but for the purpose of simplicity, I've done it here.</span><br data-jekyll-commonmark-ghpages="" /> <br data-jekyll-commonmark-ghpages="" /> <span class="c">#Also, we're pulling out the first and second digits here</span><br data-jekyll-commonmark-ghpages="" /> <span class="n">reason_payer_amount_info</span> <span class="o">=</span> <span class="n">sqlContext</span><span class="o">.</span><span class="n">sql</span><span class="p">(</span><span class="s">"""select reason<br data-jekyll-commonmark-ghpages="" /> , payer<br data-jekyll-commonmark-ghpages="" /> , amount<br data-jekyll-commonmark-ghpages="" /> , amount_str <br data-jekyll-commonmark-ghpages="" /> from payments<br data-jekyll-commonmark-ghpages="" /> """</span><span class="p">)</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="o">.</span><span class="nb">filter</span><span class="p">(</span><span class="k">lambda</span> <span class="n">t</span><span class="p">:</span><span class="nb">len</span><span class="p">(</span><span class="n">t</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span> <span class="o">></span> <span class="mi">3</span> <br data-jekyll-commonmark-ghpages="" /> <span class="ow">and</span> <span class="n">t</span><span class="p">[</span><span class="mi">2</span><span class="p">]</span> <span class="o">></span> <span class="mf">9.99</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">)</span> \<br data-jekyll-commonmark-ghpages="" /> <span class="o">.</span><span class="nb">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">t</span><span class="p">:</span> <span class="p">(</span> <span class="p">(</span><span class="n">t</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">t</span><span class="p">[</span><span class="mi">0</span><span class="p">])</span> \<br data-jekyll-commonmark-ghpages="" /> <span class="p">,</span><span class="nb">dict</span><span class="p">(</span> <span class="n">payer</span><span class="o">=</span><span class="n">t</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">,</span> <span class="n">reason</span><span class="o">=</span><span class="n">t</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> \<br data-jekyll-commonmark-ghpages="" /> <span class="p">,</span> <span class="n">first_digit</span><span class="o">=</span><span class="n">t</span><span class="p">[</span><span class="mi">3</span><span class="p">][</span><span class="mi">0</span><span class="p">]</span> \<br data-jekyll-commonmark-ghpages="" /> <span class="p">,</span> <span class="n">second_digit</span><span class="o">=</span><span class="n">t</span><span class="p">[</span><span class="mi">3</span><span class="p">][</span><span class="mi">1</span><span class="p">]</span> \<br data-jekyll-commonmark-ghpages="" /> <span class="p">)</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">)</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">)</span><br data-jekyll-commonmark-ghpages="" /> <span class="n">reason_payer_amount_info</span><span class="o">.</span><span class="n">take</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span><br data-jekyll-commonmark-ghpages="" /> <br data-jekyll-commonmark-ghpages="" /> <span class="c">#filter out the reason / payer combos that have fewer payments than </span><br data-jekyll-commonmark-ghpages="" /> <span class="c">#the minimum number of payments </span><br data-jekyll-commonmark-ghpages="" /> <span class="n">reason_payer_count</span> <span class="o">=</span> <span class="n">reason_payer_amount_info</span><span class="o">.</span><span class="nb">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">t</span><span class="p">:</span> <span class="p">(</span><span class="n">t</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span><span class="mi">1</span><span class="p">))</span> \<br data-jekyll-commonmark-ghpages="" /> <span class="o">.</span><span class="n">reduceByKey</span><span class="p">(</span><span class="n">sum_values</span><span class="p">)</span> \<br data-jekyll-commonmark-ghpages="" /> <span class="o">.</span><span class="nb">filter</span><span class="p">(</span><span class="k">lambda</span> <span class="n">t</span><span class="p">:</span> \<br data-jekyll-commonmark-ghpages="" /> <span class="n">t</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="o">></span> <span class="n">min_payments</span><br data-jekyll-commonmark-ghpages="" /> <span class="p">)</span><br data-jekyll-commonmark-ghpages="" /> <br data-jekyll-commonmark-ghpages="" /> <span class="c">#inner join with the reason/payer's that fit the count requirement </span><br data-jekyll-commonmark-ghpages="" /> <span class="c">#and annotate value with the num payments</span><br data-jekyll-commonmark-ghpages="" /> <span class="n">reason_payer_digits</span> <span class="o">=</span> \<br data-jekyll-commonmark-ghpages="" /> <span class="n">reason_payer_amount_info</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">reason_payer_count</span><span class="p">)</span> \<br data-jekyll-commonmark-ghpages="" /> <span class="o">.</span><span class="nb">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">t</span><span class="p">:</span> <span class="p">(</span><span class="n">t</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> \<br data-jekyll-commonmark-ghpages="" /> <span class="p">,</span> <span class="n">annotate_dict</span><span class="p">(</span><span class="n">join_lhs</span><span class="p">(</span><span class="n">t</span><span class="p">)</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">,</span> <span class="s">'num_payments'</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">,</span> <span class="n">join_rhs</span><span class="p">(</span><span class="n">t</span><span class="p">)</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">)</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">)</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">)</span><br data-jekyll-commonmark-ghpages="" /> <span class="c">#compute the first digit distribution.</span><br data-jekyll-commonmark-ghpages="" /> <span class="c">#First we count each of the 9 possible first digits, then we translate </span><br data-jekyll-commonmark-ghpages="" /> <span class="c">#that count into a vector of dimension 10 with count for digit i </span><br data-jekyll-commonmark-ghpages="" /> <span class="c">#in position i. We then sum those vectors, thereby getting the </span><br data-jekyll-commonmark-ghpages="" /> <span class="c">#full frequency per digit.</span><br data-jekyll-commonmark-ghpages="" /> <span class="n">first_digit_distribution</span> <span class="o">=</span> \<br data-jekyll-commonmark-ghpages="" /> <span class="n">reason_payer_digits</span><span class="o">.</span><span class="nb">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">t</span><span class="p">:</span> <span class="p">(</span> <span class="p">(</span><span class="n">t</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">t</span><span class="p">[</span><span class="mi">1</span><span class="p">][</span><span class="s">'first_digit'</span><span class="p">]</span> <span class="p">)</span> <span class="p">,</span> <span class="mi">1</span><span class="p">)</span> <span class="p">)</span> \<br data-jekyll-commonmark-ghpages="" /> <span class="o">.</span><span class="n">reduceByKey</span><span class="p">(</span><span class="n">sum_values</span><span class="p">)</span> \<br data-jekyll-commonmark-ghpages="" /> <span class="o">.</span><span class="nb">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">t</span><span class="p">:</span> <span class="p">(</span><span class="n">t</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="mi">0</span><span class="p">]</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">,</span> <span class="n">array_with_value</span><span class="p">(</span><span class="mi">10</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">,</span> <span class="nb">int</span><span class="p">(</span><span class="n">t</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="mi">1</span><span class="p">])</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">,</span> <span class="n">t</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">)</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">)</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">)</span> \<br data-jekyll-commonmark-ghpages="" /> <span class="o">.</span><span class="n">reduceByKey</span><span class="p">(</span><span class="n">sum_values</span><span class="p">)</span><br data-jekyll-commonmark-ghpages="" /> <span class="c">#same thing with the 2nd digit</span><br data-jekyll-commonmark-ghpages="" /> <span class="n">second_digit_distribution</span> <span class="o">=</span> \<br data-jekyll-commonmark-ghpages="" /> <span class="n">reason_payer_digits</span><span class="o">.</span><span class="nb">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">t</span><span class="p">:</span> <span class="p">(</span> <span class="p">(</span><span class="n">t</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">t</span><span class="p">[</span><span class="mi">1</span><span class="p">][</span><span class="s">'second_digit'</span><span class="p">])</span> <span class="p">,</span> <span class="mi">1</span><span class="p">)</span> <span class="p">)</span> \<br data-jekyll-commonmark-ghpages="" /> <span class="o">.</span><span class="n">reduceByKey</span><span class="p">(</span><span class="n">sum_values</span><span class="p">)</span> \<br data-jekyll-commonmark-ghpages="" /> <span class="o">.</span><span class="nb">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">t</span><span class="p">:</span> <span class="p">(</span><span class="n">t</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="mi">0</span><span class="p">]</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">,</span> <span class="n">array_with_value</span><span class="p">(</span><span class="mi">10</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">,</span> <span class="nb">int</span><span class="p">(</span><span class="n">t</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="mi">1</span><span class="p">])</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">,</span> <span class="n">t</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">)</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">)</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">)</span> \<br data-jekyll-commonmark-ghpages="" /> <span class="o">.</span><span class="n">reduceByKey</span><span class="p">(</span><span class="k">lambda</span> <span class="n">x</span><span class="p">,</span><span class="n">y</span><span class="p">:</span><span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">x</span><span class="p">)</span> <span class="o">+</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">y</span><span class="p">))</span><br data-jekyll-commonmark-ghpages="" /><br data-jekyll-commonmark-ghpages="" /> <span class="c">#We join the two, compute the goodness of fit based on chi-square test</span><br data-jekyll-commonmark-ghpages="" /> <span class="c">#and the distance from benford's distribution based on kl divergence.</span><br data-jekyll-commonmark-ghpages="" /> <span class="c">#Finally we sort by kl-divergence ascending (good fits come first).</span><br data-jekyll-commonmark-ghpages="" /> <span class="k">return</span> <span class="n">first_digit_distribution</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">second_digit_distribution</span><span class="p">)</span> \<br data-jekyll-commonmark-ghpages="" /> <span class="o">.</span><span class="nb">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">t</span> <span class="p">:</span> <span class="p">(</span> <span class="n">t</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">,</span> <span class="nb">dict</span><span class="p">(</span> <span class="n">payer</span><span class="o">=</span><span class="n">t</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="mi">0</span><span class="p">]</span> \<br data-jekyll-commonmark-ghpages="" /> <span class="p">,</span> <span class="n">reason</span><span class="o">=</span><span class="n">t</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="mi">1</span><span class="p">]</span> \<br data-jekyll-commonmark-ghpages="" /> <span class="p">,</span> <span class="n">first_digit_distr</span><span class="o">=</span><span class="n">join_lhs</span><span class="p">(</span><span class="n">t</span><span class="p">)</span> \<br data-jekyll-commonmark-ghpages="" /> <span class="p">,</span> <span class="n">second_digit_distr</span><span class="o">=</span><span class="n">join_rhs</span><span class="p">(</span><span class="n">t</span><span class="p">)</span> \<br data-jekyll-commonmark-ghpages="" /> <span class="p">,</span> <span class="n">first_digit_fit</span> <span class="o">=</span> \<br data-jekyll-commonmark-ghpages="" /> <span class="n">goodness_of_fit</span><span class="p">(</span><span class="n">join_lhs</span><span class="p">(</span><span class="n">t</span><span class="p">)[</span><span class="mi">1</span><span class="p">:</span><span class="mi">10</span><span class="p">]</span> \<br data-jekyll-commonmark-ghpages="" /> <span class="p">,</span> <span class="n">benford_1</span><span class="p">[</span><span class="mi">1</span><span class="p">:</span><span class="mi">10</span><span class="p">]</span> \<br data-jekyll-commonmark-ghpages="" /> <span class="p">)</span> \<br data-jekyll-commonmark-ghpages="" /> <span class="p">,</span> <span class="n">second_digit_fit</span> <span class="o">=</span> \<br data-jekyll-commonmark-ghpages="" /> <span class="n">goodness_of_fit</span><span class="p">(</span><span class="n">join_rhs</span><span class="p">(</span><span class="n">t</span><span class="p">),</span> <span class="n">benford_2</span><span class="p">)</span> \<br data-jekyll-commonmark-ghpages="" /> <span class="p">,</span> <span class="n">kl_divergence</span> <span class="o">=</span> \<br data-jekyll-commonmark-ghpages="" /> <span class="n">stats</span><span class="o">.</span><span class="n">entropy</span><span class="p">(</span> <span class="n">benford_1</span><span class="p">[</span><span class="mi">1</span><span class="p">:</span><span class="mi">10</span><span class="p">]</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">,</span> <span class="n">join_lhs</span><span class="p">(</span><span class="n">t</span><span class="p">)[</span><span class="mi">1</span><span class="p">:</span><span class="mi">10</span><span class="p">]</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">)</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">)</span> \<br data-jekyll-commonmark-ghpages="" /> <span class="p">)</span> \<br data-jekyll-commonmark-ghpages="" /> <span class="p">)</span> \<br data-jekyll-commonmark-ghpages="" /> <span class="o">.</span><span class="nb">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">t</span> <span class="p">:</span> <span class="p">(</span><span class="n">t</span><span class="p">[</span><span class="mi">1</span><span class="p">][</span><span class="s">'kl_divergence'</span><span class="p">],</span> <span class="n">t</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span> <span class="p">)</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="o">.</span><span class="n">sortByKey</span><span class="p">(</span><span class="bp">True</span><span class="p">)</span> \<br data-jekyll-commonmark-ghpages="" /> <span class="o">.</span><span class="nb">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">t</span> <span class="p">:</span> <span class="p">(</span> <span class="p">(</span><span class="n">t</span><span class="p">[</span><span class="mi">1</span><span class="p">][</span><span class="s">'payer'</span><span class="p">],</span> <span class="n">t</span><span class="p">[</span><span class="mi">1</span><span class="p">][</span><span class="s">'reason'</span><span class="p">]),</span> <span class="n">t</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span> <span class="p">)</span><br data-jekyll-commonmark-ghpages="" /><br data-jekyll-commonmark-ghpages="" /><span class="n">benford_data</span> <span class="o">=</span> <span class="n">benfords_law</span><span class="p">(</span><span class="mi">400</span><span class="p">)</span> </code></pre></figure>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="c">#Plot the distribution of first and second digit side-by-side for a set of payers.</span><br data-jekyll-commonmark-ghpages="" /><span class="k">def</span> <span class="nf">plot_figure</span><span class="p">(</span><span class="n">title</span><span class="p">,</span><span class="n">entries</span><span class="p">):</span><br data-jekyll-commonmark-ghpages="" /> <span class="n">num_rows</span> <span class="o">=</span> <span class="nb">len</span><span class="p">(</span><span class="n">entries</span><span class="p">)</span><br data-jekyll-commonmark-ghpages="" /> <span class="n">fig</span><span class="p">,</span> <span class="n">axes</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">subplots</span><span class="p">(</span><span class="n">nrows</span><span class="o">=</span><span class="nb">len</span><span class="p">(</span><span class="n">entries</span><span class="p">),</span> <span class="n">ncols</span><span class="o">=</span><span class="mi">2</span><span class="p">,</span> <span class="n">figsize</span><span class="o">=</span><span class="p">(</span><span class="mi">12</span><span class="p">,</span><span class="mi">12</span><span class="p">))</span><br data-jekyll-commonmark-ghpages="" /> <span class="n">fig</span><span class="o">.</span><span class="n">tight_layout</span><span class="p">()</span><br data-jekyll-commonmark-ghpages="" /> <span class="n">plt</span><span class="o">.</span><span class="n">subplots_adjust</span><span class="p">(</span><span class="n">top</span><span class="o">=</span><span class="mf">0.91</span><span class="p">,</span> <span class="n">hspace</span><span class="o">=</span><span class="mf">0.55</span><span class="p">,</span> <span class="n">wspace</span><span class="o">=</span><span class="mf">0.3</span><span class="p">)</span><br data-jekyll-commonmark-ghpages="" /> <span class="n">fig</span><span class="o">.</span><span class="n">suptitle</span><span class="p">(</span><span class="n">title</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="mi">14</span><span class="p">)</span><br data-jekyll-commonmark-ghpages="" /> <br data-jekyll-commonmark-ghpages="" /> <span class="n">bar_width</span> <span class="o">=</span> <span class="o">.</span><span class="mi">4</span><br data-jekyll-commonmark-ghpages="" /> <br data-jekyll-commonmark-ghpages="" /> <span class="k">for</span> <span class="n">i</span><span class="p">,</span><span class="n">entry</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">entries</span><span class="p">):</span><br data-jekyll-commonmark-ghpages="" /> <span class="n">first_ax</span> <span class="o">=</span> <span class="n">axes</span><span class="p">[</span><span class="n">i</span><span class="p">][</span><span class="mi">0</span><span class="p">]</span><br data-jekyll-commonmark-ghpages="" /> <span class="n">first_digit_distr</span> <span class="o">=</span> <span class="n">entry</span><span class="p">[</span><span class="mi">1</span><span class="p">][</span><span class="s">'first_digit_distr'</span><span class="p">][</span><span class="mi">1</span><span class="p">:</span><span class="mi">10</span><span class="p">]</span><br data-jekyll-commonmark-ghpages="" /> <span class="n">sample_label_1</span> <span class="o">=</span> \<br data-jekyll-commonmark-ghpages="" /> <span class="s">"""$</span><span class="err">\</span><span class="s">chi$={}, $p$={}, kl={}, n={}"""</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="o">.</span><span class="n">format</span><span class="p">(</span> <span class="n">locale</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="s">'</span><span class="si">%0.2</span><span class="s">f'</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">,</span> <span class="nb">float</span><span class="p">(</span><span class="n">entry</span><span class="p">[</span><span class="mi">1</span><span class="p">][</span><span class="s">'first_digit_fit'</span><span class="p">][</span><span class="mi">0</span><span class="p">])</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">)</span> \<br data-jekyll-commonmark-ghpages="" /> <span class="p">,</span> <span class="n">locale</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="s">'</span><span class="si">%0.2</span><span class="s">f'</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">,</span> <span class="nb">float</span><span class="p">(</span><span class="n">entry</span><span class="p">[</span><span class="mi">1</span><span class="p">][</span><span class="s">'first_digit_fit'</span><span class="p">][</span><span class="mi">1</span><span class="p">])</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">)</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">,</span> <span class="n">locale</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="s">'</span><span class="si">%0.2</span><span class="s">f'</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">,</span> <span class="nb">float</span><span class="p">(</span><span class="n">entry</span><span class="p">[</span><span class="mi">1</span><span class="p">][</span><span class="s">'kl_divergence'</span><span class="p">])</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">)</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">,</span> <span class="nb">int</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">first_digit_distr</span><span class="p">))</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">)</span><br data-jekyll-commonmark-ghpages="" /><br data-jekyll-commonmark-ghpages="" /> <span class="n">first_digit_distr</span> <span class="o">=</span> <span class="n">first_digit_distr</span><span class="o">/</span><span class="n">np</span><span class="o">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">first_digit_distr</span><span class="p">)</span><br data-jekyll-commonmark-ghpages="" /> <br data-jekyll-commonmark-ghpages="" /> <span class="n">first_ax</span><span class="o">.</span><span class="n">bar</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">10</span><span class="p">)</span> <span class="p">,</span><span class="n">first_digit_distr</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.35</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s">'blue'</span><span class="p">,</span> <span class="n">width</span><span class="o">=</span><span class="n">bar_width</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s">"Sample"</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">)</span><br data-jekyll-commonmark-ghpages="" /> <span class="n">first_ax</span><span class="o">.</span><span class="n">bar</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">10</span><span class="p">)</span><span class="o">+</span><span class="n">bar_width</span> <span class="p">,</span><span class="n">benford_1</span><span class="p">[</span><span class="mi">1</span><span class="p">:</span><span class="mi">10</span><span class="p">]</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.35</span><span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s">'red'</span><span class="p">,</span> <span class="n">width</span><span class="o">=</span><span class="n">bar_width</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s">'Benford'</span><span class="p">)</span><br data-jekyll-commonmark-ghpages="" /> <span class="n">first_ax</span><span class="o">.</span><span class="n">set_xticks</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span><span class="mi">10</span><span class="p">))</span><br data-jekyll-commonmark-ghpages="" /> <span class="n">first_ax</span><span class="o">.</span><span class="n">legend</span><span class="p">()</span><br data-jekyll-commonmark-ghpages="" /> <span class="n">first_ax</span><span class="o">.</span><span class="n">grid</span><span class="p">()</span><br data-jekyll-commonmark-ghpages="" /> <br data-jekyll-commonmark-ghpages="" /> <span class="n">first_ax</span><span class="o">.</span><span class="n">set_ylabel</span><span class="p">(</span><span class="s">'Probability'</span><span class="p">)</span><br data-jekyll-commonmark-ghpages="" /> <span class="n">first_ax</span><span class="o">.</span><span class="n">set_title</span><span class="p">(</span><span class="s">"{} First Digit</span><span class="se">\n</span><span class="s">{}"</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">entry</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="s">'ascii'</span><span class="p">,</span> <span class="s">'ignore'</span><span class="p">)</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">,</span> <span class="n">sample_label_1</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">)</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">)</span><br data-jekyll-commonmark-ghpages="" /> <br data-jekyll-commonmark-ghpages="" /> <span class="n">second_ax</span> <span class="o">=</span> <span class="n">axes</span><span class="p">[</span><span class="n">i</span><span class="p">][</span><span class="mi">1</span><span class="p">]</span><br data-jekyll-commonmark-ghpages="" /> <span class="n">second_digit_distr</span> <span class="o">=</span> <span class="n">entry</span><span class="p">[</span><span class="mi">1</span><span class="p">][</span><span class="s">'second_digit_distr'</span><span class="p">]</span><br data-jekyll-commonmark-ghpages="" /> <span class="n">sample_label_2</span> <span class="o">=</span> \<br data-jekyll-commonmark-ghpages="" /> <span class="s">'$</span><span class="err">\</span><span class="s">chi$={}, $p$={}, n={}'</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">locale</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="s">'</span><span class="si">%0.2</span><span class="s">f'</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">,</span> <span class="nb">float</span><span class="p">(</span><span class="n">entry</span><span class="p">[</span><span class="mi">1</span><span class="p">][</span><span class="s">'second_digit_fit'</span><span class="p">][</span><span class="mi">0</span><span class="p">])</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">)</span> \<br data-jekyll-commonmark-ghpages="" /> <span class="p">,</span> <span class="n">locale</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="s">'</span><span class="si">%0.2</span><span class="s">f'</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">,</span> <span class="nb">float</span><span class="p">(</span><span class="n">entry</span><span class="p">[</span><span class="mi">1</span><span class="p">][</span><span class="s">'second_digit_fit'</span><span class="p">][</span><span class="mi">1</span><span class="p">])</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">)</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">,</span> <span class="nb">int</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">second_digit_distr</span><span class="p">))</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">)</span><br data-jekyll-commonmark-ghpages="" /> <span class="n">second_digit_distr</span> <span class="o">=</span> <span class="n">second_digit_distr</span><span class="o">/</span><span class="n">np</span><span class="o">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">second_digit_distr</span><span class="p">)</span><br data-jekyll-commonmark-ghpages="" /> <br data-jekyll-commonmark-ghpages="" /> <span class="n">second_ax</span><span class="o">.</span><span class="n">bar</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="mi">10</span><span class="p">)</span> <span class="p">,</span><span class="n">second_digit_distr</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.35</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s">'blue'</span><span class="p">,</span> <span class="n">width</span><span class="o">=</span><span class="n">bar_width</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s">"Sample"</span><span class="p">)</span><br data-jekyll-commonmark-ghpages="" /> <span class="n">second_ax</span><span class="o">.</span><span class="n">bar</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="mi">10</span><span class="p">)</span> <span class="o">+</span> <span class="n">bar_width</span><span class="p">,</span><span class="n">benford_2</span><span class="p">,</span> <span class="n">alpha</span><span class="o">=</span><span class="mf">0.35</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">,</span> <span class="n">color</span><span class="o">=</span><span class="s">'red'</span><span class="p">,</span> <span class="n">width</span><span class="o">=</span><span class="n">bar_width</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s">'Benford'</span><span class="p">)</span><br data-jekyll-commonmark-ghpages="" /> <span class="n">second_ax</span><span class="o">.</span><span class="n">set_xticks</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="mi">10</span><span class="p">))</span><br data-jekyll-commonmark-ghpages="" /> <span class="n">second_ax</span><span class="o">.</span><span class="n">legend</span><span class="p">()</span><br data-jekyll-commonmark-ghpages="" /> <span class="n">second_ax</span><span class="o">.</span><span class="n">grid</span><span class="p">()</span><br data-jekyll-commonmark-ghpages="" /> <span class="n">second_ax</span><span class="o">.</span><span class="n">set_ylabel</span><span class="p">(</span><span class="s">'Probability'</span><span class="p">)</span><br data-jekyll-commonmark-ghpages="" /> <span class="n">second_ax</span><span class="o">.</span><span class="n">set_title</span><span class="p">(</span><span class="s">"{} Second Digit</span><span class="se">\n</span><span class="s">{}"</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">entry</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="s">'ascii'</span><span class="p">,</span> <span class="s">'ignore'</span><span class="p">)</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">,</span> <span class="n">sample_label_2</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">)</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">)</span><br data-jekyll-commonmark-ghpages="" /> <br data-jekyll-commonmark-ghpages="" /><span class="c">#Take n-worst or best (depending on t) entries for reason based on goodness</span><br data-jekyll-commonmark-ghpages="" /><span class="c"># of fit for benford's law and plot the first/second digit distributions </span><br data-jekyll-commonmark-ghpages="" /><span class="c">#versus benford's distribution side-by-side as well as the distribution </span><br data-jekyll-commonmark-ghpages="" /><span class="c">#of kl-divergences.</span><br data-jekyll-commonmark-ghpages="" /><span class="k">def</span> <span class="nf">benford_summary</span><span class="p">(</span><span class="n">reason</span><span class="p">,</span> <span class="n">data</span> <span class="o">=</span> <span class="n">benford_data</span><span class="p">,</span> <span class="n">n</span><span class="o">=</span><span class="mi">5</span><span class="p">,</span> <span class="n">t</span><span class="o">=</span><span class="s">'best'</span><span class="p">):</span><br data-jekyll-commonmark-ghpages="" /> <span class="n">raw_data</span> <span class="o">=</span> <span class="n">data</span><span class="o">.</span><span class="nb">filter</span><span class="p">(</span><span class="k">lambda</span> <span class="n">t</span><span class="p">:</span><span class="n">t</span><span class="p">[</span><span class="mi">0</span><span class="p">][</span><span class="mi">1</span><span class="p">]</span> <span class="o">==</span> <span class="n">reason</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span><br data-jekyll-commonmark-ghpages="" /> <span class="n">s</span> <span class="o">=</span> <span class="p">[]</span><br data-jekyll-commonmark-ghpages="" /> <span class="k">if</span> <span class="n">t</span> <span class="o">==</span> <span class="s">'best'</span><span class="p">:</span><br data-jekyll-commonmark-ghpages="" /> <span class="n">s</span><span class="o">=</span><span class="n">raw_data</span><span class="p">[</span><span class="mi">0</span><span class="p">:</span><span class="n">n</span><span class="p">]</span><br data-jekyll-commonmark-ghpages="" /> <span class="n">plot_figure</span><span class="p">(</span><span class="s">"Top 5 Best Fitting Benford Analysis for {}"</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">reason</span><span class="p">)</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">,</span> <span class="n">s</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">)</span><br data-jekyll-commonmark-ghpages="" /> <span class="k">else</span><span class="p">:</span><br data-jekyll-commonmark-ghpages="" /> <span class="n">s</span><span class="o">=</span><span class="n">raw_data</span><span class="p">[</span><span class="o">-</span><span class="n">n</span><span class="p">:][::</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span><br data-jekyll-commonmark-ghpages="" /> <span class="n">plot_figure</span><span class="p">(</span><span class="s">"Top 5 Worst Fitting Benford Analysis for {}"</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">reason</span><span class="p">)</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">,</span> <span class="n">s</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">)</span><br data-jekyll-commonmark-ghpages="" /> <span class="n">plot_outliers</span><span class="p">(</span> <span class="p">[</span><span class="n">d</span><span class="p">[</span><span class="mi">1</span><span class="p">][</span><span class="s">'kl_divergence'</span><span class="p">]</span> <span class="k">for</span> <span class="n">d</span> <span class="ow">in</span> <span class="n">raw_data</span><span class="p">]</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">,</span> <span class="p">[</span><span class="n">d</span><span class="p">[</span><span class="mi">1</span><span class="p">][</span><span class="s">'kl_divergence'</span><span class="p">]</span> <span class="k">for</span> <span class="n">d</span> <span class="ow">in</span> <span class="n">s</span><span class="p">]</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">,</span> <span class="n">reason</span> <span class="o">+</span> <span class="s">" KL Divergence"</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">)</span></code></pre></figure>
<h2 id="best-and-worst-fitting-payers-for-gifts">Best and Worst Fitting Payers for Gifts</h2>
<p>Pretty much across the board the p-values for the $\chi^2$ test were
super weak, but the first 4 are a pretty good fit. The last one is
interesting, that spike at the 6 digit is the kind of thing that are of interest to forensic accountants. I repeat, however, that this is <strong>not</strong> an indicator which can be used safely alone to level a charge of fraud.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="c"># Gift Benford Analysis (Best)</span><br data-jekyll-commonmark-ghpages="" /><span class="n">benford_summary</span><span class="p">(</span><span class="s">'Gift'</span><span class="p">,</span> <span class="n">t</span><span class="o">=</span><span class="s">'best'</span><span class="p">)</span></code></pre></figure>
<p><img src="files/ref_data/open_payments_files/open_payments_21_0.png" style="width:650px" /></p>
<p>Below is the density plot for the Kullback-Leibler divergences for the top best fits. You can see there’s a clump at 0 and a clump a bit farther out, but no real outliers.</p>
<p><img src="files/ref_data/open_payments_files/open_payments_21_1.png" alt="png" /></p>
<p>Now we look at the worst fitting gift payers. The lists overlap, as you
can see, because there just aren’t that many organizations that pay more
than 350 gifts out over the course of the year.</p>
<p>Two things that are interesting:</p>
<ul>
<li>Mentor Worldwide does not fit the decreasing probability distribution that we would expect. Many payers diverge from Benford’s law, but it’s interesting when they break the basic form of decreasing probabilities as digits progress from 1 through 9.</li>
<li>Benco Dental Supply has a huge amount of payments starting with 1. This is likely an indication that they have a standard gift that they give out.</li>
</ul>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">benford_summary</span><span class="p">(</span><span class="s">'Gift'</span><span class="p">,</span> <span class="n">t</span><span class="o">=</span><span class="s">'worst'</span><span class="p">)</span></code></pre></figure>
<p><img src="files/ref_data/open_payments_files/open_payments_22_0.png" style="width:650px" /></p>
<p><img src="files/ref_data/open_payments_files/open_payments_22_1.png" alt="png" /></p>
<h2 id="best-and-worst-fitting-payers-for-travel-and-lodging">Best and Worst Fitting Payers for Travel and Lodging</h2>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="c"># Travel and Lodging Benford Analysis</span><br data-jekyll-commonmark-ghpages="" /><span class="n">benford_summary</span><span class="p">(</span><span class="s">"Travel and Lodging"</span><span class="p">)</span></code></pre></figure>
<p>The top fitting payers for travel and lodging fit fairly well, so it’s
certainly possible to fit the distribution well.</p>
<p><img src="files/ref_data/open_payments_files/open_payments_23_0.png" style="width:650px" /></p>
<p><img src="files/ref_data/open_payments_files/open_payments_23_1.png" alt="png" /></p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">benford_summary</span><span class="p">(</span><span class="s">"Travel and Lodging"</span><span class="p">,</span> <span class="n">t</span><span class="o">=</span><span class="s">'worst'</span><span class="p">)</span></code></pre></figure>
<p>You can see a few payers diverging from the general form of Benford’s
distribution here. LDR Holding is the outlier in terms of goodness of
fit as you can see from the density plot below as well.</p>
<p><img src="files/ref_data/open_payments_files/open_payments_24_0.png" style="width:650px" /></p>
<p><img src="files/ref_data/open_payments_files/open_payments_24_1.png" alt="png" /></p>
<h2 id="best-and-worst-fitting-payers-for-consulting-fees">Best and Worst Fitting Payers for Consulting Fees</h2>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">benford_summary</span><span class="p">(</span><span class="s">"Consulting Fee"</span><span class="p">)</span></code></pre></figure>
<p>The good fits are pretty good.
<img src="files/ref_data/open_payments_files/open_payments_25_0.png" style="width:650px" /></p>
<p><img src="files/ref_data/open_payments_files/open_payments_25_1.png" alt="png" /></p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">benford_summary</span><span class="p">(</span><span class="s">"Consulting Fee"</span><span class="p">,</span> <span class="n">t</span><span class="o">=</span><span class="s">'worst'</span><span class="p">)</span></code></pre></figure>
<p>For both UCB and Merck, we see a payer with a huge distribution of payments starting with 7. This indicates a standardized payment of some sort, I’d wager. The interesting thing about Merck is that the 1’s distribution is pretty spot on, the rest of the density gets pushed into 7.</p>
<p><img src="files/ref_data/open_payments_files/open_payments_26_0.png" style="width:650px" /></p>
<p><img src="files/ref_data/open_payments_files/open_payments_26_1.png" alt="png" /></p>
<h2 id="conclusion">Conclusion</h2>
<p>This concludes the basic example of doing analytics with the Spark
platform. General conclusions and impressions from this whole exercise can be found <a href="pyspark-openpayments-analysis.html">here</a>.</p>
Data Science and Hadoop: Part 4, Outlier Analysis2014-10-24T00:00:00+00:00id:/pyspark-openpayments-analysis-part-4<h2 id="context">Context</h2>
<p>This is the fourth part of a 5 part <a href="pyspark-openpayments-analysis.html">series</a> on analyzing data with PySpark:</p>
<ul>
<li><a href="pyspark-openpayments-analysis.html">Data Science and Hadoop : Impressions</a></li>
<li><a href="pyspark-openpayments-analysis-part-2.html">Data Overview and Preprocessing</a></li>
<li><a href="pyspark-openpayments-analysis-part-3.html">Basic Structural Analysis</a></li>
<li>Outlier Analysis</li>
<li><a href="pyspark-openpayments-analysis-part-5.html">Benford’s Law Analysis</a></li>
</ul>
<h1 id="outlier-analysis">Outlier Analysis</h1>
<p>Generating summary statistics is very helpful for</p>
<ul>
<li>Understanding the overall shape of the data</li>
<li>Looking at trends at the extreme (answering most, least, etc style questions)</li>
</ul>
<link rel="stylesheet" href="files/css/theme.cstella.css" />
<script src="//ajax.googleapis.com/ajax/libs/jquery/1.8.3/jquery.min.js"></script>
<script src="//cdn.jsdelivr.net/tablesorter/2.15.13/js/jquery.tablesorter.min.js"></script>
<script>
$(document).ready(function()
{
for (i = 1;i <= 11;++i) {
$("#resultTable"+i).tablesorter(
{ theme: "cstella" }
);
}
}
);
</script>
<p>One thing we’re doing when we’re looking for fishy things in data is
looking to quickly zoom in on things which are outside of the norm.
Furthermore, the aim is to automatically tag these as data is ingested
so they can be acted on. This action can be raising an alert, logging a
warning or generating a report, but ultimately we want a technique to
find these outliers quickly and without human intervention.</p>
<h2 id="median-absolute-deviation">Median Absolute Deviation</h2>
<p>We’re looking to create a mechanism to rank data points by their likelihood of being an outlier along with a threshold to differentiate them from inliers.</p>
<p>The area of <a href="http://en.wikipedia.org/wiki/Anomaly_detection">outlier analysis</a> is a vibrant one and there are
quite a few techniques to choose from. Ranging from the exotic, like
<a href="http://scikit-learn.org/stable/modules/outlier_detection.html#id1">fitting an elliptic
envelope</a>
to the straightforward, like setting a threshold based on standard
deviations away from the mean. For our purposes, we’ll choose a middle
path, but be aware that there are
<a href="http://www.itl.nist.gov/div898/handbook/eda/section4/eda43.htm#Barnett">book-length</a> treatments of the subject of outlier analysis.</p>
<p><a href="http://en.wikipedia.org/wiki/Median_absolute_deviation">Median Absolute
Deviation</a> is a
robust statistic used, as standard deviations, as a measure of
variability in a univariate dataset. It’s definition is
straightforward:</p>
<blockquote>
<p>Given univariate data $X$ with $\tilde{x}=$median($X$),
MAD($X$)=median({$\forall x_i \in X \lvert |x_i - \tilde{x}|$}).</p>
</blockquote>
<p>As compared to standard deviation, it’s a bit more resilient to outliers because it doesn’t have a square weighing large values very heavily. Quoting from the <a href="http://www.itl.nist.gov/div898/handbook/eda/section3/eda35h.htm">Engineering Statistics
Handbook</a>:</p>
<blockquote>
<p>The standard deviation is an example of an estimator that is the best
we can do if the underlying distribution is normal. However, it lacks
robustness of validity. That is, confidence intervals based on the
standard deviation tend to lack precision if the underlying
distribution is in fact not normal.</p>
<p>The median absolute deviation and the interquartile range are estimates
of scale that have robustness of validity. However, they are not
particularly strong for robustness of efficiency.</p>
<p>If histograms and probability plots indicate that your data are in fact
reasonably approximated by a normal distribution, then it makes sense to
use the standard deviation as the estimate of scale. However, if your
data are not normal, and in particular if there are long tails, then
using an alternative measure such as the median absolute deviation,
average absolute deviation, or interquartile range makes sense.</p>
</blockquote>
<p>In the implementation we’ll be taking guidance the <a href="http://www.itl.nist.gov/div898/handbook/eda/section3/eda35h.htm">Engineering Statistics
Handbook</a>
and <a href="http://www.itl.nist.gov/div898/handbook/eda/section4/eda43.htm#Iglewicz">Iglewicz and Hoaglin</a>. As such, define an outlier like so:</p>
<blockquote>
<p>For a set of univariate data $X$ with $\tilde{x} =$ median($X$), an outlier is an element $x_i \in X$
such that $M_i = \frac{0.6745(x_i − \tilde{x} )}{MAD(X)} > 3.5$, where
$M_i$ is denoted the <em>modified Z-score</em>.</p>
</blockquote>
<p>Before we jump into the actual algorithm, we create some helper
functions to make the code a bit more readable and allow us to display
the outliers.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="c">#Some useful functions for more advanced analytics</span><br data-jekyll-commonmark-ghpages="" /><br data-jekyll-commonmark-ghpages="" /><span class="c">#Joins in spark take RDD[K,V] x RDD[K,U] => RDD[K, [U,V] ]</span><br data-jekyll-commonmark-ghpages="" /><span class="c">#This function returns U</span><br data-jekyll-commonmark-ghpages="" /><span class="k">def</span> <span class="nf">join_lhs</span><span class="p">(</span><span class="n">t</span><span class="p">):</span><br data-jekyll-commonmark-ghpages="" /> <span class="k">return</span> <span class="n">t</span><span class="p">[</span><span class="mi">1</span><span class="p">][</span><span class="mi">0</span><span class="p">]</span><br data-jekyll-commonmark-ghpages="" /><br data-jekyll-commonmark-ghpages="" /><span class="c">#Joins in spark take RDD[K,V] x RDD[K,U] => RDD[K, [U,V] ]</span><br data-jekyll-commonmark-ghpages="" /><span class="c">#This function returns V</span><br data-jekyll-commonmark-ghpages="" /><span class="k">def</span> <span class="nf">join_rhs</span><span class="p">(</span><span class="n">t</span><span class="p">):</span><br data-jekyll-commonmark-ghpages="" /> <span class="k">return</span> <span class="n">t</span><span class="p">[</span><span class="mi">1</span><span class="p">][</span><span class="mi">1</span><span class="p">]</span><br data-jekyll-commonmark-ghpages="" /><br data-jekyll-commonmark-ghpages="" /><span class="c">#Add a key/value to a dictionary and return the dictionary</span><br data-jekyll-commonmark-ghpages="" /><span class="k">def</span> <span class="nf">annotate_dict</span><span class="p">(</span><span class="n">d</span><span class="p">,</span> <span class="n">k</span><span class="p">,</span> <span class="n">v</span><span class="p">):</span><br data-jekyll-commonmark-ghpages="" /> <span class="n">d</span><span class="p">[</span><span class="n">k</span><span class="p">]</span> <span class="o">=</span> <span class="n">v</span><br data-jekyll-commonmark-ghpages="" /> <span class="k">return</span> <span class="n">d</span><br data-jekyll-commonmark-ghpages="" /><br data-jekyll-commonmark-ghpages="" /><span class="c">#Plots a density plot of a set of points representing inliers and outliers</span><br data-jekyll-commonmark-ghpages="" /><span class="c">#A rugplot is used to indicate the points and the outliers are marked in red.</span><br data-jekyll-commonmark-ghpages="" /><span class="k">def</span> <span class="nf">plot_outliers</span><span class="p">(</span><span class="n">inliers</span><span class="p">,</span> <span class="n">outliers</span><span class="p">,</span> <span class="n">reason</span><span class="p">):</span><br data-jekyll-commonmark-ghpages="" /> <span class="n">fig</span><span class="p">,</span> <span class="n">ax</span> <span class="o">=</span> <span class="n">plt</span><span class="o">.</span><span class="n">subplots</span><span class="p">(</span><span class="n">nrows</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span><br data-jekyll-commonmark-ghpages="" /> <span class="n">sns</span><span class="o">.</span><span class="n">distplot</span><span class="p">(</span><span class="n">inliers</span> <span class="o">+</span> <span class="n">outliers</span><span class="p">,</span> <span class="n">ax</span><span class="o">=</span><span class="n">ax</span><span class="p">,</span> <span class="n">rug</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span> <span class="n">hist</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span><br data-jekyll-commonmark-ghpages="" /> <span class="n">ax</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">outliers</span><span class="p">,</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros_like</span><span class="p">(</span><span class="n">outliers</span><span class="p">),</span> <span class="s">'ro'</span><span class="p">,</span> <span class="n">clip_on</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span><br data-jekyll-commonmark-ghpages="" /> <span class="n">fig</span><span class="o">.</span><span class="n">suptitle</span><span class="p">(</span><span class="s">'Distribution for {} Values'</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">reason</span><span class="p">),</span> <span class="n">size</span><span class="o">=</span><span class="mi">14</span><span class="p">)</span></code></pre></figure>
<p>Now, onto the implementation of the outlier analysis. But, before we
start, I’d like to make a couple notes about the implementation and
possible scalability challenges going forward.</p>
<p>We are partitioning the data by (payment reason, physician specialty). I do not want to analyze outliers based on a cohort of data across a whole reason, but rather I want to know if a point is an outlier for a given specialty <em>and</em> reason.</p>
<p>If a coarser partitioning strategy is taken or the amount of data per
partition becomes very large, the median implementation may become a
limiting factor scalability wise. There are a few things to do,
including using a tighter implementation (numpy’s implementation could
be tighter as of this writing) or a streaming estimate. Needless to say, this is something that bears some thought going forward.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="c">#Outlier analysis using Median Absolute Deviation</span><br data-jekyll-commonmark-ghpages="" /><br data-jekyll-commonmark-ghpages="" /><span class="c">#Using reservoir sampling, uniformly sample N points</span><br data-jekyll-commonmark-ghpages="" /><span class="c">#requires O(N) memory</span><br data-jekyll-commonmark-ghpages="" /><span class="k">def</span> <span class="nf">sample_points</span><span class="p">(</span><span class="n">points</span><span class="p">,</span> <span class="n">N</span><span class="p">):</span><br data-jekyll-commonmark-ghpages="" /> <span class="n">sample</span> <span class="o">=</span> <span class="p">[];</span><br data-jekyll-commonmark-ghpages="" /> <span class="k">for</span> <span class="n">i</span><span class="p">,</span><span class="n">point</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">points</span><span class="p">):</span><br data-jekyll-commonmark-ghpages="" /> <span class="k">if</span> <span class="n">i</span> <span class="o"><</span> <span class="n">N</span><span class="p">:</span><br data-jekyll-commonmark-ghpages="" /> <span class="n">sample</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">point</span><span class="p">)</span><br data-jekyll-commonmark-ghpages="" /> <span class="k">elif</span> <span class="n">i</span> <span class="o">>=</span> <span class="n">N</span> <span class="ow">and</span> <span class="n">random</span><span class="o">.</span><span class="n">random</span><span class="p">()</span> <span class="o"><</span> <span class="n">N</span><span class="o">/</span><span class="nb">float</span><span class="p">(</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">):</span><br data-jekyll-commonmark-ghpages="" /> <span class="n">replace</span> <span class="o">=</span> <span class="n">random</span><span class="o">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span><span class="nb">len</span><span class="p">(</span><span class="n">sample</span><span class="p">)</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span><br data-jekyll-commonmark-ghpages="" /> <span class="n">sample</span><span class="p">[</span><span class="n">replace</span><span class="p">]</span> <span class="o">=</span> <span class="n">point</span><br data-jekyll-commonmark-ghpages="" /> <span class="k">return</span> <span class="n">sample</span><br data-jekyll-commonmark-ghpages="" /><br data-jekyll-commonmark-ghpages="" /><span class="c">#Returns a function which will extract the median at location 'key'</span><br data-jekyll-commonmark-ghpages="" /><span class="c">#a list of dictionaries.</span><br data-jekyll-commonmark-ghpages="" /><span class="k">def</span> <span class="nf">median_func</span><span class="p">(</span><span class="n">key</span><span class="p">):</span><br data-jekyll-commonmark-ghpages="" /> <span class="c">#Right now it uses numpy's median, but probably a quickselect</span><br data-jekyll-commonmark-ghpages="" /> <span class="c">#implementation is called for as I expect this doesn't scale</span><br data-jekyll-commonmark-ghpages="" /> <span class="k">return</span> <span class="k">lambda</span> <span class="n">partition_value</span> <span class="p">:</span> <span class="p">(</span> <span class="n">partition_value</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">,</span> <span class="n">np</span><span class="o">.</span><span class="n">median</span><span class="p">(</span> <br data-jekyll-commonmark-ghpages="" /> <span class="p">[</span> <span class="n">d</span><span class="p">[</span><span class="n">key</span><span class="p">]</span> \<br data-jekyll-commonmark-ghpages="" /> <span class="k">for</span> <span class="n">d</span> <span class="ow">in</span> <span class="n">partition_value</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">]</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">)</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">)</span><br data-jekyll-commonmark-ghpages="" /><br data-jekyll-commonmark-ghpages="" /><span class="c">#Compute the modified z-score for use by as per Iglewicz and Hoaglin:</span><br data-jekyll-commonmark-ghpages="" /><span class="c">#Boris Iglewicz and David Hoaglin (1993),</span><br data-jekyll-commonmark-ghpages="" /><span class="c">#"Volume 16: How to Detect and Handle Outliers",</span><br data-jekyll-commonmark-ghpages="" /><span class="c">#The ASQC Basic References in Quality Control: Statistical Techniques</span><br data-jekyll-commonmark-ghpages="" /><span class="c">#, Edward F. Mykytka, Ph.D., Editor.</span><br data-jekyll-commonmark-ghpages="" /><span class="k">def</span> <span class="nf">get_z_score</span><span class="p">(</span><span class="n">reason_to_diff</span><span class="p">):</span><br data-jekyll-commonmark-ghpages="" /> <span class="n">med</span> <span class="o">=</span> <span class="n">join_rhs</span><span class="p">(</span><span class="n">reason_to_diff</span><span class="p">)</span><br data-jekyll-commonmark-ghpages="" /> <span class="k">if</span> <span class="n">med</span> <span class="o">></span> <span class="mi">0</span><span class="p">:</span><br data-jekyll-commonmark-ghpages="" /> <span class="k">return</span> <span class="mf">0.6745</span> <span class="o">*</span> <span class="n">join_lhs</span><span class="p">(</span><span class="n">reason_to_diff</span><span class="p">)[</span><span class="s">'diff'</span><span class="p">]</span> <span class="o">/</span> <span class="n">med</span><br data-jekyll-commonmark-ghpages="" /> <span class="k">else</span><span class="p">:</span><br data-jekyll-commonmark-ghpages="" /> <span class="k">return</span> <span class="mi">0</span><br data-jekyll-commonmark-ghpages="" /><br data-jekyll-commonmark-ghpages="" /><span class="k">def</span> <span class="nf">is_outlier</span><span class="p">(</span><span class="n">thresh</span><span class="p">):</span><br data-jekyll-commonmark-ghpages="" /> <span class="k">return</span> <span class="k">lambda</span> <span class="n">reason_to_diff</span> <span class="p">:</span> <span class="n">get_mad</span><span class="p">(</span><span class="n">reason_to_diff</span><span class="p">)</span> <span class="o">></span> <span class="n">thresh</span><br data-jekyll-commonmark-ghpages="" /><br data-jekyll-commonmark-ghpages="" /><span class="c">#Return a RDD of a uniform random sample of a specified size per key</span><br data-jekyll-commonmark-ghpages="" /><span class="k">def</span> <span class="nf">get_inliers</span><span class="p">(</span><span class="n">reason_amount_pairs</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="mi">2000</span><span class="p">):</span><br data-jekyll-commonmark-ghpages="" /> <span class="n">group_by_reason</span> <span class="o">=</span> <span class="n">reason_amount_pairs</span><span class="o">.</span><span class="n">groupByKey</span><span class="p">()</span><br data-jekyll-commonmark-ghpages="" /> <span class="k">return</span> <span class="n">group_by_reason</span><span class="o">.</span><span class="nb">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">t</span> <span class="p">:</span> <span class="p">(</span><span class="n">t</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">sample_points</span><span class="p">(</span><span class="n">t</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">size</span><span class="p">)))</span><br data-jekyll-commonmark-ghpages="" /><br data-jekyll-commonmark-ghpages="" /><br data-jekyll-commonmark-ghpages="" /><span class="c">#Return the outliers based on Median Absolute Deviation</span><br data-jekyll-commonmark-ghpages="" /><span class="c">#See http://www.itl.nist.gov/div898/handbook/eda/section3/eda35h.htm </span><br data-jekyll-commonmark-ghpages="" /><span class="c">#for more info.</span><br data-jekyll-commonmark-ghpages="" /><span class="c">#The input key structure is reason_specialty => dict(amount</span><br data-jekyll-commonmark-ghpages="" /><span class="c"># , physician</span><br data-jekyll-commonmark-ghpages="" /><span class="c"># , payer</span><br data-jekyll-commonmark-ghpages="" /><span class="c"># , specialty</span><br data-jekyll-commonmark-ghpages="" /><span class="c"># )</span><br data-jekyll-commonmark-ghpages="" /><span class="k">def</span> <span class="nf">get_outliers</span><span class="p">(</span><span class="n">reason_amount_pairs</span><span class="p">,</span> <span class="n">thresh</span><span class="o">=</span><span class="mf">3.5</span><span class="p">):</span><br data-jekyll-commonmark-ghpages="" /> <span class="s">"""<br data-jekyll-commonmark-ghpages="" /> This uses the median absolute deviation (MAD) statistic to find<br data-jekyll-commonmark-ghpages="" /> outliers for each reason x specialty partitions.<br data-jekyll-commonmark-ghpages="" /> <br data-jekyll-commonmark-ghpages="" /> Outliers are computed as follows: <br data-jekyll-commonmark-ghpages="" /> * Let X be all the payments for a given specialty, reason pair<br data-jekyll-commonmark-ghpages="" /> * Let x_i be a payment in X<br data-jekyll-commonmark-ghpages="" /> * Let MAD be the median absolute deviation, defined as<br data-jekyll-commonmark-ghpages="" /> MAD = median( for all x in X, | x - median(X)| )<br data-jekyll-commonmark-ghpages="" /> * Let M_i be the modified z-score for payment x_i, defined as<br data-jekyll-commonmark-ghpages="" /> 0.6745*(x_i − median(X) )/MAD<br data-jekyll-commonmark-ghpages="" /> <br data-jekyll-commonmark-ghpages="" /> As per the recommendations by Iglewicz and Hoaglin, a payment is<br data-jekyll-commonmark-ghpages="" /> considered an outlier if the modified z-score, M_i > thresh, which<br data-jekyll-commonmark-ghpages="" /> is 3.5 by default.<br data-jekyll-commonmark-ghpages="" /> <br data-jekyll-commonmark-ghpages="" /> <br data-jekyll-commonmark-ghpages="" /> REFERENCE:<br data-jekyll-commonmark-ghpages="" /> Boris Iglewicz and David Hoaglin (1993),<br data-jekyll-commonmark-ghpages="" /> "Volume 16: How to Detect and Handle Outliers",<br data-jekyll-commonmark-ghpages="" /> The ASQC Basic References in Quality Control: Statistical Techniques,<br data-jekyll-commonmark-ghpages="" /> Edward F. Mykytka, Ph.D., Editor.<br data-jekyll-commonmark-ghpages="" /> """</span><br data-jekyll-commonmark-ghpages="" /> <br data-jekyll-commonmark-ghpages="" /> <span class="n">group_by_reason</span> <span class="o">=</span> <span class="n">reason_amount_pairs</span><span class="o">.</span><span class="n">groupByKey</span><span class="p">()</span><br data-jekyll-commonmark-ghpages="" /> <br data-jekyll-commonmark-ghpages="" /> <span class="c">#Filter by only reason/specialty's with more than 1k entries</span><br data-jekyll-commonmark-ghpages="" /> <span class="c">#and compute the median of the amounts across the partition.</span><br data-jekyll-commonmark-ghpages="" /> <br data-jekyll-commonmark-ghpages="" /> <span class="c">#NOTE: There may be some scalability challenges around median,</span><br data-jekyll-commonmark-ghpages="" /> <span class="c">#so some care should be taken to reimplement this if partitioning</span><br data-jekyll-commonmark-ghpages="" /> <span class="c">#by (reason, specialty) does not yield small enough numbers to </span><br data-jekyll-commonmark-ghpages="" /> <span class="c">#handle in an individual map function.</span><br data-jekyll-commonmark-ghpages="" /> <span class="n">reason_to_median</span> <span class="o">=</span> <span class="n">group_by_reason</span><span class="o">.</span><span class="nb">filter</span><span class="p">(</span><span class="k">lambda</span> <span class="n">t</span><span class="p">:</span> <span class="nb">len</span><span class="p">(</span><span class="n">t</span><span class="p">[</span><span class="mi">1</span><span class="p">])</span> <span class="o">></span> <span class="mi">1000</span><span class="p">)</span> \<br data-jekyll-commonmark-ghpages="" /> <span class="o">.</span><span class="nb">map</span><span class="p">(</span><span class="n">median_func</span><span class="p">(</span><span class="s">'amount'</span><span class="p">))</span><br data-jekyll-commonmark-ghpages="" /> <br data-jekyll-commonmark-ghpages="" /> <span class="c">#Join the base, non-grouped data, with the median per key,</span><br data-jekyll-commonmark-ghpages="" /> <span class="c">#consider just the payments more than the median</span><br data-jekyll-commonmark-ghpages="" /> <span class="c">#since we're looking for large money outliers and annotate </span><br data-jekyll-commonmark-ghpages="" /> <span class="c">#the dictionary for each entry x_i with the following:</span><br data-jekyll-commonmark-ghpages="" /> <span class="c"># * diff = |x_i - median(X)| in the parlance of the comment above.</span><br data-jekyll-commonmark-ghpages="" /> <span class="c"># NOTE: Strictly speaking I can drop the absolute value since </span><br data-jekyll-commonmark-ghpages="" /> <span class="c"># x_i > median(X), but I choose not to.</span><br data-jekyll-commonmark-ghpages="" /> <span class="c"># * median = median(X)</span><br data-jekyll-commonmark-ghpages="" /> <span class="c"># </span><br data-jekyll-commonmark-ghpages="" /> <span class="n">reason_abs_dist_from_median</span> <span class="o">=</span> \<br data-jekyll-commonmark-ghpages="" /> <span class="n">reason_amount_pairs</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">reason_to_median</span><span class="p">)</span> \<br data-jekyll-commonmark-ghpages="" /> <span class="o">.</span><span class="nb">filter</span><span class="p">(</span><span class="k">lambda</span> <span class="n">t</span> <span class="p">:</span> \<br data-jekyll-commonmark-ghpages="" /> <span class="n">join_lhs</span><span class="p">(</span><span class="n">t</span><span class="p">)[</span><span class="s">'amount'</span><span class="p">]</span> <span class="o">></span> <span class="n">join_rhs</span><span class="p">(</span><span class="n">t</span><span class="p">)</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">)</span> \<br data-jekyll-commonmark-ghpages="" /> <span class="o">.</span><span class="nb">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">t</span> <span class="p">:</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">(</span> <span class="n">t</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">,</span> <span class="nb">dict</span><span class="p">(</span> <span class="n">diff</span><span class="o">=</span><span class="nb">abs</span><span class="p">(</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="n">join_lhs</span><span class="p">(</span><span class="n">t</span><span class="p">)[</span><span class="s">'amount'</span><span class="p">]</span> <span class="o">-</span> <span class="n">join_rhs</span><span class="p">(</span><span class="n">t</span><span class="p">)</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">)</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">,</span> <span class="n">row</span><span class="o">=</span><span class="n">annotate_dict</span><span class="p">(</span><span class="n">join_lhs</span><span class="p">(</span><span class="n">t</span><span class="p">)</span> \<br data-jekyll-commonmark-ghpages="" /> <span class="p">,</span> <span class="s">'median'</span> \<br data-jekyll-commonmark-ghpages="" /> <span class="p">,</span> <span class="n">join_rhs</span><span class="p">(</span><span class="n">t</span><span class="p">)</span> \<br data-jekyll-commonmark-ghpages="" /> <span class="p">)</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">)</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">)</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">)</span><br data-jekyll-commonmark-ghpages="" /> <br data-jekyll-commonmark-ghpages="" /> <span class="c"># Given diff cached per element, we need only compute the median </span><br data-jekyll-commonmark-ghpages="" /> <span class="c"># of the diffs to compute the MAD. </span><br data-jekyll-commonmark-ghpages="" /> <span class="c">#Remember, MAD = median( for all x in X, | x - median(X)| )</span><br data-jekyll-commonmark-ghpages="" /> <span class="n">reason_to_MAD</span> <span class="o">=</span> <span class="n">reason_abs_dist_from_median</span><span class="o">.</span><span class="n">groupByKey</span><span class="p">()</span> \<br data-jekyll-commonmark-ghpages="" /> <span class="o">.</span><span class="nb">map</span><span class="p">(</span><span class="n">median_func</span><span class="p">(</span><span class="s">'diff'</span><span class="p">))</span><br data-jekyll-commonmark-ghpages="" /> <span class="n">reason_to_MAD</span><span class="o">.</span><span class="n">take</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span><br data-jekyll-commonmark-ghpages="" /> <br data-jekyll-commonmark-ghpages="" /> <span class="c"># Joining the grouped data to get both | x_i - median(X) | </span><br data-jekyll-commonmark-ghpages="" /> <span class="c"># and MAD in the same place, we can compute the modified z-score</span><br data-jekyll-commonmark-ghpages="" /> <span class="c"># , 0.6475*| x_i - median(X)| / MAD, and filter by the ones which </span><br data-jekyll-commonmark-ghpages="" /> <span class="c"># are more than threshold we can then do some pivoting of keys and </span><br data-jekyll-commonmark-ghpages="" /> <span class="c">#sort by that threshold to give us the ranked list of outliers.</span><br data-jekyll-commonmark-ghpages="" /> <span class="k">return</span> <span class="n">reason_abs_dist_from_median</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">reason_to_MAD</span><span class="p">)</span> \<br data-jekyll-commonmark-ghpages="" /> <span class="o">.</span><span class="nb">filter</span><span class="p">(</span><span class="n">is_outlier</span><span class="p">(</span><span class="n">thresh</span><span class="p">))</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="o">.</span><span class="nb">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">t</span><span class="p">:</span> <span class="p">(</span><span class="n">get_z_score</span><span class="p">(</span><span class="n">t</span><span class="p">)</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">,</span> <span class="n">annotate_dict</span><span class="p">(</span><span class="n">join_lhs</span><span class="p">(</span><span class="n">t</span><span class="p">)[</span><span class="s">'row'</span><span class="p">]</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">,</span> <span class="s">'key'</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">,</span> <span class="n">t</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">)</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">)</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">)</span> \<br data-jekyll-commonmark-ghpages="" /> <span class="o">.</span><span class="n">sortByKey</span><span class="p">(</span><span class="bp">False</span><span class="p">)</span> \<br data-jekyll-commonmark-ghpages="" /> <span class="o">.</span><span class="nb">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">t</span><span class="p">:</span> <span class="p">(</span> <span class="n">t</span><span class="p">[</span><span class="mi">1</span><span class="p">][</span><span class="s">'key'</span><span class="p">]</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">,</span> <span class="n">annotate_dict</span><span class="p">(</span> <span class="n">t</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">,</span> <span class="s">'mad'</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">,</span> <span class="n">t</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">)</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">)</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">)</span><br data-jekyll-commonmark-ghpages="" /><br data-jekyll-commonmark-ghpages="" /><span class="c">#Filter the outliers by reason and return a RDD with just the outliers </span><br data-jekyll-commonmark-ghpages="" /><span class="c">#of a specified reason.</span><br data-jekyll-commonmark-ghpages="" /><span class="k">def</span> <span class="nf">get_by_reason</span><span class="p">(</span><span class="n">outliers</span><span class="p">,</span> <span class="n">reason</span><span class="p">):</span><br data-jekyll-commonmark-ghpages="" /> <span class="k">return</span> <span class="n">outliers</span><span class="o">.</span><span class="nb">filter</span><span class="p">(</span><span class="k">lambda</span> <span class="n">t</span><span class="p">:</span> <span class="nb">str</span><span class="o">.</span><span class="n">startswith</span><span class="p">(</span><span class="n">t</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="s">'ascii'</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">,</span> <span class="s">'ignore'</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">)</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">,</span><span class="n">reason</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">)</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">)</span><br data-jekyll-commonmark-ghpages="" /><br data-jekyll-commonmark-ghpages="" /><span class="c">#Grab data using Spark-SQL and filter with spark core RDD operations </span><br data-jekyll-commonmark-ghpages="" /><span class="c">#to only yield the data we want, ones with physicians, payers and reasons</span><br data-jekyll-commonmark-ghpages="" /><span class="n">reason_amount_pairs</span> <span class="o">=</span> \<br data-jekyll-commonmark-ghpages="" /> <span class="n">sqlContext</span><span class="o">.</span><span class="n">sql</span><span class="p">(</span><span class="s">"""select reason<br data-jekyll-commonmark-ghpages="" /> , physician_specialty<br data-jekyll-commonmark-ghpages="" /> , amount<br data-jekyll-commonmark-ghpages="" /> , physician_id<br data-jekyll-commonmark-ghpages="" /> , payer <br data-jekyll-commonmark-ghpages="" /> from payments"""</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">)</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="o">.</span><span class="nb">filter</span><span class="p">(</span><span class="k">lambda</span> <span class="n">row</span><span class="p">:</span><span class="nb">len</span><span class="p">(</span><span class="n">row</span><span class="o">.</span><span class="n">reason</span><span class="p">)</span> <span class="o">></span> <span class="mi">3</span> \<br data-jekyll-commonmark-ghpages="" /> <span class="ow">and</span> <span class="nb">len</span><span class="p">(</span><span class="n">row</span><span class="o">.</span><span class="n">physician_id</span><span class="p">)</span> <span class="o">></span> <span class="mi">3</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="ow">and</span> <span class="nb">len</span><span class="p">(</span><span class="n">row</span><span class="o">.</span><span class="n">payer</span><span class="p">)</span> <span class="o">></span> <span class="mi">3</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">)</span> \<br data-jekyll-commonmark-ghpages="" /> <span class="o">.</span><span class="nb">map</span><span class="p">(</span><span class="k">lambda</span> <span class="n">row</span><span class="p">:</span> <span class="p">(</span> <span class="s">"{}_{}"</span><span class="o">.</span><span class="n">format</span><span class="p">(</span> <span class="n">row</span><span class="o">.</span><span class="n">reason</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">,</span> <span class="n">row</span><span class="o">.</span><span class="n">physician_specialty</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">)</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">,</span> <span class="nb">dict</span><span class="p">(</span><span class="n">amount</span><span class="o">=</span><span class="n">row</span><span class="o">.</span><span class="n">amount</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">,</span><span class="n">physician_id</span><span class="o">=</span><span class="n">row</span><span class="o">.</span><span class="n">physician_id</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">,</span><span class="n">payer</span><span class="o">=</span><span class="n">row</span><span class="o">.</span><span class="n">payer</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">,</span><span class="n">specialty</span><span class="o">=</span><span class="n">row</span><span class="o">.</span><span class="n">physician_specialty</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">)</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">)</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">)</span><br data-jekyll-commonmark-ghpages="" /><br data-jekyll-commonmark-ghpages="" /><span class="c">#Get the outliers based on a modified z-score threshold of 3.5</span><br data-jekyll-commonmark-ghpages="" /><span class="n">outliers</span> <span class="o">=</span> <span class="n">get_outliers</span><span class="p">(</span><span class="n">reason_amount_pairs</span><span class="p">,</span> <span class="mf">3.5</span><span class="p">)</span><br data-jekyll-commonmark-ghpages="" /><span class="c">#Get a sample per specialty/reason partition</span><br data-jekyll-commonmark-ghpages="" /><span class="n">inliers</span> <span class="o">=</span> <span class="n">get_inliers</span><span class="p">(</span><span class="n">reason_amount_pairs</span><span class="p">)</span></code></pre></figure>
<p>Now that we have found the outliers per specialty/reason partition, and
a sample of inliers, let’s display them so that we can get a sense of
how sensitive the outlier detection is.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="c">#display the top k outliers in a table and a distribution plot</span><br data-jekyll-commonmark-ghpages="" /><span class="c">#of an inlier sample along with the outliers rug-plotted in red</span><br data-jekyll-commonmark-ghpages="" /><span class="k">def</span> <span class="nf">display_info</span><span class="p">(</span><span class="n">inliers_raw</span><span class="p">,</span> <span class="n">outliers_raw_tmp</span><span class="p">,</span> <span class="n">reason</span><span class="p">,</span> <span class="n">k</span><span class="o">=</span><span class="bp">None</span><span class="p">):</span><br data-jekyll-commonmark-ghpages="" /> <span class="n">outliers_raw</span> <span class="o">=</span> <span class="p">[]</span><br data-jekyll-commonmark-ghpages="" /> <span class="k">if</span> <span class="n">k</span> <span class="ow">is</span> <span class="bp">None</span><span class="p">:</span><br data-jekyll-commonmark-ghpages="" /> <span class="n">outliers_raw</span> <span class="o">=</span> <span class="nb">sorted</span><span class="p">(</span><span class="n">outliers_raw_tmp</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">,</span> <span class="n">key</span><span class="o">=</span><span class="k">lambda</span> <span class="n">d</span><span class="p">:</span><span class="n">d</span><span class="p">[</span><span class="mi">1</span><span class="p">][</span><span class="s">'amount'</span><span class="p">]</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">,</span> <span class="n">reverse</span><span class="o">=</span><span class="bp">True</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">)</span><br data-jekyll-commonmark-ghpages="" /> <span class="k">else</span><span class="p">:</span><br data-jekyll-commonmark-ghpages="" /> <span class="n">outliers_raw</span> <span class="o">=</span> <span class="nb">sorted</span><span class="p">(</span><span class="n">outliers_raw_tmp</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">,</span> <span class="n">key</span><span class="o">=</span><span class="k">lambda</span> <span class="n">d</span><span class="p">:</span><span class="n">d</span><span class="p">[</span><span class="mi">1</span><span class="p">][</span><span class="s">'amount'</span><span class="p">]</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">,</span> <span class="n">reverse</span><span class="o">=</span><span class="bp">True</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">)[</span><span class="mi">0</span><span class="p">:</span><span class="n">k</span><span class="p">]</span><br data-jekyll-commonmark-ghpages="" /> <span class="n">inlier_pts</span> <span class="o">=</span> <span class="p">[]</span><br data-jekyll-commonmark-ghpages="" /> <span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="p">[</span><span class="n">d</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span> <span class="k">for</span> <span class="n">d</span> <span class="ow">in</span> <span class="n">inliers_raw</span><span class="p">]:</span><br data-jekyll-commonmark-ghpages="" /> <span class="k">for</span> <span class="n">j</span> <span class="ow">in</span> <span class="n">i</span><span class="p">:</span><br data-jekyll-commonmark-ghpages="" /> <span class="n">inlier_pts</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">j</span><span class="p">[</span><span class="s">'amount'</span><span class="p">])</span><br data-jekyll-commonmark-ghpages="" /> <span class="n">outlier_pts</span><span class="o">=</span> <span class="p">[</span><span class="n">d</span><span class="p">[</span><span class="mi">1</span><span class="p">][</span><span class="s">'amount'</span><span class="p">]</span> <span class="k">for</span> <span class="n">d</span> <span class="ow">in</span> <span class="n">outliers_raw</span><span class="p">]</span><br data-jekyll-commonmark-ghpages="" /> <span class="n">plot_outliers</span><span class="p">(</span><span class="n">inlier_pts</span><span class="p">[</span><span class="mi">0</span><span class="p">:</span><span class="mi">1500</span><span class="p">],</span> <span class="n">outlier_pts</span><span class="p">,</span> <span class="n">reason</span><span class="p">)</span><br data-jekyll-commonmark-ghpages="" /><br data-jekyll-commonmark-ghpages="" /> <span class="n">print_table</span><span class="p">([</span><span class="s">'Physician'</span><span class="p">,</span><span class="s">'Specialty'</span><span class="p">,</span> <span class="s">'Payer'</span><span class="p">,</span> <span class="s">'Amount'</span><span class="p">]</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">,</span> <span class="p">[</span><span class="n">d</span><span class="p">[</span><span class="mi">1</span><span class="p">][</span><span class="s">'physician_id'</span><span class="p">]</span> <span class="k">for</span> <span class="n">d</span> <span class="ow">in</span> <span class="n">outliers_raw</span><span class="p">]</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">,</span> <span class="p">[</span> <span class="p">[</span> <span class="n">d</span><span class="p">[</span><span class="mi">1</span><span class="p">][</span><span class="s">'specialty'</span><span class="p">]</span> \<br data-jekyll-commonmark-ghpages="" /> <span class="p">,</span> <span class="n">d</span><span class="p">[</span><span class="mi">1</span><span class="p">][</span><span class="s">'payer'</span><span class="p">]</span><span class="o">.</span><span class="n">encode</span><span class="p">(</span><span class="s">'ascii'</span><span class="p">,</span> <span class="s">'ignore'</span><span class="p">)</span> \<br data-jekyll-commonmark-ghpages="" /> <span class="p">,</span> <span class="s">'$'</span> <span class="o">+</span> <span class="n">locale</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="s">'</span><span class="si">%0.2</span><span class="s">f'</span><span class="p">,</span> <span class="n">d</span><span class="p">[</span><span class="mi">1</span><span class="p">][</span><span class="s">'amount'</span><span class="p">],</span> <span class="n">grouping</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">]</span> \<br data-jekyll-commonmark-ghpages="" /> <span class="k">for</span> <span class="n">d</span> <span class="ow">in</span> <span class="n">outliers_raw</span><span class="p">]</span>\<br data-jekyll-commonmark-ghpages="" /> <span class="p">)</span></code></pre></figure>
<h2 id="food-and-beverage-purchase-outliers">Food and Beverage Purchase Outliers</h2>
<p>Let’s look at the top 4 outliers for Food and Beverage payments. I
could have shown all of the outliers, but I found the first few to be
the biggest bang for our buck in terms of interesting findings.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="c">#outliers for food and beverage purchases</span><br data-jekyll-commonmark-ghpages="" /><span class="n">food_outliers</span> <span class="o">=</span> <span class="n">get_by_reason</span><span class="p">(</span><span class="n">outliers</span><span class="p">,</span> <span class="s">'Food and Beverage'</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span><br data-jekyll-commonmark-ghpages="" /><span class="n">food_inliers</span> <span class="o">=</span> <span class="n">get_by_reason</span><span class="p">(</span><span class="n">inliers</span><span class="p">,</span> <span class="s">'Food and Beverage'</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span><br data-jekyll-commonmark-ghpages="" /><span class="n">display_info</span><span class="p">(</span><span class="n">food_inliers</span><span class="p">,</span> <span class="n">food_outliers</span><span class="p">,</span> <span class="s">'Food and Beverage'</span><span class="p">,</span> <span class="mi">4</span><span class="p">)</span></code></pre></figure>
<p>As we can see, the misclassified data from Teleflex is rearing its head
again with a huge single payment for food. However, looking down the
list, Biolase is paying quite a bit to some dentist for food.</p>
<table id="resultTable7" class="tablesorter">
<thead>
<tr style="text-align: center;">
<th>Physician</th>
<th>Specialty</th>
<th>Payer</th>
<th>Amount</th>
</tr>
</thead>
<tbody>
<tr>
<td> 200720</td>
<td> Allopathic & Osteopathic Physicians/ Surgery</td>
<td> Teleflex Medical Incorporated</td>
<td> $68,750.00</td>
</tr>
<tr>
<td> 28946</td>
<td> Dental Providers/ Dentist/ General Practice</td>
<td> BIOLASE, INC.</td>
<td> $13,297.15</td>
</tr>
<tr>
<td> 28946</td>
<td> Dental Providers/ Dentist/ General Practice</td>
<td> BIOLASE, INC.</td>
<td> $8,111.82</td>
</tr>
<tr>
<td> 28946</td>
<td> Dental Providers/ Dentist/ General Practice</td>
<td> BIOLASE, INC.</td>
<td> $8,111.82</td>
</tr>
</tbody>
</table>
<p>Below is a density plot of outliers and a sample of inliers with a rug
plot and the outliers marked in red. You can see how far along the tail
of the densitiy plot the outliers here are. Most food and payment data
hovers much closer to $0$.</p>
<p><img src="files/ref_data/open_payments_files/open_payments_13_1.png" alt="png" /></p>
<h2 id="travel-and-lodging-outliers">Travel and Lodging Outliers</h2>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">travel_outliers</span> <span class="o">=</span> <span class="n">get_by_reason</span><span class="p">(</span><span class="n">outliers</span><span class="p">,</span> <span class="s">'Travel and Lodging'</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span><br data-jekyll-commonmark-ghpages="" /><span class="n">travel_inliers</span> <span class="o">=</span> <span class="n">get_by_reason</span><span class="p">(</span><span class="n">inliers</span><span class="p">,</span> <span class="s">'Travel and Lodging'</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span><br data-jekyll-commonmark-ghpages="" /><span class="n">display_info</span><span class="p">(</span><span class="n">travel_inliers</span><span class="p">,</span> <span class="n">travel_outliers</span><span class="p">,</span> <span class="s">'Travel and Lodging'</span><span class="p">,</span> <span class="mi">10</span><span class="p">)</span></code></pre></figure>
<p>All I can say is that Physician 106320 travels far more than I do to get
paid $155k in 2013. Hope that triple platinum on Delta is worth it. :)</p>
<table id="resultTable8" class="tablesorter">
<thead>
<tr style="text-align: left;">
<th>Physician</th>
<th>Specialty</th>
<th>Payer</th>
<th>Amount</th>
</tr>
</thead>
<tbody>
<tr>
<td> 106320</td>
<td> Allopathic & Osteopathic Physicians/ Psychiatry & Neurology/ Neurology</td>
<td> Boehringer Ingelheim Pharma GmbH & Co.KG</td>
<td> $155,772.00</td>
</tr>
<tr>
<td> 472722</td>
<td> Allopathic & Osteopathic Physicians/ Internal Medicine/ Nephrology</td>
<td> Merck Sharp & Dohme Corporation</td>
<td> $75,000.00</td>
</tr>
<tr>
<td> 371379</td>
<td> Allopathic & Osteopathic Physicians/ Orthopaedic Surgery</td>
<td> Exactech, Inc.</td>
<td> $65,798.00</td>
</tr>
<tr>
<td> 198801</td>
<td> Allopathic & Osteopathic Physicians/ Internal Medicine/ Cardiovascular Disease</td>
<td> Medtronic Vascular, Inc.</td>
<td> $41,232.80</td>
</tr>
<tr>
<td> 382697</td>
<td> Allopathic & Osteopathic Physicians/ Internal Medicine/ Nephrology</td>
<td> Genentech, Inc.</td>
<td> $39,978.80</td>
</tr>
<tr>
<td> 169095</td>
<td> Allopathic & Osteopathic Physicians/ Surgery</td>
<td> Medtronic Vascular, Inc.</td>
<td> $37,683.00</td>
</tr>
<tr>
<td> 80052</td>
<td> Allopathic & Osteopathic Physicians/ Family Medicine</td>
<td> Boehringer Ingelheim Pharma GmbH & Co.KG</td>
<td> $24,911.25</td>
</tr>
<tr>
<td> 202461</td>
<td> Allopathic & Osteopathic Physicians/ Thoracic Surgery (Cardiothoracic Vascular Surgery)</td>
<td> Covidien LP</td>
<td> $21,594.51</td>
</tr>
<tr>
<td> 378722</td>
<td> Allopathic & Osteopathic Physicians/ Internal Medicine</td>
<td> GlaxoSmithKline, LLC.</td>
<td> $20,112.40</td>
</tr>
<tr>
<td> 243205</td>
<td> Allopathic & Osteopathic Physicians/ Internal Medicine/ Interventional Cardiology</td>
<td> Medtronic Vascular, Inc.</td>
<td> $19,273.90</td>
</tr>
</tbody>
</table>
<p>You can see on the density plot that the next nearest outlier is pretty far away and we have clumps of outliers around 20k and 40k. Interesting things to look into.</p>
<p><img src="files/ref_data/open_payments_files/open_payments_14_1.png" alt="png" /></p>
<h2 id="consulting-fee-outliers">Consulting Fee Outliers</h2>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">consulting_outliers</span> <span class="o">=</span> <span class="n">get_by_reason</span><span class="p">(</span><span class="n">outliers</span><span class="p">,</span> <span class="s">'Consulting Fee'</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span><br data-jekyll-commonmark-ghpages="" /><span class="n">consulting_inliers</span> <span class="o">=</span> <span class="n">get_by_reason</span><span class="p">(</span><span class="n">inliers</span><span class="p">,</span> <span class="s">'Consulting Fee'</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span><br data-jekyll-commonmark-ghpages="" /><span class="n">display_info</span><span class="p">(</span><span class="n">consulting_inliers</span><span class="p">,</span> <span class="n">consulting_outliers</span><span class="p">,</span> <span class="s">'Consulting Fee'</span><span class="p">,</span> <span class="mi">10</span><span class="p">)</span></code></pre></figure>
<p>Looking at consulting fee outliers, you can see some clumping, but most
of the data is lower than 50k. That makes the 200k outlier from Teva
all that much more interesting. Of course, none of this is any
indication of wrong-doing, just interesting spikes in the data.</p>
<table id="resultTable9" class="tablesorter">
<thead>
<tr style="text-align: center;">
<th>Physician</th>
<th>Specialty</th>
<th>Payer</th>
<th>Amount</th>
</tr>
</thead>
<tbody>
<tr>
<td> 104930</td>
<td> Allopathic & Osteopathic Physicians/ Psychiatry & Neurology/ Neurology</td>
<td> Teva Pharmaceuticals USA, Inc.</td>
<td> $207,500.00</td>
</tr>
<tr>
<td> 151515</td>
<td> Other Service Providers/ Specialist</td>
<td> Alcon Research Ltd</td>
<td> $150,000.00</td>
</tr>
<tr>
<td> 309376</td>
<td> Allopathic & Osteopathic Physicians/ Internal Medicine</td>
<td> Teva Pharmaceuticals USA, Inc.</td>
<td> $137,559.67</td>
</tr>
<tr>
<td> 231913</td>
<td> Allopathic & Osteopathic Physicians/ Orthopaedic Surgery</td>
<td> Exactech, Inc.</td>
<td> $108,125.00</td>
</tr>
<tr>
<td> 465481</td>
<td> Allopathic & Osteopathic Physicians/ Internal Medicine/ Rheumatology</td>
<td> Vision Quest Industries Inc.</td>
<td> $102,196.09</td>
</tr>
<tr>
<td> 409799</td>
<td> Allopathic & Osteopathic Physicians/ Internal Medicine/ Endocrinology, Diabetes & Metabolism</td>
<td> Pfizer Inc.</td>
<td> $100,000.00</td>
</tr>
<tr>
<td> 206227</td>
<td> Allopathic & Osteopathic Physicians/ Orthopaedic Surgery</td>
<td> DePuy Synthes Sales Inc.</td>
<td> $93,750.00</td>
</tr>
<tr>
<td> 436192</td>
<td> Allopathic & Osteopathic Physicians/ Internal Medicine</td>
<td> Pfizer Inc.</td>
<td> $90,000.00</td>
</tr>
<tr>
<td> 306965</td>
<td> Allopathic & Osteopathic Physicians/ Psychiatry & Neurology/ Neurology</td>
<td> Teva Pharmaceuticals USA, Inc.</td>
<td> $64,125.00</td>
</tr>
<tr>
<td> 163888</td>
<td> Allopathic & Osteopathic Physicians/ Internal Medicine/ Cardiovascular Disease</td>
<td> Boehringer Ingelheim Pharmaceuticals, Inc.</td>
<td> $61,025.00</td>
</tr>
</tbody>
</table>
<p>You can see most of the density is less than 20k, which makes that 200k
outlier so interesting.</p>
<p><img src="files/ref_data/open_payments_files/open_payments_15_1.png" alt="png" /></p>
<h2 id="gift-outliers">Gift Outliers</h2>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><span class="n">gift_outliers</span> <span class="o">=</span> <span class="n">get_by_reason</span><span class="p">(</span><span class="n">outliers</span><span class="p">,</span> <span class="s">'Gift'</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span><br data-jekyll-commonmark-ghpages="" /><span class="n">gift_inliers</span> <span class="o">=</span> <span class="n">get_by_reason</span><span class="p">(</span><span class="n">inliers</span><span class="p">,</span> <span class="s">'Gift'</span><span class="p">)</span><span class="o">.</span><span class="n">collect</span><span class="p">()</span><br data-jekyll-commonmark-ghpages="" /><span class="n">display_info</span><span class="p">(</span><span class="n">gift_inliers</span><span class="p">,</span> <span class="n">gift_outliers</span><span class="p">,</span> <span class="s">'Gift'</span><span class="p">,</span> <span class="mi">10</span><span class="p">)</span></code></pre></figure>
<p>Gifts, I think, are the most interesting payment reasons in this whole
dataset. I am intrigued when a physician might get a gift versus
an outright fee. I imagined, going in, that gifts would be low-value
items, but the table clearly shows that dentists are getting substantial gifts. What is most interesting to me is that all of the top 10 outliers are
dentists.</p>
<table id="resultTable10" class="tablesorter">
<thead>
<tr style="text-align: center;">
<th>Physician</th>
<th>Specialty</th>
<th>Payer</th>
<th>Amount</th>
</tr>
</thead>
<tbody>
<tr>
<td> 225073</td>
<td> Dental Providers/ Dentist/ General Practice</td>
<td> Dentalez Alabama, Inc.</td>
<td> $56,422.00</td>
</tr>
<tr>
<td> 167931</td>
<td> Dental Providers/ Dentist</td>
<td> DENTSPLY IH Inc.</td>
<td> $8,672.50</td>
</tr>
<tr>
<td> 380517</td>
<td> Dental Providers/ Dentist</td>
<td> DENTSPLY IH Inc.</td>
<td> $8,672.50</td>
</tr>
<tr>
<td> 380073</td>
<td> Dental Providers/ Dentist/ General Practice</td>
<td> Benco Dental Supply Co.</td>
<td> $7,570.00</td>
</tr>
<tr>
<td> 403926</td>
<td> Dental Providers/ Dentist</td>
<td> A-dec, Inc.</td>
<td> $5,430.00</td>
</tr>
<tr>
<td> 429612</td>
<td> Dental Providers/ Dentist</td>
<td> PureLife, LLC</td>
<td> $5,058.72</td>
</tr>
<tr>
<td> 404935</td>
<td> Dental Providers/ Dentist</td>
<td> A-dec, Inc.</td>
<td> $5,040.00</td>
</tr>
<tr>
<td> 8601</td>
<td> Dental Providers/ Dentist/ General Practice</td>
<td> DentalEZ, Inc.</td>
<td> $3,876.35</td>
</tr>
<tr>
<td> 385314</td>
<td> Dental Providers/ Dentist/ General Practice</td>
<td> Henry Schein, Inc.</td>
<td> $3,789.99</td>
</tr>
<tr>
<td> 389592</td>
<td> Dental Providers/ Dentist/ General Practice</td>
<td> Henry Schein, Inc.</td>
<td> $3,789.99</td>
</tr>
</tbody>
</table>
<p><img src="files/ref_data/open_payments_files/open_payments_16_1.png" alt="png" /></p>
<h2 id="up-next">Up Next</h2>
<p><a href="pyspark-openpayments-analysis-part-5.html">Next</a>, we look for anomalies
in our payment data by using Benford’s Law. This is part of a
broader <a href="pyspark-openpayments-analysis.html">series</a> of posts about Data
Science and Hadoop.</p>