Benford’s Law is an
interesting observation made by physicist Frank Benford in the 30’s
about the distribution of the first digits of many naturally occurring
datasets. In particular, the distribution of digits in each position
follows an expected distribution.
I will leave the explanation of why Benford’s law exists to better
sources, but this observation has
become a staple of forensic accounting analysis. The reason for this
interest is that financial data often fits the law and humans, when
making up numbers, have a terrible time picking numbers that fit the
law. In particular, we’ll often intuit that digits should be more
uniformly distributed than they naturally occur.
Even so, violation of Benford’s law alone is insufficient to cry foul.
It’s only an indicator to be used with other evidence to point toward
fraud. It can be a misleading indicator because not all datasets abide
by the law and it can be very sensitive to number of observations.
Ranking by Goodness of Fit
It’s of interest to us to figure out if this payment data fits
the law and, if so, are there any payers who do not fit the law in a
strange way. That leaves us with the technical challenge of determining how close two distributions are so that we can rank goodness of fit. There are a few approaches to this problem:
They are two approaches, one statistical and one information-theoretic,
that will accomplish similar goals: tell how close two distributions are
together.
I chose to rank based on KL divergence, but I compute the $\chi^2$ test as well.
I’ll quote briefly about Kullback-Leibler divergence to give a sense of what the ranking means:
The Kullback–Leibler divergence of Q from P, denoted $D_{KL}(P || Q)$
, is a measure of the information lost
when Q is used to
approximate P. The KL divergence measures the expected number of
extra bits required to code samples from P when using a code based on
Q, rather than using a code based on P. Typically P represents the
“true” distribution of data, observations, or a precisely calculated
theoretical distribution. The measure Q typically represents a theory,
model, description, or approximation of P.
In our case, P is the expected distribution based on Benford’s law and Q
is the observed distribution. We’d like to see just how much more
“complex” in some sense Q is versus P.
The Benford Digit Distribution
For the first digit, Benford’s law states that the probability of digit
$d$ occurring is .
There is a generalization beyond the first digit which can be used as
well. The probability of digit $d$ occurring in the second digit is .
Implementation of Benford’s Law Ranking
The way we’ll approach this problem is to determine the first and second
digit distributions of payments by payer/reason and rank by goodness of fit for the first digit. Then we’ll look at the top fitting and bottom fitting payers for a few different reasons and see if we can see any patterns. One caveat is that we’ll be throwing out data under the following scenarios:
The payer/specialty does not have at least 400 payments
The amount is less than $10 (therefore not having a second digit)
Obviously the second one may skew things as it throws out some data
within a partition but not the whole partition. In real analysis, I’d
find a way to include the data points.
Best and Worst Fitting Payers for Gifts
Pretty much across the board the p-values for the $\chi^2$ test were
super weak, but the first 4 are a pretty good fit. The last one is
interesting, that spike at the 6 digit is the kind of thing that are of interest to forensic accountants. I repeat, however, that this is not an indicator which can be used safely alone to level a charge of fraud.
Below is the density plot for the Kullback-Leibler divergences for the top best fits. You can see there’s a clump at 0 and a clump a bit farther out, but no real outliers.
Now we look at the worst fitting gift payers. The lists overlap, as you
can see, because there just aren’t that many organizations that pay more
than 350 gifts out over the course of the year.
Two things that are interesting:
Mentor Worldwide does not fit the decreasing probability distribution that we would expect. Many payers diverge from Benford’s law, but it’s interesting when they break the basic form of decreasing probabilities as digits progress from 1 through 9.
Benco Dental Supply has a huge amount of payments starting with 1. This is likely an indication that they have a standard gift that they give out.
Best and Worst Fitting Payers for Travel and Lodging
The top fitting payers for travel and lodging fit fairly well, so it’s
certainly possible to fit the distribution well.
You can see a few payers diverging from the general form of Benford’s
distribution here. LDR Holding is the outlier in terms of goodness of
fit as you can see from the density plot below as well.
Best and Worst Fitting Payers for Consulting Fees
The good fits are pretty good.
For both UCB and Merck, we see a payer with a huge distribution of payments starting with 7. This indicates a standardized payment of some sort, I’d wager. The interesting thing about Merck is that the 1’s distribution is pretty spot on, the rest of the density gets pushed into 7.
Conclusion
This concludes the basic example of doing analytics with the Spark
platform. General conclusions and impressions from this whole exercise can be found here.