This is a place where I attempt to form coherent thoughts about current technology, computer science, math and the general things happening on the Internet.
A feed of the most recent posts is available.
I'm going to describe some of the challenges with understanding data and I'll go into some technical detail of how to borrow some scalable unsupervised learning from natural language processing coupled with a very nice data visualization technique to facilitate understanding the natural organization and arrangement of data.
Benford's Law analysis of healthcare payment data with Spark using Median Absolute Divergence.
Outlier analysis of healthcare payment data with Spark using Median Absolute Deviation.
Basic structural analysis of healthcare payment data using Spark SQL and Python.
Part 1 of a series of analyses with PySpark with healthcare financial data. Data Overview and preprocessing for Center for Medicare and Medicaid open payments data.
A discussion of the challenges of doing Data Science projects with Hadoop.
Clustering senatorial speeches from 2008 by topic using t-stochastic neighbor embedding and latent dirichlet allocation.
An analysis of which Unix commands appear together more than random chance would suggest.
I recently gave a talk on a NLP project that I worked on for Kent's ACM