This is a place where I attempt to form coherent thoughts about current technology, computer science, math and the general things happening on the Internet.
A feed of the most recent posts is available.
A Blockchain Story Told Through The Eyes of Two Users View Comments
Cryptocurrencies are generating a lot of hype right now. Blockchain-based systems provide a huge amount of transparent detail about how a currency is actually used. Despite this fact, analysis has been largely based around traditional security analysis which ignores the full amount of data available in favor of simpler metrics which treat the system as a black box. Here, I look at two deeper analytics that tell a story of how a cryptocurrency is actually used, which may be of interest to blockchain developers and investors alike.
Word2Vec with Non-Textual Data View Comments
I'm going to describe some of the challenges with understanding data and I'll go into some technical detail of how to borrow some scalable unsupervised learning from natural language processing coupled with a very nice data visualization technique to facilitate understanding the natural organization and arrangement of data.
Data Science and Hadoop: Impressions and Example View Comments
A discussion of the challenges of doing Data Science projects with Hadoop.
Data Science and Hadoop: Part 5, Benford's Law Analysis View Comments
Benford's Law analysis of healthcare payment data with Spark using Median Absolute Divergence.
Data Science and Hadoop: Part 4, Outlier Analysis View Comments
Outlier analysis of healthcare payment data with Spark using Median Absolute Deviation.
Data Science and Hadoop: Part 3, Basic Structural Analysis View Comments
Basic structural analysis of healthcare payment data using Spark SQL and Python.
Data Science and Hadoop: Part 2, Data Overview and Preprocessing View Comments
Part 1 of a series of analyses with PySpark with healthcare financial data. Data Overview and preprocessing for Center for Medicare and Medicaid open payments data.
Making Sense of Political Texts with NLP View Comments
Clustering senatorial speeches from 2008 by topic using t-stochastic neighbor embedding and latent dirichlet allocation.
Spark for Data Science: A Case Study View Comments
An analysis of which Unix commands appear together more than random chance would suggest.
Better News through Computational Political Science View Comments
I recently gave a talk on a NLP project that I worked on for Kent's ACM
Hadoop Best Practices View Comments
I recently gave a talk at the Cleveland Hadoop User Group on Hadoop Best Practices