Word2Vec with Non-Textual Data

At least half of the battle of data analysis and data science is understanding your data.

That sounds obvious, but I’ve seen whole data science projects fail because not nearly enough time was spent on the exercise of understanding your data. There are only two real ways to go about doing this:

To have a shot at doing this you really have to do both.

In the course of this blog post, I’m going to describe some of the challenges with understanding data and I’ll go into some technical detail of how to borrow some scalable unsupervised learning from natural language processing coupled with a very nice data visualization to facilitate understanding the natural organization and arrangement of data.

Subject Matter Experts

I spend a lot of time with healthcare data and the obvious subject matter experts are nurses and doctors. These people are very gracious, very knowledgeable and extremely pressed for time. The problem with expert knowledge is that it’s surprisingly hard to communicate effectively sufficient nuance to help the working data scientist accomplish their goals. Furthermore, it’s extremely time consuming.
This is made doubly hard when the expert is entirely unclear about the goal.

The second, perhaps less obvious, challenge is that subject matter experts knowledge is biased toward that which is already known. Often data scientists and analysts are trying to understand the data not as an ends, but rather as a means to gaining insight. If you only take into account received knowledge, then making unexpected insights can be challenging. That being said, spending time with subject matter experts is a necessary yet insufficient part of data analysis.

Unsupervised Data Understanding

To complete the task of understanding your data, I have found that it is necessary to spend time looking at the data. One can think of the entire field of statistics as an exercise in building a mechanism to ask data pointed questions and get answers that we can trust, often with caveats.
The goal is generally to get a sense of how the data is organized or arranged.
With the unbelievable complexity of most real data, we are forced to simplify our representations. The question is just precisely how to simply that representation to find the proper balance between simplicity and complexity. More than that, some representations of the data offer useful views of the data for certain purposes and not for others.

Common simplified representations of data are things like distributions, histograms, and plots. Of course there are other even more complex ways to represent your data. Whole companies have been formed around providing a way to gain insight through more complex organizations of the data, taking some of the burden of interpretation from our brain and encoding it in an organization scheme.

Today, I’d like to talk about another approach to data simplification for event data which provides not just an interesting representation, but also a way to ask the data certain kinds of useful questions of your data.


One common way to impose order on data that is used by engineers and mathematicians everywhere is to embed your data in a vector space with a metric on it.
This gives us a couple things :

The issue now is how you impose this structure by embedding your data, which may not even be numeric, into a vector space. Thankfully, the nice people at Google developed a nice way of doing this in the domain of natural language text called Word2Vec.

I won’t go into extravagant detail into the implementation as Radim Řehůřek did a great job here.
The major takeaways, however, is that using the inherrent structure of natural language, Word2Vec is able to construct a vector space such that a

This is a surprisingly rich organization of data and one that has proven very effective in enhancing the accuracy of machine learning models that deal with natural language. Perhaps the most surprising part of this is that the vectorization model does not utilize any of the grammatical structure of the natural language directly. It simply analyzes the words within the sentences and through usage it fits the proper embedding. This led me to consider whether other, non-textual data which has some inherrent structure can also be organized this way with the same algorithm.

Medical Data

Whenever we go to the doctor, a set of events happen:

These events happen in a certain overall order but the order varies based on the patient situation and according to the medical staff’s best judgement. We will call this set of events a medical encounter and they happen every day all over the world.

This sequence of events has a similar tone to what we’re familiar with in natural language. The encounter can be thought of as a sort of medical sentence. Each medical event within the encounter can be thought of as a medical word. The type of event (lab, procedure, diagnoses, etc.) can be considered as a sort of part-of-speech.

It remains to determine if this structure can be teased out and encoded into a vector space model like natural language can be. If so, then we can ask questions like:

When considering trying this technique out the problem, of course, is getting access to medical data. This data is extremely sensitive and is covered by HIPAA here in the United States. What we need is a good, depersonalized set of medical encounter data.

Thankfully, back in 2012 an electronic medical records system, Practice Fusion released a set of 10,000 depersonalized medical records as part of a kaggle competition. This opened up the possibility of actually doing this analysis, albeit on a small subset of the population.


Since I’ve been doing a lot with Spark lately at work, I wanted to see if I could use the Word2Vec implementation built into SparkML to accomplish this. Also, frankly, having worked with medical data at some big hospitals and insurance companies, I am aware that there is a real scale problem when doing something this complex for millions of medical encounters and I wanted to ensure that anything I did could scale.

The implementation boiled down into a few steps, which are common to most projects that I’ve seen run on Hadoop. I have created a small github repo to capture the code collateral used to process the data here.

You can see from the Jupyter notebook detailing the model building portion and results here that model building is only a scant few lines:

 from pyspark import SparkContext
 from pyspark.mllib.feature import Word2Vec
 sentences = sc.textFile("practice_fusion/sentences_nlp").map(lambda row: row.split(" "))
 word2vec = Word2Vec()
 model = word2vec.fit(sentences)


One of the problems with unsupervised models is evaluating how well our model is describing reality. For the purpose of this entirely unscientific analysis, we’ll restrict ourselves to just diagnoses and ask a couple of questions of the model:

One thing to note before we get started. This model uses cosine similarity as the score. This measure of similarity ranges from 0 to 1, with 1 being most similar and 0 being least similar.


Also known as heart disease or hardening of the arteries. This disease is the number one killer of Americans. Our model found the following similar diseases:

ICD9 CodeDescriptionScore
v12.71Personal history of peptic ulcer disease0.930
533.40Chronic or unspecified peptic ulcer of unspecified site with hemorrhage, without mention of obstruction0.926
153.6Malignant neoplasm of ascending colon0.910
238.75Myelodysplastic syndrome, unspecified0.910
389.10Sensorineural hearing loss, unspecified0.907
428.30Diastolic heart failure, unspecified0.904
v43.65Knee joint replacement0.902

Peptic Ulcers

There have been long-standing connections noticed between ulcers and atherosclerosis. Partiaully due to smokers having a higher than average incidence of peptic ulcers and atherosclerosis. You can see an editorial in the British Medical Journal all the way back in the 1970’s discussing this.

Hearing Loss

From an article from the Journal of Atherosclerosis in 2012:

Sensorineural hearing loss seemed to be associated with vascular endothelial dysfunction and an increased cardiovascular risk

Knee Joint Replacements

These procedures are common among those with osteoarthritis and there has been a solid correlation between osteoarthritis and atherosclerosis in the literature.

Crohn’s Disease

Crohn’s disease is a type of inflammatory bowel disease that is caused by a combination of environmental, immune and bacterial factors. Let’s see if we can recover some of these connections from the data.

ICD9 CodeDescriptionScore
274.03Chronic gouty arthropathy with tophus (tophi)0.870
522.5Periapical abscess without sinus0.869
579.3Other and unspecified postsurgical nonabsorption0.863
112.3Candidiasis of skin and nails0.855
v16.42Family history of malignant neoplasm of prostate0.853


From the Crohn’s and Colitis Foundation of America:

Arthritis, or inflammation of the joints, is the most common extraintestinal complication of IBD. It may affect as many as 25% of people with Crohn’s disease or ulcerative colitis. Although arthritis is typically associated with advancing age, in IBD it often strikes the youngest patients.

Dental Abscesses

While not much medical literature exists with a specific link to dental abscesses and Crohn’s (there are general oral issues noticed here), you do see lengthy discussions on the Crohn’s forums about abscesses being a common occurance with Crohn’s.

Yeast Infections

Candidiasis of skin and nails is a form of yeast infection on the skin. From the journal “Critical Review of Microbiology” here.

It is widely accepted that Candidia could result from an inappropriate inflammatory response to intestinal microorganisms in a genetically susceptible host. Most studies to date have concerned the involvement of bacteria in disease progression. In addition to bacteria, there appears to be a possible link between the commensal yeast Candida albicans and disease development.


For further investigation, I have used t-distributed stochastic neighbor embedding to embed the 100-dimensional vector space into 2 dimensions. This embedding should retain the general connections within the data, so you can look at similar diagnoses, drugs and allergies.

Please play around with this data and let me know what you find!

Provider Specialty
Type Name Description
Highlight some points above for this summary to be filled in.
Casey Stella 04 December 2015 Cleveland, OH