# Word2Vec with Non-Textual Data

At least half of the battle of data analysis and data science is understanding your data.

That sounds obvious, but I’ve seen whole data science projects fail because not nearly enough time was spent on the exercise of understanding your data. There are only two real ways to go about doing this:

To have a shot at doing this you really have to do both.

In the course of this blog post, I’m going to describe some of the challenges with understanding data and I’ll go into some technical detail of how to borrow some scalable unsupervised learning from natural language processing coupled with a very nice data visualization to facilitate understanding the natural organization and arrangement of data.

# Subject Matter Experts

I spend a lot of time with healthcare data and the obvious subject matter experts are nurses and doctors. These people are very gracious, very knowledgeable and extremely pressed for time. The problem with expert knowledge is that it’s surprisingly hard to communicate effectively sufficient nuance to help the working data scientist accomplish their goals. Furthermore, it’s extremely time consuming.
This is made doubly hard when the expert is entirely unclear about the goal.

The second, perhaps less obvious, challenge is that subject matter experts knowledge is biased toward that which is already known. Often data scientists and analysts are trying to understand the data not as an ends, but rather as a means to gaining insight. If you only take into account received knowledge, then making unexpected insights can be challenging. That being said, spending time with subject matter experts is a necessary yet insufficient part of data analysis.

# Unsupervised Data Understanding

To complete the task of understanding your data, I have found that it is necessary to spend time looking at the data. One can think of the entire field of statistics as an exercise in building a mechanism to ask data pointed questions and get answers that we can trust, often with caveats.
The goal is generally to get a sense of how the data is organized or arranged.
With the unbelievable complexity of most real data, we are forced to simplify our representations. The question is just precisely how to simply that representation to find the proper balance between simplicity and complexity. More than that, some representations of the data offer useful views of the data for certain purposes and not for others.

Common simplified representations of data are things like distributions, histograms, and plots. Of course there are other even more complex ways to represent your data. Whole companies have been formed around providing a way to gain insight through more complex organizations of the data, taking some of the burden of interpretation from our brain and encoding it in an organization scheme.

Today, I’d like to talk about another approach to data simplification for event data which provides not just an interesting representation, but also a way to ask the data certain kinds of useful questions of your data.

## Word2Vec

One common way to impose order on data that is used by engineers and mathematicians everywhere is to embed your data in a vector space with a metric on it.
This gives us a couple things :

• Data now has a distance which can be interpreted as the degree of “difference” between the data
• Data can be combined via addition and subtraction operations which can be interpreted as combination and separation operations

The issue now is how you impose this structure by embedding your data, which may not even be numeric, into a vector space. Thankfully, the nice people at Google developed a nice way of doing this in the domain of natural language text called Word2Vec.

I won’t go into extravagant detail into the implementation as Radim Řehůřek did a great job here.
The major takeaways, however, is that using the inherrent structure of natural language, Word2Vec is able to construct a vector space such that a

• Word similarity can be interpreted as a distance calculation
• The notion of analogies can be interpreted using the addition and subtraction operators (e.g. the vector representation of king - male + female is near the vector representation of queen).

This is a surprisingly rich organization of data and one that has proven very effective in enhancing the accuracy of machine learning models that deal with natural language. Perhaps the most surprising part of this is that the vectorization model does not utilize any of the grammatical structure of the natural language directly. It simply analyzes the words within the sentences and through usage it fits the proper embedding. This led me to consider whether other, non-textual data which has some inherrent structure can also be organized this way with the same algorithm.

## Medical Data

Whenever we go to the doctor, a set of events happen:

• Measurements are made (e.g. blood pressure, pulse, height, weight)
• Labs are drawn and ordered (e.g. blood tests)
• Procedures are performed (e.g. an x-ray)
• Drugs are prescribed

These events happen in a certain overall order but the order varies based on the patient situation and according to the medical staff’s best judgement. We will call this set of events a medical encounter and they happen every day all over the world.

This sequence of events has a similar tone to what we’re familiar with in natural language. The encounter can be thought of as a sort of medical sentence. Each medical event within the encounter can be thought of as a medical word. The type of event (lab, procedure, diagnoses, etc.) can be considered as a sort of part-of-speech.

It remains to determine if this structure can be teased out and encoded into a vector space model like natural language can be. If so, then we can ask questions like:

• How similar are two diseases based on how they are treated and comorbidities found in the same encounter?
• Can we compose diseases and make them similar to other diseases? For instance, is the vector representation of type 2 diabetes - obesity close to type 1 diabetes?

When considering trying this technique out the problem, of course, is getting access to medical data. This data is extremely sensitive and is covered by HIPAA here in the United States. What we need is a good, depersonalized set of medical encounter data.

Thankfully, back in 2012 an electronic medical records system, Practice Fusion released a set of 10,000 depersonalized medical records as part of a kaggle competition. This opened up the possibility of actually doing this analysis, albeit on a small subset of the population.

## Implementation

Since I’ve been doing a lot with Spark lately at work, I wanted to see if I could use the Word2Vec implementation built into SparkML to accomplish this. Also, frankly, having worked with medical data at some big hospitals and insurance companies, I am aware that there is a real scale problem when doing something this complex for millions of medical encounters and I wanted to ensure that anything I did could scale.

The implementation boiled down into a few steps, which are common to most projects that I’ve seen run on Hadoop. I have created a small github repo to capture the code collateral used to process the data here.

• Ingest the Practice Fusion database dumps into Hadoop.
• Pin up Hive tables for each of the tables, roughly corresponding to a table per medical event.
• The set of DDL’s are here
• Transform this tabular data into a corpus of medical event sentences.
• The ETL pig scripts are here
• The shell script executing the pig scripts are here
• Build the word2vec model with Spark.

You can see from the Jupyter notebook detailing the model building portion and results here that model building is only a scant few lines:


from pyspark import SparkContext
from pyspark.mllib.feature import Word2Vec
sentences = sc.textFile("practice_fusion/sentences_nlp").map(lambda row: row.split(" "))
word2vec = Word2Vec()
word2vec.setSeed(0)
word2vec.setVectorSize(100)
model = word2vec.fit(sentences)



# Results

One of the problems with unsupervised models is evaluating how well our model is describing reality. For the purpose of this entirely unscientific analysis, we’ll restrict ourselves to just diagnoses and ask a couple of questions of the model:

• Does the model correctly recover what we currently know based on medical research?
• Does the model show us anything that is novel and likely, but unknown at present?

One thing to note before we get started. This model uses cosine similarity as the score. This measure of similarity ranges from 0 to 1, with 1 being most similar and 0 being least similar.

## Atherosclerosis

Also known as heart disease or hardening of the arteries. This disease is the number one killer of Americans. Our model found the following similar diseases:

ICD9 CodeDescriptionScore
v12.71Personal history of peptic ulcer disease0.930
533.40Chronic or unspecified peptic ulcer of unspecified site with hemorrhage, without mention of obstruction0.926
153.6Malignant neoplasm of ascending colon0.910
238.75Myelodysplastic syndrome, unspecified0.910
389.10Sensorineural hearing loss, unspecified0.907
428.30Diastolic heart failure, unspecified0.904
v43.65Knee joint replacement0.902

Peptic Ulcers

There have been long-standing connections noticed between ulcers and atherosclerosis. Partiaully due to smokers having a higher than average incidence of peptic ulcers and atherosclerosis. You can see an editorial in the British Medical Journal all the way back in the 1970’s discussing this.

Hearing Loss

From an article from the Journal of Atherosclerosis in 2012:

Sensorineural hearing loss seemed to be associated with vascular endothelial dysfunction and an increased cardiovascular risk

Knee Joint Replacements

These procedures are common among those with osteoarthritis and there has been a solid correlation between osteoarthritis and atherosclerosis in the literature.

## Crohn’s Disease

Crohn’s disease is a type of inflammatory bowel disease that is caused by a combination of environmental, immune and bacterial factors. Let’s see if we can recover some of these connections from the data.

ICD9 CodeDescriptionScore
274.03Chronic gouty arthropathy with tophus (tophi)0.870
522.5Periapical abscess without sinus0.869
579.3Other and unspecified postsurgical nonabsorption0.863
135Sarcoidosis0.859
112.3Candidiasis of skin and nails0.855
v16.42Family history of malignant neoplasm of prostate0.853

Arthritis

Arthritis, or inflammation of the joints, is the most common extraintestinal complication of IBD. It may affect as many as 25% of people with Crohn’s disease or ulcerative colitis. Although arthritis is typically associated with advancing age, in IBD it often strikes the youngest patients.

Dental Abscesses

While not much medical literature exists with a specific link to dental abscesses and Crohn’s (there are general oral issues noticed here), you do see lengthy discussions on the Crohn’s forums about abscesses being a common occurance with Crohn’s.

Yeast Infections

Candidiasis of skin and nails is a form of yeast infection on the skin. From the journal “Critical Review of Microbiology” here.

It is widely accepted that Candidia could result from an inappropriate inflammatory response to intestinal microorganisms in a genetically susceptible host. Most studies to date have concerned the involvement of bacteria in disease progression. In addition to bacteria, there appears to be a possible link between the commensal yeast Candida albicans and disease development.

## Visualization

For further investigation, I have used t-distributed stochastic neighbor embedding to embed the 100-dimensional vector space into 2 dimensions. This embedding should retain the general connections within the data, so you can look at similar diagnoses, drugs and allergies.

• You can choose to look at all types, just diagnoses or just drugs.
• Highlight in the canvas below and drag around. The points that you’ve selected will show up in the table below along with a description in plain text.

Please play around with this data and let me know what you find!

 All Provider Specialty Diagnoses Drugs Allergies
Type Name Description
Highlight some points above for this summary to be filled in.
04 December 2015 Cleveland, OH