1 of 22

DoctorAI and Medical Embeddings: Developing a Medical Sense

Connor Favreau

Data Science Intern at Providence Health and Services

2 of 22

The Takeaways

What a recurrent neural network is
How to generate word embeddings from the word2vec skipgram algorithm
That word embeddings can provide a good “time-averaged” sense of an input in a time series and, as such, can be combined with RNNs to yield greater predictive accuracy
That RNNs and embeddings can be applied to time series data beyond words
How RNNs and embeddings can be applied to the medical field

3 of 22

Neural Networks

Input (scalar, vector, matrix)

Hidden (nonlinear activation function, weights, biases)

Calculated Output (scalar, vector, matrix)

Actual Output (scalar, vector, matrix)

Cost Function

Backpropagation

Training

4 of 22

Recurrent Neural Networks

Britz D., 2015

5 of 22

Recurrent Neural Network Spin-offs

RNNs suffer from vanishing and exploding gradients.

tanh and sigmoid derivatives both approach zero from both sides. Gradients get smaller and smaller with later times.

LSTMs (Long Short Term Memory)

First proposed in 1997.
Most robust, and best at capturing non-adjacent sequence dependencies.
More parameters to train.

GRUs (Gated Recurrent Units)

First proposed in 2014.
Less parameters to train than LSTM.

6 of 22

Adding Embeddings (word2vec/skipgram) Into RNNs

Traditionally could use one-hot vectors for inputs and outputs.
Instead could perform word2vec embeddings first, based on the co-occurrence of items.
Ex.: Predicting words in a sentence.

context

The cat chased a bird flying through the window.

target

The cat chased a bird __?__.

7 of 22

The Word2Vec Skip-Gram Algorithm

A text-learning neural net algorithm that maps words to vectors.
Each vector roughly represents the conditional probability of a word appearing near the other words in the vocabulary.
First introduced in Mikolov et al. 2013 paper “Efficient Estimation of Word Representations in Vector Space”
Skip-gram task: Predict the target word from the context words.

The cat chased a bird flying through the window.

context

target

8 of 22

What are Embeddings?

Mikolov et al., 2013b

Mikolov et al., 2013c

9 of 22

Applications

Natural Language Processing

Sentences as time sequences
Can train a RNN to predict the next word in a sentence and to generate text.

Shakespeare

Speech Recognition
Medical predictions

10 of 22

Recurrent Neural Networks and Embeddings in the Medical Field

11 of 22

Medical Data Usable for RNNs

ICD-9 and ICD-10 codes

“International Statistical Classification of Diseases and Health Related Problems”
Alphanumeric codes for patient diagnosis.
ICD-10 replaced ICD-9 in October 2015, and has over 69,000 unique codes.
Ex.: “F25.0” corresponds to “cyclic schizophrenia”

Current Procedural Terminology (CPT) codes

Thousands of five-character codes describing medical events.
Typically used in outpatient processing and billing.
Ex.: “2014F” corresponds to “Mental status assessed”

General Product Identifer (GPI) codes

14-character codes to identify drugs.
Ex.: “6410001000” corresponds to “Aspirin”

Unstructured notes

Notes contain an estimated 80% of total medical information (SyTrue, 2015) http://hitconsultant.net/2015/03/31/tapping-unstructured-data-healthcares-biggest-hurdle-realized/ .

12 of 22

DoctorAI: Medical Predictions from a GRU Framework

Choi E. et al., 2015

13 of 22

GRU Networks

Update Gate

Reset Gate

How much to update by

How much value to assign new inputs versus previous layer

Schraudolph N., 2014

Hsu C., 2017

14 of 22

From GRU to Predictions

y is the situational prediction at the next time (also the next input)

Use softmax to reflect co-occurring probabilities

d is the prediction of the time duration until the next visit (also the next input)

Use ReLU/max() to reflect the “unbounded” nature of time

Guo B., 2016

15 of 22

For Training

cross-entropy, summing for each time

Square between predicted and actual duration

Goal: Minimize this cost function

16 of 22

Medical Embeddings… Not Just Words in Sentences

Can perform “word2vec” skip-gram training based on co-occurrence of codes and/or text in specific “bins” or periods in time.

Collection of a patient’s codes over a given period of time

Choi Y. et al., 2016

17 of 22

Med2Vec

Converts codes (ICD, CPT, medication) and demographics information into lower-dimensional embeddings, that can be fed into doctorAI for predictions.
Based on word2vec skip-gram model.
Written in Python/Theano.

Choi E. et al., 2016

18 of 22

Medical Embeddings from Text

Annotate text to note list of concepts, including whether they are negated.
Perform skip-gram/co-occurrence training (word2vec).
Previous works have binned notes by patient by time frames extending from 1 day bins to 1 year bins

Finlayson et al., 2014

19 of 22

Medical Embeddings from Text

Choi Y. et al., 2016

20 of 22

DoctorAI Results

Choi E. et al., 2016

21 of 22

The Takeaways

What a recurrent neural network is
How to generate word embeddings from the word2vec skipgram algorithm
That word embeddings can provide a good “time-averaged” sense of an input in a time series and, as such, can be combined with RNNs to yield greater predictive accuracy
That RNNs and embeddings can be applied to time series data beyond words
How RNNs and embeddings can be applied to the medical field

22 of 22

References/Good Links

Britz D., 2015. Recurrent Neural Networks Tutorial [Blog Post]; WildML – Artificial Intelligence, Deep Learning, and NLP. http://www.wildml.com/
Choi E., Bahadori MT. Sun J. Doctor AI: Predicting Clinical Events via Recurrent Neural Networks. CoRR. 2015; abs/1511.05942. https://arxiv.org/pdf/1511.05942.pdf
Choi E., Taha Bahadori M., Searles E., Coffey C., and Jimeng S. Multi-layer representation learning for medical concepts. In KDD, 2016a. http://www.kdd.org/kdd2016/papers/files/rpp0303-choiA.pdf
Choi Y., Yi-I Chiu C., and Sontag D., 2016. Learning Low Dimensional Representations of Medical Concepts. AMIA Jt Summits Transl Sci Proc., 2016: 41-50, 2016. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5001761/
De Vine L., Zuccon G., Koopman B., Sitbon L., Bruza P. Medical Semantic Similarity with a Neural Language Model; Proceedings of CIKM ’14; New York, NY, USA: ACM; 2014. pp. 1819–1822.
Finlayson S., LePendu P., Shah N. Data from: Building the graph of medicine from millions of clinical narratives. Dryad Digital Repository. 2014.
Guo B., 2016. ReLu compared against Sigmoid, Softmax, Tanh [Blog Post]; Quora. https://algorithmsdatascience.quora.com/ReLu-compared-against-Sigmoid-Softmax-Tanh
Hsu C., 2017. Logistic Regression: Sigmoid Function Explained in Plain English [Blog Post]; LinkedIn. https://www.linkedin.com/pulse/logistic-regression-sigmoid-function-explained-plain-english-hsu
Mikolov T., Chen K., Corrado G., Dean, J., 2013. Eﬃcient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. https://arxiv.org/pdf/1301.3781.pdf
Mikolov T., Sutskever I., Chen K., Corrado G., Dean, J., 2013. Distributed Representations of Words and Phrases and their Compositionality. arXiv preprint arXiv:1310.4546v1. https://arxiv.org/pdf/1310.4546.pdf
Mikolov T., Yih W., Zweig G., 2013. Linguistic Regularities in Continuous Space Word Representations. http://www.aclweb.org/anthology/N13-1090
Rong X., 2016. word2vec Parameter Learning Explained. arXiv preprint arXiv:1411.2738v4 . https://arxiv.org/pdf/1411.2738.pdf
Schraudolph N., 2014. Multi-layer networks. https://nic.schraudolph.org/teach/NNcourse/multilayer.html
SyTrue, 2015. Why Structured Data Holds the Key to Intelligent Healthcare Systems. http://hitconsultant.net/2015/03/31/tapping-unstructured-data-healthcares-biggest-hurdle-realized/