Natural language processing for clinical notes
Andrew Bazemore, MD, MPH
Nathaniel Hendrix, PharmD, PhD
Center for Professionalism and Value in Health Care
American Board of Family Medicine
Learning objectives
Learning objectives
Learning objectives
Computers made analysis cheap, so we do far more analysis �Desktop computers made design cheap, so there is far more design �[Large language models] make images and text cheap, so...��-Benedict Evans
The Opportunity
PRIMARY CARE is
*Martin S, Phillips RL, Petterson S, Levin Z, Bazemore AW. Primary Care Spending in the United States, 2002-2016. JAMA Intern Med. 2020;180(7):1019–1020.
“the provision of integrated, accessible health care services by clinicians who are accountable for addressing a large majority of personal health care needs, developing a sustained partnership with patients, and practicing in the context of family and community”
Engaging primary care researchers & their data in AI/ML research & implementation
https://professionalismandvalue.org/setting-a-research-agenda-for-the-use-of-artificial-intelligence-machine-learning-in-primary-care/
$3 million ABFM Foundation Investment in advancing PC in AI/ML
Natural Language Processing & Primary Care EHR Data
“How can we learn in real time from the millions of clinical decisions made every day?”
Challenges of clinical data
Not all relevant data is recorded in a structured format
Hendrix, et al. (2023) doi: 10.3122/jabfm.2022.220296R1
Challenges of clinical data
Documentation in structured data is subject to format changes
Heslin, et al. (2017) doi: 10.1097/MLR.0000000000000805
Challenges of clinical data
Structured data prone to entry errors
COVID-19 in American Family Cohort
|
| Overall | No ICD-10 / lab | ICD-10 / lab | P-Value |
n |
| 16812 | 8692 | 8120 | |
SDI, mean (SD) |
| 13.5 (9.0) | 13.8 (9.1) | 13.3 (8.8) | 0.001 |
Age, n (%) | < 18 | 93 (0.6) | 31 (0.4) | 62 (0.8) | <0.001 |
18 to 64 | 8676 (51.6) | 4134 (47.6) | 4542 (55.9) | | |
65 to 74 | 4651 (27.7) | 2521 (29.0) | 2130 (26.2) | | |
75 and above | 3392 (20.2) | 2006 (23.1) | 1386 (17.1) | | |
Race, n (%) | API | 393 (2.3) | 199 (2.3) | 194 (2.4) | 0.006 |
BLACK | 804 (4.8) | 437 (5.0) | 367 (4.5) | | |
HISPANIC | 2075 (12.3) | 997 (11.5) | 1078 (13.3) | | |
WHITE | 13530 (80.5) | 7054 (81.2) | 6476 (79.8) | | |
OTHER | 10 (0.1) | 5 (0.1) | 5 (0.1) | |
Indications of COVID-19 among patients prescribed Paxlovid
COVID-19 in American Family Cohort
Bag of words model
Bag of words representations
Image from O’Reilly Media
Bag of words representations
Image from O’Reilly Media
Lemmatization
Regularization
N-grams
Removes word endings to convert words into a basic form:
This allows for a stronger signal from the text without consideration of grammatical features.
Bag of words representations
Image from O’Reilly Media
Lemmatization
Regularization
N-grams
Often called TF-IDF (term frequency-inverse document frequency)
Shows how common a term is within a text compared to its frequency across the corpus
This allows for a greater focus on unique terms
Can also use this to filter out words that appear in most documents
Bag of words representations
Image from O’Reilly Media
Lemmatization
Regularization
N-grams
In addition to single word columns, the bag of words will count multiple words that are side by side in the text
Example 2-grams. “It is a puppy” =
Applications: Classification
Assignment of a text into groups defined by the user (supervised).
E.g., sentiment:
Or estimating likelihood of future
events, such as hospitalization in the next year
“Dr. Smith is very
caring and kind” = 👍
“Dr. Jones was late to my
appointment
and rushed me” = 👎
Applications: Topic modeling
Identification of words that tend to occur together – “topics”
This is an unsupervised method – the researcher doesn’t define the topics in advance, but the model finds them in the data
Topic A
Topic A
Topic A
Topic B
Topic B
Topic C
Topic C
Topic D
Code Walkthrough: Classification and Topic Modeling
https://shorturl.at/mvET5
Word Embeddings
Based on the Distributional Hypothesis:
You shall know a word by the company it keeps
Models of texts: Embeddings
Embeddings are commonly learned through “skip-grams” – predicting the next word in a sentence:
e.g., “Alice lay still on the sofa, feeling _____.”
“tired”
Models of texts: Embeddings
Image from Kastreti, et al. DOI:10.1016/j.dib.2019.105090
Models of texts: Embeddings
The spatial arrangement of words means that you can analyze them using basic arithmetic
Image from http://jalammar.github.io/illustrated-word2vec/
Applications: Text generation
What ChatGPT does!
Large language model -> Fine tune on your data -> Generate text
Can be used to answer questions, summarize texts, or fill in the blank
Zhang, et al. (2020) doi: 10.1145/3368555.3384448
Applications: Similarity estimates
Uses cosine similarity to identify the examples that are closest to the input
Can be used with…
Code Walkthrough: Next word prediction and text similarity
https://shorturl.at/mvET5
Questions?
abazemore@theabfm.org
nhendrix@theabfm.org