1 of 30

Natural language processing for clinical notes

Andrew Bazemore, MD, MPH

Nathaniel Hendrix, PharmD, PhD

Center for Professionalism and Value in Health Care

American Board of Family Medicine

2 of 30

Learning objectives

Understand how text is modeled in quantitative analyses

3 of 30

Learning objectives

Understand how text is modeled in quantitative analyses
Match appropriate NLP methods with policy-relevant research questions

4 of 30

Learning objectives

Understand how text is modeled in quantitative analyses
Match appropriate NLP methods with policy-relevant research questions
Raise your level of ambition about measurement and analysis using NLP

5 of 30

Computers made analysis cheap, so we do far more analysis �Desktop computers made design cheap, so there is far more design �[Large language models] make images and text cheap, so...��-Benedict Evans

6 of 30

The Opportunity

PRIMARY CARE is

The largest, most widely distributed platform of U.S. healthcare delivery; >50% HC visits
Built on the four foundational functions

The 4Cs (1^st Contact, Coordination, Continuity, and Comprehensiveness)
All potentially enhanced by the power of AI/ML

Underfunded & Under-represented in AI/ML algorithm development

5-7% of U.S. Healthcare spending*
99% of data being used in AI/ML algorithm development coming from elsewhere

Real risk of bias, harm if applied to CDS, primary care decision-making

*Martin S, Phillips RL, Petterson S, Levin Z, Bazemore AW. Primary Care Spending in the United States, 2002-2016. JAMA Intern Med. 2020;180(7):1019–1020.

“the provision of integrated, accessible health care services by clinicians who are accountable for addressing a large majority of personal health care needs, developing a sustained partnership with patients, and practicing in the context of family and community”

Primary care is the largest care delivery platform in the U.S. with nearly 230,000 primary care physicians (PCPs) providing more than 500 million visits per year, addressing preventive, acute, chronic, and multimorbid issues . Primary care serves as the first point of contact to continuous, comprehensive, and coordinated care and is central to improving community health.

Primary care in the U.S. lacks the investment and transcendent role that are granted in other developed economies. From 2005 to 2015, overfull patient panels, insufficient reimbursement, and mounting clerical responsibilities led to further physician burnout and workforce shortages. Primary care is also severely under-examined and under-funded. In 2018, among the 86,000 projects funded by National Institute of Health, 8% were considered health services research while less than 1% were primary care focused.

With the digitization of everything from videos to voices and documents, artificial intelligence and machine learning (AI/ML) have revolutionized industries, including medicine, but have yet to transform primary care. A review of primary care AI/ML concluded that the field remains in “early stages of maturity,” despite a history spanning nearly 35 years.¹ Only 1 out of every 7 of these papers includes a primary care author; therefore, one barrier to greater impact is engagement from the primary care community.

7 of 30

Engaging primary care researchers & their data in AI/ML research & implementation

https://professionalismandvalue.org/setting-a-research-agenda-for-the-use-of-artificial-intelligence-machine-learning-in-primary-care/

8 of 30

$3 million ABFM Foundation Investment in advancing PC in AI/ML

9 of 30

Natural Language Processing & Primary Care EHR Data

Primary Care EHR data… big opportunity… and a big mess

rich, voluminous, heterogenous, inconsistent, incomplete, inaccessible

The NLP opportunity (in Nathaniel’s words):

“How can we learn in real time from the millions of clinical decisions made every day?”

10 of 30

Challenges of clinical data

Not all relevant data is recorded in a structured format

Hendrix, et al. (2023) doi: 10.3122/jabfm.2022.220296R1

11 of 30

Challenges of clinical data

Documentation in structured data is subject to format changes

Heslin, et al. (2017) doi: 10.1097/MLR.0000000000000805

12 of 30

Challenges of clinical data

Structured data prone to entry errors

13 of 30

COVID-19 in American Family Cohort

		Overall	No ICD-10 / lab	ICD-10 / lab	P-Value
n		16812	8692	8120
SDI, mean (SD)		13.5 (9.0)	13.8 (9.1)	13.3 (8.8)	0.001
Age, n (%)	< 18	93 (0.6)	31 (0.4)	62 (0.8)	<0.001
	18 to 64	8676 (51.6)	4134 (47.6)	4542 (55.9)
	65 to 74	4651 (27.7)	2521 (29.0)	2130 (26.2)
	75 and above	3392 (20.2)	2006 (23.1)	1386 (17.1)
Race, n (%)	API	393 (2.3)	199 (2.3)	194 (2.4)	0.006
	BLACK	804 (4.8)	437 (5.0)	367 (4.5)
	HISPANIC	2075 (12.3)	997 (11.5)	1078 (13.3)
	WHITE	13530 (80.5)	7054 (81.2)	6476 (79.8)
	OTHER	10 (0.1)	5 (0.1)	5 (0.1)

Indications of COVID-19 among patients prescribed Paxlovid

14 of 30

COVID-19 in American Family Cohort

15 of 30

Bag of words model

Converts texts into tabular data of the word counts in each
Fast, highly interpretable, and allows for use of familiar methods like logistic regression
However, because it ignores word order, it is often less accurate than more advanced methods

16 of 30

Bag of words representations

Image from O’Reilly Media

17 of 30

Bag of words representations

Image from O’Reilly Media

Lemmatization

Regularization

N-grams

Removes word endings to convert words into a basic form:

Improve, improving, improved, improvements -> “improve”
Good, better, best -> “good”

This allows for a stronger signal from the text without consideration of grammatical features.

18 of 30

Bag of words representations

Image from O’Reilly Media

Lemmatization

Regularization

N-grams

Often called TF-IDF (term frequency-inverse document frequency)

Shows how common a term is within a text compared to its frequency across the corpus

This allows for a greater focus on unique terms

Can also use this to filter out words that appear in most documents

19 of 30

Bag of words representations

Image from O’Reilly Media

Lemmatization

Regularization

N-grams

In addition to single word columns, the bag of words will count multiple words that are side by side in the text

Example 2-grams. “It is a puppy” =

“It is”
“Is a”
“A puppy”

20 of 30

Applications: Classification

Assignment of a text into groups defined by the user (supervised).

E.g., sentiment:

Or estimating likelihood of future

events, such as hospitalization in the next year

“Dr. Smith is very

caring and kind” = 👍

“Dr. Jones was late to my

appointment

and rushed me” = 👎

21 of 30

Applications: Topic modeling

Identification of words that tend to occur together – “topics”

This is an unsupervised method – the researcher doesn’t define the topics in advance, but the model finds them in the data

Topic A

Topic B

Topic C

Topic D

22 of 30

Code Walkthrough: Classification and Topic Modeling

https://shorturl.at/mvET5

23 of 30

Word Embeddings

Based on the Distributional Hypothesis:

You shall know a word by the company it keeps

24 of 30

Models of texts: Embeddings

Embeddings are commonly learned through “skip-grams” – predicting the next word in a sentence:

e.g., “Alice lay still on the sofa, feeling _____.”

“tired”

25 of 30

Models of texts: Embeddings

Conceptualizes texts as a multi-dimensional space where similar words are near each other

Image from Kastreti, et al. DOI:10.1016/j.dib.2019.105090

26 of 30

Models of texts: Embeddings

The spatial arrangement of words means that you can analyze them using basic arithmetic

Image from http://jalammar.github.io/illustrated-word2vec/

27 of 30

Applications: Text generation

What ChatGPT does!

Large language model -> Fine tune on your data -> Generate text

Can be used to answer questions, summarize texts, or fill in the blank

Zhang, et al. (2020) doi: 10.1145/3368555.3384448

28 of 30

Applications: Similarity estimates

Uses cosine similarity to identify the examples that are closest to the input

Can be used with…

Patients
Medications
Symptoms

29 of 30

Code Walkthrough: Next word prediction and text similarity

https://shorturl.at/mvET5

30 of 30

Questions?

abazemore@theabfm.org

nhendrix@theabfm.org