1 of 23

Characterizing Information Seeking Events in Health-Related Social Discourse

Omar Sharif, Madhusudan Basak, Tanzia Parvin, Ava Scharfstein, Alphonso Bradham, Jacob T. Borodovsky, Sarah E. Lord, Sarah M. Preum

Dartmouth College

Code and data available at: https://github.com/omar-sharif03/AAAI-2024

2 of 23

Motivation

  • Social media has become a popular platform for individual to seek and share health information.
  • Analyzing the health-related information through the lens of events can offer insights into different facets of the treatment journey.

  • This post mentions multiple key events, e.g., taking medication (Gabapentin), experiencing psychophysical effects (pain or anxiety).
  • To showcase the significance of such event-driven approach we analyze social discourse regarding recovery from opioid use disorder (OUD).

2

I just had knee surgery about a week ago, and my pain meds (Gabapentin) are not cutting it! I am having sleep troubles and getting pretty anxious about recovery. I am considering Kratom. Any suggestions about how much I can take per day and how often?

3 of 23

Challenges

3

Difficulty to identify relevant events for OUD treatment

Hard to annotate domain-specific complex data.

Lack of prior work in computational health.

1

2

3

4 of 23

Contributions

4

  • Focus on a highly vulnerable population (individuals considering or undergoing OUD recovery) which received little attention.

  • Investigate the performance of state-of-the-art models including ChatGPT for such knowledge-intensive task.

  • Proposing a treatment-information seeking event schema based on the guidance from domain experts.
  • Leveraging the schema developing a novel multilabel dataset.

Resource

Social Impact

Benchmarking

5 of 23

TREAT-ISE: Treatment Information-Seeking Event Dataset

5

6 of 23

Information-Seeking Events

  • Based on the guidance of domain experts, we identify 5 coarse categories of treatment information-seeking events for MOUD.

6

Event Type

Definition

Accessing MOUD (AM)

Events related to accessing (insurance, pharmacy) MOUD.

Taking MOUD (TM)

Events related to timing, dosage, frequency of taking MOUD

Experiencing Psychophysical Effects (EP)

Events related to concern about potential physical and/or psychological effects during recovery.

Relapse (RL)

Events talk about relapsing during recovery.

Tapering MOUD (TP)

Event asking about reducing or quitting MOUD.

7 of 23

TREAT-ISE Development Steps

Figure: Information-seeking event dataset development steps.

7

8 of 23

TREAT-ISE Statistics

Table: Sample data excerpt with titles, posts, and labels (shortened and paraphrased as per IRB guidelines).

Table: Summary of different classes in TREAT-ISE.

  • Total 5083 multilabel samples.
  • The dataset is imbalance with EP having highest number of samples.
  • Longer average sample length (ranging from 122 to 151).

8

Title

Post

Events

Looking for suboxone

guidance?

I take 1-2 mg subs per day which is a decrease from the original dose of 8mg. Just looking for a plan of action in which to stick with to eventually get off completely.

Taking MOUD (TM), Tapering (TP)

9 of 23

Benchmarking: Methods & Results

9

10 of 23

Methods

  • We perform benchmark evaluation with non-transformers, transformers and LLMs
  • For LLM experiments optimal prompt is selected through an iterative process.
  • Temperature value set to 0.0 to ensure deterministic behavior and reproducibility.

10

11 of 23

Experimental and Evaluation Setup

  • Transformer models are sourced from HuggingFace library.
    • Models are fine-tuned for 10 epochs with a learning rate 2e-5 and batch size 16.
  • ChatGPT experiments are performed via (gpt-3.5-turbo-o613) API.
  • TREAT-ISE partitioned into three mutually exclusive sets.
    • Train (80%), Validation (10%), Test (10%)
  • Validation set is used to tune hyperparameters across experiments
  • F1-score is used to determine the superiority of the models.

11

12 of 23

Results

Table: Performance comparison of non-transformer and transformer models on TREAT-ISE.

  • BiGRU achieved highest F1-score among the non-transformer models.
  • XLNet achieved the maximum score of 0.774 and outperformed all models.

12

Method

Classifier

Precision

Recall

F1-Score

Non-transformer Baselines

LR

0.653

0.597

0.593

NBSVM

0.592

0.662

0.602

FastText

0.715

0.690

0.624

BiGRU

0.628

0.693

0.702

Transformer Baselines

BERT

0.809

0.679

0.733

RoBERTa

0.755

0.768

0.757

ELECTRA

0.779

0.731

0.748

XLNet

0.775

0.780

0.774

MPNet

0.768

0.740

0.751

13 of 23

Results

Table: Performance of ChatGPT (GPT-3.5) on TREAT-ISE. The shorthand indicates ZS-S, ZS-L: Zero-shot (Short, Long), FS-S, FS-L: Few-shot (Short, Long), and CoT: Chain-of-Thought prompting.

  • ChatGPT with CoT approach achieved the maximum F1-score of 0.631.
  • Surprisingly, all variants of experiment with ChatGPT achieved lower score compared to XLNet.
  • ChatGPT variants (except ZS-S) showed much higher recall than precision.

13

Method

Classifier

Precision

Recall

F1-Score

Best model

XLNet

0.775

0.780

0.774

ChatGPT Baselines

ChatGPT (ZS-S)

0.668

0.407

0.433

ChatGPT (ZS-L)

0.687

0.550

0.581

ChatGPT (FS-S)

0.497

0.824

0.609

ChatGPT (FS-L)

0.511

0.818

0.620

ChatGPT (CoT)

0.559

0.764

0.631

14 of 23

Results

  • All the models encountered challenges in identifying samples from taking medication (TM) and experiencing psychophysical effects (EP) events.
  • ChatGPT's higher recall value in individual classes indicate that it is tend to overpredict classes.

Table: Classwise performance for treatment information-seeking event detection.

14

15 of 23

Takeaways from Results

  • Transformer model (XLNet) achieved higher score than ChatGPT (GPT-3.5) in treatment information seeking event detection.

  • Identifying information-seeking events is difficult compared to similar multilabel classification problem.

  • Domain knowledge is required and has significant room for improvement.

15

16 of 23

Ablation Studies with ChatGPT and XLNet

16

17 of 23

Ablation Studies

  • Key findings from side by side analysis between the best-performing transformer (XLNet) and ChatGPT (with Chain-of-thought) approach are,

17

ChatGPT struggles more on long samples.

ChatGPT confuses between events more.

ChatGPT tends to overpredict more.

1

2

3

18 of 23

ChatGPT Overpredicts

  • ChatGPT’s average overprediction ration is 45%.
  • Exhibits higher error in the TM and EP classes, with 166 (out of 323) and 135 (out of 267) mispredictions.
  • XLNet exhibits lower overprediction ration for all categories except the EP class.

Figure: Classwise overprediction ratio (#false positive / #predicted positives) of ChatGPT with CoT prompts and the XLNet model.

18

Method

Accessing MOUD (AM)

Taking MOUD

(TM)

Tapering

(TP)

Experiencing Psychophysical Effects (EP)

Relapse

(RL)

Other

ChatGPT (GPT-3.5)

36/96

166/323

103/227

135/267

32/122

44/74

0.375

0.513

0.453

0.505

0.262

0.594

XLNet

12/75

35/165

21/139

92/227

19/155

9/33

0.160

0.212

0.151

0.405

0.122

0.27

19 of 23

ChatGPT Struggles on Long Samples

  • As the length of samples (measured in words) increases, the frequencies of accurate predictions decreases.
  • Average sample length where ChatGPT made errors is 128.04, whereas, for XLNet, it is 140.21.
  • Both models encounter difficulties in understanding information-seeking events with long-range context.

Figure: Correlation between sample length and frequency of correct/wrong predictions

19

20 of 23

ChatGPT Confuses between Events

  • ChatGPT made the highest (92) number of errors on the RL event class and, most of the time considered it as either the TM or EP event class.
  • XLNet often misclassified TM as EP (10) class.

Table. Confusion mapping of ChatGPT (CG) with CoT approach and XLNet (XL) model.

20

21 of 23

Key Takeaways and Future Work

  • Developed TREAT-ISE a novel multilabel dataset to characterize OUD treatment information seeking-events.
  • Error analysis shows model’s poor understanding of the domain-specific nuances.
  • Room for further investigation on this dataset.
  • Exploring the possibility of minimal supervision to augment dataset size can be a potential next step.
  • Investigating how other large language models perform on this task can provide us with valuable insights.

21

22 of 23

Acknowledgements

💰💰 This research is partially supported by P30 Center of Excellence grant from the National Institute on Drug Abuse (NIDA) P30DA029926.

🏠🏠 We thank our Center for Technology and Behavioral Health (CTBH) colleagues for their guidance and insightful suggestions.

22

23 of 23

Thank You!

23