1 of 1

Multimodal Analysis and Assessment of Therapist Empathy

in Motivational Interviews

Trang Tran¹, Yufeng Yin¹, Leili Tavabi¹, Joannalyn Delacruz², Brian Borsari²,

Joshua Woolley², Stefan Scherer¹, Mohammad Soleymani¹

¹Institute for Creative Technologies, University of Southern California

²San Francisco Veterans Affairs Medical Center, Department of Psychiatry & Behavioral Sciences, UCSF

Approach

Discussion & Conclusions

Best performing models:

multimodal > unimodal; larger margin in the therapist-independent setting vs. therapist-dependent
Q2 best in therapist-independent settings; Q4 and Full session better in therapist-dependent settings
therapist+client speech > therapist-only in most cases

Powerful acoustic representations (HuBERT) seem to alleviate language modality dominance in prediction, but differences are small
Modality weights: audio is assigned larger weight when using therapist-only turns vs. both speakers, likely because audio becomes a more important signal in limited data scenarios (i.e., fewer turns, less context, shorter segments)
Error analysis:

more false negatives than false positives in most cases
most misclassifications: MI inconsistent utterances (25% misclassified by all models)

Applications: understanding key aspects of empathy; facilitate therapist training

Code & models: https://github.com/ihp-lab/mm_analysis_empathy

Acknowledgements

This work was supported by NIAAA grants R01 AA027225, R01AA017427 and R01 AA12518. The content is solely the responsibility of the authors and does not represent the official views of the NIAAA, NIH, Dept. of Veterans Affairs, or the US Government.

References

Kate Carey, Lori Scott-Sheldon, Lorra Garey, Jennifer Elliott, and Michael Carey. Alcohol interventions for mandated college students: A meta-analytic review. Journal of Consulting & Clinical Psychology 84, 7 (2016)
Linda A. Dimeff, John S. Baer, Daniel R. Kivlahan, and G. Alan Marlatt. 2002. Brief alcohol screening and intervention for college students (BASICS): A harm reduction approach. Guilford Press, New York, NY.
Molly Magill, Tim Janssen, Nadine Mastroleo, Ariel Hoadley, Justin Walthers, Nancy Barnett, and Suzanne Colby. Motivational interviewing technical process and moderated relational process with underage young adult heavy drinkers. Psychology of Addictive Behaviors 33, 2 (2019)
James G. Murphy, Kathryn S. Gex, Ashley A. Dennhardt, Alex P. Miller, Susan E. O’Neill, and Brian Borsari. Beyond BASICS: A scoping review of novel intervention content to enhance the efficacy of brief alcohol interventions for emerging adults. Psychology of Addictive Behaviors 36, 6 (2022)

Background

Empathy:

“accurate understanding of the client’s awareness of their own state”
essential in reaching desired therapeutic outcomes

Motivational Interviewing: a therapy approach to evoke clients’ intrinsic reasons for behavior change (no “directives” or “judgments”)

Therapist Utterances: Questions (open ended: QUO, closed ended: QUC), Reflections (simple: RES, complex: REC), Giving Information (GI), Facilitate (FA), MI-consistent (MICO), MI-inconsistent (MIIN), Other
Client Utterances: Change Talk (CT), Sustain Talk (ST), Follow Neutral (FN)

Session stages (quartiles):

Q1 Introduction, session information
Q2 Discussion of drinking behavior, role of alcohol, rapport building
Q3 Personalized feedback, statistics and assessment of behavior
Q4 Plan for action in a collaborative conversation

Q2 and Q4: best opportunities for expressions of empathy; supported by previous studies (Dimeff et al., 2002; Carey et al., 2016; Magill et al., 2019; Murphy et al., 2022)
Task: binary empathy estimation + comprehensive analyses

Using unimodal and/or bimodal features (transcripts & speech)
Using only therapist speech vs. therapist+client
Using subsessions (Q1, Q4) vs. full session

Datasets

Dataset1:

college students with risky drinking; manual transcriptions;
MISC coding
219 sessions; avg. 50 minutes, 420 turns

Dataset2:

community underage risky drinkers; Google ASR transcripts;
MITI coding
82 sessions; avg. 55 minutes, 600 turns

Binarization:

label = 1 for score > 0.5;

label = 0 otherwise

Data setup:

sample X_i: a window of W turns
sample X_i+1: a window of W turns, overlapping with Xi with hop P
each sample gets the label y_i of the session
extracted for {Q1, Q4, full session} x {both speakers, therapist-only} x {therapist-independent, therapist dependent}

Bimodal features:

Text: Distil-RoBERTa-Emotion
Audio: HuBERT, pretrained on MSP-Podcast

Speaker encoding: learn turn change
Sequence representation:

GRU + mean/max pooling
MLP classifier

Fusion

Early: z = concat(text, audio)

→ ŷ = MLP(z)

Late: z_a = MLP(audio),

z_t = MLP(text)

→ ŷ = (1-λ)*z_t + λ*z_a

λ: learnable & used as indicator of modality contribution

Training:

BCE loss; AdamW optimizer

Voted F1 scores (%↑)

Learned Fusion Weights λ

Model Architecture