Multimodal Analysis and Assessment of Therapist Empathy
in Motivational Interviews
Trang Tran1, Yufeng Yin1, Leili Tavabi1, Joannalyn Delacruz2, Brian Borsari2,
Joshua Woolley2, Stefan Scherer1, Mohammad Soleymani1
1Institute for Creative Technologies, University of Southern California
2San Francisco Veterans Affairs Medical Center, Department of Psychiatry & Behavioral Sciences, UCSF
Discussion & Conclusions
- Best performing models:
- multimodal > unimodal; larger margin in the therapist-independent setting vs. therapist-dependent
- Q2 best in therapist-independent settings; Q4 and Full session better in therapist-dependent settings
- therapist+client speech > therapist-only in most cases
- Powerful acoustic representations (HuBERT) seem to alleviate language modality dominance in prediction, but differences are small
- Modality weights: audio is assigned larger weight when using therapist-only turns vs. both speakers, likely because audio becomes a more important signal in limited data scenarios (i.e., fewer turns, less context, shorter segments)
- Error analysis:
- more false negatives than false positives in most cases
- most misclassifications: MI inconsistent utterances (25% misclassified by all models)
- Applications: understanding key aspects of empathy; facilitate therapist training
Code & models: https://github.com/ihp-lab/mm_analysis_empathy
Acknowledgements
This work was supported by NIAAA grants R01 AA027225, R01AA017427 and R01 AA12518. The content is solely the responsibility of the authors and does not represent the official views of the NIAAA, NIH, Dept. of Veterans Affairs, or the US Government.
References
- Kate Carey, Lori Scott-Sheldon, Lorra Garey, Jennifer Elliott, and Michael Carey. Alcohol interventions for mandated college students: A meta-analytic review. Journal of Consulting & Clinical Psychology 84, 7 (2016)
- Linda A. Dimeff, John S. Baer, Daniel R. Kivlahan, and G. Alan Marlatt. 2002. Brief alcohol screening and intervention for college students (BASICS): A harm reduction approach. Guilford Press, New York, NY.
- Molly Magill, Tim Janssen, Nadine Mastroleo, Ariel Hoadley, Justin Walthers, Nancy Barnett, and Suzanne Colby. Motivational interviewing technical process and moderated relational process with underage young adult heavy drinkers. Psychology of Addictive Behaviors 33, 2 (2019)
- James G. Murphy, Kathryn S. Gex, Ashley A. Dennhardt, Alex P. Miller, Susan E. O’Neill, and Brian Borsari. Beyond BASICS: A scoping review of novel intervention content to enhance the efficacy of brief alcohol interventions for emerging adults. Psychology of Addictive Behaviors 36, 6 (2022)
Background
- Empathy:
- “accurate understanding of the client’s awareness of their own state”
- essential in reaching desired therapeutic outcomes
- Motivational Interviewing: a therapy approach to evoke clients’ intrinsic reasons for behavior change (no “directives” or “judgments”)
- Therapist Utterances: Questions (open ended: QUO, closed ended: QUC), Reflections (simple: RES, complex: REC), Giving Information (GI), Facilitate (FA), MI-consistent (MICO), MI-inconsistent (MIIN), Other
- Client Utterances: Change Talk (CT), Sustain Talk (ST), Follow Neutral (FN)
- Session stages (quartiles):
- Q1 Introduction, session information
- Q2 Discussion of drinking behavior, role of alcohol, rapport building
- Q3 Personalized feedback, statistics and assessment of behavior
- Q4 Plan for action in a collaborative conversation
- Q2 and Q4: best opportunities for expressions of empathy; supported by previous studies (Dimeff et al., 2002; Carey et al., 2016; Magill et al., 2019; Murphy et al., 2022)
- Task: binary empathy estimation + comprehensive analyses
- Using unimodal and/or bimodal features (transcripts & speech)
- Using only therapist speech vs. therapist+client
- Using subsessions (Q1, Q4) vs. full session
- Dataset1:
- college students with risky drinking; manual transcriptions;
- MISC coding
- 219 sessions; avg. 50 minutes, 420 turns
- Dataset2:
- community underage risky drinkers; Google ASR transcripts;
- MITI coding
- 82 sessions; avg. 55 minutes, 600 turns
Binarization:
label = 1 for score > 0.5;
label = 0 otherwise
- Data setup:
- sample Xi: a window of W turns
- sample Xi+1: a window of W turns, overlapping with Xi with hop P
- each sample gets the label yi of the session
- extracted for {Q1, Q4, full session} x {both speakers, therapist-only} x {therapist-independent, therapist dependent}
- Bimodal features:
- Text: Distil-RoBERTa-Emotion
- Audio: HuBERT, pretrained on MSP-Podcast
- Speaker encoding: learn turn change
- Sequence representation:
- GRU + mean/max pooling
- MLP classifier
- Fusion
- Early: z = concat(text, audio)
→ ŷ = MLP(z)
zt = MLP(text)
→ ŷ = (1-λ)*zt + λ*za
λ: learnable & used as indicator of modality contribution
- Training:
- BCE loss; AdamW optimizer