1 of 14

RL‑Powered Mental Health Support Bot

Chirasmayee B, Pallavi Bichpuriya, Diya Shah

Hack With CAIR, April 19 2025

2 of 14

Problem & Motivation

  • Pain point: People need immediate, stigma‑free mental health nudges
  • Challenges:

• One‑size‑fits‑all tips feel generic

• Live human coaches don’t scale

  • Our goal: Personalize and adapt suggestions over time

3 of 14

High‑Level Solution

  • User input → GPT classifies mood into 13 labels.
  • Map label → numerical mood_score ∈ [0,1]
  • Pre‑trained PPO picks one of 8 wellness strategies.
  • Output: suggestion + live mood‑trend chart

4 of 14

System Architecture

  • Browser sends text to Gradio backend
  • GPT mood classifier + mood_score_map + PPO.predict
  • Suggestion UI: text reply & mood‑trend graph

5 of 14

Key Technologies

  • OpenAI GPT-3.5 – Infers emotional tone from user input.
  • PPO (Proximal Policy Optimization) – Reinforcement Learning agent.
  • Gradio + Hugging Face Spaces – Web-based chatbot interface.
  • Gymnasium – Custom environment to simulate user mood responses.

6 of 14

PPO Agent Logic

• State: Mood score (0–1)

• Action: One of eight wellness strategies

• Reward: Mood improvement

• Goal: Maximize cumulative mood increase

over episodes

7 of 14

Live Demo

  • Example:

•User: “I’m sad.”

• Bot: “I sense you might be feeling sadness. How about a short walk in nature? A little fresh air might help”

• Mood chart updates

8 of 14

9 of 14

The PPO Algorithm (Formulas)

  • Probability ratio- this shows how PPO measures “how much” the new policy has moved from the old one.

st- the user’s current mood_score (a number between 0 and 1)

at- one of the 8 wellness suggestions (meditation, breathing_ex, etc.)

Θ(at|st)- the probability your new policy assigns to picking suggestion at in mood st

Θ(old) is the same probability under the policy before that update.

  • Clipped surrogate objective- the core innovation that updates “proximal”.

    • At- (the “advantage”) measures how much better it was to pick suggestion at in mood st versus the policy’s expectation.
    • By taking min of the unclipped and clipped terms, PPO caps updates when the new policy would otherwise swing too far.

10 of 14

  • Full PPO loss (with value and entropy terms)- shows how policy, value, and exploration are balanced.

  1. Policy Term- LCLIP- Pushes the suggestion probabilities toward actions that yielded high “advantage.”
  2. Value Term - Trains a value network so it can predict “how good” each mood state is, helping compute At.
  3. Entropy Bonus- Keeps the policy from collapsing—ensuring your bot doesn’t always repeat the same tip.

11 of 14

Key Implementation Snippets

12 of 14

Tests & Results

  • Functional Test - Does the bot respond to user’s mood?
  • Multi-Turn Logic Test - Testing whether GPT picks up context from previous turns.
  • Mood Trend Tracker - Does the mood tracker work?
  • Edge Case Test - Empty or Repetitive or Extremely long input.

13 of 14

Future Work

  • Online fine‑tuning with personal data
  • Multi‑modal input: voice, emojis, images
  • Mobile app + push notifications
  • A/B test vs rule‑based baselines

14 of 14

Thank You & Q&A