1 of 14

RL‑Powered Mental Health Support Bot

Chirasmayee B, Pallavi Bichpuriya, Diya Shah

Hack With CAIR, April 19 2025

2 of 14

Problem & Motivation

Pain point: People need immediate, stigma‑free mental health nudges
Challenges:

• One‑size‑fits‑all tips feel generic

• Live human coaches don’t scale

Our goal: Personalize and adapt suggestions over time

3 of 14

High‑Level Solution

User input → GPT classifies mood into 13 labels.
Map label → numerical mood_score ∈ [0,1]
Pre‑trained PPO picks one of 8 wellness strategies.
Output: suggestion + live mood‑trend chart

4 of 14

System Architecture

Browser sends text to Gradio backend
GPT mood classifier + mood_score_map + PPO.predict
Suggestion UI: text reply & mood‑trend graph

5 of 14

Key Technologies

OpenAI GPT-3.5 – Infers emotional tone from user input.
PPO (Proximal Policy Optimization) – Reinforcement Learning agent.
Gradio + Hugging Face Spaces – Web-based chatbot interface.
Gymnasium – Custom environment to simulate user mood responses.

6 of 14

PPO Agent Logic

• State: Mood score (0–1)

• Action: One of eight wellness strategies

• Reward: Mood improvement

• Goal: Maximize cumulative mood increase

over episodes

7 of 14

Live Demo

Example:

•User: “I’m sad.”

• Bot: “I sense you might be feeling sadness. How about a short walk in nature? A little fresh air might help”

• Mood chart updates

8 of 14

9 of 14

The PPO Algorithm (Formulas)

Probability ratio- this shows how PPO measures “how much” the new policy has moved from the old one.

s_t- the user’s current mood_score (a number between 0 and 1)

a_t- one of the 8 wellness suggestions (meditation, breathing_ex, etc.)

∏_Θ(a_t|s_t)- the probability your new policy assigns to picking suggestion a_t in mood s_t

∏_Θ(old) is the same probability under the policy before that update.

Clipped surrogate objective- the core innovation that updates “proximal”.

A_t- (the “advantage”) measures how much better it was to pick suggestion a_t in mood s_t versus the policy’s expectation.
By taking min of the unclipped and clipped terms, PPO caps updates when the new policy would otherwise swing too far.

10 of 14

Full PPO loss (with value and entropy terms)- shows how policy, value, and exploration are balanced.

Policy Term- L^CLIP- Pushes the suggestion probabilities toward actions that yielded high “advantage.”
Value Term - Trains a value network so it can predict “how good” each mood state is, helping compute A_t.
Entropy Bonus- Keeps the policy from collapsing—ensuring your bot doesn’t always repeat the same tip.

11 of 14

Key Implementation Snippets

12 of 14

Tests & Results

Functional Test - Does the bot respond to user’s mood?
Multi-Turn Logic Test - Testing whether GPT picks up context from previous turns.
Mood Trend Tracker - Does the mood tracker work?
Edge Case Test - Empty or Repetitive or Extremely long input.

13 of 14

Future Work

Online fine‑tuning with personal data
Multi‑modal input: voice, emojis, images
Mobile app + push notifications
A/B test vs rule‑based baselines

14 of 14

Thank You & Q&A

Hugging Face: HuggingFace Space
Test Case Demos: Tests and Results
GitHub: GitHub Repo