1 of 44

Irina Rish

Canada Excellence Research Chair in Autonomous AI

University of Montreal & Mila

AI 4 Psychology and Psychology 4 AI Towards Better Alignment Among Humans and Machines

2 of 44

“Psychiatric research is in crisis” � - Wiecki, Poland, Frank, 2015��

“Imagine going to a doctor because of chest pain that has been bothering you for a couple of weeks. The doctor would sit down with you, listen carefully to your description of symptoms, and prescribe medication to lower blood pressure in case you have a heart condition. After a couple of weeks, your pain has not subsided. The doctor now prescribes medication against reflux, which finally seems to help. In this scenario, not a single medical analysis (e.g., electrocardiogram, blood work, or a gastroscopy) was performed, and medication with potentially severe side effects was prescribed on a trial-and-error basis.

…This scenario resembles much of contemporary psychiatry

diagnosis and treatment.”

Psychiatry lacks objective clinical tests routinely used in other medical fields!

3 of 44

Goal: augment current approaches to diagnosis and treatment of mental disorders with objective measurements and AI

Multi-modal data:

MRI (Magnetic Resonance Imaging)
Functional MRI (fMRI)
EEG (electroencephalogram)
MEG (magnetoencephalogram)
“Beyond the scanner”:

wearables, text, speech, video

COMPUTATIONAL PSYCHIATRY?

4 of 44

AI 4 NEURO: �NEUROIMAGING DATA ANALYSIS �

[Cecchi et al, NIPS 2009]

[Rish et al, PLOS One, 2013

[Gheiratmand et al, Nature PJ Schizophrenia 2017]

[Carroll et al, Neuroimage 2009]

[Scheinberg&Rish, ECML 2010]

Schizophrenia classification: 74% to 93% accuracy

symptom severity prediction

[Rish et al, Brain Informatics 2010]

[Rish et al, SPIE Med.Imaging 2012]

[Cecchi et al, PLOS Comp Bio 2012]

Mental states in videogames: sparse regression, 70-95%

Pain perception: sparse regression,

70-80% accuracy, “holographic” patterns

[Honorio et al, AISTATS 2012]

[Rish et al, SPIE Med.Imaging 2016]

Cocaine addiction: sparse Markov net biomarkers;

MPH effects analysis (“stimulant 4 stimulant”)

[Bashivan et al, ICLR 2016]

Cognitive load prediction: 91% w/ recurrent ConvNets

“Statistical biomarkers”:

+

-

Predictive Model

Mental disorder

healthy

[Abrevaya et al, 2018]

Nonlinear dynamical models of CaI and fMRI

5 of 44

Beyond the Scanner: Using ‘Cheaper’ Sensors?

NeuroSky

Muse EEG

EEG, accelerometer

Hexoskin

Heart rate

Respiration

Heart-rate

variability

Jawbone UP3

Heart rate

Respiration

Galvanic Skin Response (GSR)

Skin temperature

Ambient Temperature

Accelerometer

Can we detect mental states using wearables?

Other cheap sensors: speech, transcribed text, video, etc.

6 of 44

QUIZ: CAN YOU DIAGNOSE?

The dream, I was on my way there, I tripped on a fence and fell into the water; I was struggling, then I could get out, I could get out. I got out by myself.

I had to, to, I dreamed about my neighbor, I looked in the box, I told you, didn’t I? I went, I went there, when I remembered, I saw it was not at home. It was because I was at home, there came a neighbor, we started talking, I started talking I said I'm out of time, I'll have to wash the house. Then when I said “make a point" I thought: oh my God, I'm not at home, no.

With my evangelic daughter. She is crying.

No, I only saw Jesus. She sometimes appears to me laughing, or sometimes she appears crying.

7 of 44

SPEECH GRAPHS

Mota, Natalia, et al. Speech graphs provide a quantitative measure of thought disorder in psychosis.

PLoS One 2012; 7

Speech graph:

each word is represented by a node in the graph
word sequence is given by directed links

I

walked

place

found

grandma

hugged

strongly

woke up

Transcribed speech (description of a recent dream):

I walked into a place, and I found my

grandma. I hugged her strongly. I woke up.

8 of 44

SPEECH GRAPHS

Image courtesy of Mota, Natalia, et al.

Speech graphs provide a quantitative

measure of thought disorder in psychosis.

PLoS One 2012; 7

9 of 44

COMPUTING COHERENCE

Pipeline for automated extraction of the semantic coherence features.

I cannot think of them all offhand. They were the ones I always considered my best songs. They were the ones I really wrote from experience.

[I cannot think of them all offhand.]1 [They were the ones I always considered my best songs.]2 [They were the ones I really wrote from experience.]3

[ ]1

[ ]2

[ ]3

…

[ ]1

[ ]2

[ ]3

10 of 44

RESULTS�

Bedi, Gillinder et al. "Automated analysis of free speech predicts

psychosis onset in high-risk youths." npj Schizophrenia 1 (2015): 15030

Text features are changing noticeably between controls and schizophrenic subjects

100% accurate classification achieved using these features, for predicting 1^st psychotic episode 1-2 years in advance based on text of interview with the patients

Subject who did not develop psychosis - blue

Subjects who developed psychosis - red

11 of 44

OTHER EXAMPLES

12 of 44

As predicted by the World Health Organization, by 2030 the amount of worldwide disability and life loss attributable to depression may become greater than for any other condition, including cancer, stroke, heart disease, accidents, and war

However, many people do not receive an adequate treatment; one of the major factors here is limited availability of mental health professionals, compared to the number of potential patient

Goal: easily accessible, round-the-clock therapeutic services provided by a conversational agent (“PsyBot”?)

�AI FOR

PSYCHOTHERAPY?

S. Garg et al, Infogain-Driven Dialogue Modeling via Hash Functions (submitted)

Iirina

Rish

S. Garg @USC

Guillermo

Cecchi

13 of 44

13

Slide credit: Sahil Garg

14 of 44

14

Slide credit: Sahil Garg

15 of 44

Outperforms deep net systems:

Works great on small datasets

where deep nets failed

Learns orders of magnitude faster: hours instead of days

construct hash codes of responses

optimize hashing model to maximize

mutual information between patient

and therapist

learn a predictive model to infer

therapist’s response to patient

Patient

Therapist

INFOGAIN-DRIVEN DIALOGUE VIA HASHCODE REPRESENTATIONS

16 of 44

16

Slide credit: Sahil Garg

17 of 44

17

Slide credit: Sahil Garg

18 of 44

18

Slide credit: Sahil Garg

19 of 44

FUTURE: VIRTUAL THERAPIST?

A virtual AI assistant on a smartphone which implements the following four main steps: (1) data collection; (2) mental state recognition; (3) taking action to improve the mental state; (4) receiving feedback from a person to improve future actions

Sensor data:

Text, Audio

Video

EEG signal

Temperature

Heart-rate

AI Algorithms: classification

of mental states, detection of emotional and cognitive changes

Decision-

Making:

choosing best feedback or another action (call a friend? Tell a joke? Send a reminder?)

Take an action

Obtain feedback from a person

Users

Roles:

24/7 personal coach, assistant, therapist, caretaker, or just a “digital friend”

20 of 44

Integration of real time feedback to the individual, the caretaker, the clinician and to wearable devices

Formalization across the entire spectrum of neuropsychiatric conditions

Using emergent features as better correlates of neural mechanisms

Building dialog on top of Large Language Models (GPT-k ++)

FUTURE RESEARCH

21 of 44

The Holy Grail of AI: Generalization

AGI ⇔ “General” AI ⇔ Multi-task,“Broad” AI

“Highly autonomous systems that outperform humans at most economically valuable work” (OpenAI definition)

22 of 44

“Cambrian Explosion” of Large-Scale Models

GPT-3: natural language model (May 2020)
CLIP: image to text (Jan 2021)
DALL-E: text to image (Jan 2021)
Copilot/Codex: code-generation (Sept 2021),
StableDiffusion: text to image (Aug 2022)
GPT-4, ChatGPT, LLaMA, etc (2023 + )

23 of 44

Foundation Models: Jump Towards AGI?

“Train one model on a huge amount of data and adapt it to many applications.

We call such a model a foundation model.”

CEFM: Stanford’s Center for Research on Foundation Models

“On the Opportunities and Risks of Foundation Models”

Application example: healthcare

24 of 44

Scaling Laws as “Investment Tools”

An example:

image transformers dominated by convnets in lower data regimes, but outperforming the latter with more data: https://arxiv.org/pdf/2010.11929.pdf

25 of 44

Brief History of Neural Scaling Laws

Kaplan et al, Scaling Laws for Neural Language Models, 2020

Cortes et al. Learning curves: Asymptotic values and rate of convergence. NeurIPS 1994.

First to observe power law scaling: of ANNs:

x = dataset size and y = test error.

Hestness et al. Deep Learning Scaling is Predictable,Empirically. Dec 2017.

1994

2017

Showed that data-size dependent scaling laws given by power laws hold over many orders of magnitude.

Rosenfeld et al. . A constructive prediction of the generalization error across scales. 2019.

Applied power laws to model-size dependent scaling laws, i.e. when x = number of parameters.

Showed that power law applies when x = compute, besides x = data and x = model.

This paper brought “neural” scaling laws to the mainstream as it was in context of GPT-3 training.

2019

2020

26 of 44

Neural Scaling Laws: Kaplan et al

Jared Kaplan et al, Scaling Laws for Neural Language Models, 2020.

27 of 44

Scale and Inductive Biases

28 of 44

More Complex Scaling Behavior:

“Phase Transitions”, Emergent Phenomena

3SAT, CSPs, NPhard problems
Random graphs
Universal Laws of Robustness
GPT-3 on Arithmetic task
Grokking

-

29 of 44

Broken Neural Scaling Laws:

A Universal Functional Form for Neural Scaling Laws?

Ethan Caballero et al, 2022

https://arxiv.org/abs/2210.14891

30 of 44

BNSL accurately fits and extrapolates a very wide range of scaling behaviors

Settings: Zero-Shot, Prompted, and Fine-Tuned settings; Downstream and upstream
Tasks: Large-Scale Vision, Language, Audio, Video, Diffusion, Generative Modeling, Multimodal Learning, Contrastive Learning, AI Alignment, AI Capabilities, Robotics, Out-Of-Distribution Generalization, Continual Learning, Transfer Learning, Uncertainty Estimation / Calibration, Out-Of-Distribution Detection, Adversarial Robustness, Distillation, Sparsity, Retrieval, Quantization, Pruning, Fairness, Molecules, Computer Programming/Coding, Math Word Problems, Arithmetic, Double Descent, “Emergent” “Phase Transitions”, Supervised Learning, Unsupervised / Self-Supervised Learning, & Reinforcement Learning (Single Agent & Multi-Agent)
Architectures: ResNet, Transformer, MLP-Mixer, MLP, Graph Neural Network, U-Net, Ensemble, Sparsely-Gated Mixture-of-Experts, Sparse Pruned Model
X-axes: Compute, Dataset Size, Number of Model Parameters, Number of Training Steps, Input (e.g. Context) Size, & Upstream Performance
Y-axes: prediction error, cross entropy, calibration error, AUROC, BLEU score percentage, F1 score, reward, Elo rating, FID score

31 of 44

Training Foundation Models

“We think the most benefits will go to whoever has the biggest computer.” � Greg Brockman, OpenAI’s CTO, Financial Times

Most compute is owned by AI companies (Google, OpenAI, etc), not academia & nonprofit research; this “compute gap” continues to widen.

We need to “democratize AI”!

32 of 44

INCITE award to Train Open Foundation Models

News

5.9M V100 GPU hrs on Summit

33 of 44

Supercomputers: Summit and Frontier

34 of 44

Growing International Collaboration

nolano.org

Farama

35 of 44

Ongoing Projects

Language Models: Pretraining and Continual Learning

Aligned Multimodal Language-Vision Models:

Time-series Transformers

Multimodal “Generalist” Agent

Ultimate goal:

Interactive, Continually Learning “Open ChatX” model

36 of 44

Should Pretraining be Continual?

Standard pre-training:

multiple datasets available at once; mixed into one dataset

(or, sampled uniformly into each minibatch)

Example: A Generalist Agent

37 of 44

Aligning Vision-Language Models

???

38 of 44

Aligning Multimodal Models with Human Values

JC Layoun, A Roger, I Rish. Aligning MAGMA by Few-Shot Learning and Finetuning. Montreal AI Symp 2022, arXiv:2210.14161

A Roger, E Aïmeur, I Rish, Towards ethical multimodal systems, arXiv:2304.13765. presentation

We evaluate “commonsense morality” (Hendryks et al., 2020) either (1) manually or (2) using RoBERTa- large common sense classifier trained only on the text of the Ethics database by Hendrycks et al. (2020). However, the latter tends to be less reliable than the former.

Promising preliminary result: fine-tuning on just 30 hand-made “good” samples, only for 4 epochs, improves the morality score by 10%.

Ongoing work:

Train better/larger versions of MAGMA (based on top LLMs)
Create instruct-dataset
Finetune models iteratively, continually collecting samples via discord interface.

39 of 44

40 of 44

Project Direction 1: AI for Psychology

Use openly available transcribed therapy datasets (e.g., Alexander Street dataset we used in AAAI 2020 paper) to fine-tune/instruction-tune several open LLMs (CL-FoMo family of our models, as well as others’ open models)

Longer-term project: use multimodal therapy session datasets (transcribed text, audio and video) to train multimodal foundation models that can be later used in various therapy settings (either as a “personal assistant/coach/therapist, or as a helper to a human therapist, or both)

41 of 44

Project Direction 2: Psychology for AI

Apply existing psycho-evaluation tests (e.g., PsychoBench and others) to evaluate the output of an LLM or a vision-text (e.g. Robin) model.

42 of 44

What Variables Affecting the Output Can Vary

Model (many open or proprietary LLMs, also vision-text: LLava, Robin, etc)
Prompts (e.g.,negative:- jail-breaking or positive- stating moral values, etc)
Few-shot Finetuning - the type of instruction samples (Alexis, me)
Model size, data size, compute size
The effects other hyperparameters (e.g., temperature in GPT etc)

43 of 44

What Metrics Can We Evaluate

Rankings on the psych evals
Variance of those rankings
Other metrics?

44 of 44

Thank you!