1 of 18

Evaluating Artificial Social Intelligence �in an �Urban Search and Rescue �Task Environment

AAAI Fall Symposium Series:�Theory of Mind for Teams

4-5 Nov 2021

�Jared Freeman¹, Lixiao Huang², Matt Wood¹, Stephen J. Cauffman²

Aptima Inc.¹, Arizona State University²

� freeman@aptima.com, lixiao.huang@asu.edu, mwood@aptima.com, scauffma@asu.edu

This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR001119C0130. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the Defense Advanced Research Projects Agency.

2 of 18

Overview

Training, Talk, Technology, and Theory of Mind
The ASIST Team Task Environment
Experimental Design
Artificial Social Intelligence
ASI Evaluation & Findings
Future Research

3 of 18

Training, Talk, Technology, and Theory of Mind

Teams coordinate through Training, Talk, Tech, ToM
ToM

Builds from Training, Talk, & Technology…

The better the TTT, the better ToM

Compensates for deficiencies of TT&T…

The worse the TTT, the more critical the ToM

ToM is inevitable, errorful

Training

Talk

Technology

ToM

I’d like to motivate the work we report here with some theory or philosophy

Teams coordinate through Training, Talk, Technology, and inference, especially about teammates, or ToM

Training conveys the decisions & behaviors we should expect of teammates
Talk confirms or alters those decisions & behaviors in missions
Tech informs decisions & behaviors by partitioning and provisioning information differentially between teammates

All 3 of these coordination mechanisms are imperfect (severely so in adverse and adversarial conditions)
All 3 are effortful and slow relative to the speed of inference. They are the System 2 mechanisms of coordination (in Kahneman’s terms)
We know from the literature and from life that effortless, speedy System 1 mechanisms eat System 2 for breakfast.
Theory of Mind is that System 2 mechanism.
ToM is the ability to infer the state and predict the actions of teammates
This supports supports coordination of independent actions, and corrective interventions
No one of these four Ts is sufficient for a team to function
They are interdependent

The better the TT&T, the richer the ToM, and the more accurate it is when ToM inference is needed
The worse the TT&T, the poorer the ToM, but the more often it will be essential

ToM needs to be very accurate, especially when any of these is true: training is inadequate, OR tech fails, OR talk is difficult. That logical expression is most of the time.

===================================================

We are agnostic to the mechanism or algorithm of ToM: theory theory, simulation theory
We are saying that Training, Talk, and Technology are input that mechanism

Training

Definition: education, planning, rehearsal, and repeated mission execution
Limitations: training cannot perfectly anticipate a specific mission nor should it if it is to ensure generalizability

Talk

Definition: Spoken language (e.g., military air control communications protocols) and symbolic language (e.g., the marking conventions used by search and rescue teams). These communication protocols range from the formal to the informal.
Limitations: Inaccurate, incomplete, untimely, or unavailable (e.g., in military operations)

Technology

Definition: Technologies to design teams that are better sized and synchronized, plan missions, represent mission state, enact plans (AEGIS doctrine), assess team state
Limitations: Not available in all domains at all times, nor competent or trusted by their users in all situations.

4 of 18

Training, Talk, Technology, and Theory of Mind

Human-built ToM

Infer the cognitive and affective state of others, their goals, and their needs
Predict others’ actions
...to develop guidance and actions that coordinate work.
ToM inference is ineluctable and errorful.

Machine-built ToM

of artificial agents (Rabinowitz, et al., 2018)
of humans & human teams (ASIST)

5 of 18

DARPA ASIST

Artificial Social Intelligence �for Successful Teams

Objective

Model teams members (MToM) and teams (MToT) well enough to offer reliably useful advice
Important when when the necessity and difficulty of human-built ToM are high because:

members are highly varied in their capabilities and capacity
task synchronization is complex
risk of failure at high stakes
preparation and information are incomplete

Interdisciplinary

Six teams of social scientists create theory and analytic agents
Six teams of computer scientists create ASI agents
One evaluation team (Aptima+ASU+COS)

Large: 320 individuals from 31 organizations
Long: Fall 2019 - Summer 2024

6 of 18

ASIST & �Team Tasks

Objective: Create a task environment in which

Human team members must coordinate well to succeed
ASI must formulate MToM to make inferences & predictions
ASI generate interventions to improve coordination (in 2022)
The accuracy and utility of ASI can be evaluated

Characteristics of team tasks

Skills & roles are distinct
Synchronization benefits performance
Risk & reward trade off
Planning matters
Information is imperfect
Communication is necessary

7 of 18

The ASIST USAR �Team Task Environment

Bird’s-eye view of three participants

Skills & roles: Three members in swappable roles
Reward: Rescue critical victims vs. regular victims (high vs. low reward)
Task synchronization: Async, sequential, and simultaneous teamwork
Communication: Audio & marker blocks for comms
Risk: Hidden freeze plate in rooms

Zoom

Picture in Picture

1st person view

Building layout and player locations

We developed an Urban Search and Rescue task
Three members with swappable roles and tools
Must rescue victims in different states of injury
Using methods requiring asynchronous, sequential, or simultaneous actions
Communication occurs through speech and markers
Traps pose a risk that both rewards deliberate action and coordinated rescue

===================================================

Skills & roles: Three team members and three possible roles; individual and team tasks
Task synchronization: 15 min missions to rescue victims in tasks requiring asynchronous, sequential, and simultaneous teamwork
Risk: Hidden freeze plate in rooms
Reward: Critical victims vs. regular victims (high vs. low reward)
Preparation: Planning session was manipulated
Information: Maps are incomplete
Communication: Audio & marker blocks for comms (Divergent marker keys create conflicting mental models)
Skills & roles: Three team members and three possible roles; individual and team tasks
Task synchronization: 15 min missions to rescue victims in tasks requiring asynchronous, sequential, and simultaneous teamwork
Risk: Hidden freeze plate in rooms
Reward: Critical victims vs. regular victims (high vs. low reward)
Preparation: Planning session was manipulated
Information: Maps are incomplete
Communication: Audio & marker blocks for comms (Divergent marker keys create conflicting mental models)

8 of 18

The ASIST Task�from the Participant’s View

Map -- Shared, unique, & missing info about locations of victims and rubble No mark of team member locations.
Marker block legend -- Identical legend for 2 participants. No Victim & Regular Victim swapped for 1.
Gamespace -- First-person view of self (arm), space, time, victim count by type, score (RV+n*CV)

1

2

3

Info Map

Marker block legend

Minecraft world

9 of 18

The ASIST �Experimental Design

		Trial Maps (within-group): SaturnA and SaturnB
		Trial 1	Trial 2
Shared mental model manipulation (between-group)	Condition 1: Team planning	32 teams
	Condition 2: No planning (math control task)	32 teams
Sample data for study 3	*No planning + human advisor	4 teams

10 of 18

Experimental Design

Participants (all remote)

192 participants in 64 teams from Reddit, Discord, ASU
141 males, 49 females, and 2 other or no response
Mean age 22.04 (SD=5.22, ranging from 18 to 49)
Ethnicities were white/Caucasian (54.2%; 104), Asian (25.8%; 49), and Hispanic or Latino (13%; 25)
All participants had at least a high school level education
All claimed Minecraft expertise, which we tested

Procedure (3.5hrs)

Software installation & Surveys (60min)
Consent, slides training, hands-on practice, competency test, trials, multiple surveys (150min)

Data collection

Surveys: 469 items re: 22 constructs
Testbed messages re: team, trial, experimental conditions, events
Human observer measurements on test trials (only)
2079 files, 280GB

11 of 18

Artificial Social Intelligence

University of Arizona -- Dynamic Bayes networks (DBNs) model individuals & teams from behavior, NLP, & speech acts
SIFT -- MC Tree Search over learnable action grammars
University of Southern California -- Recursive POMDPs constructed using RDDL domain rep + perturbations. Bayesian inference to update beliefs
DOLL/MIT -- Narratives from stories, inverse planning, probabilistic ToM, probabilistic conditional preference, story understanding (Genesis), and learned player capability.
Carnegie Mellon University -- Modular neural network models individual, introspection resolves deviations between predicted and observed behaviors.
Charles River Analytics -- Probabilistic programming to model goals & states. Strategic Coach selects interventions.

Developers of Artificial Social Intelligence used methods varying from

Dynamic Bayes nets + speech act analysis
Recursive POMDPs
Narrative modeling
Probabilistic programming

=============================================

The University of Arizona team, led by Adarsh Pyarelal, used dynamic Bayes networks (DBNs) to model individual and team activity states and mental states (ToM), using in-game participant behavior, natural language processing, and speech analysis.
The SIFT team, led by Chris Geib, employed MC Tree Search over learnable action grammars to generate multiple candidate explanations for observed behavior. Explanations included explicit ascriptions of ToM beliefs for each agent. The system then used weighted model counting over the explanations to probabilistically infer the most likely mental states and asymmetric beliefs between team members.
The team from University of Southern California, led by David Pynadath, applied recursive POMDPs as candidate participant models with ToM, constructed by combining a RDDL specification of the domain with perturbations along domain-independent dimensions. The ASI agent performed Bayesian inference to update beliefs over these candidate models based on observed team and individual behavior.
The team from DOLL/MIT, led by Paul Robertson, generated narratives from stories that represent, for each team member, a story of the team. The Narrative provided a rationale for the past and predictions for the future. This ASI agent also used mechanisms for inverse planning, probabilistic ToM, probabilistic conditional preference, story understanding (Genesis), and learned player capability, such as speed.
The team from Carnegie Mellon, led by Katia Sycara, implemented a modular neural network Theory of Mind (ToM) model that infers an individual's beliefs, goals and intentions from observations and environmental context; introspection resolves deviations between predicted and observed behaviors. Combined ToM models of teammates provided reasoning over shared mental models, team processes and produce appropriate individual and team interventions.
The team from Charles River Analytics, led by Bryan Loyall, created a Cognitive Inverter that uses probabilistic programming to recognize goals, behaviors, and mental states from open world observations. A Strategic Coach will select the most effective interventions, based on principles from interactive narrative research.

12 of 18

ASI Evaluation & Findings

Human observers and ASI agents evaluated on the same four metrics relevant to MToM
Few (4) evaluation metrics ensured that ASI developers had spare capacity to create and test alternative MToM capabilities.

Metric ID: Function	ASI Agent & Human Observer �infer / predict	Measure
M1: Prediction of effects of future interventions	Team score (3x per trial at fixed times)	Normalized RMSE
M3: Inference of member mental model / knowledge	Given map information (3x)	Mean accuracy
M6: Inference of member mental model / knowledge (conflicting knowledge)	Given marker block meanings (3x)	Mean accuracy
M7: Prediction of action given member beliefs (Sally-Anne)	Room entry in response to another participant’s marker block (many per trial)	Mean accuracy

13 of 18

ASI Evaluation & Findings

Findings

All agents performed above chance
Humans outperformed ASI agents

Accuracy of ASI agents (yellow) & human observers (blue) on four tests of social intelligence.

14 of 18

ASI Evaluation & Findings

Findings

No one ASI agent reliably outperformed the others

Accuracy of human observers (triangle) and artificial agents (circles) on four tests of social intelligence.

15 of 18

ASI Evaluation & Findings

Finding

Accuracy of ASI agents and human observers improved rapidly though modestly during the brief trials.

Accuracy predicting final score (M1), 3x per trial

Percent accuracy for inferring marker block semantics (M6), 3x per trial

ASI accuracy generally increased between 3 measurements within trials on two measures

Prediction of final score -- Note that the opportunity for prediction error diminishes across trials as scores rise towards the maximum possible score
Inference of marker block semantics -- Note that marker block use and emergent ground truth converged or diverged

=============================================

M1: ASI may have been able to take advantage of decreasing variance in scores as the trial progressed and the diminishing likelihood of accruing more points by rescuing victims.
M6, ASI agents had additional opportunities to observe participant behavior related to marker block placement and movement given others’ placement, and therefore allowed ASI agents the opportunity to update prior beliefs on the likely marker block assignment for each participant.
Agents did not reliably increase the accuracy of their inferences concerning divergent map information (M3), probably bcs participants made less use of maps than anticipated.

16 of 18

Future Research

Support the claim that	With quantitative measurements of
Social science constructs drive	Analytic agent use, influence, effect
Design of ASI MToM/T to enable	MToM/T Existence, Inference, Prediction
ASI interventions on	Intervention (non)existence, Compliance, Explanations, Perceived Utility of ASI, Trust in ASI
Team process that improve	Synchronization, Error Reduction, Resilience, Coordinative Comms
Mission effects	Mission score (weighted to team tasks)

17 of 18

Goal

Technology: MToM / MToT
Talks to human team members
In Training and missions
To

improve coordination & mission outcomes
enhance the accuracy of human ToM

Training

Talk

Technology

ToM

18 of 18

Acknowledgement

Contact:

Jared Freeman <freeman@aptima.com>