1 of 17

Voice2Action: Language Models as Agent for Efficient Real-Time Interaction in Virtual Reality

Yang Su

Student Researcher @ Cornell University

SuperAGI Leap Summit 2024

This work is supported by

2 of 17

Voice2Action

	Unity Engine (VR mode)	Unreal Engine (simulation mode)
User Instruction	“Select buildings from 6 to 10 meters and make them orange”	“Generate computers on the table”
Environment Config	Game Engine (Runtime Codes, 3D Scene) + User Defined (Primary Functions)
Output

3 of 17

Motivation & Contribution

Deploying existing agent LLM systems to virtual reality (VR) is challenging

Number of functions and interaction categories in 3D environments
Lack of efficiency in real-time interactions

4 of 17

Motivation & Contribution

Deploying existing agent LLM systems to virtual reality (VR) is challenging

Number of functions and interaction categories in 3D environments
Lack of efficiency in real-time interactions

Voice2Action follows user instructions with

Low-cost function calls by only focusing on relevant tools
Parallelizable code execution for efficient interaction
Customizable interaction sets (use-defined tools)

5 of 17

Research Question

Given a textual user input, how to perform their instructions in VR efficiently?
Revisit the key components in the current Agent Systems

Planning - task decomposition and self-refinement
Memory - short-term and long-term
Tool use & actions

Game engines (the execution environment of VR) have

Hierarchical function categories and event execution order
Functional impact analysis tools (dependency analysis, profiling, etc.)
Simulation environment of code execution

We can utilize them as advantages to develop more efficient systems

6 of 17

Full Pipeline

7 of 17

Low-cost by Planning from Environment Config

The agent system is fully aware of

User-defined and built-in primary function implementations
Function categories and event execution orders in game engines

“Generate computers on the tables”

LLM for Planning

Game Initialization

The instruction contains instant static object modifications.

Input Events (User-Triggered)

Gaze/Gesture/etc.-based modifications.

Game Logic Update

The instruction contains instant static property modifications.

Each Frame Update

(Related Scripts)

Object Location

Prompts / Documentation

…

(Related Scripts)

Object Instantiation

8 of 17

Planning Reference

Event Initialization

Physics (Animation)

Input Events

(User-triggered)

Game Logic Update

Scene Rendering

Decommissioning

Each Frame Update

Static modification

Dynamic modification

Static modification

Environment modification

User Input comes in

https://docs.unity3d.com/Manual/ExecutionOrder.html

9 of 17

Parallelized Tool Use from Environment Feedback

Function calls are run in parallel whenever possible as decomposed by the planning stage
A reordering stage is run after all function calls are executed
We get feedback from

Engine code execution traces
Visual scene changes (Grounded SAM, etc.)
Engine rendering feedback (profiling, dependency analysis) - working in progress

“Generate computers on the tables”

Instantiate(object = computer)

SetLocation(object = computer)

GetLocation(object = table)

depends

LLM for Tool Use

parallelizable

10 of 17

Evaluation

We want to save the inference cost of LLMs
Efficiency measured by total number of tokens: N_trial * sum(N_i)

11 of 17

Fine-Tuning (Working In Progress)

We can align this system efficiently to our domain-specific use cases!

Since each component is structured to have minimum dependencies

Each component as a neuron
Tree-structured multi-agent collaboration framework

We are able to get feedback from each intermediate stage

extraction model - alignment from vision feedback
execution model - alignment from engine feedback

12 of 17

Voice Command

LLM for Pre-Processing

VR Environment Configuration (changes per frame)

Voice Recognition SDK

T

(Text Command)

T

(Text Command)

Action 1

Action a

Collect All

(Supposed | Wrongly)

T0 (Pre-

Processed Text)

…

T0 (Pre-

Processed Text)

Actions

LLM for Classification

Action Type

Action Arg i

Text

Atomic Actions

+

Entities

Action Arg i

Text

LLM for

Extraction

Atomic

Action Type

Entity Type

Atomic Action Arg i

Text

LLM for

Execution

Text

…

Text

Execution Examples

+

Do

…

Skip

VR Environment Configuration

frame (as time passes)

1

f0

f1

f2

Pass

Fail

Negative Examples

f

13 of 17

Related works

Agent Planning

Chain-of-Thought / Tree-of-Thought
PDDL (Planning Domain Definition Language) in robotics

Agent in Interactive Simulation Environments

Voyager (NVIDIA Minecraft Exploration)
Generative Agents (Stanford Town)

Multi-Agent Collaboration

AutoGPT
MetaGPT

etc.

14 of 17

Discussion

Broader Impact

Interactive data generation (to train SORA with active intervention, Genie)
Simulation Engine for LLMs (opposed to LLMs for Game Engines)

Current Limitations

Restricted to Game Engines
Evaluating this system is expensive
User still have to define their primary functions
Human instructions are vague and unpredictable

15 of 17

Next Step

Deploying LLMs in interactive environments is challenging due to

the vague and unpredictable nature of human instructions

Running separately enables accurate execution by letting the model interact with the user (agent) in multi-turns and clarifies their needs

“Generate a building”

Instantiate(object = building,

position = ?, scale = ?)

LLMs for Interaction

“Where / What size?”

User

16 of 17

Thank you!

Link to the Voice2Action Project

Paper / Unity Package / Demo

Connect

LinkedIn / Twitter / GitHub

17 of 17

Acknowledgement

Datasets

Unreal engine data is from Palatial XR

We thank the following advisors for their advice and suggestions throughout this work

Harald Haraldsson

We thank the following organizations for supporting this work

Cornell XRC - Cornell Tech

ALICE