1 of 17

Voice2Action: Language Models as Agent for Efficient Real-Time Interaction in Virtual Reality

Yang Su

Student Researcher @ Cornell University

SuperAGI Leap Summit 2024

This work is supported by

2 of 17

Voice2Action

Unity Engine (VR mode)

Unreal Engine (simulation mode)

User

Instruction

“Select buildings from 6 to 10 meters and make them orange”

“Generate computers on the table”

Environment

Config

Game Engine (Runtime Codes, 3D Scene) + User Defined (Primary Functions)

Output

3 of 17

Motivation & Contribution

  • Deploying existing agent LLM systems to virtual reality (VR) is challenging
    • Number of functions and interaction categories in 3D environments
    • Lack of efficiency in real-time interactions

4 of 17

Motivation & Contribution

  • Deploying existing agent LLM systems to virtual reality (VR) is challenging
    • Number of functions and interaction categories in 3D environments
    • Lack of efficiency in real-time interactions
  • Voice2Action follows user instructions with
    • Low-cost function calls by only focusing on relevant tools
    • Parallelizable code execution for efficient interaction
    • Customizable interaction sets (use-defined tools)

5 of 17

Research Question

  • Given a textual user input, how to perform their instructions in VR efficiently?
  • Revisit the key components in the current Agent Systems
    • Planning - task decomposition and self-refinement
    • Memory - short-term and long-term
    • Tool use & actions
  • Game engines (the execution environment of VR) have
    • Hierarchical function categories and event execution order
    • Functional impact analysis tools (dependency analysis, profiling, etc.)
    • Simulation environment of code execution
  • We can utilize them as advantages to develop more efficient systems

6 of 17

Full Pipeline

7 of 17

Low-cost by Planning from Environment Config

  • The agent system is fully aware of
    • User-defined and built-in primary function implementations
    • Function categories and event execution orders in game engines

“Generate computers on the tables”

LLM for Planning

Game Initialization

The instruction contains instant static object modifications.

Input Events (User-Triggered)

Gaze/Gesture/etc.-based modifications.

Game Logic Update

The instruction contains instant static property modifications.

Each Frame Update

(Related Scripts)

Object Location

Prompts / Documentation

(Related Scripts)

Object Instantiation

8 of 17

Planning Reference

Event Initialization

Physics (Animation)

Input Events

(User-triggered)

Game Logic Update

Scene Rendering

Decommissioning

Each Frame Update

Static modification

Dynamic modification

Static modification

Static modification

Environment modification

User Input comes in

9 of 17

Parallelized Tool Use from Environment Feedback

  • Function calls are run in parallel whenever possible as decomposed by the planning stage
  • A reordering stage is run after all function calls are executed
  • We get feedback from
    • Engine code execution traces
    • Visual scene changes (Grounded SAM, etc.)
    • Engine rendering feedback (profiling, dependency analysis) - working in progress

“Generate computers on the tables”

Instantiate(object = computer)

SetLocation(object = computer)

GetLocation(object = table)

depends

LLM for Tool Use

parallelizable

10 of 17

Evaluation

  • We want to save the inference cost of LLMs
  • Efficiency measured by total number of tokens: N_trial * sum(N_i)

11 of 17

Fine-Tuning (Working In Progress)

We can align this system efficiently to our domain-specific use cases!

  • Since each component is structured to have minimum dependencies
    • Each component as a neuron
    • Tree-structured multi-agent collaboration framework
  • We are able to get feedback from each intermediate stage
    • extraction model - alignment from vision feedback
    • execution model - alignment from engine feedback

12 of 17

Voice Command

LLM for Pre-Processing

VR Environment Configuration (changes per frame)

Voice Recognition SDK

T

(Text Command)

T

(Text Command)

Action 1

Action a

Collect All

(Supposed | Wrongly)

T0 (Pre-

Processed Text)

T0 (Pre-

Processed Text)

Actions

LLM for Classification

Action Type

Action Arg i

Text

Atomic Actions

+

Entities

Action Arg i

Text

LLM for

Extraction

Atomic

Action Type

Entity Type

Atomic Action Arg i

Text

LLM for

Execution

Text

Text

Execution Examples

+

+

+

Do

Skip

VR Environment Configuration

frame (as time passes)

1

f0

f1

f2

Pass

Fail

Negative Examples

f

13 of 17

Related works

  • Agent Planning
    • Chain-of-Thought / Tree-of-Thought
    • PDDL (Planning Domain Definition Language) in robotics
  • Agent in Interactive Simulation Environments
    • Voyager (NVIDIA Minecraft Exploration)
    • Generative Agents (Stanford Town)
  • Multi-Agent Collaboration
    • AutoGPT
    • MetaGPT
  • etc.

14 of 17

Discussion

  • Broader Impact
    • Interactive data generation (to train SORA with active intervention, Genie)
    • Simulation Engine for LLMs (opposed to LLMs for Game Engines)
  • Current Limitations
    • Restricted to Game Engines
    • Evaluating this system is expensive
    • User still have to define their primary functions
    • Human instructions are vague and unpredictable

15 of 17

Next Step

  • Deploying LLMs in interactive environments is challenging due to
    • the vague and unpredictable nature of human instructions
  • Running separately enables accurate execution by letting the model interact with the user (agent) in multi-turns and clarifies their needs

“Generate a building”

Instantiate(object = building,

position = ?, scale = ?)

LLMs for Interaction

“Where / What size?”

User

16 of 17

Thank you!

17 of 17

Acknowledgement

Datasets

  • Unreal engine data is from Palatial XR

We thank the following advisors for their advice and suggestions throughout this work

  • Harald Haraldsson

We thank the following organizations for supporting this work

  • Cornell XRC - Cornell Tech

ALICE