Voice2Action: Language Models as Agent for Efficient Real-Time Interaction in Virtual Reality
Yang Su
Student Researcher @ Cornell University
SuperAGI Leap Summit 2024
This work is supported by
Voice2Action
| Unity Engine (VR mode) | Unreal Engine (simulation mode) |
User Instruction | “Select buildings from 6 to 10 meters and make them orange” | “Generate computers on the table” |
Environment Config | Game Engine (Runtime Codes, 3D Scene) + User Defined (Primary Functions) | |
Output | | |
Motivation & Contribution
Motivation & Contribution
Research Question
Full Pipeline
Low-cost by Planning from Environment Config
“Generate computers on the tables”
LLM for Planning
Game Initialization
The instruction contains instant static object modifications.
Input Events (User-Triggered)
Gaze/Gesture/etc.-based modifications.
Game Logic Update
The instruction contains instant static property modifications.
Each Frame Update
(Related Scripts)
Object Location
Prompts / Documentation
…
(Related Scripts)
Object Instantiation
Planning Reference
Event Initialization
Physics (Animation)
Input Events
(User-triggered)
Game Logic Update
Scene Rendering
Decommissioning
Each Frame Update
Static modification
Dynamic modification
Static modification
Static modification
Environment modification
User Input comes in
Parallelized Tool Use from Environment Feedback
“Generate computers on the tables”
Instantiate(object = computer)
SetLocation(object = computer)
GetLocation(object = table)
depends
LLM for Tool Use
parallelizable
Evaluation
Fine-Tuning (Working In Progress)
We can align this system efficiently to our domain-specific use cases!
Voice Command
LLM for Pre-Processing
VR Environment Configuration (changes per frame)
Voice Recognition SDK
T
(Text Command)
T
(Text Command)
Action 1
Action a
Collect All
(Supposed | Wrongly)
T0 (Pre-
Processed Text)
…
T0 (Pre-
Processed Text)
Actions
LLM for Classification
Action Type
Action Arg i
Text
Atomic Actions
+
Entities
Action Arg i
Text
LLM for
Extraction
Atomic
Action Type
Entity Type
Atomic Action Arg i
Text
LLM for
Execution
Text
…
Text
Execution Examples
+
+
+
Do
…
Skip
VR Environment Configuration
frame (as time passes)
1
f0
f1
f2
Pass
Fail
Negative Examples
f
Related works
Discussion
Next Step
“Generate a building”
Instantiate(object = building,
position = ?, scale = ?)
LLMs for Interaction
“Where / What size?”
User
Thank you!
Acknowledgement
Datasets
We thank the following advisors for their advice and suggestions throughout this work
We thank the following organizations for supporting this work
ALICE