1 of 19

Combining Vocal and Gesture Interactions

Caitlin Kollar, Vivian Zhao

2 of 19

Problem

Dobby only responds to voice commands
Gesturing in a direction is simpler than saying it
People going on tours may not know the name of the landmarks/objects they see

3 of 19

Background - Previous Research

Mobile robots demand high-quality human-robot interactions, emphasizing adaptable gesture recognition
Research focuses on robust gesture recognition in 3D environments, leveraging advanced computer vision
Integrated gesture recognition enhances intuitive human-robot collaboration
Understanding gestures aids teamwork and communication in HCI gesture recognition research

4 of 19

Background

The Dobby robot is used to give tours and navigate around AHG, interacting with the user to find target locations and move throughout the building

Capable of performing a task synchronously with communicating with the user
Utilizes LLMs to create dialogue between itself and the user
Promotes flexible, user-driven interactions

5 of 19

Approach - Direction

Recognize a ‘pointing’ gesture and categorize it as left, right, or forward

MediaPipe hand landmarker to identify points on user’s hand from a webcam frame
Find angles between the vectors representing the lines between index finger joints
Determine if finger is straight enough to be considered a point

Angles are within ~12 degrees of each other

Determine direction the user is pointing

By finding angle between finger and x- and y-references

6 of 19

Approach - Object Detection

Extend a vector representing finger into a ray
Determine which objects’ bounding boxes are intersected by the ray
Calculate which object is closest based on point of intersection and position of hand
YOLOv8 object recognition

7 of 19

Approach - Verbal Commands

Use keywords/semantics to distinguish whether the user wants the robot to move or identify an object

OpenAI Whisper voice-to-text to turn microphone input into text strings
Compare similarity to phrases/keywords “go in that direction move there pointing” (move) and “what is what’s this unfamiliar object pointing at” (identify)

Scikit Learn’s vectorizer and cosine similarity formula

Identify which command the user says

In case of invalid verbal command, prompt user to say a command that the robot is capable of acting on

8 of 19

Approach - Response

Combine these parts and create a response from the robot

Rely on user’s vocal command to determine what information to extract from the frame: direction or an object
Respond with affirmation of action and additional information (direction of movement / identified object)

9 of 19

Evaluation

2 participants, 3 locations (dorm room, library, outdoors)
6 trials each for move left/right/forward, and identify object commands
Different lighting, backgrounds, hand shape, phrasing, and objects
Determine whether robot output matched with desired instruction and if additional extracted information was correct

10 of 19

Evaluation - Go

Phrasings
“Go there”
“Move where I’m pointing”
“Go in that direction”
Hand Position
Index finger only
Flat palm

12 of 19

Evaluation - Go

The ‘forward’ direction was the least accurate, and we found that it tended to generate ‘invalid’ visual cue responses
In the outdoor setting, the audio was sometimes distorted by wind or background noise and the microphone and text-to-speech struggled to accurately transcribe the verbal cue

13 of 19

Go demo video here

14 of 19

Evaluation - Identify

Phrasings
“What is that”
“What’s this object”
“I’m pointing at that”
Hand Position
Index finger only
‘L’ shape with index finger and thumb
Objects : Cell Phone, Fork, Book

16 of 19

Evaluation - Identify

Like with the Go tests, we found that the outdoor context was less accurate for the identify model due to external audio factors
Many of the ‘correct audio only’ results came from the visual cue being misdirected to a different landmark rather than a failure to properly identify an object

17 of 19

Identify demo video here

18 of 19

Conclusion

This work provides a foundation for implementing gesture recognition and response in Dobby
By combining this work with robot navigation, we can allow users to utilize gestures to guide their interactions with Dobby
This increases accessibility and improves the user experience

19 of 19

Conclusion - Further Work

Add additional / more complex gestures, such as pointing backwards or waving
Utilize more advanced object recognition software
Implement more sophisticated semantic analysis
Train Dobby to have knowledge of the projects in the lab, so that it is capable of providing information about them when they are pointed at