1 of 19

Combining Vocal and Gesture Interactions

Caitlin Kollar, Vivian Zhao

2 of 19

Problem

  • Dobby only responds to voice commands
  • Gesturing in a direction is simpler than saying it
  • People going on tours may not know the name of the landmarks/objects they see

3 of 19

Background - Previous Research

  • Mobile robots demand high-quality human-robot interactions, emphasizing adaptable gesture recognition
  • Research focuses on robust gesture recognition in 3D environments, leveraging advanced computer vision
  • Integrated gesture recognition enhances intuitive human-robot collaboration
  • Understanding gestures aids teamwork and communication in HCI gesture recognition research

4 of 19

Background

  • The Dobby robot is used to give tours and navigate around AHG, interacting with the user to find target locations and move throughout the building
    • Capable of performing a task synchronously with communicating with the user
    • Utilizes LLMs to create dialogue between itself and the user
    • Promotes flexible, user-driven interactions

5 of 19

Approach - Direction

  • Recognize a ‘pointing’ gesture and categorize it as left, right, or forward
    • MediaPipe hand landmarker to identify points on user’s hand from a webcam frame
    • Find angles between the vectors representing the lines between index finger joints
    • Determine if finger is straight enough to be considered a point
      • Angles are within ~12 degrees of each other
    • Determine direction the user is pointing
      • By finding angle between finger and x- and y-references

6 of 19

Approach - Object Detection

  • Extend a vector representing finger into a ray
  • Determine which objects’ bounding boxes are intersected by the ray
  • Calculate which object is closest based on point of intersection and position of hand
  • YOLOv8 object recognition

7 of 19

Approach - Verbal Commands

  • Use keywords/semantics to distinguish whether the user wants the robot to move or identify an object
    • OpenAI Whisper voice-to-text to turn microphone input into text strings
    • Compare similarity to phrases/keywords “go in that direction move there pointing” (move) and “what is what’s this unfamiliar object pointing at” (identify)
      • Scikit Learn’s vectorizer and cosine similarity formula
    • Identify which command the user says
      • In case of invalid verbal command, prompt user to say a command that the robot is capable of acting on

8 of 19

Approach - Response

  • Combine these parts and create a response from the robot
    • Rely on user’s vocal command to determine what information to extract from the frame: direction or an object
    • Respond with affirmation of action and additional information (direction of movement / identified object)

9 of 19

Evaluation

  • 2 participants, 3 locations (dorm room, library, outdoors)
  • 6 trials each for move left/right/forward, and identify object commands
  • Different lighting, backgrounds, hand shape, phrasing, and objects
  • Determine whether robot output matched with desired instruction and if additional extracted information was correct

10 of 19

Evaluation - Go

  • Phrasings
  • “Go there”
  • “Move where I’m pointing”
  • “Go in that direction”
  • Hand Position
  • Index finger only
  • Flat palm

11 of 19

12 of 19

Evaluation - Go

  • The ‘forward’ direction was the least accurate, and we found that it tended to generate ‘invalid’ visual cue responses
  • In the outdoor setting, the audio was sometimes distorted by wind or background noise and the microphone and text-to-speech struggled to accurately transcribe the verbal cue

13 of 19

Go demo video here

14 of 19

Evaluation - Identify

  • Phrasings
  • “What is that”
  • “What’s this object”
  • “I’m pointing at that”
  • Hand Position
  • Index finger only
  • ‘L’ shape with index finger and thumb
  • Objects : Cell Phone, Fork, Book

15 of 19

16 of 19

Evaluation - Identify

  • Like with the Go tests, we found that the outdoor context was less accurate for the identify model due to external audio factors
  • Many of the ‘correct audio only’ results came from the visual cue being misdirected to a different landmark rather than a failure to properly identify an object

17 of 19

Identify demo video here

18 of 19

Conclusion

  • This work provides a foundation for implementing gesture recognition and response in Dobby
  • By combining this work with robot navigation, we can allow users to utilize gestures to guide their interactions with Dobby
  • This increases accessibility and improves the user experience

19 of 19

Conclusion - Further Work

  • Add additional / more complex gestures, such as pointing backwards or waving
  • Utilize more advanced object recognition software
  • Implement more sophisticated semantic analysis
  • Train Dobby to have knowledge of the projects in the lab, so that it is capable of providing information about them when they are pointed at