KAIST & Korea University
MLV Lab
Visual Diversity and Region-aware �Prompt Learning for Zero-shot HOI detection
Chanhyeong Yang1, Taehoon Song2, Jihwan Park1,
Hyunwoo J. Kim2
1Korea University 2KAIST
Task definition
< Person, Riding, Horse >
Example of HOI detection
Human-Object Interaction detection
MLV Lab
KAIST & Korea University
Motivation
MLV Lab
1. Intra-class diversity
“Person holding a baseball glove”
Object
Verb
Object class diversity score
Verb class diversity score
[Diversity score comparison between object and verb]
Diverse visual patterns
Verbs are more diverse than objects
KAIST & Korea University
Motivation
MLV Lab
1. Intra-class diversity
“Person holding a baseball glove”
Object
Verb
Object class diversity score
Verb class diversity score
[Diversity score comparison between object and verb]
Diverse visual patterns
Verbs are more diverse than objects
Need to capture �visual variance of each verb class.
KAIST & Korea University
Motivation
MLV Lab
2. Inter-class visual entanglement
: CLS feature
: Prototype
: boarding
: sitting at
: licking
: eating
: riding
[Inter-class visual overlap of verb classes in t-SNE]
“Person eating an object.”
“Person licking an object.”
“Person sitting at an object.”
Similar, but different classes
Overlap in visual embedding space
KAIST & Korea University
Motivation
MLV Lab
2. Inter-class visual entanglement
: CLS feature
: Prototype
: boarding
: sitting at
: licking
: eating
: riding
[Inter-class visual overlap of verb classes in t-SNE]
“Person eating an object.”
“Person licking an object.”
“Person sitting at an object.”
Similar, but different classes
Overlap in visual embedding space
Requires fine-grained �information of each verb class.
KAIST & Korea University
Methods
MLV Lab
Then, how can we handle both challenges?
Capture visual �variance of each verb
Leverage fine-grained concepts about verbs
KAIST & Korea University
Methods
MLV Lab
So, We need dual-module design: VDRP
Visual Diversity and Region-aware �Prompt Learning
KAIST & Korea University
Methods
MLV Lab
1. We need VDP: Visual Diversity-aware Prompts
KAIST & Korea University
1. VDP: injection and reflect visual variance for each verb
Methods
MLV Lab
We need RAP: Region-Aware Prompts
2. RAP: augmentation with region-level concepts for each verb
KAIST & Korea University
Experiments
MLV Lab
HICO-DET: NF-UC / RF-UC
KAIST & Korea University
Experiments
MLV Lab
HICO-DET: Unseen Object / Unseen Verb
KAIST & Korea University
Experiments
MLV Lab
Qualitative results
CMMP causes modality mismatch with over-separated prompts
KAIST & Korea University
VDRP maintains balanced alignment between the two modalities
Experiments
MLV Lab
Qualitative results
: Human region concepts
: Object region concepts
: Union region concepts
: Human box
: Object box
Retrieved concepts
Kicking an object.
(0.27) Buttocks are tense and slightly lifted.
(0.27) Person is extending leg to make contact with object.
(0.25) Lower leg is straight and foot is flat on ground.
(0.21) Back is straight and shoulders are relaxed.
(0.50) Object appears to be in mid-air or suspended.
(0.35) Object's surface appears to be vibrating or trembling.
(0.07) Object's position changes or shifts.
(0.27) The object is propelled through the air.
(0.14) The person's foot is cocked back before making contact.
(0.13) The object is sent flying in a diagonal direction.
(0.13) The object is struck with a firm, direct kick.
Retrieved concepts
(0.59) Arms are extended and pulling on oars.
(0.20) Legs are bent and feet are planted firmly on the platform.
(0.14) Back is straight and facing forward.
(0.67) Rowing motion creates ripples in the water.
(0.20) Object is positioned at an angle in the water.
(0.13) Object is partially submerged in water.
(0.82) The person's legs are tucked in to maintain balance.
(0.07) The person's face is focused on the task at hand.
(0.05) The water is calm and reflective of the surrounding scenery.
Rowing an object.
KAIST & Korea University
Conclusions
MLV Lab
KAIST & Korea University
MLV Lab
Github
Paper
KAIST & Korea University