1 of 16

KAIST & Korea University

MLV Lab

Visual Diversity and Region-aware �Prompt Learning for Zero-shot HOI detection

Chanhyeong Yang1, Taehoon Song2, Jihwan Park1,

Hyunwoo J. Kim2

1Korea University 2KAIST

2 of 16

Task definition

< Person, Riding, Horse >

Example of HOI detection

    • Localizing humans and objects in an image
    • Recognizing their interactions

    • Zero-shot HOI detection
      • Addressing unseen combinations
      • Ex) unseen combination/object/verb

Human-Object Interaction detection

MLV Lab

KAIST & Korea University

3 of 16

Motivation

MLV Lab

1. Intra-class diversity

“Person holding a baseball glove”

 

Object

Verb

 

 

Object class diversity score

Verb class diversity score

 

[Diversity score comparison between object and verb]

Diverse visual patterns

Verbs are more diverse than objects

KAIST & Korea University

4 of 16

Motivation

MLV Lab

1. Intra-class diversity

“Person holding a baseball glove”

 

Object

Verb

 

 

Object class diversity score

Verb class diversity score

 

[Diversity score comparison between object and verb]

Diverse visual patterns

Verbs are more diverse than objects

Need to capture �visual variance of each verb class.

KAIST & Korea University

5 of 16

Motivation

MLV Lab

2. Inter-class visual entanglement

: CLS feature

: Prototype

: boarding

: sitting at

: licking

: eating

: riding

[Inter-class visual overlap of verb classes in t-SNE]

“Person eating an object.”

“Person licking an object.”

“Person sitting at an object.”

Similar, but different classes

Overlap in visual embedding space

KAIST & Korea University

6 of 16

Motivation

MLV Lab

2. Inter-class visual entanglement

: CLS feature

: Prototype

: boarding

: sitting at

: licking

: eating

: riding

[Inter-class visual overlap of verb classes in t-SNE]

“Person eating an object.”

“Person licking an object.”

“Person sitting at an object.”

Similar, but different classes

Overlap in visual embedding space

Requires fine-grained �information of each verb class.

KAIST & Korea University

7 of 16

Methods

MLV Lab

Then, how can we handle both challenges?

Capture visual �variance of each verb

Leverage fine-grained concepts about verbs

KAIST & Korea University

8 of 16

Methods

MLV Lab

So, We need dual-module design: VDRP

Visual Diversity and Region-aware �Prompt Learning

KAIST & Korea University

9 of 16

Methods

MLV Lab

1. We need VDP: Visual Diversity-aware Prompts

KAIST & Korea University

1. VDP: injection and reflect visual variance for each verb

10 of 16

Methods

MLV Lab

We need RAP: Region-Aware Prompts

2. RAP: augmentation with region-level concepts for each verb

KAIST & Korea University

11 of 16

Experiments

MLV Lab

HICO-DET: NF-UC / RF-UC

KAIST & Korea University

12 of 16

Experiments

MLV Lab

HICO-DET: Unseen Object / Unseen Verb

KAIST & Korea University

13 of 16

Experiments

MLV Lab

Qualitative results

CMMP causes modality mismatch with over-separated prompts

KAIST & Korea University

VDRP maintains balanced alignment between the two modalities

14 of 16

Experiments

MLV Lab

Qualitative results

 

: Human region concepts

: Object region concepts

: Union region concepts

: Human box

: Object box

Retrieved concepts

Kicking an object.

(0.27) Buttocks are tense and slightly lifted.

(0.27) Person is extending leg to make contact with object.

(0.25) Lower leg is straight and foot is flat on ground.

(0.21) Back is straight and shoulders are relaxed.

(0.50) Object appears to be in mid-air or suspended.

(0.35) Object's surface appears to be vibrating or trembling.

(0.07) Object's position changes or shifts.

(0.27) The object is propelled through the air.

(0.14) The person's foot is cocked back before making contact.

(0.13) The object is sent flying in a diagonal direction.

(0.13) The object is struck with a firm, direct kick.

Retrieved concepts

(0.59) Arms are extended and pulling on oars.

(0.20) Legs are bent and feet are planted firmly on the platform.

(0.14) Back is straight and facing forward.

(0.67) Rowing motion creates ripples in the water.

(0.20) Object is positioned at an angle in the water.

(0.13) Object is partially submerged in water.

(0.82) The person's legs are tucked in to maintain balance.

(0.07) The person's face is focused on the task at hand.

(0.05) The water is calm and reflective of the surrounding scenery.

Rowing an object.

KAIST & Korea University

15 of 16

Conclusions

MLV Lab

    • We propose VDRP, a dual-module prompt learning framework for zero-shot HOI detection.

    • By combining visual diversity-aware prompts and region-aware prompts,

    • VDRP addresses both intra-class diversity and inter-class entanglement in once.

KAIST & Korea University

16 of 16

MLV Lab

Github

Paper

KAIST & Korea University