1 of 16

KAIST & Korea University

MLV Lab

Visual Diversity and Region-aware �Prompt Learning for Zero-shot HOI detection

Chanhyeong Yang¹, Taehoon Song², Jihwan Park^1,

Hyunwoo J. Kim²

¹Korea University ²KAIST

2 of 16

Task definition

< Person, Riding, Horse >

Example of HOI detection

Localizing humans and objects in an image
Recognizing their interactions

Zero-shot HOI detection

Addressing unseen combinations
Ex) unseen combination/object/verb

Human-Object Interaction detection

MLV Lab

KAIST & Korea University

Human-Object Interaction detection은 이미지 내에서 사람과 객체의 위치를 판별하고, 이들 사이의 상호작용을 식별하는 것을 목표로 합니다.

예를 들어, 오른쪽 이미지를 보시면 모델은 이 이미지에서 사람과 말의 위치를 판별하고 이들 사이의 상호작용인 riding을 식별해야 합니다.

특히, zero-shot HOI detection에서는 학습 중 보지 못한 조합에 대한 내용도 해결해야 하기에 더욱 어렵다고 할 수 있습니다.

Human-Object Interaction detection aims to localize humans and objects in an image and then. recognize their interactions.

For example, in case of the right image, the model needs to localize the person and horse and then, recognize their interaction “Riding”.

Especially, in zero-shot settings, the model needs to address unseen combinations which are more challenging.

3 of 16

Motivation

MLV Lab

1. Intra-class diversity

“Person holding a baseball glove”

Object

Verb

Object class diversity score

Verb class diversity score

[Diversity score comparison between object and verb]

Diverse visual patterns

Verbs are more diverse than objects

KAIST & Korea University

저희는 HOI detection에서 두 가지의 핵심 챌린지를 발견했는데요,

먼저 Intra-class diversity입니다.

이는 같은 verb라고 할 지라도 다양한 시각적 패턴을 가질 수 있다는 것인데요.

왼쪽 이미지에서 볼 수 있듯 “Person holding a baseball glove” 는 이미지를 통해 다양한 시각적 패턴들을 가집니다.

또한, 오른쪽 그래프에서 볼 수 있듯 HOI detection에서 식별하고자 하는 verb는 object 보다 더 큰 다양성을 지니는 것도 확인할 수 있죠.

We observed that, in HOI detection, there are two core challenges.

First, Intra-class diversity, that is the same verb has diverse visual patterns.

As you see the left images, “Person holding a baseball glove” can have diverse visual patterns

Also, as you see the right figure, verbs which can be classified in the HOI detection, are more diverse than objects.

4 of 16

Motivation

MLV Lab

1. Intra-class diversity

“Person holding a baseball glove”

Object

Verb

Object class diversity score

Verb class diversity score

[Diversity score comparison between object and verb]

Diverse visual patterns

Verbs are more diverse than objects

Need to capture �visual variance of each verb class.

KAIST & Korea University

5 of 16

Motivation

MLV Lab

2. Inter-class visual entanglement

: CLS feature

: Prototype

: boarding

: sitting at

: licking

: eating

: riding

[Inter-class visual overlap of verb classes in t-SNE]

“Person eating an object.”

“Person licking an object.”

“Person sitting at an object.”

Similar, but different classes

Overlap in visual embedding space

KAIST & Korea University

두번째로, Inter-class visual entanglement입니다.

이는 시각적으로 비슷한 상호작용이라 할지라도 서로 다른 verb classes에 속할 수 있다는 내용인데요,

왼쪽 이미지에서 볼 수 있듯 “Person eating an object.” , “Person licking an object.” 그리고 “Person sitting at an object.”은 비슷한 시각 패턴을 지녔으나, 서로 다른 클래스에 속합니다.

또한, 오른쪽 t-SNE를 보시면 visual embedding space 내에서 각 features의 overlap도 확인할 수 있죠

Second, Inter-class visual entanglement, that is visually similar interactions can belong to different classes.

As you see the left images, “Person eating an object.” , “Person licking an object.” and “Person sitting at an object.” can have similar visual patterns.

Also, as you see the right figure, in visual embedding space, the overlap is observed.

6 of 16

Motivation

MLV Lab

2. Inter-class visual entanglement

: CLS feature

: Prototype

: boarding

: sitting at

: licking

: eating

: riding

[Inter-class visual overlap of verb classes in t-SNE]

“Person eating an object.”

“Person licking an object.”

“Person sitting at an object.”

Similar, but different classes

Overlap in visual embedding space

Requires fine-grained �information of each verb class.

KAIST & Korea University

7 of 16

Methods

MLV Lab

Then, how can we handle both challenges?

Capture visual �variance of each verb

Leverage fine-grained concepts about verbs

KAIST & Korea University

저희는 이러한 두 가지 문제를 HOI detection의 핵심 문제로 정의하였습니다.

그렇다면, 어떻게 하면 저희가 이러한 문제들을 한 번에 해결할 수 있을까요?

전형적인 2-stage HOI detection model은 먼저 이미지 내 사람과 객체들의 위치를 식별하고, 이들의 상호작용을 식별합니다.

저희는 이 문제를 해결하기 위해 상호작용 식별 부에서 2가지 포인트를 고려해야 합니다.

먼저, 각 verb의 다양성을 반영하는 것 그리고 verb에 대한 fine-grained한 정보, 즉 concept을 이용하는 것입니다.

These two challenges make HOI detection more challenging.

Then, how can we handle both challenges in once?

In 2-stage HOI detection model which detects humans and objects first, and then classifies their verbs, we consider two points.

First, we need to capture visual variance of each verb.

And second, we need to leverage fine-grained concepts for verbs.

8 of 16

Methods

MLV Lab

So, We need dual-module design: VDRP

Visual Diversity and Region-aware �Prompt Learning

KAIST & Korea University

9 of 16

Methods

MLV Lab

1. We need VDP: Visual Diversity-aware Prompts

KAIST & Korea University

1. VDP: injection and reflect visual variance for each verb

10 of 16

Methods

MLV Lab

We need RAP: Region-Aware Prompts

2. RAP: augmentation with region-level concepts for each verb

KAIST & Korea University

11 of 16

Experiments

MLV Lab

HICO-DET: NF-UC / RF-UC

KAIST & Korea University

12 of 16

Experiments

MLV Lab

HICO-DET: Unseen Object / Unseen Verb

KAIST & Korea University

13 of 16

Experiments

MLV Lab

Qualitative results

CMMP causes modality mismatch with over-separated prompts

KAIST & Korea University

VDRP maintains balanced alignment between the two modalities

14 of 16

Experiments

MLV Lab

Qualitative results

: Human region concepts

: Object region concepts

: Union region concepts

: Human box

: Object box

Retrieved concepts

Kicking an object.

(0.27) Buttocks are tense and slightly lifted.

(0.27) Person is extending leg to make contact with object.

(0.25) Lower leg is straight and foot is flat on ground.

(0.21) Back is straight and shoulders are relaxed.

(0.50) Object appears to be in mid-air or suspended.

(0.35) Object's surface appears to be vibrating or trembling.

(0.07) Object's position changes or shifts.

(0.27) The object is propelled through the air.

(0.14) The person's foot is cocked back before making contact.

(0.13) The object is sent flying in a diagonal direction.

(0.13) The object is struck with a firm, direct kick.

Retrieved concepts

(0.59) Arms are extended and pulling on oars.

(0.20) Legs are bent and feet are planted firmly on the platform.

(0.14) Back is straight and facing forward.

(0.67) Rowing motion creates ripples in the water.

(0.20) Object is positioned at an angle in the water.

(0.13) Object is partially submerged in water.

(0.82) The person's legs are tucked in to maintain balance.

(0.07) The person's face is focused on the task at hand.

(0.05) The water is calm and reflective of the surrounding scenery.

Rowing an object.

KAIST & Korea University

15 of 16

Conclusions

MLV Lab

We propose VDRP, a dual-module prompt learning framework for zero-shot HOI detection.

By combining visual diversity-aware prompts and region-aware prompts,

VDRP addresses both intra-class diversity and inter-class entanglement in once.

KAIST & Korea University

16 of 16

MLV Lab

Github

Paper

KAIST & Korea University