1 of 17

Groupwise Query Specialization and �Quality-Aware Multi-Assignment for �Transformer-based Visual Relationship Detection

Jongha Kim*, Jihwan Park*, Jinyoung Park*,�Jinyoung Kim, Sehyung Kim, Hyunwoo J. Kim

�Department of Computer Science and Engineering, Korea University

Korea University

MLV Lab

2 of 17

Korea University

MLV Lab

Visual Relationship Detection (VRD)

Visual Relationship Detection is a task of detecting <subject, predicate, object> triplets existing in an image, including �Scene Graph Generation (SGG) and Human-Object Interaction (HOI) Detection tasks.

Scene Graph Generation by Iterative Message Passing, Xu et al., CVPR 2017

Example of a <s, p, o> triplet

3 of 17

Korea University

MLV Lab

Transformer-based Visual Relationship Detection

Following success of DETR in the field of object detection, Transformer-based detectors have recently gained attention �for VRD tasks. Transformer-based VRD detectors consist of a backbone, Transformer encoder, and Transformer decoder.

Iterative Scene Graph Generation, Khandelwal et al., NeurIPS 2022�HOTR: End-to-End Human-Object Interaction Detection with Transformers, Kim et al., CVPR 2021 (Oral)

4 of 17

Korea University

MLV Lab

Label Assignment for training of Transformer-based VRD detectors

Label assignment is a process of assigning a ground-truth (GT) to corresponding predictions in order to train Transformer-based VRD detectors. Following DETR, the Hungarian matching algorithm is widely adopted as an assignment strategy.

NMS Strikes Back, Ouyang-Zhang et al., arXiv 2022

An example of label assignment result

5 of 17

Korea University

MLV Lab

Problem 1: Unspecialized Training Signals

Since a GT is assigned to an arbitrary query, a query is expected to detect every predicates. It makes a query to learn a �non-specific or vague role, as it struggles to learn to detect every type of predicates.

6 of 17

Korea University

MLV Lab

Problem 2: Insufficient Training Signals

Although multiple high-quality predictions corresponding to a GT may exist, conventional assignment only assigns a GT�to a single prediction. Therefore, promising predictions are suppressed by being assigned ‘no relation’ as a GT.

7 of 17

Korea University

MLV Lab

Enhanced label assignment for Transformer-based VRD models

To address problems in conventional assignment, we propose an enhanced label assignment strategy named SpeaQ.�SpeaQ promotes specialization of a query by only assigning GTs with specific predicate labels to a query. It also provides sufficient supervision to queries by adaptively assigning a GT to multiple predictions considering prediction quality.

8 of 17

Korea University

MLV Lab

Method 1: Groupwise Query Specialization

First, predicates and queries are divided into multiple groups. Then, a GT is only assigned to a query belonging to the first query group if the predicate label belong to the first predicate group. By doing so, a query specializes to detect only �small number of target predicates rather than struggling to detect every predicates.

9 of 17

Korea University

MLV Lab

Frequency-based predicate grouping

Predicates are divided into multiple groups. To relieve the optimization difficulties caused by long-tailed predicate �distribution, predicates with similar frequencies are grouped together.

Stacked Hybrid-Attention and Group Collaborative Learning for Unbiased Scene Graph Generation, Dong et al., CVPR 2022

10 of 17

Korea University

MLV Lab

Proportional query grouping

Queries are also divided into multiple groups. To assign similar number of GTs for every query in average, the size of a �query group is set proportional to the number of GTs in the corresponding predicate group.

11 of 17

Korea University

MLV Lab

Groupwise query specialization

With predicate and query groups defined, groupwise query specialization places an additional constraint in assignment�that the GT with a predicate label in specific predicate group can only be assigned to queries in the corresponding query group. Such constraint makes a query to only focus on small set of specific target predicates.

12 of 17

Korea University

MLV Lab

Method 2: Quality-Aware Multi-Assignment

13 of 17

Korea University

MLV Lab

Definition of triplet-level prediction quality

The triplet-level quality of a prediction on a GT is defined based on box localization quality (i.e., IoU) on subject/object �and classification quality (i.e., score) on predicate.

14 of 17

Korea University

MLV Lab

15 of 17

Korea University

MLV Lab

Results on Visual Genome benchmark

Applied to HOTR and ISG, the best result on VG dataset is obtained. SpeaQ is the first method achieving best results on �both two contradictory metrics R@k and mR@k, which are biased toward frequent and rare predicates, respectively.

16 of 17

Korea University

MLV Lab

Result on HICO-DET Benchmark

Similarly, applying SpeaQ on GEN-VLKT results in consistent performance gains. It is notable that the performance gains�are attained with zero additional inference cost, since SpeaQ is only applied during training.

17 of 17

Korea University

MLV Lab

Conclusion

Under conventional label assignment, a query learns vague role and is �insufficiently trained.

We propose SpeaQ, which is an enhanced label assignment that provides �specialized and abundant training signals to queries.

SpeaQ improves performance across multiple visual relationship detection �tasks and architectures with zero additional inference cost.