Groupwise Query Specialization and �Quality-Aware Multi-Assignment for �Transformer-based Visual Relationship Detection
Jongha Kim*, Jihwan Park*, Jinyoung Park*,�Jinyoung Kim, Sehyung Kim, Hyunwoo J. Kim
�Department of Computer Science and Engineering, Korea University
Korea University
MLV Lab
Korea University
MLV Lab
Visual Relationship Detection (VRD)
Visual Relationship Detection is a task of detecting <subject, predicate, object> triplets existing in an image, including �Scene Graph Generation (SGG) and Human-Object Interaction (HOI) Detection tasks.
Scene Graph Generation by Iterative Message Passing, Xu et al., CVPR 2017
Example of a <s, p, o> triplet
Korea University
MLV Lab
Transformer-based Visual Relationship Detection
Following success of DETR in the field of object detection, Transformer-based detectors have recently gained attention �for VRD tasks. Transformer-based VRD detectors consist of a backbone, Transformer encoder, and Transformer decoder.
Iterative Scene Graph Generation, Khandelwal et al., NeurIPS 2022�HOTR: End-to-End Human-Object Interaction Detection with Transformers, Kim et al., CVPR 2021 (Oral)
Korea University
MLV Lab
Label Assignment for training of Transformer-based VRD detectors
Label assignment is a process of assigning a ground-truth (GT) to corresponding predictions in order to train Transformer-based VRD detectors. Following DETR, the Hungarian matching algorithm is widely adopted as an assignment strategy.
NMS Strikes Back, Ouyang-Zhang et al., arXiv 2022
An example of label assignment result
Korea University
MLV Lab
Problem 1: Unspecialized Training Signals
Since a GT is assigned to an arbitrary query, a query is expected to detect every predicates. It makes a query to learn a �non-specific or vague role, as it struggles to learn to detect every type of predicates.
Korea University
MLV Lab
Problem 2: Insufficient Training Signals
Although multiple high-quality predictions corresponding to a GT may exist, conventional assignment only assigns a GT�to a single prediction. Therefore, promising predictions are suppressed by being assigned ‘no relation’ as a GT.
Korea University
MLV Lab
Enhanced label assignment for Transformer-based VRD models
To address problems in conventional assignment, we propose an enhanced label assignment strategy named SpeaQ.�SpeaQ promotes specialization of a query by only assigning GTs with specific predicate labels to a query. It also provides sufficient supervision to queries by adaptively assigning a GT to multiple predictions considering prediction quality.
Korea University
MLV Lab
Method 1: Groupwise Query Specialization
First, predicates and queries are divided into multiple groups. Then, a GT is only assigned to a query belonging to the first query group if the predicate label belong to the first predicate group. By doing so, a query specializes to detect only �small number of target predicates rather than struggling to detect every predicates.
Korea University
MLV Lab
Frequency-based predicate grouping
Predicates are divided into multiple groups. To relieve the optimization difficulties caused by long-tailed predicate �distribution, predicates with similar frequencies are grouped together.
Stacked Hybrid-Attention and Group Collaborative Learning for Unbiased Scene Graph Generation, Dong et al., CVPR 2022
Korea University
MLV Lab
Proportional query grouping
Queries are also divided into multiple groups. To assign similar number of GTs for every query in average, the size of a �query group is set proportional to the number of GTs in the corresponding predicate group.
Korea University
MLV Lab
Groupwise query specialization
With predicate and query groups defined, groupwise query specialization places an additional constraint in assignment�that the GT with a predicate label in specific predicate group can only be assigned to queries in the corresponding query group. Such constraint makes a query to only focus on small set of specific target predicates.
Korea University
MLV Lab
Method 2: Quality-Aware Multi-Assignment
Korea University
MLV Lab
Definition of triplet-level prediction quality
The triplet-level quality of a prediction on a GT is defined based on box localization quality (i.e., IoU) on subject/object �and classification quality (i.e., score) on predicate.
Korea University
MLV Lab
Korea University
MLV Lab
Results on Visual Genome benchmark
Applied to HOTR and ISG, the best result on VG dataset is obtained. SpeaQ is the first method achieving best results on �both two contradictory metrics R@k and mR@k, which are biased toward frequent and rare predicates, respectively.
Korea University
MLV Lab
Result on HICO-DET Benchmark
Similarly, applying SpeaQ on GEN-VLKT results in consistent performance gains. It is notable that the performance gains�are attained with zero additional inference cost, since SpeaQ is only applied during training.
Korea University
MLV Lab
Conclusion