1 of 18

Deformable DETR: Deformable Transformers

for End-to-End Object Detection

ICLR 2021

April 4, 2023

Presenter, Jeongwan On

2 of 18

0. Background - Transformer

기존 RNN 기반의 Encoder, Decoder 구조에서 RNN 없이 Attention만으로 task를 수행
Embedding 된 input 시퀀스(query, key elements)를 Encoder-Decoder에 넣어서 output 출력

3 of 18

0. Background – DETR

이분 매칭 기반의 새로운 detection 구조

Obj Detection을 direct set-prediction 문제로 접근, fully end-to-end(No RPN, NMS) 모델

Attention 메커니즘을 사용, 큰 물체 탐지에 대해 기존 model보다 우수한 성능을 보여줌

4 of 18

0. Background – DETR(Encoder & Decoder)

CNN backbone으로부터 얻은 Feature map을 input으로 사용
Encoder에서는 self-attention을 통해 pixel간의 정보를 파악
Decoder에서는self/cross attention을 통해 실제 오브젝트의 위치 및 클래스를 학습

5 of 18

0. Background – DETR(Set prediction)

Object Query의 수(slot의 수) 만큼의 임베딩 값 출력
각 Object Query는 예측 물체의 유무(class)와 box를 예측하는 set-prediction 수행

6 of 18

0. Background – DETR(Bipartite Matching)

7 of 18

0. Background – DETR(GIOU Loss)

Bbox가 겹쳐있지 않은 경우 IOU loss는 0
GIOU는 A,B를 포함하는 최소의 Bbox C를 하나 더 생성하여 Loss를 계산

8 of 18

0. Background - Problem

1. Slow convergence

Key의 개수가 커질수록
특정 Object에 focusing 하기까지 시간이 오래 걸림

Memory complexity

Multi-head attention은 모든 pixel을 전부 사용
메모리 및 시간 복잡도 증가

Small object에 대한 performance 저하

복잡한 계산 복잡도로 인해 low-resolution feature map만을 사용
Small object는 잘 detect하지 못하는 모습을 보임

9 of 18

1. Architecture

Deformable attention : 특정 layer를 통해 예측 된 sampling points들에 대해서만 attention 수행
Multi scale feature map : 여러 scale의 feature map을 사용

10 of 18

1. Architecture – Deformable Attention

11 of 18

1. Architecture – Multi-scale Deformable Attention

12 of 18

1. Architecture – Encoder

Sampling�points

Attention�weights

Layer1 / point,head 3

Layer4 / point,head 3

Final L. / point,head 3

13 of 18

1. Architecture – Decoder

Self attention

Object query의 최적의 이분매칭을 찾음

Cross attention

최적의 reference point를 찾음

14 of 18

2. Variants for Deformable DETR

Iterative Bounding Box Refinement

- box가 decoder layer를 거칠 때 box 좌표를 보정

- each decoder layer refines the bounding boxes based on the predictions from the previous layer.

Two-stage Deformable DETR

- Decoder의 object query를 random initailize하지 않음
- Encoder-only의 Deformable DETR을 먼저 학습한 뒤 얻은 region proposal을 Decoder의 object query로 사용
- we remove the decoder and form an encoder-only Deformable DETR for region proposal generation.

15 of 18

3. Experiments – with DETR

16 of 18

3. Experiments – Ablations

FPNs

MS attention에 대해 FPN 추가하여도 큰 성능 개선x
여러 스케일의 feature map에 대한 attention 연산 → 모델이 여러 스케일의 이미지 정보를 충분히 반영

17 of 18

4. Conclusion

Deformable DETR is an end-to-end object detector, which is efficient and fast-converging.

It enables us to explore more interesting and practical variants of end-to-end object detectors.

At the core of Deformable DETR are the (multi-scale) deformable attention modules, which is an efficient attention mechanism in processing image feature maps.

We hope our work opens up new possibilities in exploring end-to-end object detection.