KAIST
MLV Lab
Captioning for Text-Video Retrieval via
Dual-Group Direct Preference Optimization
Ji Soo Lee1, Byungoh Ko1, Jaewon Cho1
Howoong Lee2, Jaewoon Byun2, Hyunwoo J. Kim3
1Korea University 2Hanwha Vision 3KAIST
Motivation
KAIST
MLV Lab
Video 3
First time cat turns head toward camera. Cat reaches for string. Cat moves to left and grabs at at string…
Text Query
Video 1
…
Video 2
Video 4
Text-Video retrieval
KAIST
MLV Lab
Video 3
The video features a serene scene of an otter floating on its back in clear, blue water. The otter is surrounded by green plants and leaves that gently float around it as the camera captures different angles of this peaceful …
The video features a gray tabby cat sitting on a red and beige patterned rug in front of white doors, with its attention focused intently to the left. The scene remains consistent as the cat occasionally shifts slightly but …
The video begins with a cat walking around the room, exploring its surroundings. The scene transitions to The scene transitions to show more of the wooden floor and bookshelves filled with various books in red …
The video features a small orange kitten playing with a pink toy on a wooden floor. The kitten moves around, occasionally looking up and down as it interacts with the toy. In some frames, another object is visible near the …
First time cat turns head toward camera. Cat reaches for string. Cat moves to left and grabs at at string…
Text Query
Video 1
Video captions
…
…
Video 2
Video 4
Motivation
Text-Video retrieval
KAIST
MLV Lab
The video begins with a cat walking around the room, exploring its surroundings. The scene transitions to …
The kitten moves around, occasionally looking up and down as it interacts with the toy. In some frames …
Baseline
Vid 1
Vid 2
Motivation
❶ Generic captions across similar videos
KAIST
MLV Lab
The video begins with a cat walking around the room, exploring its surroundings. The scene transitions to …
The kitten moves around, occasionally looking up and down as it interacts with the toy. In some frames …
Baseline
😵💫
Vid 1
Vid 2
It then sits up, stretches its body, stands on all fours, and moves towards an object resembling a wheel …
… kitten playing with a pink toy on a wooden floor, moving around and interacting playfully. The …
Ours
☺️
Vid 1
Vid 2
Motivation
KAIST
MLV Lab
Motivation
❷ Misalignment between captioning and retrieval objectives
Method
KAIST
MLV Lab
Conventional approach of MLLM-based retrieval model
KAIST
MLV Lab
Method
MLLM-based retrieval model
KAIST
MLV Lab
Method
Single-Group Direct Preference Optimization
DPO aims to increase the probability of the winning sample while decreasing the probability of the losing sample
KAIST
MLV Lab
Method
Retrieval score-driven preference dataset
Adopt retrieval scores for preference dataset construction
KAIST
MLV Lab
Method
Retrieval score-driven preference dataset
Adopt retrieval scores for preference dataset construction
Local retrieval rank preference
KAIST
MLV Lab
Method
Dual-Group Direct Preference Optimization
global retrieval rank preference
KAIST
MLV Lab
Method
Dual-Group Direct Preference Optimization
Winning sample in SG-DPO → Losing sample in DG-DPO
global retrieval rank preference
KAIST
MLV Lab
Method
Dual-Group Direct Preference Optimization
KAIST
MLV Lab
Main Results
Performance of the State-of-the-Art text-video retrieval models
KAIST
MLV Lab
Experiments
KAIST
MLV Lab
Experiments
KAIST
MLV Lab
Experiments
KAIST
MLV Lab
Experiments
Effectiveness of DG-DPO in challenging retrieval cases
KAIST
MLV Lab
Experiments
Effectiveness of DG-DPO in challenging retrieval cases
KAIST
MLV Lab
Experiments
Effectiveness of DG-DPO in challenging retrieval cases
KAIST
MLV Lab
Qualitative Result
KAIST
MLV Lab
Conclusion
KAIST
MLV Lab
Captioning for Text-Video Retrieval via
Dual-Group Direct Preference Optimization
Github
Paper