1 of 24

KAIST

MLV Lab

Captioning for Text-Video Retrieval via

Dual-Group Direct Preference Optimization

Ji Soo Lee¹, Byungoh Ko¹, Jaewon Cho¹

Howoong Lee², Jaewoon Byun², Hyunwoo J. Kim³

¹Korea University ²Hanwha Vision ³KAIST

2 of 24

Motivation

KAIST

MLV Lab

Video 3

First time cat turns head toward camera. Cat reaches for string. Cat moves to left and grabs at at string…

Text Query

Video 1

…

Video 2

Video 4

Text-Video retrieval

3 of 24

KAIST

MLV Lab

Video 3

The video features a serene scene of an otter floating on its back in clear, blue water. The otter is surrounded by green plants and leaves that gently float around it as the camera captures different angles of this peaceful …

The video features a gray tabby cat sitting on a red and beige patterned rug in front of white doors, with its attention focused intently to the left. The scene remains consistent as the cat occasionally shifts slightly but …

The video begins with a cat walking around the room, exploring its surroundings. The scene transitions to The scene transitions to show more of the wooden floor and bookshelves filled with various books in red …

The video features a small orange kitten playing with a pink toy on a wooden floor. The kitten moves around, occasionally looking up and down as it interacts with the toy. In some frames, another object is visible near the …

First time cat turns head toward camera. Cat reaches for string. Cat moves to left and grabs at at string…

Text Query

Video 1

Video captions

…

Video 2

Video 4

Motivation

Text-Video retrieval

4 of 24

KAIST

MLV Lab

The video begins with a cat walking around the room, exploring its surroundings. The scene transitions to …

The kitten moves around, occasionally looking up and down as it interacts with the toy. In some frames …

Baseline

Vid 1

Vid 2

Motivation

Captions produced by pretrained models fail to capture the detailed distinction necessary for retrieval

❶ Generic captions across similar videos

5 of 24

KAIST

MLV Lab

The video begins with a cat walking around the room, exploring its surroundings. The scene transitions to …

The kitten moves around, occasionally looking up and down as it interacts with the toy. In some frames …

Baseline

😵‍💫

Vid 1

Vid 2

It then sits up, stretches its body, stands on all fours, and moves towards an object resembling a wheel …

… kitten playing with a pink toy on a wooden floor, moving around and interacting playfully. The …

Ours

☺️

Vid 1

Vid 2

Motivation

We want to generate captions that provide discriminative cues critical for retrieval among similar videos

6 of 24

KAIST

MLV Lab

Motivation

The top-1 caption selected based on a conventional captioning metric (e.g., BLEU) often does not align with the top-1 caption when ranked by retrieval relevance score (placed at the bottom rank)

❷ Misalignment between captioning and retrieval objectives

7 of 24

Method

KAIST

MLV Lab

Conventional approach of MLLM-based retrieval model

The model does not effectively distinguish the caption’s role as auxiliary context from the text query as the retrieval target, resulting in inefficient use of the additional information.

8 of 24

KAIST

MLV Lab

Method

MLLM-based retrieval model

We introduce role-embeddings during retrieval model training to differentiate the roles of heterogeneous textual inputs

As the retrieval relevance score, we adopt the pairwise score margin between the probability of generating `True’ and `False’ (contrastive strategy)

9 of 24

KAIST

MLV Lab

Method

Single-Group Direct Preference Optimization

SG-DPO: Preference pairs are constructed by comparing outputs (captions) conditioned on a single input V

Leverages the retrieval relevance scores as preference signals

DPO aims to increase the probability of the winning sample while decreasing the probability of the losing sample

DPO:

10 of 24

KAIST

MLV Lab

Method

Retrieval score-driven preference dataset

Adopt retrieval scores for preference dataset construction

SG-DPO: Preference pairs are constructed by comparing outputs conditioned on a single input V

11 of 24

KAIST

MLV Lab

Method

Retrieval score-driven preference dataset

Adopt retrieval scores for preference dataset construction

SG-DPO: Preference pairs are constructed by comparing outputs conditioned on a single input V

Local retrieval rank preference

12 of 24

KAIST

MLV Lab

Method

Dual-Group Direct Preference Optimization

DG-DPO: extends to consider preferences across distinct video-caption pairs

global retrieval rank preference

13 of 24

KAIST

MLV Lab

Method

Dual-Group Direct Preference Optimization

Winning sample in SG-DPO → Losing sample in DG-DPO

DG-DPO: extends to consider preferences across distinct video-caption pairs

global retrieval rank preference

14 of 24

KAIST

MLV Lab

Method

Dual-Group Direct Preference Optimization

DG-DPO: extends to consider preferences across distinct video-caption pairs

15 of 24

KAIST

MLV Lab

Main Results

Performance of the State-of-the-Art text-video retrieval models

CaRe-DPO outperforms baseline models across multiple datasets

16 of 24

KAIST

MLV Lab

Experiments

17 of 24

KAIST

MLV Lab

Experiments

18 of 24

KAIST

MLV Lab

Experiments

19 of 24

KAIST

MLV Lab

Experiments

Effectiveness of DG-DPO in challenging retrieval cases

20 of 24

KAIST

MLV Lab

Experiments

Effectiveness of DG-DPO in challenging retrieval cases

21 of 24

KAIST

MLV Lab

Experiments

Effectiveness of DG-DPO in challenging retrieval cases

22 of 24

KAIST

MLV Lab

Qualitative Result

With the caption generated by DG-DPO trained model (Ours), the retrieval model retrieves the ground-truth video correctly, guided by the discriminative details in the caption that closely match the text query

23 of 24

KAIST

MLV Lab

Conclusion

To the best of our knowledge, we are the first to address the misalignment between conventional captioning metrics and retrieval objectives and tackle the challenge of leveraging captions to improve retrieval performance.

We propose CaRe-DPO, a retrieval framework that integrates role-embeddings with retrieval-aligned caption optimization to leverage auxiliary captions in MLLM-based text-video retrieval.

Our DG-DPO supervises caption generation using retrieval relevance scores with both local (within-video) and global (cross-video-caption) ranking, enabling generation of captions that better reflect retrieval importance.

24 of 24

KAIST

MLV Lab

Captioning for Text-Video Retrieval via

Dual-Group Direct Preference Optimization

Github

Paper