1 of 2

Human: Locate the woman in a blue jacket skiing in the image. Output must be bounding box coordinates with the format [xmin, ymin, xmax, ymax].

LLaVA-v1.57B: [0.30, 0.21, 0.50, 0.87]

LLaVA-World7B (ours): [0.34, 0.13, 0.60, 0.85]

Note:LLaVA-World7B (ours) provides a better localization than the base LLaVA-v1.57B model.

Improving Open-Vocabulary Object Detection in a Vision Language Model

Nikko Carlo A. Yabut, Rowel O. Atienza, PhD

Solution

Our model, LLaVA-World, enhances LLaVA-v1.5 by fine-tuning with a dataset auto-generated by FireLLaVA and YOLO-World, boosting its open-vocabulary detection from 4.8% to 25% AP on LVIS benchmark, a remarkable 420% improvement.

Problem Statement

Vision Language Models (VLMs) like LLaVA-v1.5 perform poorly on open-vocabulary object detection, with low average precision scores on novel object benchmarks like LVIS.

Features of Solution

  • Fine-Tuning on Auto-Generated Dataset
  • Increases AP for OVD from 4.8% to 25% (LVIS)
  • Achieves 25.0 AP, competitive with leading specialist OVD models like OWL-ViT and YOLO-World
  • Enhances OVD while preserving LLaVA-v1.5's general vision-language task abilities.

2 of 2

2

Capstone Project: “Improving Open-Vocabulary Object Detection in a Vision-Language Model.”

One of the First Graduates of

Master of Engineering in Artificial Intelligence