Human: Locate the woman in a blue jacket skiing in the image. Output must be bounding box coordinates with the format [xmin, ymin, xmax, ymax].
LLaVA-v1.57B: [0.30, 0.21, 0.50, 0.87]
LLaVA-World7B (ours): [0.34, 0.13, 0.60, 0.85]
Note:LLaVA-World7B (ours) provides a better localization than the base LLaVA-v1.57B model.
Improving Open-Vocabulary Object Detection in a Vision Language Model
Nikko Carlo A. Yabut, Rowel O. Atienza, PhD
Solution
Our model, LLaVA-World, enhances LLaVA-v1.5 by fine-tuning with a dataset auto-generated by FireLLaVA and YOLO-World, boosting its open-vocabulary detection from 4.8% to 25% AP on LVIS benchmark, a remarkable 420% improvement.
Problem Statement
Vision Language Models (VLMs) like LLaVA-v1.5 perform poorly on open-vocabulary object detection, with low average precision scores on novel object benchmarks like LVIS.
Features of Solution
2
Capstone Project: “Improving Open-Vocabulary Object Detection in a Vision-Language Model.”
One of the First Graduates of
Master of Engineering in Artificial Intelligence