All Languages Matter: Evaluating LMMs on Culturally Diverse 100 Languages �(CVPR 2025 Highlight)
Ashmal Vayani, Dinura Dissanayake, Hasindri Watawana, Noor Ahsan, Nevasini Sasikumar, Omkar Thawakar, … , Aman Chadha, Hisham Cholakkal, Rao Muhammad Anwer, Michael Felsberg, Jorma Laaksonen , Thamar Solorio, Monojit Choudhury , Ivan Laptev, Mubarak Shah, Salman Khan, Fahad Shahbaz Khan
University of Central Florida, Mohamed bin Zayed University of AI, Amazon, Aalto University, Australian National University, Linköping University
https://mbzuai-oryx.github.io/ALM-Bench/
Overall Picture
Motivation
Problem
ALM-Bench Overview
Various Question Types
Benchmark Comparison
Dataset Statistics
Dataset Curation
Dataset Curation – Generic QAs
Outdoor
Food
Meme
Painting
Indoor
Sketch
[1] Liu, Haotian, et al. "Visual instruction tuning." Advances in neural information processing systems 36 (2024).
Dataset Curation
Dataset Curation – Cultural QAs
Dataset Curation – Cultural QAs
Language | Cultural Domain | Country |
Afrikaans | Religion | South Africa |
Albanian | Customs | Albania |
Amharic | Festivals | Ethiopia |
Armenian | Heritage | Armenia |
… | … | … |
Dataset Curation – Cultural QAs
Dataset Curation – Cultural QAs
Dataset Curation – Cultural QAs
PicdeFacer: https://picdefacer.com/en/
Dataset Curation – Cultural QAs
Dataset Curation – Cultural QAs
Prompt Includes:
Dataset Curation – Cultural QAs
Dataset Curation – Cultural QAs
User Interface
Guidelines:
GPT-4o Translation Issues - Human Verification
Most Errors
Evaluation Setup
Various Question Types:
Evaluation Criterion:
Evaluation Judge
Scoring
Benchmarking LMM on ALM-Bench
Benchmarking LMM on ALM-Bench
Benchmarking LMM on ALM-Bench
Analysis of ALM-Bench Evaluation
Analysis of ALM-Bench – Language Scripts
Error Analysis on Language Scripts
Qualitative Examples of Error Analysis on Scripts
Correct Cases
Incorrect Cases
Qualitative Examples of Error Analysis on Scripts
Incorrect Cases
Analysis of ALM-Bench – Language Families
Analysis of ALM-Bench – Question Types
Analysis of ALM-Bench – Cultural Categories
Analysis of ALM-Bench – Location Aware Prompts
Models | With Country Info. | Without Country Info. |
GPT-4o | 83.57% | 80.96% |
Gemini-1.5-Pro | 81.52% | 76.19% |
GLM-4V-9B | 56.78% | 56.41% |
Qwen2-VL | 53.97% | 52.57% |
ALM-Bench Insights
Annotators Demographics
Qualitative Examples of ALM-Bench
Qualitative Examples of ALM-Bench
Qualitative Examples of ALM-Bench
Word Cloud of Categories
Summary
Thank you for listening.��Any questions?
Ashmal Vayani
University of Central Florida
Masters in Computer Vision
linkedin.com/in/ashmal-vayani/
https://ashmalvayani.github.io/