1 of 44

All Languages Matter: Evaluating LMMs on Culturally Diverse 100 Languages �(CVPR 2025 Highlight)

Ashmal Vayani, Dinura Dissanayake, Hasindri Watawana, Noor Ahsan, Nevasini Sasikumar, Omkar Thawakar, … , Aman Chadha, Hisham Cholakkal, Rao Muhammad Anwer, Michael Felsberg, Jorma Laaksonen , Thamar Solorio, Monojit Choudhury , Ivan Laptev, Mubarak Shah, Salman Khan, Fahad Shahbaz Khan

University of Central Florida, Mohamed bin Zayed University of AI, Amazon, Aalto University, Australian National University, Linköping University

https://mbzuai-oryx.github.io/ALM-Bench/

2 of 44

  • Motivation
  • Problem Statement
  • Dataset Comparison
  • Dataset Curation
  • Experimental Setup
  • Evaluation and Results
  • Qualitative Figures

Overall Picture

 

3 of 44

  • LMMs lack global inclusivity - biased toward high-resource languages.
  • Current benchmarks lacks evaluation of underrepresented cultures.
  • Lack of diverse question types, cultural domains and language coverage.
  • Need for natively verified, multilingual benchmarks to ensure cultural inclusivity.

Motivation

 

4 of 44

  • Existing benchmarks focus on high-resource languages, overlooking cultural richness of low-resource languages.
  • The benchmarks are often unfair - limited to a single question type only (MCQ).
  • Existing works ignores the cultural inclusivity in LMMs from Multilingual perspective.
  • ALM-Bench fills this gap: Global visual cultural benchmark for 100 languages.

 

Problem

5 of 44

ALM-Bench Overview

  • Contains 22.7K culturally diverse multilingual LMM questions in 100 languages.
  • Contains 19 generic and culture-specific categories.
  • Four diverse question types (MCQs, Short QAs, Long QAs, True/False.)
  • Over 800 hours of native-language expert annotations.
  • Over 60 volunteers from 53 countries.
  • Benchmarked 16 LMMs (open-source and closed-source), identifying performance gaps and areas for improvement.

 

6 of 44

 

7 of 44

Various Question Types

 

8 of 44

Benchmark Comparison

 

9 of 44

Dataset Statistics

  • ALM-Bench Spans:
    • 100 Languages
    • 73 Countries
    • 24 Language Scripts
    • 15 Language Families

 

10 of 44

Dataset Curation

 

11 of 44

Dataset Curation – Generic QAs

 

  • Generic VQA Pairs:
    • LLaVA-Bench-in-the-Wild dataset. [1]
    • 6 Categories (Indoor, Outdoor, Food Items, Memes, Painting, Sketch).

Outdoor

Food

Meme

Painting

Indoor

Sketch

[1] Liu, Haotian, et al. "Visual instruction tuning." Advances in neural information processing systems 36 (2024).

12 of 44

Dataset Curation

 

13 of 44

Dataset Curation – Cultural QAs

 

14 of 44

Dataset Curation – Cultural QAs

 

  • Web-scraping using Google Search Engine.
  • Pass a Country-Language Pair in search query.
  • Eg. “{language} {cultural domain description} in {country}.”
  • Selected images only from public and open licenses.

Language

Cultural Domain

Country

Afrikaans

Religion

South Africa

Albanian

Customs

Albania

Amharic

Festivals

Ethiopia

Armenian

Heritage

Armenia

15 of 44

Dataset Curation – Cultural QAs

 

16 of 44

Dataset Curation – Cultural QAs

 

17 of 44

Dataset Curation – Cultural QAs

 

  • Manual Filtration of irrelevant cultural images.
  • Blurring PIDs (such as face, watermarks) using PicdeFacer tool.
  • Fetch other metadata (eg. Image title, Image captions) from retrieved images.

PicdeFacer: https://picdefacer.com/en/

18 of 44

Dataset Curation – Cultural QAs

 

19 of 44

Dataset Curation – Cultural QAs

 

Prompt Includes:

  • Image
  • Added Caption
  • Language
  • Country
  • Cultural Category

20 of 44

Dataset Curation – Cultural QAs

 

21 of 44

Dataset Curation – Cultural QAs

 

User Interface

Guidelines:

  • Question Relevance: Question cannot be answered without the use of VLM.
    • Where is UCF Located? ❌
    • Where is the place shown in the image located? ✅

  • Translation Verification: If either translation (Q/A) is incorrect,
    • Identify the translation error.
    • Rewrite the correct one.

22 of 44

GPT-4o Translation Issues - Human Verification

 

Most Errors

23 of 44

Evaluation Setup

 

Various Question Types:

    • Different prompts for different question types.

Evaluation Criterion:

    • MCQs: Accuracy
    • True/False: Accuracy
    • Short QA: Correctness
    • Long QA: Consistency, Fluency, Relevance

24 of 44

Evaluation Judge

 

Scoring

    • GPT-4o as a Judge.
    • Score between 0-10.
    • Compare ground truth with predicted answer.
    • Results verification from native experts.

25 of 44

Benchmarking LMM on ALM-Bench

 

26 of 44

Benchmarking LMM on ALM-Bench

 

  • List of 100 languages (x-axis).
  • List of open-source and closed-source models (y-axis).
  • Performance of model on a language in each respective box.
  • Higher color intensity represents higher results and vice versa.
  • 16 LMMs (14 open-source, 2 closed-source).

27 of 44

Benchmarking LMM on ALM-Bench

 

28 of 44

Analysis of ALM-Bench Evaluation

  • Overall Results:
    • Closed-Source models (GPT-4o and Gemini-1.5 Pro) performed higher than the open-source models.
    • Best open-source model was GLM-4V-9B (51.9%)
    • Best close-source model was GPT-4o (78.8%)
    • Both open- and closed- sourced models struggled with Amharic, Kinyarwanda, Burmese, and Sanskrit.

 

29 of 44

Analysis of ALM-Bench – Language Scripts

 

  • Both GPT-4o and Qwen2-VL struggles on low-resource languages.

  • For eg. Ge’ez (Amharic), Sinhalese (Sinhala), Oriya (Odia), Myanmar (Myanmar-Burmese).

  • Native speakers Error Analysis on the results for analysis.

30 of 44

Error Analysis on Language Scripts

 

31 of 44

Qualitative Examples of Error Analysis on Scripts

Correct Cases

Incorrect Cases

 

32 of 44

Qualitative Examples of Error Analysis on Scripts

Incorrect Cases

 

33 of 44

Analysis of ALM-Bench – Language Families

 

  • Both GPT-4o and Qwen2-VL struggles on African Language Families.

  • Eg. Language family African-Congo (Igbo, Kinyarwanda, Swahili, Yoruba).

  • Evaluated 15 Language Families.

34 of 44

Analysis of ALM-Bench – Question Types

 

  • Closed-source models perform better on Long VQAs (LVQA).

  • Open-source models perform better on Short VQAs (SVQAs).

  • Both fare better on Decision Making Questions (MCQs, T/F).

35 of 44

Analysis of ALM-Bench – Cultural Categories

 

  • Closed-source model, GPT-4o outperforms with 80.3%.

  • GPT-4o achieves 83.7% on Education and heritage but drops to 72.7% on Notable key figures category.

  • Categories such as Notable Key Figures and Customs are often culturally specific and under-represented in training datasets.

36 of 44

Analysis of ALM-Bench – Location Aware Prompts

 

  • Adding country related information in the prompts for LMM evaluation.

  • Closed-source models like GPT-4o and Gemini-1.5-Pro better utilizes the added geographical context.

  • Open-source models do not utilize this information well.

Models

With Country Info.

Without Country Info.

GPT-4o

83.57%

80.96%

Gemini-1.5-Pro

81.52%

76.19%

GLM-4V-9B

56.78%

56.41%

Qwen2-VL

53.97%

52.57%

37 of 44

ALM-Bench Insights

 

  • Near equal distribution for all the cultural categories.

  • Economy category was difficult to find cultural samples.

  • Verification done by native speakers, removing culturally irrelevant information.

38 of 44

Annotators Demographics

 

  • Over 60 volunteers for the research project.

  • Volunteers represent over 50 countries.

  • One-Fourth presence of female volunteers.

  • Over 46% fall in 18-25 age limit.

  • Over 80% speakers were either native speakers or bilingual.

  • Over 88.5% annotators are culturally familiar with their languages.

  • Over 87.9% of annotators have lived in the country from which cultural examples are from.

39 of 44

Qualitative Examples of ALM-Bench

 

40 of 44

Qualitative Examples of ALM-Bench

 

41 of 44

Qualitative Examples of ALM-Bench

 

42 of 44

Word Cloud of Categories

 

43 of 44

Summary

 

  • We introduce ALM-Bench, novel multilingual multimodal cultural benchmark.

  • We propose over 22.7k humanly verified examples across 19 domains.

  • ALM-Bench spans across 73 countries, 24 language scripts, 15 language families.

  • We benchmark 16 LMMs with four different question types (MCQs, T/F, SVQA, LVQA).

  • We highlight several important insights to curate a better culturally diverse pre-training dataset.

44 of 44

Thank you for listening.��Any questions?

 

Ashmal Vayani

University of Central Florida

Masters in Computer Vision

linkedin.com/in/ashmal-vayani/

https://ashmalvayani.github.io/