1 of 10

Kushal

Sil

Hazim

Alice

Alice Zanon is interested in the path towards AGI. Her background in visual arts and philosophy of language makes her prone to questioning which avenues can be taken in order to better evaluate AI and make it more robust for complex and abstract reasoning in relation to feelings, having art as a methodological starting point.

Highlights of Team’s Motivation

Drishy

Drishti Sharma has been investigating discrimination and bias in VLMs. Her main motivation is to understand how these models classify graffitis from regions like South Asia, the Middle East, and Africa in contrast to graffitis from Europe and North America. More pointedly, she wonders -- are VLMs systematically biased in how they interpret graffiti from developing versus developed countries?

Abigail

Abigail Oppong aims to explore whether (VLMs) genuinely appreciate diverse worldviews and cultural symbols. She is concerned that these models may often reduce culturally rich symbols to literal visuals. For example, the Adinkra symbol Sankofa is often interpreted simply as a bird, yet in Ghanaian culture, it embodies the deeper message of learning from the past to guide the future. Abigail is particularly interested in whether VLMs can grasp such symbolic, historical, and cultural dimensions, especially those rooted in non-Western traditions

Street Art Analysis and Cultural Mapping

Each one of the collaborators or our Aya Expedition team has shared a personal motivation either for joining the project or carrying it out:

Abigail Oppong wants to understand VLMs appreciation of different views and cultures and whether VLMs respects other people’s points of view. For her, the adinkra symbol of Sankofa cannot be understood simply as a bird. She wants to check if VLMs can grasp the different meanings of us.

Drishti Sharma has been investigating discrimination and bias in VLMs. Her main question has been how VLMs classify graffiti from South Asia and Middle East in comparison with Europe and North America. Or rather: is graffiti in developing areas seen as vandalism instead of art? Is is different in developed areas?

Muhammed Hazim, who was also working on the project “Understanding how Aya thinks”, wants to find out if VLMs are smart enough to understand art. He sees graffiti art data as an inspiration for a very interesting dataset, and believes that even though the collection and annotation of this data is hard, it can help make better models.

2 of 10

Aspects	RQ1	RQ2	RQ3	RQ5
Question	Do Visual-Language Models (VLMs) exhibit regional biases when interpreting graffiti in different cities?	How well do VLMs perform sentiment analysis on graffiti?	Do VLMs provide different responses to graffiti-related prompts depending on the input language? Can VLMs identify cultural artifact designs specific to certain cultures which might not have an English name?	Can VLMs identify artist styles? If we input multiple images of different artists can we cluster them based on the underlying signature and style each artist has?
Data Selection Process	Arts that have distinct cultural elements that make location more plausible for VLMs to identify country based on culture.	Arts that can evoke a distinct array of feelings on viewers.	Arts with descriptions from artists that contain symbols intrinsic of specified cultures and/or have untranslatable names.	A set of around 20 arts made by different artists, that had distinct styles amongst themselves.
Methods to answer RQ	1. We approached this research question as a classification problem. Using vision-language models (VLMs), we prompted the models to predict which continent or country a given artwork most likely originates from. 2. To evaluate performance, we used standard classification metrics such as accuracy, precision, and recall.	Created three forms on Google Forms to ask for human annotation of sentiments. The forms covered two different aspects of annotation. Contrastive analysis: we analyzed VLMs (GPT-4o), Gemini (2.5 flash), Claude (3.7 Sonnet), Grok, AyaVision in light of our results of human annotation.	Contrastive analysis: we analyzed VLMs (GPT-4o), Gemini (2.5 flash), Claude, Grok, AyaVision and compared to our results of human annotation. Prompts in English, Twi, Portuguese, Spanish, Italian and Hindi. Compared our results with human annotation of a specialist.	Feature Extraction: We used a pretrained VLM to generate image embeddings for every artwork. Clustering: Apply K-Means and Gaussian Mixture Models (GMMs) to group the embeddings. Dimensionality Reduction [PCA]: enabling us to visually inspect and interpret cluster structure.

We developed a protocol to ensure that our data was following ethical guidelines. The source of our dataset came from: Graffiti Arts open source dataset (600), Twitter (210), Official Government & City Programs websites, Non-Profit & Official Organization Portals. Digital Archives & Aggregators, Artist Directories & Guides, Curated Lists, Blogs & Articles (246).

Overview of Project

3 of 10

Research Question 1: Classifying where the graffiti is from

4 of 10

Research Question 1: Country Level Analysis

B1 prompt_1 = """ Based on this street art, which country is it most likely from? Then classify it with one word: art, vandalism, protest, decoration, advertisement, or other. Finally, If this particular graffiti were banned in this location, would urban life improve or be harmed? Briefly justify. """	prompt_4 = """ You're analyzing a piece of street art as an Art Critic and Cultural Anthropologist. Name the country of origin. "Location": "<country>", “Location_Reasoning”: ", "Location_Features": "Label": "<Art or Graffiti or Vandalism or Activism or Advertisement or "Label_Reasoning":, "Label_Features", "CulturalImpact": "<Enhances or Degrades or Neutral, "CulturalImpact_Reasoning" "CulturalImpact_Features": } “””
B2 prompt_1 = """ This artwork comes from continent {given_continent}. Which country is it most likely from? Then classify it with one word: art, vandalism, protest, decoration, advertisement, or other. Finally, If this particular mural were banned in this location, would urban life improve or be harmed? Briefly justify. """	prompt_4 = """ You're analyzing a piece of street art from {given_continent} as an Art Critic and Cultural Anthropologist. Name the country of origin. { "Location": "<country>", “Location_Reasoning”: "Location_Features": " "Label": "<Art or Graffiti or Vandalism or Activism or Advertisement or "Label_Reasoning": , "Label_Features":, "CulturalImpact":", "CulturalImpact_Reasoning":, "CulturalImpact_Features" } “””
A2 prompt_1 = """ This art is from continent {given_continent}. Which country is it most likely from? (Only choose from the provided {country_list}) Classify it with one word: art, vandalism, protest, decoration, advertisement, or other. Finally, If this particular mural were banned in this location, would urban life improve or be harmed? Briefly justify. """	prompt_4 = """ You're analyzing a piece of street art from {given_continent} as an Art Critic and Cultural Anthropologist. Country must be one of {country_list}. { "Location": "<country>", “Location_Reasoning”: " "Location_Features": "Label": "<Art or Graffiti or Vandalism or Activism or Advertisement or Other>", "Label_Reasoning":, "Label_Features":, "CulturalImpact": , "CulturalImpact_Reasoning":, "CulturalImpact_Features": } """

5 of 10

Layer 1 (Cross-Family Analysis): Benchmark models from diverse architectural paradigms [Fig: Performance at Continent Level]

Layer 2 (Intra-Family Evolution): Evaluate whether architectural advancements within the family drive gains or not [Fig: Performance at Continent Level]

While most VLMs are able to identify the general region associated with the artworks, their accuracy is still relatively low, with most scores falling below 70%. Additionally, the prompts noticeably affect the accuracy scores for each model. This suggests that the performance of the VLMs is influenced not only by the models themselves, but also by the prompts used.

InternVL3 2B significantly outperforms its predecessors across all prompts, indicating clear gains from architectural evolution, whereas InternVL2 2B unexpectedly underperforms InternVL1 2B in prompts 1 and 4.

6 of 10

Prompt: Which one of these 13 emotions do you think humans feel when they look at this graffiti? Make an inference based on your knowledge of human behavior and art appreciation in critical art theory. Hopeful, angry, defiant, happy, calm, anxious, sad, fearful, amused, disgusted, proud, confused, in awe.

Research Question 2: Sentiment analysis of graffiti

GPT	GEMINI	CLAUDE	GROK	AYA VISION	IMAGE	COUNTRY	HUMANS	COMMENTS
1st feeling: calm - 2nd feeling: hopeful	1st feeling: calm - 2nd feeling: happy, hopeful, in awe	1st feeling: calm - 2nd feeling: happy, hopeful, proud	1st feeling: happy - 2nd feeling: calm, proud	1st feeling: in awe		Iran	1st feeling: hopeful (33%), calm (25%) - 2nd feeling: calm (33%), hopeful and happy (25%)	We would benefit from human reasoning to analyse CoT of perception of feelings in reception of art; tags are good but they are also a constraint.

7 of 10

Form: https://tinyurl.com/Form-3-Street-Art-Analysis

Problem: Some images had very heterogeneous answers, so we chose the most homogeneous ones for testing.

Hypothesis: 1) These images that were more homogeneous would be more easily identifiable by VLMs;

2) More data, and more diverse data, might make feelings more homogeneous eventually.

Dealing with bias: The forms store data regarding age, education, ethnicity, mother tongue, country of origin, country of residence, level of proficiency in English.

GPT	GEMINI	CLAUDE	GROK	AYA VISION	IMAGE	COUNTRY	HUMANS	COMMENT
1st feeling: fear - 2nd feeling: anxiety	1st feeling: fearful - 2nd feeling: anxiety, sad, angry	1st feeling: anxious - 2nd feeling: fearful, confused, in awe	1st feeling: anxious - 2nd feeling: defiant, in awe	1st feeling: defiant		Israel	1st feeling: fearful (33.3%), anxious (25%) and sad (17,6%) - 2nd feeling: fearful (41,7%) and anxious (33,3%)	some VLMs give "in awe" due to quality of technique
1st feeling: in awe - 2nd feeling: proud, hopeful	1st feeling: defiant - 2nd feeling: proud, in awe	1st feeling: in awe - 2nd feeling: proud, calm, hopeful	1st feeling: proud - 2nd feeling: in awe, sad	1st feeling: in awe - 2nd feeling: calm, intrigued, reflective, angry, sad, fearful		1st feeling: hopeful (33,3%) and calm (33,3%), sad and in awe (16,7%) - 2nd feeling: hopeful (33,3%), in awe (25%)	Malaysia	Grok and Aya were the only one who identified "sad" as a possibility. Aya gave answers that weren't on the prompt.

8 of 10

Steven

Bronson

Jaward

Alex

Research Question 3: Multilingualism and cultural symbols

Insights from VLMs’ outputs:

Gemini didn’t want to assume the identity of the men portrayed in the painting; GPT identifies Malcolm X in pt and en; Claude identifies Malcolm X and MLK as “might be”; Grok identifies Malcolm X and MLK in pt and sp; Aya sees MLK, but doesn’t see Malcolm X.
AYA-Vision 8B hallucinates in Italian and sees Indian and Egyptian symbols that are not on the image.
The way different models analyze the chess pattern in the background showcases different understandings of the symbol. It’s unclear if the way they analyze the pattern draws from eastern or western cultures.
Some cultural symbols like the Adinkra Symbols were recognised by Gemini, Claude, Grok while GPT captured them as Mythical animal Creatures.
Gemini and AYA-Vision 8B were unable to recognize Twi, but recognises prompts for Portuguese,Spanish, Italian, Hindi. Claude and Grok not only understood Twi, but also gave output in Twi.

Prompt: Examine this graffiti as a system of signs. Identify three distinct cultural symbols or motifs in the image, explain what each signifies in its original context, and argue how their combination creates a new or blended meaning. (Translated to Twi, Portuguese, Spanish, Italian and Hindi by the team).

9 of 10

KMeans vs GMM [BLIP as VLM] - Using 3D Vector space

KMeans vs GMM [BLIP as VLM] - Using 2D Vector space

⚙️Approach Pipeline

Total Files: 246 | Artists: 12

Research Question 5: Clustering Artists

There is minor improvement when moving from BLIP to BLIP-2, and when switching from K-Means to GMM clustering. However, these improvements are minimal. The demo shows that the clusters are not well-separated, especially considering how distinct the artist styles are from a human perspective. This suggests that VLMs are not effectively capturing or encoding those stylistic features.

10 of 10

We initially had 5 research questions, and due to time constraint we couldn’t investigate research question 4 (RQ4. How do VLMs use Chain-of-Thought (CoT) reasoning when analyzing graffiti, particularly in understanding its placement and sentiment?)

RQ1: We have been able to verify whether smaller VLMs can identify where images are from, and we carried out quantitative analysis. But why do they misjudge where images are from? Are there patterns to that? For instance, if an image is from Mexico, it can get mistagged as Brazil, United States, Argentina, Peru or Colombia, but it doesn’t get a mislabel such as Japan or Iran. Can we map these insights going back to the images in a steroscopical manner?

RQ2: Offering tags for sentiment allows for an easier quantitative approach, but feelings are not as controllable as this method for collecting answers would make it seem to be. There is a qualitative analysis and a CoT analysis that needs to be performed. How can we get more diverse responses, and have them in sufficient number so that we can really have ground truth?

RQ3: We need to have an extensive corpora of answers from VLMs, and they can vary with prompt design almost infinitely. How can we propose a methodology that allows for a comprehensive analysis of multilingual input-output with regards to cultural aspects to be investigated in graffiti? Is it feasible and interesting to automate analysis of outputs in order to compare them in bulk?

RQ5: Our current results indicate that, although a few clusters are very distinct, the majority remain difficult to interpret, even in high-purity clusters. To investigate further, we plan to evaluate broader spectrum of vision–language models across multiples sizes and architectural paradigms to determine whether they can capture stylistic nuances more reliably and improve representational richness.

Publish a paper!

Future Directions