Kushal
Sil
Hazim
Alice
Alice Zanon is interested in the path towards AGI. Her background in visual arts and philosophy of language makes her prone to questioning which avenues can be taken in order to better evaluate AI and make it more robust for complex and abstract reasoning in relation to feelings, having art as a methodological starting point.
Highlights of Team’s Motivation
Drishy
Drishti Sharma has been investigating discrimination and bias in VLMs. Her main motivation is to understand how these models classify graffitis from regions like South Asia, the Middle East, and Africa in contrast to graffitis from Europe and North America. More pointedly, she wonders -- are VLMs systematically biased in how they interpret graffiti from developing versus developed countries?
Abigail
Abigail Oppong aims to explore whether (VLMs) genuinely appreciate diverse worldviews and cultural symbols. She is concerned that these models may often reduce culturally rich symbols to literal visuals. For example, the Adinkra symbol Sankofa is often interpreted simply as a bird, yet in Ghanaian culture, it embodies the deeper message of learning from the past to guide the future. Abigail is particularly interested in whether VLMs can grasp such symbolic, historical, and cultural dimensions, especially those rooted in non-Western traditions
Street Art Analysis and Cultural Mapping
Aspects | RQ1 | RQ2 | RQ3 | RQ5 |
Question | Do Visual-Language Models (VLMs) exhibit regional biases when interpreting graffiti in different cities? | How well do VLMs perform sentiment analysis on graffiti? | Do VLMs provide different responses to graffiti-related prompts depending on the input language? Can VLMs identify cultural artifact designs specific to certain cultures which might not have an English name? | Can VLMs identify artist styles? If we input multiple images of different artists can we cluster them based on the underlying signature and style each artist has? |
Data Selection Process | Arts that have distinct cultural elements that make location more plausible for VLMs to identify country based on culture. | Arts that can evoke a distinct array of feelings on viewers. | Arts with descriptions from artists that contain symbols intrinsic of specified cultures and/or have untranslatable names. | A set of around 20 arts made by different artists, that had distinct styles amongst themselves. |
Methods to answer RQ | 1. We approached this research question as a classification problem. Using vision-language models (VLMs), we prompted the models to predict which continent or country a given artwork most likely originates from. 2. To evaluate performance, we used standard classification metrics such as accuracy, precision, and recall. |
|
|
|
We developed a protocol to ensure that our data was following ethical guidelines. The source of our dataset came from: Graffiti Arts open source dataset (600), Twitter (210), Official Government & City Programs websites, Non-Profit & Official Organization Portals. Digital Archives & Aggregators, Artist Directories & Guides, Curated Lists, Blogs & Articles (246).
Overview of Project
Research Question 1: Classifying where the graffiti is from
Research Question 1: Country Level Analysis
B1 prompt_1 = """ Based on this street art, which country is it most likely from? Then classify it with one word: art, vandalism, protest, decoration, advertisement, or other. Finally, If this particular graffiti were banned in *this location*, would urban life improve or be harmed? Briefly justify. """ | prompt_4 = """ You're analyzing a piece of street art as an Art Critic and Cultural Anthropologist. Name the country of origin. "Location": "<country>", “Location_Reasoning”: ", "Location_Features": "Label": "<Art or Graffiti or Vandalism or Activism or Advertisement or "Label_Reasoning":, "Label_Features", "CulturalImpact": "<Enhances or Degrades or Neutral, "CulturalImpact_Reasoning" "CulturalImpact_Features": } “”” |
B2 prompt_1 = """ This artwork comes from continent **{given_continent}**. Which country is it most likely from? Then classify it with one word: art, vandalism, protest, decoration, advertisement, or other. Finally, If this particular mural were banned in *this location*, would urban life improve or be harmed? Briefly justify. """ | prompt_4 = """ You're analyzing a piece of street art from {given_continent} as an Art Critic and Cultural Anthropologist. Name the country of origin. { "Location": "<country>", “Location_Reasoning”: "Location_Features": " "Label": "<Art or Graffiti or Vandalism or Activism or Advertisement or "Label_Reasoning": , "Label_Features":, "CulturalImpact":", "CulturalImpact_Reasoning":, "CulturalImpact_Features" } “”” |
A2 prompt_1 = """ This art is from continent **{given_continent}**. Which country is it most likely from? *(Only choose from the provided {country_list})* Classify it with one word: art, vandalism, protest, decoration, advertisement, or other. Finally, If this particular mural were banned in *this location*, would urban life improve or be harmed? Briefly justify. """ | prompt_4 = """ You're analyzing a piece of street art from {given_continent} as an Art Critic and Cultural Anthropologist. Country must be one of {country_list}. { "Location": "<country>", “Location_Reasoning”: " "Location_Features": "Label": "<Art or Graffiti or Vandalism or Activism or Advertisement or Other>", "Label_Reasoning":, "Label_Features":, "CulturalImpact": , "CulturalImpact_Reasoning":, "CulturalImpact_Features": } """ |
Layer 1 (Cross-Family Analysis): Benchmark models from diverse architectural paradigms [Fig: Performance at Continent Level]
Layer 2 (Intra-Family Evolution): Evaluate whether architectural advancements within the family drive gains or not [Fig: Performance at Continent Level]
While most VLMs are able to identify the general region associated with the artworks, their accuracy is still relatively low, with most scores falling below 70%. Additionally, the prompts noticeably affect the accuracy scores for each model. This suggests that the performance of the VLMs is influenced not only by the models themselves, but also by the prompts used.
InternVL3 2B significantly outperforms its predecessors across all prompts, indicating clear gains from architectural evolution, whereas InternVL2 2B unexpectedly underperforms InternVL1 2B in prompts 1 and 4.
Prompt: Which one of these 13 emotions do you think humans feel when they look at this graffiti? Make an inference based on your knowledge of human behavior and art appreciation in critical art theory. Hopeful, angry, defiant, happy, calm, anxious, sad, fearful, amused, disgusted, proud, confused, in awe.
Research Question 2: Sentiment analysis of graffiti
GPT | GEMINI | CLAUDE | GROK | AYA VISION | IMAGE | COUNTRY | HUMANS | COMMENTS |
1st feeling: calm - 2nd feeling: hopeful | 1st feeling: calm - 2nd feeling: happy, hopeful, in awe | 1st feeling: calm - 2nd feeling: happy, hopeful, proud | 1st feeling: happy - 2nd feeling: calm, proud | 1st feeling: in awe | | Iran | 1st feeling: hopeful (33%), calm (25%) - 2nd feeling: calm (33%), hopeful and happy (25%) | We would benefit from human reasoning to analyse CoT of perception of feelings in reception of art; tags are good but they are also a constraint. |
Form: https://tinyurl.com/Form-3-Street-Art-Analysis
Problem: Some images had very heterogeneous answers, so we chose the most homogeneous ones for testing.
Hypothesis: 1) These images that were more homogeneous would be more easily identifiable by VLMs;
2) More data, and more diverse data, might make feelings more homogeneous eventually.
Dealing with bias: The forms store data regarding age, education, ethnicity, mother tongue, country of origin, country of residence, level of proficiency in English.
GPT | GEMINI | CLAUDE | GROK | AYA VISION | IMAGE | COUNTRY | HUMANS | COMMENT |
1st feeling: fear - 2nd feeling: anxiety | 1st feeling: fearful - 2nd feeling: anxiety, sad, angry | 1st feeling: anxious - 2nd feeling: fearful, confused, in awe | 1st feeling: anxious - 2nd feeling: defiant, in awe | 1st feeling: defiant | | Israel | 1st feeling: fearful (33.3%), anxious (25%) and sad (17,6%) - 2nd feeling: fearful (41,7%) and anxious (33,3%) | some VLMs give "in awe" due to quality of technique |
1st feeling: in awe - 2nd feeling: proud, hopeful | 1st feeling: defiant - 2nd feeling: proud, in awe | 1st feeling: in awe - 2nd feeling: proud, calm, hopeful | 1st feeling: proud - 2nd feeling: in awe, sad | 1st feeling: in awe - 2nd feeling: calm, intrigued, reflective, angry, sad, fearful | | 1st feeling: hopeful (33,3%) and calm (33,3%), sad and in awe (16,7%) - 2nd feeling: hopeful (33,3%), in awe (25%) | Malaysia | Grok and Aya were the only one who identified "sad" as a possibility. Aya gave answers that weren't on the prompt. |
Steven
Bronson
Jaward
Alex
Research Question 3: Multilingualism and cultural symbols
Insights from VLMs’ outputs:
Prompt: Examine this graffiti as a system of signs. Identify three distinct cultural symbols or motifs in the image, explain what each signifies in its original context, and argue how their combination creates a new or blended meaning. (Translated to Twi, Portuguese, Spanish, Italian and Hindi by the team).
KMeans vs GMM [BLIP as VLM] - Using 3D Vector space
KMeans vs GMM [BLIP as VLM] - Using 2D Vector space
⚙️Approach Pipeline
Total Files: 246 | Artists: 12
Research Question 5: Clustering Artists
There is minor improvement when moving from BLIP to BLIP-2, and when switching from K-Means to GMM clustering. However, these improvements are minimal. The demo shows that the clusters are not well-separated, especially considering how distinct the artist styles are from a human perspective. This suggests that VLMs are not effectively capturing or encoding those stylistic features.
We initially had 5 research questions, and due to time constraint we couldn’t investigate research question 4 (RQ4. How do VLMs use Chain-of-Thought (CoT) reasoning when analyzing graffiti, particularly in understanding its placement and sentiment?)
RQ1: We have been able to verify whether smaller VLMs can identify where images are from, and we carried out quantitative analysis. But why do they misjudge where images are from? Are there patterns to that? For instance, if an image is from Mexico, it can get mistagged as Brazil, United States, Argentina, Peru or Colombia, but it doesn’t get a mislabel such as Japan or Iran. Can we map these insights going back to the images in a steroscopical manner?
RQ2: Offering tags for sentiment allows for an easier quantitative approach, but feelings are not as controllable as this method for collecting answers would make it seem to be. There is a qualitative analysis and a CoT analysis that needs to be performed. How can we get more diverse responses, and have them in sufficient number so that we can really have ground truth?
RQ3: We need to have an extensive corpora of answers from VLMs, and they can vary with prompt design almost infinitely. How can we propose a methodology that allows for a comprehensive analysis of multilingual input-output with regards to cultural aspects to be investigated in graffiti? Is it feasible and interesting to automate analysis of outputs in order to compare them in bulk?
RQ5: Our current results indicate that, although a few clusters are very distinct, the majority remain difficult to interpret, even in high-purity clusters. To investigate further, we plan to evaluate broader spectrum of vision–language models across multiples sizes and architectural paradigms to determine whether they can capture stylistic nuances more reliably and improve representational richness.
Publish a paper!
Future Directions