GETReason: Enhancing Image Context Extraction through Hierarchical Multi-Agent Reasoning
Shikhhar Siingh*1, Abhinav Rawat*1, Chitta Baral1, Vivek Gupta1
1 Arizona State University *equal contribution
Motivation
Publicly significant images are rich in context
Need to go beyond the surface level visual parts of the image to reason this information out of the image
Existing frameworks fail to extract this contextual information
Motivation
What is this image about?
Motivation
Who are the people in this image?
Motivation
Who are the people in this image?
President Rene Preval
Michelle Obama
Jill Biden
Motivation
Where could this image be from?
President Rene Preval
Michelle Obama
Jill Biden
Motivation
Where could this image be from?
Port-au-Prince, Haiti
President Rene Preval
Michelle Obama
Jill Biden
Motivation
What is this image about?
President Rene Preval
Michelle Obama
Jill Biden
Port-au-Prince, Haiti
Motivation
What is this image about?
Visit of Michelle Obama and Jill Biden to Haiti after the 2010 Haiti Earthquakes
President Rene Preval
Michelle Obama
Jill Biden
Motivation
What is this image about?
Visit of Michelle Obama and Jill Biden to Haiti after the 2010 Haiti Earthquakes
President Rene Preval
Michelle Obama
Jill Biden
Event
Motivation
What is this image about?
Visit of Michelle Obama and Jill Biden to Haiti after the 2010 Haiti Earthquakes
President Rene Preval
Michelle Obama
Jill Biden
Event
Location
Motivation
What is this image about?
Visit of Michelle Obama and Jill Biden to Haiti after the 2010 Haiti Earthquakes
President Rene Preval
Michelle Obama
Jill Biden
Event
Location
Time
Problem Statement
Given an image, extract:
Location (Geospatial)
Time (Temporal)
Event (Socio Political significance)
→ Move beyond object recognition to real-world reasoning
GETReason
Geospatial Event and Temporal Reasoning
Hierarchical Multi-Agent Framework
Structured Outputs
Prompt Engineering
GETReason
Geospatial Event and Temporal Reasoning
Architecture
Scene Graph Generation
Prompt Generation
Multi Agentic Extraction
GETReason
Geospatial Event and Temporal Reasoning
Scene Graph Generation
Abstract Agent
Scene graph Agent
GETReason
Geospatial Event and Temporal Reasoning
Prompt Generation
Prompt Generator
Event Prompt
Temporal Prompt
Geospatial Prompt
GETReason
Geospatial Event and Temporal Reasoning
Multi Agentic Extraction
Event Agent
Temporal Agent
Geospatial Agent
Dataset
TARA: 11,241 images
WikiTiLo: 6,296 images
→ JSON-based structure: location, time, event & reasoning
Event
Temporal
Geospatial
Restructuring &
Augmentation
TARA*
WikiTiLo*
{
"id": "",
"temporal_information": {
"century": "",
"decade": "",
"year": "",
"month": "",
"day": ""
},
"geospatial_information": {
"country": "",
"state_or_province": "",
"city": ""
}
}
{
"id": "",
"event": {
"value": "",
"reasoning": ""
},
"background": {
"value": "",
"reasoning": ""
},
"geospatial_information": {
"city": "",
"country": "",
"state_or_province": ""
},
"temporal_information": {
"century": "",
"day": "",
"decade": "",
"month": "",
"year": ""
}
}
Event Augmentation
Spatio-Temporal Augmentation
Deduction Augmentation
Evaluation
GREAT (Geospatial Reasoning Event Accuracy with Temporal alignment)
Event: Semantic cosine similarity
Geospatial: Haversine distance + hierarchy
Temporal: Weighted unit-wise scoring
GREAT
(Geospatial Reasoning Event Accuracy with Temporal alignment)
Event Evaluation
Cosine Similarity of Sentence Embeddings (Event + Background)
GREAT
(Geospatial Reasoning Event Accuracy with Temporal alignment)
Geo-spatial Evaluation
Haversine Distance-Based Similarity
GREAT
(Geospatial Reasoning Event Accuracy with Temporal alignment)
Temporal Evaluation
Granularity-Weighted Temporal Accuracy�
Results Summary
GETReason achieves highest scores in Event, Geo, Temporal inference
Superior performance on TARA and WikiTiLo datasets
Ablation confirms value of cross-agent iteration and structured outputs
Results Summary
Best performing model’s performance:
Gemini 1.5 pro
(among the 3 tried models: Gemini 1.5 pro, GPT-4o-mini, Qwen 2.5-VL)
on the two used datasets.
TARA*
WikiTiLo*
Results Summary
Ablation
Error Analysis
Ablation: Impact of cross extraction and images in prompt and multi- extraction layer
Relative error: Net improvement observed in GETReason for different tasks against baselines
Conclusion & Takeaways
Hierarchical multi-agent design improves contextual reasoning
Structured responses help in controlling the output of an LLM
GREAT metric evaluates reasoning, not just overlap