1 of 28

GETReason: Enhancing Image Context Extraction through Hierarchical Multi-Agent Reasoning

Shikhhar Siingh*1, Abhinav Rawat*1, Chitta Baral1, Vivek Gupta1

1 Arizona State University *equal contribution

2 of 28

Motivation

Publicly significant images are rich in context

Need to go beyond the surface level visual parts of the image to reason this information out of the image

Existing frameworks fail to extract this contextual information

3 of 28

Motivation

What is this image about?

4 of 28

Motivation

Who are the people in this image?

5 of 28

Motivation

Who are the people in this image?

President Rene Preval

Michelle Obama

Jill Biden

6 of 28

Motivation

Where could this image be from?

President Rene Preval

Michelle Obama

Jill Biden

7 of 28

Motivation

Where could this image be from?

Port-au-Prince, Haiti

President Rene Preval

Michelle Obama

Jill Biden

8 of 28

Motivation

What is this image about?

President Rene Preval

Michelle Obama

Jill Biden

Port-au-Prince, Haiti

9 of 28

Motivation

What is this image about?

Visit of Michelle Obama and Jill Biden to Haiti after the 2010 Haiti Earthquakes

President Rene Preval

Michelle Obama

Jill Biden

10 of 28

Motivation

What is this image about?

Visit of Michelle Obama and Jill Biden to Haiti after the 2010 Haiti Earthquakes

President Rene Preval

Michelle Obama

Jill Biden

Event

11 of 28

Motivation

What is this image about?

Visit of Michelle Obama and Jill Biden to Haiti after the 2010 Haiti Earthquakes

President Rene Preval

Michelle Obama

Jill Biden

Event

Location

12 of 28

Motivation

What is this image about?

Visit of Michelle Obama and Jill Biden to Haiti after the 2010 Haiti Earthquakes

President Rene Preval

Michelle Obama

Jill Biden

Event

Location

Time

13 of 28

Problem Statement

Given an image, extract:

Location (Geospatial)

Time (Temporal)

Event (Socio Political significance)

→ Move beyond object recognition to real-world reasoning

14 of 28

GETReason

Geospatial Event and Temporal Reasoning

Hierarchical Multi-Agent Framework

Structured Outputs

Prompt Engineering

15 of 28

GETReason

Geospatial Event and Temporal Reasoning

Architecture

Scene Graph Generation

Prompt Generation

Multi Agentic Extraction

16 of 28

GETReason

Geospatial Event and Temporal Reasoning

Scene Graph Generation

Abstract Agent

Scene graph Agent

17 of 28

GETReason

Geospatial Event and Temporal Reasoning

Prompt Generation

Prompt Generator

Event Prompt

Temporal Prompt

Geospatial Prompt

18 of 28

GETReason

Geospatial Event and Temporal Reasoning

Multi Agentic Extraction

Event Agent

Temporal Agent

Geospatial Agent

19 of 28

Dataset

TARA: 11,241 images

WikiTiLo: 6,296 images

JSON-based structure: location, time, event & reasoning

Event

Temporal

Geospatial

20 of 28

Restructuring &

Augmentation

TARA*

WikiTiLo*

{

"id": "",

"temporal_information": {

"century": "",

"decade": "",

"year": "",

"month": "",

"day": ""

},

"geospatial_information": {

"country": "",

"state_or_province": "",

"city": ""

}

}

{

"id": "",

"event": {

"value": "",

"reasoning": ""

},

"background": {

"value": "",

"reasoning": ""

},

"geospatial_information": {

"city": "",

"country": "",

"state_or_province": ""

},

"temporal_information": {

"century": "",

"day": "",

"decade": "",

"month": "",

"year": ""

}

}

Event Augmentation

Spatio-Temporal Augmentation

Deduction Augmentation

21 of 28

Evaluation

GREAT (Geospatial Reasoning Event Accuracy with Temporal alignment)

Event: Semantic cosine similarity

Geospatial: Haversine distance + hierarchy

Temporal: Weighted unit-wise scoring

22 of 28

GREAT

(Geospatial Reasoning Event Accuracy with Temporal alignment)

Event Evaluation

Cosine Similarity of Sentence Embeddings (Event + Background)

23 of 28

GREAT

(Geospatial Reasoning Event Accuracy with Temporal alignment)

Geo-spatial Evaluation

Haversine Distance-Based Similarity

24 of 28

GREAT

(Geospatial Reasoning Event Accuracy with Temporal alignment)

Temporal Evaluation

Granularity-Weighted Temporal Accuracy�

25 of 28

Results Summary

GETReason achieves highest scores in Event, Geo, Temporal inference

Superior performance on TARA and WikiTiLo datasets

Ablation confirms value of cross-agent iteration and structured outputs

26 of 28

Results Summary

Best performing model’s performance:

Gemini 1.5 pro

(among the 3 tried models: Gemini 1.5 pro, GPT-4o-mini, Qwen 2.5-VL)

on the two used datasets.

TARA*

WikiTiLo*

27 of 28

Results Summary

Ablation

Error Analysis

Ablation: Impact of cross extraction and images in prompt and multi- extraction layer

Relative error: Net improvement observed in GETReason for different tasks against baselines

28 of 28

Conclusion & Takeaways

Hierarchical multi-agent design improves contextual reasoning

Structured responses help in controlling the output of an LLM

GREAT metric evaluates reasoning, not just overlap