1 of 1

Motivation

Dataset

Methods

Input Representations

Question Generation

Results

HVQA-count (captures over-counting of existing objects): for simple counting, spatial counting and value based questions.
HVQA-in-domain (captures predictions of non-existing in-domain objects)
HVQA-out-domain (captures predictions with out-of-domain objects)

HVQA-in-domain and HVQA-out-domain are both applicable for position-based questions.

BLIP provides the best accuracy while LLaVa and GPT4V provide the lowest hallucination scores.

Our best model is fine-tuned BLIP (acc=91.7), when it is paired with prompts of visual description of the component (BLIP-Desc).

It also hallucinates less on in-domain objects compared to its fine-tuned counterparts GIT and Pix2Struct.

Fine tuning broadly ensures that the models (BLIP, GIT and Pix2Struct) do not hallucinate out-of-domain objects.

LLaVA predicts out-of-domain objects like ‘circle’, ‘square’, ‘A’, ‘B’, ‘D’, ‘F’, ‘triangle’, ‘carlin’, ‘nano’, ‘peizo-keeper’, ‘trigger’, ‘Snake’, ‘Snake Detector’.

There is a need to provide domain specific datasets like medical,finance etc so that the ML systems can be thoroughly tested before being deployed in the corresponding industry
We are uncertain about the constraints of state-of-the-art transformer-based models when applied to domain-specific visual question answering datasets.
Currently no such dataset and studies exists for Electrical Circuits

Keep only the numbers and units typically expected by electrical measurements
Retain OCR output tokens that contain any of [‘Ω’, ‘H’, ‘A’, ‘F’, ‘V’, ‘W’, ‘k’, ‘K’, ‘.’, ‘κ’, ‘M’] or a combination of these symbols with a digit or only digits.

Segment bounding boxes into 9 categories based on 3*3 grid - top-left,top-middle,top-right

ChatGPT: “Describe the electrical component ⟨component⟩ in 50 words”
Desc: Pass question, [DESC], component description of relevant circuit component

Source: Roboflow and Kaggle
5 question types: Simple Counting, Spatial counting, Position based, Value Based and Junction based.
Question Templates - For each type, obtain question templates using ChatGPT.
Question Generation - Instantiate questions using these templates, and image metadata like the associated components and their bounding boxes.

Answer Generation

Dataset Statistics

Hallucination scores. A=count, B=in-domain, C=out-domain.

Results per question type for the Desc variants of the models on CircuitVQA test set.

Question Type	Answer Creation	Skills Tested
Simple Counting	Image metadata contains component names and their counts	Object recognition and counting skills
Spatial Counting	Human annotation	Object detection and localization skills
Junction based	For a positive answer (i.e., answer=“yes”), we need valid triples of component X, junction Y and junction Z. Choose random junction Y. Choose junction Z which is closest to Y.	Objection detection, localization and spatial reasoning skills
Value based	Human Annotation	Optical Character Recognition, Link text labels with components
Position based	Use bounding boxes of the components to decide left-most,right-most,top-most or bottom-most components	Object Detection and localization

CircuitVQA: A visual question answering dataset for Electrical Circuit Images

Rahul Mehta¹, Bhavyajeet Singh¹², Vasudeva Varma¹, Manish Gupta¹²

IIIT Hyderabad, Microsoft

Contact: rahul008mehta@gmail.com