1 of 1

Motivation

Dataset

Methods

Input Representations

Question Generation

Results

  • Hallucination in VQA
    • HVQA: average of three scores
      • HVQA-count (captures over-counting of existing objects): for simple counting, spatial counting and value based questions.
      • HVQA-in-domain (captures predictions of non-existing in-domain objects)
      • HVQA-out-domain (captures predictions with out-of-domain objects)
    • HVQA-in-domain and HVQA-out-domain are both applicable for position-based questions.

  • BLIP provides the best accuracy while LLaVa and GPT4V provide the lowest hallucination scores.

  • Our best model is fine-tuned BLIP (acc=91.7), when it is paired with prompts of visual description of the component (BLIP-Desc).
    • It also hallucinates less on in-domain objects compared to its fine-tuned counterparts GIT and Pix2Struct.

  • Fine tuning broadly ensures that the models (BLIP, GIT and Pix2Struct) do not hallucinate out-of-domain objects.

  • LLaVA predicts out-of-domain objects like ‘circle’, ‘square’, ‘A’, ‘B’, ‘D’, ‘F’, ‘triangle’, ‘carlin’, ‘nano’, ‘peizo-keeper’, ‘trigger’, ‘Snake’, ‘Snake Detector’.
    • There is a need to provide domain specific datasets like medical,finance etc so that the ML systems can be thoroughly tested before being deployed in the corresponding industry
    • We are uncertain about the constraints of state-of-the-art transformer-based models when applied to domain-specific visual question answering datasets.
    • Currently no such dataset and studies exists for Electrical Circuits

  • Base: Image and text as input

  • OCR text: Google Vision API
    • Question [OCR] OCR output
  • OCR-Post
    • Keep only the numbers and units typically expected by electrical measurements
    • Retain OCR output tokens that contain any of [‘Ω’, ‘H’, ‘A’, ‘F’, ‘V’, ‘W’, ‘k’, ‘K’, ‘.’, ‘κ’, ‘M’] or a combination of these symbols with a digit or only digits.

  • Bounding box information
    • Use human annotated bounding boxes in metadata to fine-tune YOLO v8
    • Precision of 78.1, recall of 63.9, mAP 50 of 69.8 and mAP(50-95) of 51.3.
  • Bounding Box Segments
    • Segment bounding boxes into 9 categories based on 3*3 grid - top-left,top-middle,top-right

  • Visual description of components
    • ChatGPT: “Describe the electrical component ⟨component⟩ in 50 words”
    • Desc: Pass question, [DESC], component description of relevant circuit component

  • Source: Roboflow and Kaggle
  • 5 question types: Simple Counting, Spatial counting, Position based, Value Based and Junction based.
  • Question Templates - For each type, obtain question templates using ChatGPT.
  • Question Generation - Instantiate questions using these templates, and image metadata like the associated components and their bounding boxes.

Answer Generation

Dataset Statistics

Hallucination scores. A=count, B=in-domain, C=out-domain.

Results per question type for the Desc variants of the models on CircuitVQA test set.

Question Type

Answer Creation

Skills Tested

Simple Counting

Image metadata contains component names and their counts

Object recognition and counting skills

Spatial Counting

Human annotation

Object detection and localization skills

Junction based

For a positive answer (i.e., answer=“yes”), we need valid triples of component X, junction Y and junction Z. Choose random junction Y. Choose junction Z which is closest to Y.

Objection detection, localization and spatial reasoning skills

Value based

Human Annotation

Optical Character Recognition, Link text labels with components

Position based

Use bounding boxes of the components to decide left-most,right-most,top-most or bottom-most components

Object Detection and localization

CircuitVQA: A visual question answering dataset for Electrical Circuit Images

Rahul Mehta¹, Bhavyajeet Singh¹², Vasudeva Varma¹, Manish Gupta¹²

IIIT Hyderabad, Microsoft

Contact: rahul008mehta@gmail.com