- Hallucination in VQA
- HVQA: average of three scores
- HVQA-count (captures over-counting of existing objects): for simple counting, spatial counting and value based questions.
- HVQA-in-domain (captures predictions of non-existing in-domain objects)
- HVQA-out-domain (captures predictions with out-of-domain objects)
- HVQA-in-domain and HVQA-out-domain are both applicable for position-based questions.
- BLIP provides the best accuracy while LLaVa and GPT4V provide the lowest hallucination scores.
- Our best model is fine-tuned BLIP (acc=91.7), when it is paired with prompts of visual description of the component (BLIP-Desc).
- It also hallucinates less on in-domain objects compared to its fine-tuned counterparts GIT and Pix2Struct.
- Fine tuning broadly ensures that the models (BLIP, GIT and Pix2Struct) do not hallucinate out-of-domain objects.
- LLaVA predicts out-of-domain objects like ‘circle’, ‘square’, ‘A’, ‘B’, ‘D’, ‘F’, ‘triangle’, ‘carlin’, ‘nano’, ‘peizo-keeper’, ‘trigger’, ‘Snake’, ‘Snake Detector’.
- There is a need to provide domain specific datasets like medical,finance etc so that the ML systems can be thoroughly tested before being deployed in the corresponding industry
- We are uncertain about the constraints of state-of-the-art transformer-based models when applied to domain-specific visual question answering datasets.
- Currently no such dataset and studies exists for Electrical Circuits
- Base: Image and text as input
- OCR text: Google Vision API
- Question [OCR] OCR output
- OCR-Post
- Keep only the numbers and units typically expected by electrical measurements
- Retain OCR output tokens that contain any of [‘Ω’, ‘H’, ‘A’, ‘F’, ‘V’, ‘W’, ‘k’, ‘K’, ‘.’, ‘κ’, ‘M’] or a combination of these symbols with a digit or only digits.
- Bounding box information
- Use human annotated bounding boxes in metadata to fine-tune YOLO v8
- Precision of 78.1, recall of 63.9, mAP 50 of 69.8 and mAP(50-95) of 51.3.
- Bounding Box Segments
- Segment bounding boxes into 9 categories based on 3*3 grid - top-left,top-middle,top-right
- Visual description of components
- ChatGPT: “Describe the electrical component ⟨component⟩ in 50 words”
- Desc: Pass question, [DESC], component description of relevant circuit component
- Source: Roboflow and Kaggle
- 5 question types: Simple Counting, Spatial counting, Position based, Value Based and Junction based.
- Question Templates - For each type, obtain question templates using ChatGPT.
- Question Generation - Instantiate questions using these templates, and image metadata like the associated components and their bounding boxes.
Hallucination scores. A=count, B=in-domain, C=out-domain.
Results per question type for the Desc variants of the models on CircuitVQA test set.
| | |
| Image metadata contains component names and their counts | Object recognition and counting skills |
| | Object detection and localization skills |
| For a positive answer (i.e., answer=“yes”), we need valid triples of component X, junction Y and junction Z. Choose random junction Y. Choose junction Z which is closest to Y. | Objection detection, localization and spatial reasoning skills |
| | Optical Character Recognition, Link text labels with components |
| Use bounding boxes of the components to decide left-most,right-most,top-most or bottom-most components | Object Detection and localization |
CircuitVQA: A visual question answering dataset for Electrical Circuit Images
Rahul Mehta¹, Bhavyajeet Singh¹², Vasudeva Varma¹, Manish Gupta¹²
IIIT Hyderabad, Microsoft
Contact: rahul008mehta@gmail.com