Imran Bitar, BSᵃ, Richard Olawoyin, PhDᵇ
ᵃOakland University William Beaumont School of Medicine, Rochester, MI
ᵇOakland University Department of Industrial and Systems Engineering, Rochester, MI
Evaluating ChatGPT-4o’s Performance on a Radiology Training Exam: Implications for Diagnostic Artificial Intelligence
Artificial intelligence (AI) has rapidly expanded within the field of medicine, with radiology being one of the most significantly impacted specialties. AI-powered models have demonstrated potential in improving diagnostic accuracy, increasing workflow efficiency, and supporting clinical decision-making. Recent developments in large language models (LLMs) have further expanded the potential applications of AI, particularly in areas like medical education.
Despite these advancements, limitations remain in the ability of current AI systems to interpret complex medical images. While many models demonstrate strong performance in text-based reasoning and knowledge recall, accurate image interpretation requires advanced pattern recognition and contextual understanding.
Evaluating AI performance on standardized medical examinations provides insight into both the strengths and limitations of these technologies. This study evaluates the performance of ChatGPT-4o on a standardized radiology training examination to further assess its potential role as a supportive tool in radiology education and clinical practice.
Introduction
Aims and Objectives
A total of 106 multiple-choice questions were obtained from the 2022 American College of Radiology (ACR) in-house training examination, which was available on the ACR website at the time of data collection. The exam included questions from multiple radiology subspecialties, including breast radiology, cardiac radiology, chest radiology, genitourinary radiology, musculoskeletal radiology, gastrointestinal radiology, neuroradiology, nuclear radiology, pediatric radiology, radiation physics, and ultrasound.
Each question was manually entered into ChatGPT-4o, and the model’s responses were recorded. Answers were then evaluated as either correct or incorrect based on the official answer key. Overall accuracy and subspecialty-specific accuracy were calculated.
To further analyze performance, the questions were categorized into two groups: image-dependent questions, which required interpretation of medical images, and image-independent questions, which could be answered using conceptual knowledge without image interpretation. Model performance between these categories was compared, and statistical analysis was performed using Fisher’s exact test.
Methods
ChatGPT-4o achieved an overall accuracy of 69.8% (74/106). Accuracy was significantly higher for image-independent questions (95.2%, 40/42) compared to image-dependent questions (53.1%, 34/64; p < 0.001). Subspecialty performance varied, with the highest accuracy in ultrasound (90.0%) and radiation physics (88.9%), and the lowest in musculoskeletal (50.0%) and pediatric radiology (55.6%). Notably, the model achieved 100% accuracy on image-independent questions in nine subspecialties.
Results
References
The results of this study demonstrate that ChatGPT-4o achieved an overall accuracy of 69.8% on a standardized radiology training examination. The model demonstrated significantly higher performance on image-independent questions, achieving an accuracy of 95.2%, compared to 53.1% accuracy on image-dependent questions.
These findings suggest that current large language models demonstrate strong capabilities in conceptual knowledge and text-based reasoning, but still face challenges when interpreting complex medical images.
Overall, ChatGPT-4o shows potential as a supportive tool for radiology education and knowledge-based tasks, while further advancements in AI development will be necessary to improve performance in medical image interpretation and diagnostic applications.
Conclusions
Table 1. Total performance of ChatGPT-4o on exam questions
Table 2. ChatGPT-4o performance on image-dependent questions
Table 3. ChatGPT-4o performance on image-independent questions