CLEAR: A Clinically Grounded Tabular Framework for Radiology Report Evaluation
Yuyang Jiang 1, Chacha Chen 1, Shengyuan Wang2, Feng Li 1, Zecong Tang 3, Benjamin M. Mervak4, Lydia Chelala 1, Christopher M. Straus 1, Reve Chahine4, Samuel G. Armato III 1*, Chenhao Tan 1*
1University of Chicago, 2Tsinghua University,
3Zhejiang University, 4University of Michigan
Nowadays, LLMs/VLLMs rapidly hill-climb on benchmarks…
◀️ Stanford HAI The 2025 AI Index Report
🔼 Wu, C., Zhang, X., Zhang, Y. et al. Towards generalist foundation model for radiology by leveraging web-scale 2D&3D medical data. Nat Commun 16, 7866 (2025).
But do these appealing numbers truly capture clinically aligned qualities?
◀️ Stanford HAI The 2025 AI Index Report
🔼 Wu, C., Zhang, X., Zhang, Y. et al. Towards generalist foundation model for radiology by leveraging web-scale 2D&3D medical data. Nat Commun 16, 7866 (2025).
🤔
Evaluation Methodology: Overview
Evaluation Methodology: Lexical Metrics
Evaluation Methodology: Clinical Efficacy Metrics
Pattern: Label / Entity Extraction + Metric Calculation
Fixed value set of Entities (Anatomy, Observation) and Relations
Jain, Saahil, et al. "RadGraph: Extracting Clinical Entities and Relations from Radiology Reports.” NeurIPS (2021).
Evaluation Methodology: LLM-based Metrics
Pattern: Expertise-distilled Taxonomy + LLM-as-Judge
When we define Taxonomy,
we also define the structure of metric alignment test,
directly limiting the structure of LLM-based metrics.
Yu, Feiyang, et al. "Evaluating progress in automatic chest x-ray radiology report generation." Patterns 4.9 (2023).
Evaluation Methodology: LLM-based Metrics
Pattern: Expertise-distilled Taxonomy + LLM-as-Judge
When we define Taxonomy,
we also define the structure of metric alignment test,
directly limiting the structure of LLM-based metrics.
GREEN: Generative Radiology Report Evaluation and Error Notation (Ostmeier et al., EMNLP Findings 2024)
Evaluation Methodology: LLM-based Metrics
Why good?
Why improve?
Curtis P. Langlotz. 2015. The Radiology Report: A Guide to Thoughtful Communication for Radiologists and Other Medical Professionals. CreateSpace Independent Publishing Platform, North Charleston, SC.
Evaluation Methodology: LLM-based Metrics
Why good?
Why improve?
CLEAR: Taxonomy Overview�
CLEAR: Attribute-Level Radiology Report Evaluator�
1️⃣ CLEAR transforms the coarse, single-dimensional taxonomy into a fine-grained, multidimensional structure.
2️⃣ CLEAR also provides interpretable outputs to assess report quality at the level of condition-attribute pairs.
3️⃣ CLEAR supports both commercial models (AzureOpenAI backend) and open-source local models (vLLM backend).
How to evaluate an evaluator?
No new taxonomy 🡪 no new metric 🡪 no better reports!
Yu, Feiyang, et al. "Evaluating progress in automatic chest x-ray radiology report generation." Patterns 4.9 (2023).
CLEAR-Bench: Attribute-Level Expert Alignment Dataset�
Irvin, Jeremy, et al. "Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison." Proceedings of the AAAI conference on artificial intelligence. Vol. 33. No. 01. 2019.
Yu, Feiyang, et al. "Evaluating progress in automatic chest x-ray radiology report generation." Patterns 4.9 (2023).
Reliability Test: Evaluating CLEAR on CLEAR-Bench�
✅ Label Extraction Module
✅ Description Extraction Module
❤️🔥 Scoring Module
Smit, Akshay, et al. "Combining Automatic Labelers and Expert Annotations for Accurate Radiology Report Labeling Using BERT." Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020.
Irvin, Jeremy, et al. "Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison." Proceedings of the AAAI conference on artificial intelligence. Vol. 33. No. 01. 2019.
Scan For Open-Sourced Artifacts!
Find Us at the Poster Session!
Or yuyang2001@uchicago.edu