1 of 16

CLEAR: A Clinically Grounded Tabular Framework for Radiology Report Evaluation

Yuyang Jiang 1, Chacha Chen 1, Shengyuan Wang2, Feng Li 1, Zecong Tang 3, Benjamin M. Mervak4, Lydia Chelala 1, Christopher M. Straus 1, Reve Chahine4, Samuel G. Armato III 1*, Chenhao Tan 1*

1University of Chicago, 2Tsinghua University,

3Zhejiang University, 4University of Michigan

2 of 16

Nowadays, LLMs/VLLMs rapidly hill-climb on benchmarks…

◀️ Stanford HAI The 2025 AI Index Report

🔼 Wu, C., Zhang, X., Zhang, Y. et al. Towards generalist foundation model for radiology by leveraging web-scale 2D&3D medical data. Nat Commun 16, 7866 (2025).

3 of 16

But do these appealing numbers truly capture clinically aligned qualities?

◀️ Stanford HAI The 2025 AI Index Report

🔼 Wu, C., Zhang, X., Zhang, Y. et al. Towards generalist foundation model for radiology by leveraging web-scale 2D&3D medical data. Nat Commun 16, 7866 (2025).

🤔

4 of 16

Evaluation Methodology: Overview

5 of 16

Evaluation Methodology: Lexical Metrics

6 of 16

Evaluation Methodology: Clinical Efficacy Metrics

Pattern: Label / Entity Extraction + Metric Calculation

Fixed value set of Entities (Anatomy, Observation) and Relations

Jain, Saahil, et al. "RadGraph: Extracting Clinical Entities and Relations from Radiology Reports.” NeurIPS (2021).

7 of 16

Evaluation Methodology: LLM-based Metrics

Pattern: Expertise-distilled Taxonomy + LLM-as-Judge

When we define Taxonomy,

we also define the structure of metric alignment test,

directly limiting the structure of LLM-based metrics.

Yu, Feiyang, et al. "Evaluating progress in automatic chest x-ray radiology report generation." Patterns 4.9 (2023).

8 of 16

Evaluation Methodology: LLM-based Metrics

Pattern: Expertise-distilled Taxonomy + LLM-as-Judge

When we define Taxonomy,

we also define the structure of metric alignment test,

directly limiting the structure of LLM-based metrics.

9 of 16

Evaluation Methodology: LLM-based Metrics

  1. Expertise-distilled Taxonomy

Why good?

  • Expertise Distillation: if we want the model to generate a good report, we should know what a good report we want first.
  • Structured Reporting

Why improve?

  • We want to generate an examination sheet for reports just like a report for our physical examination 🡨 structured and interpretable

Curtis P. Langlotz. 2015. The Radiology Report: A Guide to Thoughtful Communication for Radiologists and Other Medical Professionals. CreateSpace Independent Publishing Platform, North Charleston, SC.

10 of 16

Evaluation Methodology: LLM-based Metrics

  1. LLM-as-Judge

Why good?

  • Evaluation is multidimensional and we want a unified model type to operationalize. → More efficient; Robust across different NLP tasks
  • Agent as Evaluator → Easy to design; Support Various Models

Why improve?

  • Adapt evaluator structure to taxonomy structure
  • Optimize software workflow and backend models to support stronger capability

11 of 16

CLEAR: Taxonomy Overview�

12 of 16

CLEAR: Attribute-Level Radiology Report Evaluator�

1️⃣ CLEAR transforms the coarse, single-dimensional taxonomy into a fine-grained, multidimensional structure.

2️⃣ CLEAR also provides interpretable outputs to assess report quality at the level of condition-attribute pairs.

3️⃣ CLEAR supports both commercial models (AzureOpenAI backend) and open-source local models (vLLM backend).

13 of 16

How to evaluate an evaluator?

  1. Report Selection
  2. Radiologist Evaluation
  3. Metric / Evaluator Evaluation
  4. Alignment Test

No new taxonomy 🡪 no new metric 🡪 no better reports!

Yu, Feiyang, et al. "Evaluating progress in automatic chest x-ray radiology report generation." Patterns 4.9 (2023).

14 of 16

CLEAR-Bench: Attribute-Level Expert Alignment Dataset�

Irvin, Jeremy, et al. "Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison." Proceedings of the AAAI conference on artificial intelligence. Vol. 33. No. 01. 2019.

Yu, Feiyang, et al. "Evaluating progress in automatic chest x-ray radiology report generation." Patterns 4.9 (2023).

15 of 16

Reliability Test: Evaluating CLEAR on CLEAR-Bench�

✅ Label Extraction Module

  • The LLM-based labeler achieves substantial gains over existing systems (CheXbert, CheXpert).
  • Improving positive condition detection (0.805 F1) by 15.8% and negative condition detection (0.744 F1) by 42.5%.

✅ Description Extraction Module

  • LLMs, especially GPT-4o, excel at fine-grained attribute extraction.
  • Reaching the highest radiologist rating of 0.940 / 1.000 for Recommendation and the lowest of 0.809 / 1.000 for Severity.

❤️‍🔥 Scoring Module

  • Metrics produced by CLEAR aligns well with radiologist ratings (up to 0.994).

Smit, Akshay, et al. "Combining Automatic Labelers and Expert Annotations for Accurate Radiology Report Labeling Using BERT." Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020.

Irvin, Jeremy, et al. "Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison." Proceedings of the AAAI conference on artificial intelligence. Vol. 33. No. 01. 2019.

16 of 16

Scan For Open-Sourced Artifacts!

Find Us at the Poster Session!

Or yuyang2001@uchicago.edu