1 of 16

CLEAR: A Clinically Grounded Tabular Framework for Radiology Report Evaluation

Yuyang Jiang¹, Chacha Chen¹, Shengyuan Wang², Feng Li ¹, Zecong Tang ³, Benjamin M. Mervak⁴, Lydia Chelala ¹, Christopher M. Straus ¹, Reve Chahine⁴, Samuel G. Armato III ^1*, Chenhao Tan ^1*

¹University of Chicago, ²Tsinghua University,

³Zhejiang University, ⁴University of Michigan

2 of 16

Nowadays, LLMs/VLLMs rapidly hill-climb on benchmarks…

◀️ Stanford HAI The 2025 AI Index Report

🔼 Wu, C., Zhang, X., Zhang, Y. et al. Towards generalist foundation model for radiology by leveraging web-scale 2D&3D medical data. Nat Commun 16, 7866 (2025).

3 of 16

But do these appealing numbers truly capture clinically aligned qualities?

◀️ Stanford HAI The 2025 AI Index Report

🔼 Wu, C., Zhang, X., Zhang, Y. et al. Towards generalist foundation model for radiology by leveraging web-scale 2D&3D medical data. Nat Commun 16, 7866 (2025).

🤔

4 of 16

Evaluation Methodology: Overview

In the existing literature, three main types of metrics have been proposed to assess the quality of generated radiology reports, as illustrated in \autoref{fig:metric}:
(i) \textbf{Lexical metrics} measure surface-level similarity between the generated and ground-truth reports \cite{papineni2002bleu,lin2004rouge,zhang2019bertscore}. While straightforward and easy to compute, these metrics struggle to capture nuanced semantics and domain-specific terminology, leading to poor sensitivity for capturing clinically significant errors.
(ii) \textbf{Clinical efficacy metrics} evaluate the correctness of medical entities and their relationships \cite{jain2021radgraph,yu2023evaluating,zhao-etal-2024-ratescore}, typically through structured extraction-based comparisons. Although more clinically informed than lexical metrics, they lack the resolution to assess fine-grained attributes such as severity, temporal progression, or treatment recommendations.
(iii) \textbf{LLM-based metrics}
\cite{ostmeier2024green,huang2024fineradscore,ZambranoChaves2025} represent the latest direction, often leveraging the pipeline of LLM-as-a-Judge~\cite{zheng2023judging} with pre-defined taxonomies such as the six error categories from the ReXVal dataset~\cite{yu2023rexval}. While getting closer to expert judgment compared with the previous two types, these methods may still lack comprehensive structured attribution and condition-level interpretability.

Therefore, to address the limitations of existing metrics, we introduce \textbf{CLEAR} (Section~\ref{sec:pipeline}), the first clinically-grounded attribute-level evaluation framework that leverages LLMs to map free-text radiology reports to a structured tabular format. Compared with prior work, CLEAR transforms the coarse, single-dimensional taxonomy into a fine-grained, multidimensional structure. Our design not only enables more comprehensive comparisons between candidate and ground-truth reports, but also provides interpretable outputs to assess report quality at the level of condition-attribute pairs.

5 of 16

Evaluation Methodology: Lexical Metrics

In the existing literature, three main types of metrics have been proposed to assess the quality of generated radiology reports, as illustrated in \autoref{fig:metric}:
(i) \textbf{Lexical metrics} measure surface-level similarity between the generated and ground-truth reports \cite{papineni2002bleu,lin2004rouge,zhang2019bertscore}. While straightforward and easy to compute, these metrics struggle to capture nuanced semantics and domain-specific terminology, leading to poor sensitivity for capturing clinically significant errors.
(ii) \textbf{Clinical efficacy metrics} evaluate the correctness of medical entities and their relationships \cite{jain2021radgraph,yu2023evaluating,zhao-etal-2024-ratescore}, typically through structured extraction-based comparisons. Although more clinically informed than lexical metrics, they lack the resolution to assess fine-grained attributes such as severity, temporal progression, or treatment recommendations.
(iii) \textbf{LLM-based metrics}
\cite{ostmeier2024green,huang2024fineradscore,ZambranoChaves2025} represent the latest direction, often leveraging the pipeline of LLM-as-a-Judge~\cite{zheng2023judging} with pre-defined taxonomies such as the six error categories from the ReXVal dataset~\cite{yu2023rexval}. While getting closer to expert judgment compared with the previous two types, these methods may still lack comprehensive structured attribution and condition-level interpretability.

Therefore, to address the limitations of existing metrics, we introduce \textbf{CLEAR} (Section~\ref{sec:pipeline}), the first clinically-grounded attribute-level evaluation framework that leverages LLMs to map free-text radiology reports to a structured tabular format. Compared with prior work, CLEAR transforms the coarse, single-dimensional taxonomy into a fine-grained, multidimensional structure. Our design not only enables more comprehensive comparisons between candidate and ground-truth reports, but also provides interpretable outputs to assess report quality at the level of condition-attribute pairs.

6 of 16

Evaluation Methodology: Clinical Efficacy Metrics

Pattern: Label / Entity Extraction + Metric Calculation

Fixed value set of Entities (Anatomy, Observation) and Relations

Jain, Saahil, et al. "RadGraph: Extracting Clinical Entities and Relations from Radiology Reports.” NeurIPS (2021).

In the existing literature, three main types of metrics have been proposed to assess the quality of generated radiology reports, as illustrated in \autoref{fig:metric}:
(i) \textbf{Lexical metrics} measure surface-level similarity between the generated and ground-truth reports \cite{papineni2002bleu,lin2004rouge,zhang2019bertscore}. While straightforward and easy to compute, these metrics struggle to capture nuanced semantics and domain-specific terminology, leading to poor sensitivity for capturing clinically significant errors.
(ii) \textbf{Clinical efficacy metrics} evaluate the correctness of medical entities and their relationships \cite{jain2021radgraph,yu2023evaluating,zhao-etal-2024-ratescore}, typically through structured extraction-based comparisons. Although more clinically informed than lexical metrics, they lack the resolution to assess fine-grained attributes such as severity, temporal progression, or treatment recommendations.
(iii) \textbf{LLM-based metrics}
\cite{ostmeier2024green,huang2024fineradscore,ZambranoChaves2025} represent the latest direction, often leveraging the pipeline of LLM-as-a-Judge~\cite{zheng2023judging} with pre-defined taxonomies such as the six error categories from the ReXVal dataset~\cite{yu2023rexval}. While getting closer to expert judgment compared with the previous two types, these methods may still lack comprehensive structured attribution and condition-level interpretability.

Therefore, to address the limitations of existing metrics, we introduce \textbf{CLEAR} (Section~\ref{sec:pipeline}), the first clinically-grounded attribute-level evaluation framework that leverages LLMs to map free-text radiology reports to a structured tabular format. Compared with prior work, CLEAR transforms the coarse, single-dimensional taxonomy into a fine-grained, multidimensional structure. Our design not only enables more comprehensive comparisons between candidate and ground-truth reports, but also provides interpretable outputs to assess report quality at the level of condition-attribute pairs.

7 of 16

Evaluation Methodology: LLM-based Metrics

Pattern: Expertise-distilled Taxonomy + LLM-as-Judge

When we define Taxonomy,

we also define the structure of metric alignment test,

directly limiting the structure of LLM-based metrics.

Yu, Feiyang, et al. "Evaluating progress in automatic chest x-ray radiology report generation." Patterns 4.9 (2023).

In the existing literature, three main types of metrics have been proposed to assess the quality of generated radiology reports, as illustrated in \autoref{fig:metric}:
(i) \textbf{Lexical metrics} measure surface-level similarity between the generated and ground-truth reports \cite{papineni2002bleu,lin2004rouge,zhang2019bertscore}. While straightforward and easy to compute, these metrics struggle to capture nuanced semantics and domain-specific terminology, leading to poor sensitivity for capturing clinically significant errors.
(ii) \textbf{Clinical efficacy metrics} evaluate the correctness of medical entities and their relationships \cite{jain2021radgraph,yu2023evaluating,zhao-etal-2024-ratescore}, typically through structured extraction-based comparisons. Although more clinically informed than lexical metrics, they lack the resolution to assess fine-grained attributes such as severity, temporal progression, or treatment recommendations.
(iii) \textbf{LLM-based metrics}
\cite{ostmeier2024green,huang2024fineradscore,ZambranoChaves2025} represent the latest direction, often leveraging the pipeline of LLM-as-a-Judge~\cite{zheng2023judging} with pre-defined taxonomies such as the six error categories from the ReXVal dataset~\cite{yu2023rexval}. While getting closer to expert judgment compared with the previous two types, these methods may still lack comprehensive structured attribution and condition-level interpretability.

Therefore, to address the limitations of existing metrics, we introduce \textbf{CLEAR} (Section~\ref{sec:pipeline}), the first clinically-grounded attribute-level evaluation framework that leverages LLMs to map free-text radiology reports to a structured tabular format. Compared with prior work, CLEAR transforms the coarse, single-dimensional taxonomy into a fine-grained, multidimensional structure. Our design not only enables more comprehensive comparisons between candidate and ground-truth reports, but also provides interpretable outputs to assess report quality at the level of condition-attribute pairs.

8 of 16

Evaluation Methodology: LLM-based Metrics

Pattern: Expertise-distilled Taxonomy + LLM-as-Judge

When we define Taxonomy,

we also define the structure of metric alignment test,

directly limiting the structure of LLM-based metrics.

GREEN: Generative Radiology Report Evaluation and Error Notation (Ostmeier et al., EMNLP Findings 2024)

In the existing literature, three main types of metrics have been proposed to assess the quality of generated radiology reports, as illustrated in \autoref{fig:metric}:
(i) \textbf{Lexical metrics} measure surface-level similarity between the generated and ground-truth reports \cite{papineni2002bleu,lin2004rouge,zhang2019bertscore}. While straightforward and easy to compute, these metrics struggle to capture nuanced semantics and domain-specific terminology, leading to poor sensitivity for capturing clinically significant errors.
(ii) \textbf{Clinical efficacy metrics} evaluate the correctness of medical entities and their relationships \cite{jain2021radgraph,yu2023evaluating,zhao-etal-2024-ratescore}, typically through structured extraction-based comparisons. Although more clinically informed than lexical metrics, they lack the resolution to assess fine-grained attributes such as severity, temporal progression, or treatment recommendations.
(iii) \textbf{LLM-based metrics}
\cite{ostmeier2024green,huang2024fineradscore,ZambranoChaves2025} represent the latest direction, often leveraging the pipeline of LLM-as-a-Judge~\cite{zheng2023judging} with pre-defined taxonomies such as the six error categories from the ReXVal dataset~\cite{yu2023rexval}. While getting closer to expert judgment compared with the previous two types, these methods may still lack comprehensive structured attribution and condition-level interpretability.

Therefore, to address the limitations of existing metrics, we introduce \textbf{CLEAR} (Section~\ref{sec:pipeline}), the first clinically-grounded attribute-level evaluation framework that leverages LLMs to map free-text radiology reports to a structured tabular format. Compared with prior work, CLEAR transforms the coarse, single-dimensional taxonomy into a fine-grained, multidimensional structure. Our design not only enables more comprehensive comparisons between candidate and ground-truth reports, but also provides interpretable outputs to assess report quality at the level of condition-attribute pairs.

9 of 16

Evaluation Methodology: LLM-based Metrics

Expertise-distilled Taxonomy

Why good?

Expertise Distillation: if we want the model to generate a good report, we should know what a good report we want first.
Structured Reporting

Why improve?

We want to generate an examination sheet for reports just like a report for our physical examination 🡨 structured and interpretable

Curtis P. Langlotz. 2015. The Radiology Report: A Guide to Thoughtful Communication for Radiologists and Other Medical Professionals. CreateSpace Independent Publishing Platform, North Charleston, SC.

In the existing literature, three main types of metrics have been proposed to assess the quality of generated radiology reports, as illustrated in \autoref{fig:metric}:
(i) \textbf{Lexical metrics} measure surface-level similarity between the generated and ground-truth reports \cite{papineni2002bleu,lin2004rouge,zhang2019bertscore}. While straightforward and easy to compute, these metrics struggle to capture nuanced semantics and domain-specific terminology, leading to poor sensitivity for capturing clinically significant errors.
(ii) \textbf{Clinical efficacy metrics} evaluate the correctness of medical entities and their relationships \cite{jain2021radgraph,yu2023evaluating,zhao-etal-2024-ratescore}, typically through structured extraction-based comparisons. Although more clinically informed than lexical metrics, they lack the resolution to assess fine-grained attributes such as severity, temporal progression, or treatment recommendations.
(iii) \textbf{LLM-based metrics}
\cite{ostmeier2024green,huang2024fineradscore,ZambranoChaves2025} represent the latest direction, often leveraging the pipeline of LLM-as-a-Judge~\cite{zheng2023judging} with pre-defined taxonomies such as the six error categories from the ReXVal dataset~\cite{yu2023rexval}. While getting closer to expert judgment compared with the previous two types, these methods may still lack comprehensive structured attribution and condition-level interpretability.

Therefore, to address the limitations of existing metrics, we introduce \textbf{CLEAR} (Section~\ref{sec:pipeline}), the first clinically-grounded attribute-level evaluation framework that leverages LLMs to map free-text radiology reports to a structured tabular format. Compared with prior work, CLEAR transforms the coarse, single-dimensional taxonomy into a fine-grained, multidimensional structure. Our design not only enables more comprehensive comparisons between candidate and ground-truth reports, but also provides interpretable outputs to assess report quality at the level of condition-attribute pairs.

10 of 16

Evaluation Methodology: LLM-based Metrics

LLM-as-Judge

Why good?

Evaluation is multidimensional and we want a unified model type to operationalize. → More efficient; Robust across different NLP tasks
Agent as Evaluator → Easy to design; Support Various Models

Why improve?

Adapt evaluator structure to taxonomy structure
Optimize software workflow and backend models to support stronger capability

In the existing literature, three main types of metrics have been proposed to assess the quality of generated radiology reports, as illustrated in \autoref{fig:metric}:
(i) \textbf{Lexical metrics} measure surface-level similarity between the generated and ground-truth reports \cite{papineni2002bleu,lin2004rouge,zhang2019bertscore}. While straightforward and easy to compute, these metrics struggle to capture nuanced semantics and domain-specific terminology, leading to poor sensitivity for capturing clinically significant errors.
(ii) \textbf{Clinical efficacy metrics} evaluate the correctness of medical entities and their relationships \cite{jain2021radgraph,yu2023evaluating,zhao-etal-2024-ratescore}, typically through structured extraction-based comparisons. Although more clinically informed than lexical metrics, they lack the resolution to assess fine-grained attributes such as severity, temporal progression, or treatment recommendations.
(iii) \textbf{LLM-based metrics}
\cite{ostmeier2024green,huang2024fineradscore,ZambranoChaves2025} represent the latest direction, often leveraging the pipeline of LLM-as-a-Judge~\cite{zheng2023judging} with pre-defined taxonomies such as the six error categories from the ReXVal dataset~\cite{yu2023rexval}. While getting closer to expert judgment compared with the previous two types, these methods may still lack comprehensive structured attribution and condition-level interpretability.

Therefore, to address the limitations of existing metrics, we introduce \textbf{CLEAR} (Section~\ref{sec:pipeline}), the first clinically-grounded attribute-level evaluation framework that leverages LLMs to map free-text radiology reports to a structured tabular format. Compared with prior work, CLEAR transforms the coarse, single-dimensional taxonomy into a fine-grained, multidimensional structure. Our design not only enables more comprehensive comparisons between candidate and ground-truth reports, but also provides interpretable outputs to assess report quality at the level of condition-attribute pairs.

11 of 16

CLEAR: Taxonomy Overview�

12 of 16

CLEAR: Attribute-Level Radiology Report Evaluator�

1️⃣ CLEAR transforms the coarse, single-dimensional taxonomy into a fine-grained, multidimensional structure.

2️⃣ CLEAR also provides interpretable outputs to assess report quality at the level of condition-attribute pairs.

3️⃣ CLEAR supports both commercial models (AzureOpenAI backend) and open-source local models (vLLM backend).

Given the strong adaptability of LLMs across diverse language tasks, they serve as an ideal unified model to operationalize our proposed framework.
Specifically, we develop CLEAR to support both ...

Given a pair of ground-truth and candidate reports, CLEAR first assesses whether the candidate report can accurately identify a set of medical observations in the \textbf{label extraction module}. For each correctly identified positive condition, the \textbf{description extraction module} further evaluates the report’s ability to describe the condition across five attributes: \texttt{first occurrence}, \texttt{change}, \texttt{severity}, \texttt{descriptive location}, and \texttt{recommendation}. Finally, the \textbf{scoring module} compiles and outputs the attribute-specific evaluation metrics.

13 of 16

How to evaluate an evaluator?

Report Selection
Radiologist Evaluation
Metric / Evaluator Evaluation
Alignment Test

No new taxonomy 🡪 no new metric 🡪 no better reports!

Yu, Feiyang, et al. "Evaluating progress in automatic chest x-ray radiology report generation." Patterns 4.9 (2023).

14 of 16

CLEAR-Bench: Attribute-Level Expert Alignment Dataset�

Irvin, Jeremy, et al. "Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison." Proceedings of the AAAI conference on artificial intelligence. Vol. 33. No. 01. 2019.

Yu, Feiyang, et al. "Evaluating progress in automatic chest x-ray radiology report generation." Patterns 4.9 (2023).

15 of 16

Reliability Test: Evaluating CLEAR on CLEAR-Bench�

✅ Label Extraction Module

The LLM-based labeler achieves substantial gains over existing systems (CheXbert, CheXpert).
Improving positive condition detection (0.805 F1) by 15.8% and negative condition detection (0.744 F1) by 42.5%.

✅ Description Extraction Module

LLMs, especially GPT-4o, excel at fine-grained attribute extraction.
Reaching the highest radiologist rating of 0.940 / 1.000 for Recommendation and the lowest of 0.809 / 1.000 for Severity.

❤️‍🔥 Scoring Module

Metrics produced by CLEAR aligns well with radiologist ratings (up to 0.994).

Smit, Akshay, et al. "Combining Automatic Labelers and Expert Annotations for Accurate Radiology Report Labeling Using BERT." Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020.

Irvin, Jeremy, et al. "Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison." Proceedings of the AAAI conference on artificial intelligence. Vol. 33. No. 01. 2019.

16 of 16

Scan For Open-Sourced Artifacts!

Find Us at the Poster Session!

Or yuyang2001@uchicago.edu