1 of 1

Generative AI can transform scientific papers into audience-adapted stories, but evaluating these narratives challenges standard metrics. We propose StoryScore, a composite metric for AI-generated scientific stories, and analyse why hallucination detectors fail to distinguish pedagogical creativity from factual errors.

RQ1) How to evaluate AI story quality beyond surface similarity, considering coherence and structure?

RQ2) How to reliably detect hallucinations while encouraging creative reformulation?

RQ3) Do existing detectors mix legitimate abstraction with factual inconsistency?

RESEARCH GOALS

SCIENTIFIC STORY GENERATION

THE StoryScore METRIC

Repository�bit.ly/ai-sci-storyteller

KEY FINDINGS

LIMITATIONS & FUTURE WORKS

RESULTS

HALLUCINATION DETECTION

Hallucination or Creativity:

How to Evaluate AI-Generated Scientific Stories?

Alex Argese, Pasquale Lisena, Raphael Troncy

alex .argese @eurecom.fr

Scientific papers are transformed into persona-adapted stories via a two-stage pipeline:

(outline generation)

A unified score in [0,1] integrating six complementary signals:

METRIC	Context Recall	BERTScore	Prompt Cleanliness	Title Coverage	No Redundancy	No Hallucination
TYPE	Lexical	Semantic	Structural	Structural	Fluency	Factuality
OBJECTIVE	Token overlap of paper vocabulary covered by story	Contextual embedding alignment (cosine, F1)	Absence of instruction leakage, JSON artifacts, fences	Soft similarity of generated vs. outline section titles	Penalises degenerative trigram loops and repetition	SpaCy NER: PERSON/ORG entities absent from source

Unlike summarisation, storytelling intentionally diverges through metaphors and simplifications. We evaluated five detection approaches on our story corpus:

METHOD	Capitalised Words	SpaCy NER	MIRAGE	LLM-Judge (Qwen 7B)	LLM-Judge (GPT 5.1)	HHD (hybrid)
WHAT IT DETECTS	Surface-form mismatch	Incorrect PERSON/ORG entities	Rewrite-consistency instability	Factual consistency	High-level reasoning errors	Entity + retrieval alignment
KEY WEAKNESS	Flags abbreviations & creative capitalisations as errors	Misses conceptual errors (wrong claims, invented datasets)	Penalises analogies and audience-adapted rephrasing	"Hallucinates hallucination" labels correct facts as errors	Overcautious: flags benign contextual expansions	Dominant false positives; threshold unstable
VERDICT	Too noisy	✓ Chosen	Too rigid	Unstable	Too strict	Unreliable

Fine-tuning Qwen2.5 on 76 stories improved StoryScore and eliminated prompt leakage.
Pre-trained model scored well on the semantic metric, but manual review showed leakage and redundancy.
StoryScore strongly correlated with qualitative assessment of quality, control, and factual grounding.

Limitations: Useful for comparison, but lacks absolute quality assessment. Detecting hallucination is difficult due to creativity and methodology.
Future work will focus on: improving StoryScore, validating it with human judgment and improving an hallucination detector to distinguish acceptable abstraction from factual distortion.

Text2Story’26