Generative AI can transform scientific papers into audience-adapted stories, but evaluating these narratives challenges standard metrics. We propose StoryScore, a composite metric for AI-generated scientific stories, and analyse why hallucination detectors fail to distinguish pedagogical creativity from factual errors.
RQ1) How to evaluate AI story quality beyond surface similarity, considering coherence and structure?
RQ2) How to reliably detect hallucinations while encouraging creative reformulation?
RQ3) Do existing detectors mix legitimate abstraction with factual inconsistency?
RESEARCH GOALS
SCIENTIFIC STORY GENERATION
THE StoryScore METRIC
Repository�bit.ly/ai-sci-storyteller
KEY FINDINGS
LIMITATIONS & FUTURE WORKS
RESULTS
HALLUCINATION DETECTION
Hallucination or Creativity:
How to Evaluate AI-Generated Scientific Stories?
Alex Argese, Pasquale Lisena, Raphael Troncy
Scientific papers are transformed into persona-adapted stories via a two-stage pipeline:
(outline generation)
A unified score in [0,1] integrating six complementary signals:
METRIC | Context Recall | BERTScore | Prompt Cleanliness | Title Coverage | No Redundancy | No Hallucination |
TYPE | Lexical | Semantic | Structural | Structural | Fluency | Factuality |
OBJECTIVE | Token overlap of paper vocabulary covered by story | Contextual embedding alignment (cosine, F1) | Absence of instruction leakage, JSON artifacts, fences | Soft similarity of generated vs. outline section titles | Penalises degenerative trigram loops and repetition | SpaCy NER: PERSON/ORG entities absent from source |
Unlike summarisation, storytelling intentionally diverges through metaphors and simplifications. We evaluated five detection approaches on our story corpus:
METHOD | Capitalised Words | SpaCy NER | MIRAGE | LLM-Judge (Qwen 7B) | LLM-Judge (GPT 5.1) | HHD (hybrid) |
WHAT IT DETECTS | Surface-form mismatch | Incorrect PERSON/ORG entities | Rewrite-consistency instability | Factual consistency | High-level reasoning errors | Entity + retrieval alignment |
KEY WEAKNESS | Flags abbreviations & creative capitalisations as errors | Misses conceptual errors (wrong claims, invented datasets) | Penalises analogies and audience-adapted rephrasing | "Hallucinates hallucination" labels correct facts as errors | Overcautious: flags benign contextual expansions | Dominant false positives; threshold unstable |
VERDICT | Too noisy | ✓ Chosen | Too rigid | Unstable | Too strict | Unreliable |
Text2Story’26