1 of 1

Generative AI can transform scientific papers into audience-adapted stories, but evaluating these narratives challenges standard metrics. We propose StoryScore, a composite metric for AI-generated scientific stories, and analyse why hallucination detectors fail to distinguish pedagogical creativity from factual errors.

RQ1) How to evaluate AI story quality beyond surface similarity, considering coherence and structure?

RQ2) How to reliably detect hallucinations while encouraging creative reformulation?

RQ3) Do existing detectors mix legitimate abstraction with factual inconsistency?

RESEARCH GOALS

SCIENTIFIC STORY GENERATION

THE StoryScore METRIC

KEY FINDINGS

LIMITATIONS & FUTURE WORKS

RESULTS

HALLUCINATION DETECTION

Hallucination or Creativity:

How to Evaluate AI-Generated Scientific Stories?

Alex Argese, Pasquale Lisena, Raphael Troncy

alex.argese@eurecom.fr

Scientific papers are transformed into persona-adapted stories via a two-stage pipeline:

(outline generation)

A unified score in [0,1] integrating six complementary signals:

METRIC

Context Recall

BERTScore

Prompt Cleanliness

Title Coverage

No Redundancy

No Hallucination

TYPE

Lexical

Semantic

Structural

Structural

Fluency

Factuality

OBJECTIVE

Token overlap of paper vocabulary covered by story

Contextual embedding alignment (cosine, F1)

Absence of instruction leakage, JSON artifacts, fences

Soft similarity of generated vs. outline section titles

Penalises degenerative trigram loops and repetition

SpaCy NER: PERSON/ORG entities absent from source

Unlike summarisation, storytelling intentionally diverges through metaphors and simplifications. We evaluated five detection approaches on our story corpus:

METHOD

Capitalised Words

SpaCy NER

MIRAGE

LLM-Judge (Qwen 7B)

LLM-Judge (GPT 5.1)

HHD (hybrid)

WHAT IT DETECTS

Surface-form mismatch

Incorrect PERSON/ORG entities

Rewrite-consistency instability

Factual consistency

High-level reasoning errors

Entity + retrieval alignment

KEY WEAKNESS

Flags abbreviations & creative capitalisations as errors

Misses conceptual errors (wrong claims, invented datasets)

Penalises analogies and audience-adapted rephrasing

"Hallucinates hallucination" labels correct facts as errors

Overcautious: flags benign contextual expansions

Dominant false positives; threshold unstable

VERDICT

Too noisy

✓ Chosen

Too rigid

Unstable

Too strict

Unreliable

  • Fine-tuning Qwen2.5 on 76 stories improved StoryScore and eliminated prompt leakage.
  • Pre-trained model scored well on the semantic metric, but manual review showed leakage and redundancy.
  • StoryScore strongly correlated with qualitative assessment of quality, control, and factual grounding.
  • Limitations: Useful for comparison, but lacks absolute quality assessment. Detecting hallucination is difficult due to creativity and methodology.
  • Future work will focus on: improving StoryScore, validating it with human judgment and improving an hallucination detector to distinguish acceptable abstraction from factual distortion.

Text2Story’26