1 of 17

Chat-Ghosting: Methods for Auto-Completion in Dialog Systems

Anubhav Mandal, Sandeep Mishra, Bishal Santra, Tushar Abhishek, Pawan Goyal, Manish Gupta

Main Conference Track

2 of 17

Chat-Ghosting: Methods for Auto-Completion in Dialog Systems. EACL 2026.

Motivation: Why Chat-Ghosting ?

The Rise of Conversational AI�

  • LLM-based interfaces (ChatGPT, Microsoft Copilot, Gemini) are ubiquitous
  • Users need faster, easier ways to interact with chatbots
  • Ghosting : type-ahead completion that predicts user's intended input
  • Benefits users with:
    • Slow typing speeds
    • Disabilities
    • Limited language proficiency

Problem: Chat-ghosting has received little attention from NLP/ML community

Microsoft Copilot

3 of 17

Chat-Ghosting: Methods for Auto-Completion in Dialog Systems. EACL 2026.

What is Chat-Ghosting ?

Task Definition�

Input

  • Partial prefix typed by user (p)
  • Optional dialog history/context (h)

Output

  • Single completion suggestion (c)

Goal: Predict completion such that [prefix; completion] is a valid response

Key Difference from Query Auto Completion (QAC): 

  • QAC shows ~10 suggestions
  • Ghosting shows 1 inline suggestion (like Microsoft Copilot)

Microsoft Copilot

4 of 17

Chat-Ghosting: Methods for Auto-Completion in Dialog Systems. EACL 2026.

Research Gaps & Contributions

�Gaps:

  • No standardized benchmarks for chat-ghosting
  • No comparative analysis of deep learning vs non-deep learning methods

Our Contributions:

  1. First comprehensive study on chat-ghosting for dialog systems
  2. Investigation of diverse methodologies (tries, n-grams, neural models, prompt engineering)
  3. Novel entropy-based dynamic early stopping strategy
  4. Benchmark on 4 public datasets with multiple evaluation metrics

5 of 17

Chat-Ghosting: Methods for Auto-Completion in Dialog Systems. EACL 2026.

Benchmark Datasets

1. Daily Dialog (DD) - Open-Domain Human Conversations

  • Natural, everyday English conversations (daily activities, emotions, opinions, etc) between humans
  • Challenge: High novelty (94% unseen queries in test set)

2. DSTC7-Ubuntu (DU) - Technical Domain Conversations

  • Linux technical support dialogues (technical vocabulary, commands, troubleshooting discussions, etc) from Ubuntu IRC channel
  • Challenge: Balancing seen (51.8%) vs unseen queries (48.2%)

3. Open Assistant (OASST) - Human-Bot Interactions

  • Human-annotated assistant-style conversations with AI systems
  • Focus on human utterances only (bot responses excluded)
  • Challenge: Increased utterance length and complexity

4. ShareGPT (SGPT) - Real-World LLM Conversations

  • Authentic user interactions with LLM chatbots via ShareGPT API
  • Longest utterances among all datasets
  • Challenge: Most difficult due to length (53 words avg) and diversity

S.No

Dataset

Type

Domain

Avg Words

Avg Chars

Train Size

Test Size

1

DD

Human-Human

Open

12.37

57

69,216

7,986

2

DU

Human-Human

Tech

13.48

74

549,002

5,588

3

OASST

Human-Bot

Mixed

20.36

115

19,421

981

4

SGPT

Human-Bot

Mixed

53.27

341

328,078

1,088

6 of 17

Chat-Ghosting: Methods for Auto-Completion in Dialog Systems. EACL 2026.

Methods Evaluated

We evaluated the following methods with/without context (previous utterances in conversation):

Standard QAC methods:

  • Most Popular Completions (MPC) : Character-level trie built from training utterances
  • MPC++ : Enhanced version combining main trie + suffix trie
  • Query Blazer : Generative QAC system with subword-level n-gram language model

Neural Language Models

  • GPT-2 (~124M params) – Decoder only Model
  • T5-base (~220M params) – Sequence-To-Sequence

Prompt Engineering

  • Phi-2 (~2.7B params) – Pretrained & Finetuned
  • Mistral-7B – pretrained only
  • GPT-4

Context Integration: Prepending for neural models, reranking for tries/n-grams

7 of 17

Chat-Ghosting: Methods for Auto-Completion in Dialog Systems. EACL 2026.

Evaluation Metrics

Quality Metrics:

  • Match Rate (MR) : Exact Match with ground truth
  • Partial Recall (P-Rec) : Fraction of ground truth match
  • Partial Precision (P-Prec) : Fraction of prediction that’s correct.

User Experience Metrics

  • Trigger Rate (TR) : How often suggestions are show to users
  • Typing Efforts Saved (TES) : Reduction in user keystrokes (1 – (keystroked type/total_length_of_entered_query))

Additional Evaluations (For Completeness)

  • METEOR: Semantic similarity metric for partial semantic matches
  • GPT-4o as Judge: LLM-based evaluation on 5-point scale (1=bad, 5=exact match)
    • Correlation with human evaluation: 0.57 (p-value: 2.2e-88)
  • Human Evaluation: 100 instances each from seen/unseen sets rated by annotators

8 of 17

Chat-Ghosting: Methods for Auto-Completion in Dialog Systems. EACL 2026.

Key Results on DD and DU dataset

Chat-Ghosting task for the Daily Dialog (DD) dataset. PT=Pretrained, FT=Finetuned

Chat-Ghosting task for the DSTC7-Ubuntu (DU) dataset. PT=Pretrained, FT=Finetuned

9 of 17

Chat-Ghosting: Methods for Auto-Completion in Dialog Systems. EACL 2026.

Key Results on OASST and SGPT dataset

Chat-Ghosting task for the Open Assistant (OASST) dataset. PT=Pretrained, FT=Finetuned

Chat-Ghosting task for the Share GPT (SGPT) dataset. PT=Pretrained, FT=Finetuned

10 of 17

Chat-Ghosting: Methods for Auto-Completion in Dialog Systems. EACL 2026.

Key Results on DD and DU dataset with Context

Contextual Chat-Ghosting task for the Open Assistant (OASST) dataset. PT=Pretrained, FT=Finetuned

Contextual Chat-Ghosting task for the DSTC7-Ubuntu (DU) dataset. PT=Pretrained, FT=Finetuned

11 of 17

Chat-Ghosting: Methods for Auto-Completion in Dialog Systems. EACL 2026.

Key Results on OASST and SGPT dataset with Context

Contextual Chat-Ghosting task for the Daily Dialog (DD) dataset. PT=Pretrained, FT=Finetuned

Contextual Chat-Ghosting task for the Share GPT (SGPT) dataset. PT=Pretrained, FT=Finetuned

12 of 17

Chat-Ghosting: Methods for Auto-Completion in Dialog Systems. EACL 2026.

Key Takeaways

On Human-Human Interactions datasets: DD & DU

  • N-gram > Neural (Unseen Prefixes): QB achieves highest P-Prec (~43 DD, ~40 DU) and TES due its short, precise, memory-driven completions.
  • Tries Dominate Seen Prefixes: MPC/MPC++ retrieve exact historical completions. MPC++ improves coverage via suffix-trie (slight seen-performance trade-off)
  • Zero-shot LLMs Underperform: Mistral-7B, GPT-4 ≪ Fine-tuned models. Training on domain logs is essential for production ghosting.

On Human-Bot Interactions datasets: OASST & SGPT

  • Longer utterances change the dynamics: Pretrained knowledge becomes more useful.
  • Neural Model dominates on unseen prefixes: Mistral has highest P-Rec and Phi-2 has highest MR & TES
  • Memory still wins on Seen Prefixes: MPC, MPC++, and QB outperform neural models even in human–bot settings.

Practical recommendation: Utilize a hybrid system. Use MPC for seen prefixes (memory-based precision), and T5 or QB for unseen prefixes depending on latency budget.

    • T5 → longer completions, higher recall
    • QB → shorter, more precise completions with lower latency

13 of 17

Chat-Ghosting: Methods for Auto-Completion in Dialog Systems. EACL 2026.

Entropy-Based Early Stopping for Neural Language Models

Problem: Long predictions less likely to be accepted as whole by the end user.

Solution:

  • Stop generation when model confidence drops (entropy threshold).
  • We apply entropy-based dynamic early stopping with thresholds of 0.6 and 3 during suggestion generation using T5 and GPT2 on the unseen test sets of DD and DU in the non-contextual setting.

Benefit: +75% P-Prec improvement, +120% TES improvement with shorter predictions.

14 of 17

Chat-Ghosting: Methods for Auto-Completion in Dialog Systems. EACL 2026.

Trade-offs: Accuracy vs. Inference Latency

  • Low latency is a critical requirement for a practical ghosting feature.

  • Efficient Models: T5 and GPT2 achieve inference within 100-150ms on a V100 GPU (batch size=1), making them viable for production.

  • Heavy Models: While billion-parameter models perform strongly on human-bot data, they suffer from high inference latency (1-3 seconds).

  • Future Deployment: Scaling these LLM solutions requires techniques like quantization-aware training (GPTQ, AWQ) and CUDA-level optimizations (Tensor-RT, VLLM)

Inference Latency for each chat-ghosting method (in milliseconds) at max Trigger Rate. PT=Pretrained, FT=Finetuned. cDD and cDU are Contextual datasets.

15 of 17

Chat-Ghosting: Methods for Auto-Completion in Dialog Systems. EACL 2026.

Non-Contextual Ghosting Examples

Examples of suffixes predicted by various non-contextual models for different prefixes.

16 of 17

Chat-Ghosting: Methods for Auto-Completion in Dialog Systems. EACL 2026.

Contextual Ghosting Examples

Examples of suffixes predicted by various contextual models for different prefixes.

17 of 17

Chat-Ghosting: Methods for Auto-Completion in Dialog Systems. EACL 2026.

Conclusion

  • We formalized chat-ghosting, established benchmarks across four datasets, and proved that a blend of memorization (Tries) and generalization (T5/QB) is currently necessary.
  • We introduced entropy-based dynamic early stopping that adaptively halts generation when model confidence drops, achieving +75% improvement in P-Prec and +120% in TES.
  • Context integration significantly improves performance, especially for human-bot datasets (OASST, ShareGPT), with T5 benefiting most from conversational history.
  • Fine-tuning on domain logs is essential: zero-shot LLMs are ineffective; models with inherent memory capacity trained on historical data deliver reliable ghosting performance.
  • Open Source: Code, prompt, data and models publicly available

Thank You

Anubhav Mandal, Sandeep Mishra, Bishal Santra, Tushar Abhishek, Pawan Goyal, Manish Gupta