2 of 17

Chat-Ghosting: Methods for Auto-Completion in Dialog Systems. EACL 2026.

Motivation: Why Chat-Ghosting ?

The Rise of Conversational AI�

LLM-based interfaces (ChatGPT, Microsoft Copilot, Gemini) are ubiquitous
Users need faster, easier ways to interact with chatbots
Ghosting : type-ahead completion that predicts user's intended input
Benefits users with:

Slow typing speeds
Disabilities
Limited language proficiency

Problem: Chat-ghosting has received little attention from NLP/ML community

Microsoft Copilot

3 of 17

Chat-Ghosting: Methods for Auto-Completion in Dialog Systems. EACL 2026.

What is Chat-Ghosting ?

Task Definition�

Input

Partial prefix typed by user (p)
Optional dialog history/context (h)

Output

Single completion suggestion (c)

Goal: Predict completion such that [prefix; completion] is a valid response

Key Difference from Query Auto Completion (QAC):

QAC shows ~10 suggestions
Ghosting shows 1 inline suggestion (like Microsoft Copilot)

Microsoft Copilot

4 of 17

Chat-Ghosting: Methods for Auto-Completion in Dialog Systems. EACL 2026.

Research Gaps & Contributions

�Gaps:

No standardized benchmarks for chat-ghosting
No comparative analysis of deep learning vs non-deep learning methods

Our Contributions:

First comprehensive study on chat-ghosting for dialog systems
Investigation of diverse methodologies (tries, n-grams, neural models, prompt engineering)
Novel entropy-based dynamic early stopping strategy
Benchmark on 4 public datasets with multiple evaluation metrics�

5 of 17

Chat-Ghosting: Methods for Auto-Completion in Dialog Systems. EACL 2026.

Benchmark Datasets

1. Daily Dialog (DD) - Open-Domain Human Conversations

Natural, everyday English conversations (daily activities, emotions, opinions, etc) between humans
Challenge: High novelty (94% unseen queries in test set)

2. DSTC7-Ubuntu (DU) - Technical Domain Conversations

Linux technical support dialogues (technical vocabulary, commands, troubleshooting discussions, etc) from Ubuntu IRC channel
Challenge: Balancing seen (51.8%) vs unseen queries (48.2%)

�

3. Open Assistant (OASST) - Human-Bot Interactions

Human-annotated assistant-style conversations with AI systems
Focus on human utterances only (bot responses excluded)
Challenge: Increased utterance length and complexity

4. ShareGPT (SGPT) - Real-World LLM Conversations

Authentic user interactions with LLM chatbots via ShareGPT API
Longest utterances among all datasets
Challenge: Most difficult due to length (53 words avg) and diversity�

S.No	Dataset	Type	Domain	Avg Words	Avg Chars	Train Size	Test Size
1	DD	Human-Human	Open	12.37	57	69,216	7,986
2	DU	Human-Human	Tech	13.48	74	549,002	5,588
3	OASST	Human-Bot	Mixed	20.36	115	19,421	981
4	SGPT	Human-Bot	Mixed	53.27	341	328,078	1,088

6 of 17

Chat-Ghosting: Methods for Auto-Completion in Dialog Systems. EACL 2026.

Methods Evaluated

We evaluated the following methods with/without context (previous utterances in conversation):

Standard QAC methods:

Most Popular Completions (MPC) : Character-level trie built from training utterances
MPC++ : Enhanced version combining main trie + suffix trie
Query Blazer : Generative QAC system with subword-level n-gram language model

Neural Language Models

GPT-2 (~124M params) – Decoder only Model
T5-base (~220M params) – Sequence-To-Sequence

Prompt Engineering

Phi-2 (~2.7B params) – Pretrained & Finetuned
Mistral-7B – pretrained only
GPT-4

Context Integration: Prepending for neural models, reranking for tries/n-grams

�

7 of 17

Chat-Ghosting: Methods for Auto-Completion in Dialog Systems. EACL 2026.

Evaluation Metrics

Quality Metrics:

Match Rate (MR) : Exact Match with ground truth
Partial Recall (P-Rec) : Fraction of ground truth match
Partial Precision (P-Prec) : Fraction of prediction that’s correct.

User Experience Metrics

Trigger Rate (TR) : How often suggestions are show to users
Typing Efforts Saved (TES) : Reduction in user keystrokes (1 – (keystroked type/total_length_of_entered_query))

Additional Evaluations (For Completeness)

METEOR: Semantic similarity metric for partial semantic matches
GPT-4o as Judge: LLM-based evaluation on 5-point scale (1=bad, 5=exact match)

Correlation with human evaluation: 0.57 (p-value: 2.2e-88)

Human Evaluation: 100 instances each from seen/unseen sets rated by annotators

8 of 17

Chat-Ghosting: Methods for Auto-Completion in Dialog Systems. EACL 2026.

Key Results on DD and DU dataset

Chat-Ghosting task for the Daily Dialog (DD) dataset. PT=Pretrained, FT=Finetuned

Chat-Ghosting task for the DSTC7-Ubuntu (DU) dataset. PT=Pretrained, FT=Finetuned

9 of 17

Chat-Ghosting: Methods for Auto-Completion in Dialog Systems. EACL 2026.

Key Results on OASST and SGPT dataset

Chat-Ghosting task for the Open Assistant (OASST) dataset. PT=Pretrained, FT=Finetuned

Chat-Ghosting task for the Share GPT (SGPT) dataset. PT=Pretrained, FT=Finetuned

10 of 17

Chat-Ghosting: Methods for Auto-Completion in Dialog Systems. EACL 2026.

Key Results on DD and DU dataset with Context

Contextual Chat-Ghosting task for the Open Assistant (OASST) dataset. PT=Pretrained, FT=Finetuned

Contextual Chat-Ghosting task for the DSTC7-Ubuntu (DU) dataset. PT=Pretrained, FT=Finetuned

11 of 17

Chat-Ghosting: Methods for Auto-Completion in Dialog Systems. EACL 2026.

Key Results on OASST and SGPT dataset with Context

Contextual Chat-Ghosting task for the Daily Dialog (DD) dataset. PT=Pretrained, FT=Finetuned

Contextual Chat-Ghosting task for the Share GPT (SGPT) dataset. PT=Pretrained, FT=Finetuned

12 of 17

Chat-Ghosting: Methods for Auto-Completion in Dialog Systems. EACL 2026.

Key Takeaways

On Human-Human Interactions datasets: DD & DU

N-gram > Neural (Unseen Prefixes): QB achieves highest P-Prec (~43 DD, ~40 DU) and TES due its short, precise, memory-driven completions.
Tries Dominate Seen Prefixes: MPC/MPC++ retrieve exact historical completions. MPC++ improves coverage via suffix-trie (slight seen-performance trade-off)
Zero-shot LLMs Underperform: Mistral-7B, GPT-4 ≪ Fine-tuned models. Training on domain logs is essential for production ghosting.

On Human-Bot Interactions datasets: OASST & SGPT

Longer utterances change the dynamics: Pretrained knowledge becomes more useful.
Neural Model dominates on unseen prefixes: Mistral has highest P-Rec and Phi-2 has highest MR & TES
Memory still wins on Seen Prefixes: MPC, MPC++, and QB outperform neural models even in human–bot settings.

�Practical recommendation: Utilize a hybrid system. Use MPC for seen prefixes (memory-based precision), and T5 or QB for unseen prefixes depending on latency budget.

T5 → longer completions, higher recall
QB → shorter, more precise completions with lower latency

13 of 17

Chat-Ghosting: Methods for Auto-Completion in Dialog Systems. EACL 2026.

Entropy-Based Early Stopping for Neural Language Models

Problem: Long predictions less likely to be accepted as whole by the end user.

Solution:

Stop generation when model confidence drops (entropy threshold).
We apply entropy-based dynamic early stopping with thresholds of 0.6 and 3 during suggestion generation using T5 and GPT2 on the unseen test sets of DD and DU in the non-contextual setting.

Benefit: +75% P-Prec improvement, +120% TES improvement with shorter predictions.

�

14 of 17

Chat-Ghosting: Methods for Auto-Completion in Dialog Systems. EACL 2026.

Trade-offs: Accuracy vs. Inference Latency

Low latency is a critical requirement for a practical ghosting feature.

Efficient Models: T5 and GPT2 achieve inference within 100-150ms on a V100 GPU (batch size=1), making them viable for production.

Heavy Models: While billion-parameter models perform strongly on human-bot data, they suffer from high inference latency (1-3 seconds).

Future Deployment: Scaling these LLM solutions requires techniques like quantization-aware training (GPTQ, AWQ) and CUDA-level optimizations (Tensor-RT, VLLM)�

Inference Latency for each chat-ghosting method (in milliseconds) at max Trigger Rate. PT=Pretrained, FT=Finetuned. cDD and cDU are Contextual datasets.

15 of 17

Chat-Ghosting: Methods for Auto-Completion in Dialog Systems. EACL 2026.

Non-Contextual Ghosting Examples

Examples of suffixes predicted by various non-contextual models for different prefixes.

16 of 17

Chat-Ghosting: Methods for Auto-Completion in Dialog Systems. EACL 2026.

Contextual Ghosting Examples

Examples of suffixes predicted by various contextual models for different prefixes.

17 of 17

Chat-Ghosting: Methods for Auto-Completion in Dialog Systems. EACL 2026.

Conclusion

We formalized chat-ghosting, established benchmarks across four datasets, and proved that a blend of memorization (Tries) and generalization (T5/QB) is currently necessary.
We introduced entropy-based dynamic early stopping that adaptively halts generation when model confidence drops, achieving +75% improvement in P-Prec and +120% in TES.
Context integration significantly improves performance, especially for human-bot datasets (OASST, ShareGPT), with T5 benefiting most from conversational history.
Fine-tuning on domain logs is essential: zero-shot LLMs are ineffective; models with inherent memory capacity trained on historical data deliver reliable ghosting performance.
Open Source: Code, prompt, data and models publicly available

Thank You

Anubhav Mandal, Sandeep Mishra, Bishal Santra, Tushar Abhishek, Pawan Goyal, Manish Gupta