1 of 15

EnhanceMyPrompt: Rewriting Chat Queries for Effective Response Generation from LLMs

Tushar Abhishek*, Manas Jain*, Shishir Hardia, Shreevignesh Suriyanarayanan, Sandra Anil, Rushabh Gandhi, Manish Gupta

Applied Research Paper Track

2 of 15

EnhanceMyPrompt: Rewriting Chat Queries for Effective Response Generation from LLMs. CIKM 2025.

Problem Statement: Short & Ambiguous Chat Queries

��Observations: ~70% of Microsoft Copilot queries are broad and short (< 10 words).

“write a poem”
“create an image of a dog”
“swiss vacation plan”
“write a good introduction for my lab report”

Consequences:

Vague LLM responses → Generic, unhelpful answers
Multiple iterations needed → Users follow up with clarifying queries
Longer task completion times → Poor user experience

Key Question: Can we semi-automatically enhance short queries into specific, well-formed ones with clear intent?

Microsoft Copilot

ChatGPT

Google Gemini

3 of 15

EnhanceMyPrompt: Rewriting Chat Queries for Effective Response Generation from LLMs. CIKM 2025.

Our Solution: Enhance My Prompt (EMP)

Novel Problem Formulation�Unlike query expansion: We add relevant sub-intents/constraints, not just related terms�Unlike prompt optimization: We enhance with detail, not compress

Unlike prompt engineering: We don't change prompting strategy

�Enhance My Prompt System

Uses Small Language Models (SLMs) for efficiency
Human-in-the-loop approach for editing enhanced drafts
4 levels of enhancement with increasing detail
Practical deployment with low latency (~1.8s per query)

Example Transformation:

Original: “create an image of rhino playing football”

Enhanced (Level 4): “create a <photo-realistic> image in <bright colors> of a rhinoceros wearing a <red jersey> playing in a <football stadium>”

User can click on enhance my prompt button to invoke the system.

4 of 15

EnhanceMyPrompt: Rewriting Chat Queries for Effective Response Generation from LLMs. CIKM 2025.

Our Solution: Enhance My Prompt (EMP)

Novel Problem Formulation�Unlike query expansion: We add relevant sub-intents/constraints, not just related terms�Unlike prompt optimization: We enhance with detail, not compress

Unlike prompt engineering: We don't change prompting strategy

�Enhance My Prompt System

Uses Small Language Models (SLMs) for efficiency
Human-in-the-loop approach for editing enhanced drafts
4 levels of enhancement with increasing detail
Practical deployment with low latency (~1.8s per query)

Example Transformation:

Original: “create an image of rhino playing football”

Enhanced (Level 4): “create a <photo-realistic> image in <bright colors> of a rhinoceros wearing a <red jersey> playing in a <football stadium>”

User can analyze, edit or choose from different placeholder in the enhanced prompt before submitting their query to copilot.

5 of 15

EnhanceMyPrompt: Rewriting Chat Queries for Effective Response Generation from LLMs. CIKM 2025.

Four Levels of Enhancements

Key Advantages:

Users can edit drafts before submission
Minimal cognitive load on users.
EMP can predict future intents: saves up to 3 conversation turns in ~23% of cases.

6 of 15

EnhanceMyPrompt: Rewriting Chat Queries for Effective Response Generation from LLMs. CIKM 2025.

Dataset & Enhanced Query Generation

Microsoft Copilot Dataset (Proprietary)

Randomly sampled 500K English queries from Bing Chat logs (Sep-Oct 2023)
Query length: 4-40 words (avg 9.45 words)
71% queries < 10 words
Train: 400K, Val: 50K, Test: 50K

LMSYS+NQ Dataset (Public)

We filtered queries from LMSYS-Chat-1M (Real-world conversations with LLMs) and Google Natural Questions (Entity-centric questions)
Filter criteria: more than 4 words, first turn queries and non-toxic (tagged through OpenAI moderation API)
Randomly sampled 50K instances from each of the data sources.
Train: 80K, Val: 10K, Test: 10K

Enhanced Query Generation

We utilized GPT-4 Turbo with Chain-of-Thought prompting and few shot examples to generate enhanced query across all level and dataset. All prompts are publicly available: GitHub - tushar117/EYP-25: Enhance Your Prompt

Added specific instruction for sub-intent identification, placeholder generation, and value suggestion.

On average, enhanced drafts from GPT-4T are ∼2.9× and ∼2.6× for the Copilot and LMSYS+NQ datasets respectively.

Enhanced query draft and final enhanced query differ by around 1 word indicating small edit distance between them.

Average query sizes (in words) across different levels of enhancement for both datasets using GPT-4T.

7 of 15

EnhanceMyPrompt: Rewriting Chat Queries for Effective Response Generation from LLMs. CIKM 2025.

Model Training & Architecture

We evaluated the following models:

Supervised Finetuning

GPT2-Medium (355M parameters)
T5-Large (770M parameters)

Instruction Finetuning

Phi-3-Mini-Instruct (3.8B parameters)
LLaMA-3-Instruct (8B parameters)

�Training Setup:�

Trained a separate models for each dataset and each level (4 levels)
Input : original query, output: enhanced query draft
We utilize LoRA for efficient fine-tuning for Phi-3-Mini-Instruct and LLaMA-3-Instruct model

Why SLMs?�

Efficiency: Low latency for real-time deployment
Cost-effective: Lower inference costs than GPT-4
Practical: Can run on standard hardware (e.g. T4, V100 GPU, etc.)

8 of 15

EnhanceMyPrompt: Rewriting Chat Queries for Effective Response Generation from LLMs. CIKM 2025.

Evaluation Metrics

To design the relevant metrics, we performed a user study with 10 users with diverse backgrounds and ethnicities worldwide. Based on the feedback, we propose three novel metrics to measure enhanced query draft (evaluated using GPT4 Turbo):

Enhanced Query Draft Quality (EQDQ)

Relevance - with original query
Engagement – creative & informative
Detail – appropriate length & info
Clarity – coherent & clear
Placeholder Quality
Placeholder Choices Quality

Additional User Effort (AUE)

Cognitive Load Details - Complexity of comprehending the draft appropriate length & info
Edit Effort - Effort required to modify placeholders/text

LLM Response Quality Improvement (LRQI)

Relevance
Engagement
Detail
Clarity
Perceived Intelligence

EQDQ takes user query and enhanced query draft as input to measure 6 dimensions on 4-point Likert scale

AUE take enhanced query draft and final enhanced query (after user/agent edit) as input.

LRQI measures how much the LLM’s response improves when the final enhanced query is used, by comparing the response quality against that from the original user query.

9 of 15

EnhanceMyPrompt: Rewriting Chat Queries for Effective Response Generation from LLMs. CIKM 2025.

Detailed Results - Copilot

10 of 15

EnhanceMyPrompt: Rewriting Chat Queries for Effective Response Generation from LLMs. CIKM 2025.

Detailed Results - LMSYS+NQ

11 of 15

EnhanceMyPrompt: Rewriting Chat Queries for Effective Response Generation from LLMs. CIKM 2025.

Key Results : Overall Model Performance

Best Performing Model: Instruction-tuned Phi-3-Mini

Model	Level-1	Level-2	Level-3	Level-4	Average
Pretrained Phi-3	2.57	2.59	2.6	2.6	2.59
Pretrained LLaMA-3	2.79	2.84	2.82	2.84	2.82
GPT2-Medium (FT)	2.66	2.61	2.64	2.73	2.66
T5-Large (FT)	2.73	2.68	2.71	2.8	2.73
Phi-3 (IT)	2.8	2.79	2.81	2.86	2.82
LLaMA-3 (IT)	2.77	2.75	2.79	2.84	2.79
GPT-4T (reference)	2.81	2.8	2.82	2.87	2.83

Overall EQDQ Scores - Copilot Dataset (Higher is Better):

Model	Level-1	Level-2	Level-3	Level-4	Average
Pretrained Phi-3	0.06	0.08	0.05	0.05	0.06
Pretrained LLaMA-3	-0.01	-0.08	-0.07	-0.14	-0.08
GPT2-Medium (FT)	0.13	0.09	0.1	0.12	0.11
T5-Large (FT)	0.12	0.1	0.11	0.11	0.11
Phi-3 (IT)	0.12	0.09	0.11	0.12	0.11
LLaMA-3 (IT)	0.13	0.08	0.09	0.11	0.10
GPT-4T (reference)	0.13	0.09	0.11	0.13	0.12

Overall LRQI Scores - Copilot Dataset (Higher is Better):

AUE Scores – Copilot Dataset (Lower is Better):

Phi-3 (IT) achieves lowest cognitive load and edit effort across all level
Better than GPT-4T on user effort metrics

12 of 15

EnhanceMyPrompt: Rewriting Chat Queries for Effective Response Generation from LLMs. CIKM 2025.

Impact & Practical Insights

Coverage of Future User Intents using best performing model: Instruction-tuned Phi-3-Mini

�Experiment: 10K LMSYS conversations with ≥3 turns

Key Finding: EnhanceMyPrompt can predict user intents up to 3 turns ahead in ~23% of conversations

Latency (Phi-3 IT on V100 GPU):

Average: ~1.8 seconds per query
Batch size: 1
Practically deployable for real-time applications

Edit Distance Analysis (Enhanced Draft Query and Enhanced Final Query)

Enhanced drafts require minimal editing
Average edit distance: 2-3 words
Exact matches 60-70% for best models

Average Inference Times for 100 samples on a NVIDIA V100 GPUwithbatch size as 1.

Edit distance (in words) between enhanced query draft and final enhanced query.

13 of 15

EnhanceMyPrompt: Rewriting Chat Queries for Effective Response Generation from LLMs. CIKM 2025.

Examples

Enhanced Draft Query using EnhanceMyPrompt models. Query=“can you recommend some investment opportunities?”

14 of 15

EnhanceMyPrompt: Rewriting Chat Queries for Effective Response Generation from LLMs. CIKM 2025.

Qualitative Analysis

Examples of errors made by our best Enhance MyPromptSLMmodel - instruction-tuned Phi-3.

15 of 15

EnhanceMyPrompt: Rewriting Chat Queries for Effective Response Generation from LLMs. CIKM 2025.

Conclusion

Novel Problem: First to propose semi-automatic prompt enhancement for web copilots
4-Level Framework: Progressive enhancement with sub-intents, placeholders, and value suggestions
Novel Metrics: EQDQ, AUE, and LRQI for comprehensive evaluation
Practical System: Instruction-tuned Phi-3-mini achieves near GPT-4T performance with low latency (1.8s latency, deployable on V100 GPU)
Open Source: Code, prompt, data and models publicly available: GitHub - tushar117/EYP-25: Enhance Your Prompt

Thank You

Tushar Abhishek, Manas Jain, Shishir Hardia, Shreevignesh Suriyanarayanan, Sandra Anil, Rushabh Gandhi, Manish Gupta