1 of 15

EnhanceMyPrompt: Rewriting Chat Queries for Effective Response Generation from LLMs

Tushar Abhishek*, Manas Jain*, Shishir Hardia, Shreevignesh Suriyanarayanan, Sandra Anil, Rushabh Gandhi, Manish Gupta

Applied Research Paper Track

2 of 15

EnhanceMyPrompt: Rewriting Chat Queries for Effective Response Generation from LLMs. CIKM 2025.

Problem Statement: Short & Ambiguous Chat Queries

��Observations: ~70% of Microsoft Copilot queries are broad and short (< 10 words).

    • “write a poem”
    • “create an image of a dog”
    • “swiss vacation plan”
    • “write a good introduction for my lab report”

Consequences:

    • Vague LLM responses → Generic, unhelpful answers
    • Multiple iterations needed → Users follow up with clarifying queries
    • Longer task completion times → Poor user experience

Key Question: Can we semi-automatically enhance short queries into specific, well-formed ones with clear intent?

Microsoft Copilot

ChatGPT

Google Gemini

3 of 15

EnhanceMyPrompt: Rewriting Chat Queries for Effective Response Generation from LLMs. CIKM 2025.

Our Solution: Enhance My Prompt (EMP)

Novel Problem FormulationUnlike query expansion: We add relevant sub-intents/constraints, not just related terms�Unlike prompt optimization: We enhance with detail, not compress

Unlike prompt engineering: We don't change prompting strategy

Enhance My Prompt System

  1. Uses Small Language Models (SLMs) for efficiency
  2. Human-in-the-loop approach for editing enhanced drafts
  3. 4 levels of enhancement with increasing detail
  4. Practical deployment with low latency (~1.8s per query)

Example Transformation:

Original: “create an image of rhino playing football”

Enhanced (Level 4): “create a <photo-realistic> image in <bright colors> of a rhinoceros wearing a <red jersey> playing in a <football stadium>”

User can click on enhance my prompt button to invoke the system.

4 of 15

EnhanceMyPrompt: Rewriting Chat Queries for Effective Response Generation from LLMs. CIKM 2025.

Our Solution: Enhance My Prompt (EMP)

Novel Problem FormulationUnlike query expansion: We add relevant sub-intents/constraints, not just related terms�Unlike prompt optimization: We enhance with detail, not compress

Unlike prompt engineering: We don't change prompting strategy

Enhance My Prompt System

  1. Uses Small Language Models (SLMs) for efficiency
  2. Human-in-the-loop approach for editing enhanced drafts
  3. 4 levels of enhancement with increasing detail
  4. Practical deployment with low latency (~1.8s per query)

Example Transformation:

Original: “create an image of rhino playing football”

Enhanced (Level 4): “create a <photo-realistic> image in <bright colors> of a rhinoceros wearing a <red jersey> playing in a <football stadium>”

User can analyze, edit or choose from different placeholder in the enhanced prompt before submitting their query to copilot.

5 of 15

EnhanceMyPrompt: Rewriting Chat Queries for Effective Response Generation from LLMs. CIKM 2025.

Four Levels of Enhancements

Key Advantages:

  1. Users can edit drafts before submission
  2. Minimal cognitive load on users.
  3. EMP can predict future intents: saves up to 3 conversation turns in ~23% of cases.

6 of 15

EnhanceMyPrompt: Rewriting Chat Queries for Effective Response Generation from LLMs. CIKM 2025.

Dataset & Enhanced Query Generation

  1. Microsoft Copilot Dataset (Proprietary)
    • Randomly sampled 500K English queries from Bing Chat logs (Sep-Oct 2023)
    • Query length: 4-40 words (avg 9.45 words)
    • 71% queries < 10 words
    • Train: 400K, Val: 50K, Test: 50K

  • LMSYS+NQ Dataset (Public)
    • We filtered queries from LMSYS-Chat-1M (Real-world conversations with LLMs) and Google Natural Questions (Entity-centric questions)
    • Filter criteria: more than 4 words, first turn queries and non-toxic (tagged through OpenAI moderation API)
    • Randomly sampled 50K instances from each of the data sources.
    • Train: 80K, Val: 10K, Test: 10K

Enhanced Query Generation

  • We utilized GPT-4 Turbo with Chain-of-Thought prompting and few shot examples to generate enhanced query across all level and dataset. All prompts are publicly available: GitHub - tushar117/EYP-25: Enhance Your Prompt

  • Added specific instruction for sub-intent identification, placeholder generation, and value suggestion.

  • On average, enhanced drafts from GPT-4T are ∼2.9× and ∼2.6× for the Copilot and LMSYS+NQ datasets respectively.

  • Enhanced query draft and final enhanced query differ by around 1 word indicating small edit distance between them.

Average query sizes (in words) across different levels of enhancement for both datasets using GPT-4T.

7 of 15

EnhanceMyPrompt: Rewriting Chat Queries for Effective Response Generation from LLMs. CIKM 2025.

Model Training & Architecture

We evaluated the following models:

  1. Supervised Finetuning
    • GPT2-Medium (355M parameters)
    • T5-Large (770M parameters)

  • Instruction Finetuning
    • Phi-3-Mini-Instruct (3.8B parameters)
    • LLaMA-3-Instruct (8B parameters)

�Training Setup:�

  • Trained a separate models for each dataset and each level (4 levels)
  • Input : original query, output: enhanced query draft
  • We utilize LoRA for efficient fine-tuning for Phi-3-Mini-Instruct and LLaMA-3-Instruct model

Why SLMs?�

  • Efficiency: Low latency for real-time deployment
  • Cost-effective: Lower inference costs than GPT-4
  • Practical: Can run on standard hardware (e.g. T4, V100 GPU, etc.)

8 of 15

EnhanceMyPrompt: Rewriting Chat Queries for Effective Response Generation from LLMs. CIKM 2025.

Evaluation Metrics

To design the relevant metrics, we performed a user study with 10 users with diverse backgrounds and ethnicities worldwide. Based on the feedback, we propose three novel metrics to measure enhanced query draft (evaluated using GPT4 Turbo):

Enhanced Query Draft Quality (EQDQ)

  • Relevance - with original query
  • Engagement – creative & informative
  • Detail – appropriate length & info
  • Clarity – coherent & clear
  • Placeholder Quality
  • Placeholder Choices Quality

Additional User Effort (AUE)

  • Cognitive Load Details - Complexity of comprehending the draft appropriate length & info
  • Edit Effort - Effort required to modify placeholders/text

LLM Response Quality Improvement (LRQI)

  • Relevance
  • Engagement
  • Detail
  • Clarity
  • Perceived Intelligence

EQDQ takes user query and enhanced query draft as input to measure 6 dimensions on 4-point Likert scale

AUE take enhanced query draft and final enhanced query (after user/agent edit) as input.

LRQI measures how much the LLM’s response improves when the final enhanced query is used, by comparing the response quality against that from the original user query.

9 of 15

EnhanceMyPrompt: Rewriting Chat Queries for Effective Response Generation from LLMs. CIKM 2025.

Detailed Results - Copilot

10 of 15

EnhanceMyPrompt: Rewriting Chat Queries for Effective Response Generation from LLMs. CIKM 2025.

Detailed Results - LMSYS+NQ

11 of 15

EnhanceMyPrompt: Rewriting Chat Queries for Effective Response Generation from LLMs. CIKM 2025.

Key Results : Overall Model Performance

Best Performing Model: Instruction-tuned Phi-3-Mini

Model

Level-1

Level-2

Level-3

Level-4

Average

Pretrained Phi-3

2.57

2.59

2.6

2.6

2.59

Pretrained LLaMA-3

2.79

2.84

2.82

2.84

2.82

GPT2-Medium (FT)

2.66

2.61

2.64

2.73

2.66

T5-Large (FT)

2.73

2.68

2.71

2.8

2.73

Phi-3 (IT)

2.8

2.79

2.81

2.86

2.82

LLaMA-3 (IT)

2.77

2.75

2.79

2.84

2.79

GPT-4T (reference)

2.81

2.8

2.82

2.87

2.83

Overall EQDQ Scores - Copilot Dataset (Higher is Better):

Model

Level-1

Level-2

Level-3

Level-4

Average

Pretrained Phi-3

0.06

0.08

0.05

0.05

0.06

Pretrained LLaMA-3

-0.01

-0.08

-0.07

-0.14

-0.08

GPT2-Medium (FT)

0.13

0.09

0.1

0.12

0.11

T5-Large (FT)

0.12

0.1

0.11

0.11

0.11

Phi-3 (IT)

0.12

0.09

0.11

0.12

0.11

LLaMA-3 (IT)

0.13

0.08

0.09

0.11

0.10

GPT-4T (reference)

0.13

0.09

0.11

0.13

0.12

Overall LRQI Scores - Copilot Dataset (Higher is Better):

AUE Scores – Copilot Dataset (Lower is Better):

  • Phi-3 (IT) achieves lowest cognitive load and edit effort across all level
  • Better than GPT-4T on user effort metrics

12 of 15

EnhanceMyPrompt: Rewriting Chat Queries for Effective Response Generation from LLMs. CIKM 2025.

Impact & Practical Insights

Coverage of Future User Intents using best performing model: Instruction-tuned Phi-3-Mini

Experiment: 10K LMSYS conversations with ≥3 turns

Key Finding: EnhanceMyPrompt can predict user intents up to 3 turns ahead in ~23% of conversations

Latency (Phi-3 IT on V100 GPU):

  • Average: ~1.8 seconds per query
  • Batch size: 1
  • Practically deployable for real-time applications

Edit Distance Analysis (Enhanced Draft Query and Enhanced Final Query)

  • Enhanced drafts require minimal editing
  • Average edit distance: 2-3 words
  • Exact matches 60-70% for best models

Average Inference Times for 100 samples on a NVIDIA V100 GPUwithbatch size as 1.

Edit distance (in words) between enhanced query draft and final enhanced query.

13 of 15

EnhanceMyPrompt: Rewriting Chat Queries for Effective Response Generation from LLMs. CIKM 2025.

Examples

Enhanced Draft Query using EnhanceMyPrompt models. Query=“can you recommend some investment opportunities?”

14 of 15

EnhanceMyPrompt: Rewriting Chat Queries for Effective Response Generation from LLMs. CIKM 2025.

Qualitative Analysis

Examples of errors made by our best Enhance MyPromptSLMmodel - instruction-tuned Phi-3.

15 of 15

EnhanceMyPrompt: Rewriting Chat Queries for Effective Response Generation from LLMs. CIKM 2025.

Conclusion

  • Novel Problem: First to propose semi-automatic prompt enhancement for web copilots
  • 4-Level Framework: Progressive enhancement with sub-intents, placeholders, and value suggestions
  • Novel Metrics: EQDQ, AUE, and LRQI for comprehensive evaluation
  • Practical System: Instruction-tuned Phi-3-mini achieves near GPT-4T performance with low latency (1.8s latency, deployable on V100 GPU)
  • Open Source: Code, prompt, data and models publicly available: GitHub - tushar117/EYP-25: Enhance Your Prompt

Thank You

Tushar Abhishek, Manas Jain, Shishir Hardia, Shreevignesh Suriyanarayanan, Sandra Anil, Rushabh Gandhi, Manish Gupta