1 of 20

Leveraging In-Context Learning and Retrieval-Augmented Generation for Automatic Question Generation in Educational Domains

Subhankar Maity, Aniket Deroy and Sudeshna Sarkar

Indian Institute of Technology Kharagpur

The 16^th meeting of Forum for Information Retrieval Evaluation

DA-IICT, Gandhinagar

12^th-15^th December

2 of 20

Motivation

Educational question generation is essential for personalized learning and assessment but remains time-consuming and cognitively demanding for educators, requiring a balance between contextual relevance and pedagogical soundness.

Many existing automated question generation methods produce out-of-context questions, reducing their effectiveness in educational applications.

Advanced techniques like In-Context Learning (ICL) and Retrieval-Augmented Generation (RAG) offer promising solutions for generating contextually relevant and high-quality questions.

Combining these techniques into a Hybrid Model can address existing limitations, enabling more reliable and context-aware automated question generation systems.

3 of 20

Contributions

A comparative analysis of In-Context Learning (ICL) and Retrieval-Augmented Generation (RAG) for automated question generation in the educational domain. We evaluate how each method performs individually in generating questions from textbook passages and assess their respective strengths and weaknesses.

A novel hybrid model that combines ICL and RAG to generate higher-quality, contextually accurate, and pedagogically aligned questions. Our hybrid approach leverages external knowledge through retrieval and incorporates in-context learning to guide the generation process.

A comprehensive evaluation of the proposed methods using both automated metrics (e.g., BLEU-4, ROUGE-L, METEOR, ChRF, BERTScore) and human evaluation by educators. We assess the grammaticality, appropriateness, relevance, complexity, and answerability, providing insights into the pedagogical value of each approach.

4 of 20

TASK DEFINITION

5 of 20

Problem Statement & Input-Output Representation

The task of Automatic Question Generation (AQG) can be formulated as follows: given an input passage 𝑃, the objective is to generate a question 𝑄 that is relevant to the content of 𝑃, contextually accurate, and aligned with educational goals.

The input to the model is a passage 𝑃, which can be a sentence, a paragraph, or a longer text excerpt from educational material. The output is a question 𝑄 that is relevant to 𝑃 and suitable for use in an educational setting. Formally, we define the task as learning a function:

𝑓 (𝑃) = 𝑄

where 𝑃 is the input passage, and 𝑄 is the generated question.

6 of 20

In-Context Learning (ICL)

In the In-Context Learning (ICL) paradigm, given an input passage 𝑃_new and a set of 𝑘 few-shot examples {(𝑃₁, 𝑄₁), (𝑃₂, 𝑄₂), . . . , (𝑃_𝑘 , 𝑄_𝑘 )}, the model generates a new question 𝑄_new corresponding to 𝑃_new:

Here, the few-shot examples serve as prompts to guide the question generation process for the new passage.

7 of 20

Retrieval-Augmented Generation (RAG)

For Retrieval-Augmented Generation (RAG), the task is extended by incorporating an external retrieval mechanism.

Given a passage 𝑃, the model retrieves a set of relevant documents {𝑅₁, 𝑅₂, . . . , 𝑅_𝑘 } from an external corpus. These documents provide additional context, and the final question 𝑄 is generated as:

8 of 20

Hybrid Model

Our proposed Hybrid Model combines the advantages of both ICL and RAG.

The model first retrieves a set of documents {𝑅_𝑖} ^𝑘_𝑖=1for the input passage 𝑃, and then uses few-shot learning to generate the question 𝑄 based on both the passage and retrieved documents:

Here, the retrieval step enriches the context for question generation, while the few-shot examples (i.e., 𝑚 examples) help guide the model towards generating pedagogically relevant questions.

9 of 20

Dataset

The EduProbe dataset [1] contains 3,502 question-answer pairs across five subjects: History (858 pairs), Geography (861 pairs), Economics (802 pairs), Environmental Studies (606 pairs), and Science (375 pairs).

The dataset is curated from NCERT textbooks for standards 6^th to 12^th, covering diverse chapters with varying segment lengths.

For our experiments, we extracted only the context (or passage) and question, focusing on evaluating how well the models generate questions based on the provided contexts (or passages).

[1] Maity et al.: Harnessing the Power of Prompt-based Techniques for Generating School-Level Questions using Large Language Models. In Proceedings of the 15th Annual Meeting of the Forum for Information Retrieval Evaluation (Panjim, India) (FIRE ’23). Association for Computing Machinery, New York, NY, USA, 30–39. https://doi.org/10.1145/ 3632754.3632755

10 of 20

METHODOLOGY

11 of 20

In-Context Learning (ICL) Approach

For the ICL approach, we use the GPT-4 model to generate questions based on a few-shot prompt.

Each prompt consists of 𝑘 example input-output pairs {(𝑃₁, 𝑄₁), . . . , (𝑃_𝑘 , 𝑄_𝑘 )}, where each pair consists of a passage and a corresponding question.

Given a new passage 𝑃_new, the model generates a question 𝑄_new using the few-shot examples (K = 3, 5, 7) as context. The general structure of the ICL prompt is as follows:

12 of 20

Retrieval-Augmented Generation (RAG) Approach

The RAG model utilizes BART as its generative backbone, enhanced with a FAISS-based retrieval module that searches through an extensive corpus of educational materials tailored for school-level subjects like History, Geography, Economics, Science, and Environmental Studies.

For a given passage P, the retrieval module identifies the most relevant documents from the curated corpus, denoted as {𝑅_𝑖} ^𝑘_𝑖=1, which are concatenated with P and input into a fine-tuned BART model trained on the EduProbe training set to generate a question Q.

The fine-tuning of BART on EduProbe ensures that the model is adept at generating questions based on educational content, further enhancing its performance. Formally, question generation in RAG (K = 5) is defined as:

13 of 20

Hybrid Model

Our Hybrid Model combines the retrieval-based context enrichment of RAG with the few-shot learning mechanism of ICL using GPT-4.

First, relevant documents are retrieved for a given passage 𝑃.

Then, the few-shot learning mechanism uses these retrieved documents alongside the input passage to generate a more contextually accurate and pedagogically meaningful question. The hybrid approach can be mathematically defined as:

Here, 𝑃 is the input passage, {𝑅_𝑖}^𝑘_𝑖=1 are the retrieved documents (K = 5), and {(𝑃_𝑖 , 𝑄_𝑖)}^𝑚_𝑖=1 are the few-shot examples (m = 5) used to guide the question generation process.

14 of 20

Baseline Models

We fine-tune the best-performing models (based on automated evaluation), such as the T5-large [1] and BART-large [1] architectures, on the EduProbe training dataset.

[1] Maity et al.: Harnessing the Power of Prompt-based Techniques for Generating School-Level Questions using Large Language Models. In Proceedings of the 15th Annual Meeting of the Forum for Information Retrieval Evaluation (Panjim, India) (FIRE ’23). Association for Computing Machinery, New York, NY, USA, 30–39. https://doi.org/10.1145/ 3632754.3632755

15 of 20

Automatic Evaluation Results

16 of 20

Human Evaluation Results

17 of 20

Output Samples

18 of 20

Output Samples

19 of 20

Conclusions

This study investigated advanced methods for automated question generation (AQG) in education, focusing on In-Context Learning (ICL), Retrieval-Augmented Generation (RAG), and a novel Hybrid Model.

Results demonstrated that ICL (k=7) excels in automated metrics (ROUGE-L, METEOR, CHrF, BERTScore), while the Hybrid Model (k=5, m=5) outperformed in human evaluations (grammaticality, appropriateness, relevance, complexity).

The Hybrid Model effectively combines retrieval and in-context learning, generating questions with greater depth and contextual alignment, surpassing baseline models like T5-large and BART-large.

Future work will evaluate the models on diverse datasets, integrate feedback from educators and students, and explore cutting-edge LLMs like Gemini and Llama-2-70B to further enhance AQG systems.

20 of 20

Thank you!