Excerpts and Commentary On Deepmind Paper

Excerpts Towards Responsible Development of Generative AI for Education: An Evaluation-Driven Approach

Irina Jurenka & Markus Kunesch, et al. | Link to Original Paper

Note : Below are the parts of the paper we thought were particularly interesting from a learning design / learning science approach. There is a lot more in the paper about the specific of how they build the customized model, and different approaches they took to evaluating performance

1. Introduction 1

2. Participatory approach 2

3. Improving Gemini for education 3

4. Measuring Pedagogy in Gen AI 5

5. Human evaluations 6

6. Automatic Evaluations 7

7. Learning from real-world interactions: 7

8. Evaluating particular educational capabilities 7

9. Responsible development 10

Appendix D. Challenges with prompting gen AI for pedagogy 11

Appendix F. Challenges with eliciting human preferences for pedagogy 12

Appendix G. Sociotechnical limitations of text-based gen AI 13

1. Introduction

The rise in gen AI has been met with mixed reactions. On the one hand, it appears to hold some promise to democratise access to knowledge and education: students are early adopters and top users of the technology [7], and gen AI is dominating the EdTech landscape [8]. On the other hand, several concerns have been raised about the misuse of this technology in educational settings [7, 9]. For example, the gen AI models that power most of the latest EdTech systems are not explicitly optimised for pedagogy. Instead, models are trained to be “helpful” [10–14], but this specific definition of helpfulness may often be at odds with pedagogy and learning. For example, students can easily get direct answers to homework assignments instead of working through them for themselves to get the intended practice. The availability of what appears to be “expert” information by prompting a gen AI model for an answer also gives students an illusion of mastery before it has been achieved, which may eventually lead to problems in the workplace [9, 15].

This report describes our first steps towards optimising gen AI for educational use cases. In particular, we focus on 1:1 conversational tutoring, and propose a comprehensive evaluation protocol for this use case. We focus on conversational tutoring because we believe that it is one of the most impactful and general use cases, and because it requires the integration of many important educational capabilities into a single system. An excellent conversational AI tutor has the potential to enhance the educational experience of both learners (by providing them with instant feedback and adapting to their individual needs) and teachers (by multiplying their impact and lightening their workload).

We describe our approach to responsible development of AI for education (Figure 1), which is informed by the ethics and policy literature [16–26]. We emphasise a participatory (Section 2) and multidisciplinary approach to research, bringing together experts in pedagogy, cognitive science, AI, engineering, ethics, and policy, as well as the ultimate stakeholders—students and teachers—to translate insights from learning science into pragmatic and useful pedagogical improvements of Gemini 1.0 [10] for education.

· We introduce LearnLM-Tutor, a new text-based gen AI tutor based on Gemini 1.0, further fine- tuned for 1:1 conversational tutoring (Section 3), and show that we improve its education-related capabilities over a prompt tuned Gemini 1.0.

· We develop a comprehensive suite of seven pedagogical benchmarks (quantitative and qualita- tive, and using both human and automatic evaluations; Figure 2) intended for assessing the performance of conversational AI tutors from various angles (participatory research), in line with the recommendations in = Foster et al. [21].

· Finally, we discuss the limitations, as well as the safety, ethical, and policy implications of our work. Our approach to ethics and safety goes beyond the common gen AI guidelines, as we develop education-specific interventions (Section 9).

2. Participatory approach

This section details the participatory elements that helped shape this project, including the design of our evaluative approach

2.1. Participatory workshops: Imagining and critiquing the future of education and AI

We conducted two participatory workshops in the UK: one with learners, primarily university students coming from diverse academic backgrounds (𝑛 = 60), and another with educators, mainly high school teachers specialising in STEM subjects (𝑛 = 34). The choice of the participant demographics was dictated by practical considerations. We realise that future work is needed to expand our reach to broader communities, since learners in the UK and other WEIRD3 countries likely encounter fewer barriers to accessing gen AI tools, and perspectives on AI in education likely differ substantially across cultural contexts.

Personalised tutoring, by AI or humans, was valued by both learners and educators. Tutors are especially effective when they have knowledge of the learner and can adapt their approach accordingly. Learners felt more comfortable seeking clarifications from AI tutors than human tutors, perceiving AI tutors as less formal and less likely to induce fears of judgement. A shared limitation of both human and AI tutors was their lack of familiarity with the nuances of particular syllabi or exam board requirements.

2.2 Understanding learning experiences: Initial interviews and Wizard-of-Oz sessions

To initiate our iterative participatory design process for LearnLM-Tutor, we conducted an exploratory series of user-centred studies involving both learners and educators.

Learners noted several main challenges with online courses: the learners’ lack of assumed prerequi- site knowledge, not being able to follow explanations due to missing details or logical steps, difficulty concentrating on long video lectures without doing exercises, and needing more help navigating the course materials. When doing practice problems, learners reported needing help breaking down the task into manageable chunks and diagnosing errors in their solutions; they reported that the tools they used could only point out the error, rather than how to diagnose it. Learners also wanted an AI tutor to have access to the same learning materials as them, use short communications that guide them in small steps, and give them frequent assessments of their knowledge. They did not want the tutor to give away too much information as they reported feeling pride in doing things themselves. They also wanted the tutor to be encouraging and constructive in its feedback, responsive and kind, proactive in soliciting questions from the learners, and always available.

From our conversations with the educators we have derived the following principles that apply to both human and AI tutors (see Section B.2 for additional principles that are only relevant to AI tutors):

Do not give away solutions prematurely. Encourage learners to come up with solutions.
Make explanations easy to understand, for example by making connections to the real world.
Be encouraging. Celebrate learner progress and embrace mistakes as learning opportunities.
Recognise when learners are struggling, and proactively check in with them.
Ask questions to determine learner understanding and misunderstanding.
Explain step-by-step, and deconstruct to teach thought processes.

3. Improving Gemini for education

This section surveys our work on enabling productive pedagogical behaviour in a language-based gen AI model5. We begin by framing our contributions with respect to related prior work in learning science, EdTech and AI research.

3.1. Lack of universal best pedagogical practices: lessons from learning science

Optimising an AI system for any goal requires a concomitant ability to measure progress. While learning and teaching strategies have been studied across many disciplines, defining (and subsequently quantifying) universal pedagogical principles remains a challenge.”. One reason why it has been hard to establish a common set of recommended pedagogical practices is related to the fragmentation of educational research across many disciplines. Even within the same discipline, many studies highlight different interventions or strategies with little overlap.

The resulting theories are often based on inconclusive evidence [37], and their translation to practice is often difficult or unclear [27, 38, 39]. Furthermore, most cognitive and learning science research tends to be done with small homogeneous populations [27], limited to specific narrow educational contexts, like subject domain, difficulty level, or prior learner knowledge [27], and typically conducted in WEIRD countries [40], which makes the findings hard to generalise.

Studied interventions also come with variable implementation parameters (e.g. the time spacing between practices, the ratio of examples to questions) and can be combined in different ways, resulting in a combinatorial explosion in possible, often context-dependant, pedagogical strategies [27] that is hard to explore manually, yet alone measure.

3.2. Lack of transparency and common evaluation practices: lessons from EdTech

Education has always been an important application for the latest computing technology. From the earliest instantiations, these systems tended to follow a similar blueprint. They assume that the learner is interacting with the tutoring system without any assistance from a human teacher, and the tutoring system guides the learner through a pre-defined set of learning materials with some level of adaptation to the learner’s progress (e.g., choosing the difficulty of the next practice problem based on how well the learner did on the previous ones), and some level of timely feedback (e.g., at the step or solution level) [41, 44, 48].
Under the hood, ITSs tend to be rule-based expert systems [67–70]—the predominant AI paradigm in the 1970-1980s. Although expert systems have many positive qualities, they have largely been replaced by deep learning in recent years due to difficulties with scale and generality inherent in th .paradigm [71, 72]. These limitations of expert systems also lead to the most common criticisms of ITSs (see Section C for further discussion).

3.3. Generative AI in education

Deep learning has become the predominant paradigm in AI, and It has removed the dependency on humans to provide structured knowledge to AI by enabling AI systems to discover structure from data on their own during training. Although there has been a lot of excitement about the potential impact of the recent gen AI technology in education, and a number of gen AI-based tutors have emerged [89–105], the full extent of this potential has not materialised just yet. A recent review of gen AI tutoring systems found that “dialog tutoring has largely remained unaffected by these advances” [106].

Out of the box, gen AI models have a remarkable ability to understand user queries expressed in natural language and generate responses that synthesise relevant information from across the internet (used in the gen AI pre-training) to answer in a helpful and harmless way. However, by default, these models do not typically behave like human tutors. Such default behaviour can be modified in two ways: (a) prompting or (b) fine-tuning (through supervised and/or reinforcement learning).

3.3.1. Prompting

Prompting is the easiest and most popular way to adjust the behaviour of gen AI

The prompting approach, however, has a number of limitations. Most importantly, it requires explicit specification of what good tutoring behaviours look like in natural language. This involves enumerating what should be done and when, what should be avoided and when, all the possible exceptions to the rules, etc. This makes prompted gen AI-based tutors similar to ITSs: while gen AI is more general and faster to build (based on an existing foundation model), in the end both are limited by declarative knowledge of what the best educational practices look like. However, as discussed in Section 3.1, as a community we have not come even close to fully exploring the search space of optimal pedagogical strategies, let alone operationalising excellent pedagogy beyond the surface level into a prompt.We spent some time trying to elicit pedagogical behaviour via prompting. In some cases, this worked well, for example when instructing the model to ask a user for their grade level and responding with age-appropriate vocabulary. However, we found that most pedagogy is too nuanced to be explained with prompting.

Furthermore, prompting produced unreliable and inconsistent results, because there are limits to how much it can push the behaviour of gen AI away from the core principles ingrained into it during the pre-training and instruction tuning phases of its development (see Section D for a discussion of these limitations in the educational context). Such inconsistent performance is incompatible with providing reliable standards of pedagogy for all learners throughout the entire learning journey. Hence, we decided to turn to fine-tuning for more deeply embedded pedagogical behaviour, and only rely on prompting to adjust more superficial characteristics and user preferences.

3.3.2. Fine-tuning

If prompting can be roughly seen as the modern, more capable generalisation of expert systems, its alternative—fine-tuning, which typically includes stages of supervised fine-tuning (SFT), followed by Reinforcement Learning from Human Feedback (RLHF)—brings the full power of the deep learning paradigm, i.e. learning from data, to the table. While far less computationally intensive than the standard pre-training phase, fine-tuning can still be costly to perform on models with many billions of parameters [101], which explains why it is less explored in the gen AI for education literature compared to prompting.

Successful fine-tuning has two prerequisites: enough high-quality data (provided by researchers in the SFT case, or self-generated by the learning agent through exploration in the RL case) and a good measure of success. This was the key to many modern success stories in AI, from AlphaGo [112] to AlphaFold [113]. However, neither are available in the education domain. This section addresses the lack of high-quality pedagogical data to enable education-related SFT, while the lack of a good measures of success is discussed in subsequent sections.

4. Measuring Pedagogy in Gen AI

This section was more about their data gathering approaches, and how they fine-tuned Gemini. Important for the specific paper, but fewer generalizable insights.

5. Human evaluations

(This section reports their results on benchmark. They evaluate conversations at a few levels of granularity and with a few different types of evaluators. I don’t think the results themselves were are as interesting as are the way they frame how to evaluate a model is interesting / useful.)

5.1. Unguided conversations: Subjective learner feedback

5.2. Turn-level pedagogy: teacher feedback

5.3. Conversation-level pedagogy: teacher feedback

5.4. Side-by-side pedagogy: teacher feedback

6. Automatic Evaluations

This section describes efforts to get LLMs to evaluate outputs of LLM tutors. Interesting technically, report some success but ultimately less generalizable.

7. Learning from real-world interactions:

This section briefly Describes a pilot they ran at with Arizona State University (ASU) to integrate LearnLM-Tutor into ASU’s Study Hall. Study Hall is a partnership between ASU, Crash Course, and YouTube that offers a pathway to college credit, and is accessible to learners of all ages and backgrounds.

8. Evaluating particular educational capabilities

Apart from the holistic evaluations of the pedagogical effectiveness of gen AI tutors described in the previous sections, sometimes it is useful to have more targeted evaluations that shed light on how the tutors perform in particular phases of a conversational learning session. In this section we describe two case studies of developing such evaluations: one for the evaluative practice phase of the mastery loop and the other one measuring the quality of tutor feedback when working with a learner on procedural homework problems.

8.1. Evaluative practice

Knowledge assessment is a crucial part of the learning process and one of the most talked about capabilities during the teacher workshop described in Section 2. In order to do well, it requires a complex dialog interaction between the learner and the tutor. Consider, for example, several possible answer and feedback pairs in an evaluative practice session on the geography of Normandy shown in Figure 14, in response to the question “What is the largest city in Normandy?”. These different examples highlight several challenges and opportunities that come up during interactive evaluative practice:

There can be multiple correct conflicting answers. This seeming contradiction is resolved by the content in the learner’s answer and/or tutor feedback (e.g. explicit mentioning of ‘metropolis’).
There can be multiple and conflicting assessments of the same answer, depending on the level of detail in the learner response and the rigidity of the tutor (compare e.g. (b) and (c)).
An answer that is strictly wrong (e.g. example (d)) can in fact be a minor mistake if the learner reveals strong understanding of the domain (e.g. the explicit distinguishing of ‘city’ and ‘metropolis’).
An answer need not necessarily be correct or incorrect. It can be e.g. a partial or close answer.The learner can convey additional information in the response which can lead the tutor to be more or less forgiving, such as uncertainty (as in example (c)).
Dynamic feedback provides opportunities for complementing with enrichment, e.g. the “By the way...” statement in example (a).

8.1.1. Automated Metrics

Pedagogical conversation flow. Used to assess the extent to which our model follows the evaluative practice schema of question, answer, appropriate feedback, and so on

Conversational adaptability. Used to measure how well the model adapts to the user’s specific request. It is based on the score returned by a gen AI model that is prompted with the following chain-of-thought approach: “Break down the user’s request into separate statements, and score the extent to which these statements are acknowledged in the bot’s response.”

Feedback quality. Used to measure the quality of the model’s feedback to the user’s answer to the question. Since this requires actually knowing the right answer, this metric is applied not to new conversations but rather to a hand labelled evaluation set where each user answer is given one of four labels: Correct, Incorrect, Partially correct, and Irrelevant. Our tutor model responses are generative and do not come in the form of these four labels. Thus, to measure the performance of our model, we used a trained assessment extraction model that “translates” the feedback of the model into these classes. We then compare the extracted class and compute the overall precision and recall metrics.

Question difficulty. Used to measure the average and range of question difficulties generated by the model to ensure varied quizzes. We rely on Bloom’s taxonomy [158] to map questions to the level of cognitive effort required to answer them: 1) Remember, 2) Understand, 3) Apply, 4) Analyse, 5) Evaluate, 6) Create. The metric is computed using a gen AI model prompted to extract and predict Bloom’s taxonomy for each question.

8.1.2 Pedagogical Expert Human Evaluation

We rely on a pool of pedagogical experts (two per example, with an optional third rater in case of a tie) to collect deeper feedback on the pedagogical value of the evaluative practice experience. In this setup the raters review two evaluative practice conversations about the same topic that were generated by the generalist human raters mentioned above. The pedagogical raters respond to a series of questions about the pedagogical value of each conversation, as well as an overall side-by-side question to decide which model was preferable. The evaluative questions ask raters to assign a score on a 3 point scale on the following criteria:

Accuracy: Overall accuracy, question accuracy, feedback accuracy
Helpfulness and relevance: Question and feedback relevance, feedback helpfulness
Question set quality: To what extent is the question set well formulated?
Conversational quality: Engagingness, response length, context usage, unexpected behaviour •
Overall: Which conversation was better as a tutoring conversation?

8.2. Feedback on procedural homework problems

This section describes how we evaluated LearnLM-Tutor’s ability to provide conversational feedback on procedural homework problems, such as maths word problems. Procedural problems often have one or few correct solution(s) and require a series of steps a student must perform to reach that solution.

Despite significant gains in mathematical and multi-hop reasoning as tracked by the common benchmarks [121, 159–161], the performance of AI tutors in providing conversation based feedback on procedural problems is still inadequate as tutoring is more difficult than just solving a problem itself. When tutoring a student, an AI tutor has to not only solve a presented procedural problem correctly, but also evaluate the learner’s (potentially partially correct) solution, identifying any misconceptions. The AI tutor must allow for multiple possible problem solving strategies from the learner, while providing a consistent explanation that a learner can understand. This is at odds with the tendency of gen AI models to change their solutions to a given problem multiple times within a single conversation [162]. Additionally, the AI tutor must not exhibit the sycophantic tendencies of LLMs [163] to give proper feedback on mistakes. Existing benchmarks do not evaluate these capabilities.

To track progress on improving the quality of LearnLM-Tutor’s performance on providing feedback to learner-attempted procedural problems, we developed the following set of progressively harder automated evaluation metrics:

Identify that the solution is correct: Although base gen AI models are already good at this, we believe it is important to track this capability to avoid regression when trying to improve the ability of the models to identify and point out a learner’s mistake.
Identify the presence of a mistake in a partially correct solution: Given a mathematics problem asked by the tutor and a learner’s partially correct response, this metric measures whether the tutor points out that the solution is incorrect.
Provide remediation feedback to an incorrect solution: While the previous metrics measure whether the mistake was pointed out by the tutor, this metric measures if the tutor provides feedback on how to fix the mistake, e.g., with a hint.
Point out the mistake in a partially correct solution: As problems become difficult, it is important to point out what mistake was made in a solution. To evaluate this, the gen AI critic receives ground truth information on what mistake was made in a partially correct solution and compares it to the mistake pointed out by the tutor.
Acknowledging the correct part of a partially correct solution: A key trait of a good tutor is to acknowledge what was correct in a partially correct solution. This metric tracks whether the gen AI tutor points out the correct parts of a partially correct solution. To evaluate this, we augment our dataset with ground truth information on what is correct in a partially correct solution. The critic’s task is to compare the evaluated tutor response with the ground truth.

9. Responsible development

9.1. Impact assessment

Impact assessments were carried out throughout the development, drawing on the participatory workshops with learners and educators described in Section 2.1, and the literature on the benefits and harms of generative AI [23–26] and of artificial intelligence for education specifically [16–22]. All individual studies and products underwent a separate impact assessment; in the case of the ASU HallMate study in Section 7, this was conducted by Google DeepMind’s Human Behavioural Research Ethics Committee.

Through our participatory research, we have learned that AI tutors can be beneficial to learners by promoting active learning and providing personalised help when explaining concepts or working through problems. An AI tutor can understand the learner’s current knowledge, adapt its explanations to the learner’s proficiency, and making connections to real-world examples interesting to the learner

We have also seen early signals that AI tutors can be an always available, safe place for learners to ask questions they may be uncomfortable asking teachers or peers or to get motivation when feeling overwhelmed in a course.

9.3. Mitigations

Mitigations to known risks were applied from the outset, with further mitigations being added to address failure modes discovered during safety evaluations. The first mitigation was careful curation of our SFT data: our “Golden conversations” data was written by pedagogy experts with instructions on style and content, and most of our synthetic fine-tuning data (with the exception of some synthetic data for mathematics) was manually reviewed. Furthermore, we used prompted LLMs to flag turns in the data that might make policy violations more likely and manually reviewed all flagged turns.

Our main mitigation method was additional safety fine-tuning on top of that of Gemini 1.0. This is necessary to enforce the additional safety policies for LearnLM-Tutor, and mitigate safety issues arising from the customisation of the models for AI tutoring—even non-adversarial customisation can affect safety [169, 170]— and customise the way the model responds to policy violation-inducing queries. Since a conversation with LearnLM-Tutor has a narrower conversation goal than that of a generalist conversational AI, the handling of most harm-inducing queries can be different: for queries that are unrelated to the learning goal, we aimed for LearnLM-Tutor to give briefer rejections and refocus the conversation on the lesson content.

Our safety fine-tuning data consists of harm-inducing conversations and golden responses on lesson material across a wide range of subjects. Queries were either written by the team or taken from failures observed during automatic or human red-teaming. The number and type of training examples was chosen to ensure broad coverage of our model policies and different harm types as well as appropriate dataset size relative to the rest of our fine-tuning data.

9.5. Examples of the evaluation and mitigation process

We present two examples of our evaluation and mitigation process: failure patterns caused by the customisation of the model for pedagogy, and anthropomorphism as an example of a risk that was identified early on and tracked throughout the entirety of development.

9.5.1. Failure patterns caused by customisation

Model customisations—even if they are non-adversarial—can result in safety regressions [169, 170]. This is equally true of our pedagogy fine-tuning. For example, the model developed a tendency to

9.5.2. Anthropomorphism

Perceiving human-like characteristics in non-human systems is known as anthropomorphism [172]. Many technologies have been perceived as human-like by their users [173–176], including generative conversational AI systems powered by large language models [177, 178]. Anthropomorphic percep- tions of technologies, including AI, have been demonstrated to have a great impact on how users interact with and form mental models of the systems [179–183]. While greater trust and acceptance of anthropomorphic systems may have a positive effect on user-system interactions in certain contexts, like customer service [184], it is important to anticipate downstream harms. For example, users may experience emotional attachments to AI systems, which may give rise to dependence and over-reliance on AI systems [26].

Appendix D. Challenges with prompting gen AI for pedagogy

Recent review articles found that although prompted gen AI approaches tend to do better than their ITS predecessors in constrained tutoring scenarios where the number of concepts and possible teaching strategies is small, these systems perform poorly in more general learning scenarios [90, 92, 98]. A major disadvantage of the prompting approach is that there are limits to how much it can push the behaviour of the gen AI away from the core principles fine-tuned into the model during the pre-training and instruction tuning phases as discussed in more detail below. Note, however, that gen AI models improve continuously, including in terms of their ability to follow prompts, so many of the results discussed next may not hold at the point this report is published.

Multi-turn/Proactivity It is impossible to teach someone if you can only make one utterance, so tutoring is inherently multi-turn. Furthermore, evidence suggests that human tutors tend to proactively drive the conversation, asking more questions in a session than the learner [203]. Gen AI, however, is optimised to be as helpful as possible to resolve the user query in a single turn, and thus tends not to ask follow up questions (when prompted to do so, the quality of the questions is often suboptimal) [89], their performance tends to drop as the conversation progresses [89–92], and the conversations tend to meander and have no goal or structure [89, 93].
Giving away answers Since foundational models are optimised to be as helpful as possible, they naturally tend to give away the answer very quickly [89, 90, 92, 94, 162]. This promotes cheating [95], and has the potential to make learners overly reliant on gen AI, since they do not have the incentive to acquire the knowledge [90, 95]. The latter can lead to problems in the workspace [9, 15].
Sycophancy Related to the points above, gen AI models are known to suffer from sycophancy [204]. Since models tend to agree with the user, they often struggle to identify the learner’s mistake and give them relevant feedback [66, 96]. Learners are also able to sway their gen AI tutor away from being pedagogical (intentionally or not) because of the gen AI models’ strong tendency to please [90]. Without critical feedback learners are unable to realistically reflect on their knowledge and learning progress, which may lead them to disengage from exploratory or active information-seeking behaviours necessary for effective learning [90, 205]
Uncertainty signalling Gen AI models are known to suffer from hallucinations [206]. They also tend to present all information, whether hallucinated or not, with the same level of high certainty. This can be particularly harmful and misleading to learners in educational settings, and is highlighted as one of the key missing capabilities of gen AI tutors [90, 91, 105].
Pedagogy Gen AI models are pre-trained on vast amounts of text scraped from the internet. High- quality pedagogy is effectively lacking from this training set [100, 101, 106]. Hence, it is not surprising that gen AI models have been found to perform poorly at producing pedagogical moves, such as explaining a concept, asking a question, providing a worked example [96], or comparing favourably to human teachers on dimensions such as talking like a teacher, understanding the student, or being helpful to the student [97, 98]. Gen AI tutors have also been reported to be bad at answering “why” questions [91] or helping undergraduate students debug their code [92]. Qualitatively, Hicke et al. [100] found that the responses produced by a prompted gen AI tutor on a language learning tutoring task were contextually relevant and linguistically correct, but not pedagogical [100]. In a separate study on the same task, Li et al. [93] found that gen AI produced tutoring interactions that felt too formal and not natural.
Cognitive Load/Leveling Since gen AI models are optimised for single-turn helpfulness, they tend to produce long-form answers that contain as much relevant information as possible. Such “wall-of-text” answers are not ideal in the context of multi-turn tutoring conversations, since they do not manage the learner’s cognitive load and can be hard for learners to parse, especially if they have a short attention span or sub-optimal reading skills [162]. Qualitatively, this tendency also makes AI tutors sound too much like assistants rather than teachers, often sounding too thorough or technical and not adjusting to the learner’s level [89]. Such overly long and redundant responses tend to be negatively perceived by learners [91, 93].

Appendix F. Challenges with eliciting human preferences for pedagogy

In order to use Reinforcement Learning (RL) to fine-tune gen AI for education, it is important to train a Reward Model (RM) that can provide an evaluative signal on how well either each single response produced by a gen AI model rates in terms of its pedagogical value, or how well a whole multi-turn interaction with a learner has helped this learner achieve their goal. Such RMs are typically trained by eliciting feedback from human raters. These raters are typically presented with a pair of model responses and asked to judge which one they prefer based on certain criteria. Improving gen AI models through this process is called RL from Human Feedback, or RLHF. Currently RLHF has only been applied at a single-turn level (rather than at the conversation level) [90, 208] and human preference collection has not so far been generalised to pedagogy as far as we are aware. This is because eliciting human preferences reliably is already a hard task, and doing so for pedagogy amplifies the existing problems. For example, inconsistencies between different raters are exacerbated because good pedagogy is hard to define and there are multiple possibly equally valid pedagogical moves that can be made in each situation. It is also not clear whether the preferences should be elicited from the learners, educators or both, and how they should be combined if it is the latter.

Appendix G. Sociotechnical limitations of text-based gen AI

It is important to frame the work on gen AI tutor model development in terms of sociotechnical limitations of text-based gen AI. It is natural to think of an AI tutor as approximating a human tutor, however approximating human tutors closely may not always be desirable or possible. While modern gen AI models can provide impressive, often human-like responses, text-based interaction is usually only a fragment of human communication. At least relative to today’s state-of-the-art models, human tutor advantages include:

Full understanding of time in place: We live in a real world with real physical and social dynamics shared implicitly by all people that underlie all our explicit communication, but are largely missing from non-embodied AI systems trained on de-contextualised randomised samples of media.
Personalisation: A human tutor is likely to have important background on each learner, such as their age, level, course of study, learning style, and knowledge of specific past details, all of which continue to develop through repeated interaction. AI systems face logistical obstacles (e.g., restrictions on what kinds of personal information they can obtain and retain) and technical obstacles (e.g., it is unclear how to translate the relevant parts of past interactions into a limited memory and use them effectively) to this kind of personalisation.
Non-verbal communication: In most settings, a human tutor will have access to non-verbal cues through facial expression, body language, and tone that indicate attention, frustration, or enthusiasm that can be used to guide content and style of the lesson. Current AI systems largely do not leverage this information, and in a chat environment, have no ability to adjust their own non-verbal style as appropriate.
Multi-modal interaction: Human tutoring often relies on working together, looking at the same diagram, manipulating the same object, or writing together on the same surface. While multi- modal capabilities are nascent in current models, seamless interaction across media types is still not possible.
Reliance on social norms: Human tutors can mostly rely on social norms that tend to regulate learner behaviour, giving them space for pedagogical strategies like leading the learner towards an answer through questioning, instead of giving away the answer directly. By contrast, learners feel comfortable demanding direct answers from AI systems or simply walking away, limiting opportunities for traditional pedagogy.

The design of an AI tutor should take into account these shortcomings with respect to human interaction, in addition to well-known limitations on current model capabilities like confident gen- eration of false or misleading information, unpredictable failure to generalise learned behaviour appropriately, improper use of tools leading to incorrect calculations, and missing introspection that might allow for post hoc correction of mistakes (also see Section D).