Published using Google Docs
Excerpts and Commentary On Deepmind Paper
Updated automatically every 5 minutes

Excerpts Towards Responsible Development of Generative AI for Education: An Evaluation-Driven Approach

 Irina Jurenka & Markus Kunesch, et al.  |  Link to Original Paper

Note : Below are the parts of the paper we thought were particularly interesting from a learning design  / learning science approach. There is a lot more in the paper about the specific of how they build the customized model, and different approaches they took to evaluating performance

1. Introduction        1

2. Participatory approach        2

3. Improving Gemini for education        3

4. Measuring Pedagogy in Gen AI        5

5. Human evaluations        6

6. Automatic Evaluations        7

7. Learning from real-world interactions:        7

8. Evaluating particular educational capabilities        7

9. Responsible development        10

Appendix D. Challenges with prompting gen AI for pedagogy        11

Appendix F. Challenges with eliciting human preferences for pedagogy        12

Appendix G. Sociotechnical limitations of text-based gen AI        13

1. Introduction

The rise in gen AI has been met with mixed reactions. On the one hand, it appears to hold some promise to democratise access to knowledge and education: students are early adopters and top users of the technology [7], and gen AI is dominating the EdTech landscape [8]. On the other hand, several concerns have been raised about the misuse of this technology in educational settings [7, 9]. For example, the gen AI models that power most of the latest EdTech systems are not explicitly optimised for pedagogy. Instead, models are trained to be “helpful” [1014], but this specific definition of helpfulness may often be at odds with pedagogy and learning. For example, students can easily get direct answers to homework assignments instead of working through them for themselves to get the intended practice. The availability of what appears to be “expert” information by prompting a gen AI model for an answer also gives students an illusion of mastery before it has been achieved, which may eventually lead to problems in the workplace [9, 15].

 

This report describes our first steps towards optimising gen AI for educational use cases. In particular, we focus on 1:1 conversational tutoring, and propose a comprehensive evaluation protocol for this use case. We focus on conversational tutoring because we believe that it is one of the most impactful and general use cases, and because it requires the integration of many important educational capabilities into a single system. An excellent conversational AI tutor has the potential to enhance the educational experience of both learners (by providing them with instant feedback and adapting to their individual needs) and teachers (by multiplying their impact and lightening their workload).

 

We describe our approach to responsible development of AI for education (Figure 1), which is informed by the ethics and policy literature [1626]. We emphasise a participatory (Section 2) and multidisciplinary approach to research, bringing together experts in pedagogy, cognitive science, AI, engineering, ethics, and policy, as well as the ultimate stakeholders—students and teachers—to translate insights from learning science into pragmatic and useful pedagogical improvements of Gemini 1.0 [10] for education.

 

·           We introduce LearnLM-Tutor, a new text-based gen AI tutor based on Gemini 1.0, further fine- tuned for 1:1 conversational tutoring (Section 3), and show that we improve its education-related capabilities over a prompt tuned Gemini 1.0.

·           We develop a comprehensive suite of seven pedagogical benchmarks (quantitative and qualita- tive, and using both human and automatic evaluations; Figure 2) intended for assessing the performance of conversational AI tutors from various angles (participatory research), in line with the recommendations in = Foster et al. [21].

·           Finally, we discuss the limitations, as well as the safety, ethical, and policy implications of our work. Our approach to ethics and safety goes beyond the common gen AI guidelines, as we develop education-specific interventions (Section 9).

 

 

2. Participatory approach

This section details the participatory elements that helped shape this project, including the design of our evaluative approach

 

2.1. Participatory workshops: Imagining and critiquing the future of education and AI

We conducted two participatory workshops in the UK: one with learners, primarily university students coming from diverse academic backgrounds (𝑛 = 60), and another with educators, mainly high school teachers specialising in STEM subjects (𝑛 = 34). The choice of the participant demographics was dictated by practical considerations. We realise that future work is needed to expand our reach to broader communities, since learners in the UK and other WEIRD3 countries likely encounter fewer barriers to accessing gen AI tools, and perspectives on AI in education likely differ substantially across cultural contexts.

 

Personalised tutoring, by AI or humans, was valued by both learners and educators. Tutors are especially effective when they have knowledge of the learner and can adapt their approach accordingly. Learners felt more comfortable seeking clarifications from AI tutors than human tutors, perceiving AI tutors as less formal and less likely to induce fears of judgement. A shared limitation of both human and AI tutors was their lack of familiarity with the nuances of particular syllabi or exam board requirements.

 

2.2 Understanding learning experiences: Initial interviews and Wizard-of-Oz sessions

To initiate our iterative participatory design process for LearnLM-Tutor, we conducted an exploratory series of user-centred studies involving both learners and educators.

 

Learners noted several main challenges with online courses: the learners’ lack of assumed prerequi- site knowledge, not being able to follow explanations due to missing details or logical steps, difficulty concentrating on long video lectures without doing exercises, and needing more help navigating the course materials. When doing practice problems, learners reported needing help breaking down the task into manageable chunks and diagnosing errors in their solutions; they reported that the tools they used could only point out the error, rather than how to diagnose it. Learners also wanted an AI tutor to have access to the same learning materials as them, use short communications that guide them in small steps, and give them frequent assessments of their knowledge. They did not want the tutor to give away too much information as they reported feeling pride in doing things themselves. They also wanted the tutor to be encouraging and constructive in its feedback, responsive and kind, proactive in soliciting questions from the learners, and always available.

 

From our conversations with the educators we have derived the following principles that apply to both human and AI tutors (see Section B.2 for additional principles that are only relevant to AI tutors):

 

3. Improving Gemini for education

This section surveys our work on enabling productive pedagogical behaviour in a language-based gen AI model5. We begin by framing our contributions with respect to related prior work in learning science, EdTech and AI research.

 

3.1. Lack of universal best pedagogical practices: lessons from learning science

Optimising an AI system for any goal requires a concomitant ability to measure progress. While learning and teaching strategies have been studied across many disciplines, defining (and subsequently quantifying) universal pedagogical principles remains a challenge.”.  One reason why it has been hard to establish a common set of recommended pedagogical practices is related to the fragmentation of educational research across many disciplines. Even within the same discipline, many studies highlight different interventions or strategies with little overlap.

 

The resulting theories are often based on inconclusive evidence [37], and their translation to practice is often difficult or unclear [27, 38, 39]. Furthermore, most cognitive and learning science research tends to be done with small homogeneous populations [27], limited to specific narrow educational contexts, like subject domain, difficulty level, or prior learner knowledge [27], and typically conducted in WEIRD countries [40], which makes the findings hard to generalise.

 

Studied interventions also come with variable implementation parameters (e.g. the time spacing between practices, the ratio of examples to questions) and can be combined in different ways, resulting in a combinatorial explosion in possible, often context-dependant, pedagogical strategies [27] that is hard to explore manually, yet alone measure.

 

3.2. Lack of transparency and common evaluation practices: lessons from EdTech

Education has always been an important application for the latest computing technology. From the earliest instantiations, these systems tended to follow a similar blueprint. They assume that the learner is interacting with the tutoring system without any assistance from a human teacher, and the tutoring system guides the learner through a pre-defined set of learning materials with some level of adaptation to the learner’s progress (e.g., choosing the difficulty of the next practice problem based on how well the learner did on the previous ones), and some level of timely feedback (e.g., at the step or solution level) [41, 44, 48].
Under the hood, ITSs tend to be rule-based expert systems [
6770]—the predominant AI paradigm in the 1970-1980s. Although expert systems have many positive qualities, they have largely been replaced by deep learning in recent years due to difficulties with scale and generality inherent in th .paradigm [71, 72]. These limitations of expert systems also lead to the most common criticisms of ITSs (see Section C for further discussion).

 

3.3. Generative AI in education

Deep learning has become the predominant paradigm in AI, and It has removed the dependency on humans to provide structured knowledge to AI by enabling AI systems to discover structure from data on their own during training. Although there has been a lot of excitement about the potential impact of the recent gen AI technology in education, and a number of gen AI-based tutors have emerged [89105], the full extent of this potential has not materialised just yet. A recent review of gen AI tutoring systems found that “dialog tutoring has largely remained unaffected by these advances” [106].

 

Out of the box, gen AI models have a remarkable ability to understand user queries expressed in natural language and generate responses that synthesise relevant information from across the internet (used in the gen AI pre-training) to answer in a helpful and harmless way. However, by default, these models do not typically behave like human tutors. Such default behaviour can be modified in two ways: (a) prompting or (b)  fine-tuning (through supervised and/or reinforcement learning).

 

3.3.1. Prompting

Prompting is the easiest and most popular way to adjust the behaviour of gen AI

The prompting approach, however, has a number of limitations. Most importantly, it requires explicit specification of what good tutoring behaviours look like in natural language. This involves enumerating what should be done and when, what should be avoided and when, all the possible exceptions to the rules, etc. This makes prompted gen AI-based tutors similar to ITSs: while gen AI is more general and faster to build (based on an existing foundation model), in the end both are limited by declarative knowledge of what the best educational practices look like. However, as discussed in Section 3.1, as a community we have not come even close to fully exploring the search space of optimal pedagogical strategies, let alone operationalising excellent pedagogy beyond the surface level into a prompt.We spent some time trying to elicit pedagogical behaviour via prompting. In some cases, this worked well, for example when instructing the model to ask a user for their grade level and responding with age-appropriate vocabulary. However, we found that most pedagogy is too nuanced to be explained with prompting.

 

Furthermore, prompting produced unreliable and inconsistent results, because there are limits to how much it can push the behaviour of gen AI away from the core principles ingrained into it during the pre-training and instruction tuning phases of its development (see Section D for a discussion of these limitations in the educational context). Such inconsistent performance is incompatible with providing reliable standards of pedagogy for all learners throughout the entire learning journey. Hence, we decided to turn to fine-tuning for more deeply embedded pedagogical behaviour, and only rely on prompting to adjust more superficial characteristics and user preferences.

 

3.3.2. Fine-tuning

If prompting can be roughly seen as the modern, more capable generalisation of expert systems, its alternative—fine-tuning, which typically includes stages of supervised fine-tuning (SFT), followed by Reinforcement Learning from Human Feedback (RLHF)—brings the full power of the deep learning paradigm, i.e. learning from data, to the table. While far less computationally intensive than the standard pre-training phase, fine-tuning can still be costly to perform on models with many billions of parameters [101], which explains why it is less explored in the gen AI for education literature compared to prompting.

 

Successful fine-tuning has two prerequisites: enough high-quality data (provided by researchers in the SFT case, or self-generated by the learning agent through exploration in the RL case) and a good measure of success. This was the key to many modern success stories in AI, from AlphaGo [112] to AlphaFold [113]. However, neither are available in the education domain. This section addresses the lack of high-quality pedagogical data to enable education-related SFT, while the lack of a good measures of success is discussed in subsequent sections.

 

4. Measuring Pedagogy in Gen AI

This section was more about their data gathering approaches, and how they fine-tuned Gemini. Important for the specific paper, but fewer generalizable insights.

 

5. Human evaluations

(This section reports their results on benchmark. They evaluate conversations at a few levels of granularity and with a few different types of evaluators. I don’t think the results themselves were are as interesting as are the way they frame how to evaluate a model is interesting / useful.)

 

5.1. Unguided conversations: Subjective learner feedback

 

5.2. Turn-level pedagogy: teacher feedback

 

 

 

5.3. Conversation-level pedagogy: teacher feedback

 

5.4. Side-by-side pedagogy: teacher feedback

 

6. Automatic Evaluations

This section describes efforts to get LLMs to evaluate outputs of LLM tutors. Interesting technically, report some success but ultimately less generalizable.

 

7. Learning from real-world interactions:

This section briefly Describes a pilot they ran at with Arizona State University (ASU) to integrate LearnLM-Tutor into ASU’s Study Hall. Study Hall is a partnership between ASU, Crash Course, and YouTube that offers a pathway to college credit, and is accessible to learners of all ages and backgrounds.

 

8. Evaluating particular educational capabilities

Apart from the holistic evaluations of the pedagogical effectiveness of gen AI tutors described in the previous sections, sometimes it is useful to have more targeted evaluations that shed light on how the tutors perform in particular phases of a conversational learning session. In this section we describe two case studies of developing such evaluations: one for the evaluative practice phase of the mastery loop and the other one measuring the quality of tutor feedback when working with a learner on procedural homework problems.

 

8.1. Evaluative practice

Knowledge assessment is a crucial part of the learning process and one of the most talked about capabilities during the teacher workshop described in Section 2. In order to do well, it requires a complex dialog interaction between the learner and the tutor. Consider, for example, several possible answer and feedback pairs in an evaluative practice session on the geography of Normandy shown in Figure 14, in response to the question “What is the largest city in Normandy?”. These different examples highlight several challenges and opportunities that come up during interactive evaluative practice:

 

 

8.1.1. Automated Metrics

 

Pedagogical conversation flow. Used to assess the extent to which our model follows the evaluative practice schema of question, answer, appropriate feedback, and so on

 

Conversational adaptability. Used to measure how well the model adapts to the user’s specific request. It is based on the score returned by a gen AI model that is prompted with the following chain-of-thought approach: “Break down the user’s request into separate statements, and score the extent to which these statements are acknowledged in the bot’s response.”

 

Feedback quality. Used to measure the quality of the model’s feedback to the user’s answer to the question. Since this requires actually knowing the right answer, this metric is applied not to new conversations but rather to a hand labelled evaluation set where each user answer is given one of four labels: Correct, Incorrect, Partially correct, and Irrelevant. Our tutor model responses are generative and do not come in the form of these four labels. Thus, to measure the performance of our model, we used a trained assessment extraction model that “translates” the feedback of the model into these classes. We then compare the extracted class and compute the overall precision and recall metrics.

 

Question difficulty. Used to measure the average and range of question difficulties generated by the model to ensure varied quizzes. We rely on Bloom’s taxonomy [158] to map questions to the level of cognitive effort required to answer them: 1) Remember, 2) Understand, 3) Apply, 4) Analyse, 5) Evaluate, 6) Create. The metric is computed using a gen AI model prompted to extract and predict Bloom’s taxonomy for each question.

 

8.1.2 Pedagogical Expert Human Evaluation

We rely on a pool of pedagogical experts (two per example, with an optional third rater in case of a tie) to collect deeper feedback on the pedagogical value of the evaluative practice experience. In this setup the raters review two evaluative practice conversations about the same topic that were generated by the generalist human raters mentioned above. The pedagogical raters respond to a series of questions about the pedagogical value of each conversation, as well as an overall side-by-side question to decide which model was preferable. The evaluative questions ask raters to assign a score  on a 3 point scale on the following criteria:

 

 

8.2. Feedback on procedural homework problems

This section describes how we evaluated LearnLM-Tutor’s ability to provide conversational feedback on procedural homework problems, such as maths word problems. Procedural problems often have one or few correct solution(s) and require a series of steps a student must perform to reach that solution.

Despite significant gains in mathematical and multi-hop reasoning as tracked by the common benchmarks [121, 159161], the performance of AI tutors in providing conversation based feedback on procedural problems is still inadequate as tutoring is more difficult than just solving a problem itself. When tutoring a student, an AI tutor has to not only solve a presented procedural problem correctly, but also evaluate the learner’s (potentially partially correct) solution, identifying any misconceptions. The AI tutor must allow for multiple possible problem solving strategies from the learner, while providing a consistent explanation that a learner can understand. This is at odds with the tendency of gen AI models to change their solutions to a given problem multiple times within a single conversation [162]. Additionally, the AI tutor must not exhibit the sycophantic tendencies of LLMs [163] to give proper feedback on mistakes. Existing benchmarks do not evaluate these capabilities.

To track progress on improving the quality of LearnLM-Tutor’s performance on providing feedback to learner-attempted procedural problems, we developed the following set of progressively harder automated evaluation metrics:

 

9. Responsible development

 

9.1. Impact assessment

Impact assessments were carried out throughout the development, drawing on the participatory workshops with learners and educators described in Section 2.1, and the literature on the benefits and harms of generative AI [2326] and of artificial intelligence for education specifically [1622]. All individual studies and products underwent a separate impact assessment; in the case of the ASU HallMate study in Section 7, this was conducted by Google DeepMind’s Human Behavioural Research Ethics Committee.

 

Through our participatory research, we have learned that AI tutors can be beneficial to learners by promoting active learning and providing personalised help when explaining concepts or working through problems. An AI tutor can understand the learner’s current knowledge, adapt its explanations to the learner’s proficiency, and making connections to real-world examples interesting to the learner

 We have also seen early signals that AI tutors can be an always available, safe place for learners to ask questions they may be uncomfortable asking teachers or peers or to get motivation when feeling overwhelmed in a course.

 

9.3. Mitigations

Mitigations to known risks were applied from the outset, with further mitigations being added to address failure modes discovered during safety evaluations. The first mitigation was careful curation of our SFT data: our “Golden conversations” data was written by pedagogy experts with instructions on style and content, and most of our synthetic fine-tuning data (with the exception of some synthetic data for mathematics) was manually reviewed. Furthermore, we used prompted LLMs to flag turns in the data that might make policy violations more likely and manually reviewed all flagged turns.

Our main mitigation method was additional safety fine-tuning on top of that of Gemini 1.0. This is necessary to enforce the additional safety policies for LearnLM-Tutor, and mitigate safety issues arising from the customisation of the models for AI tutoring—even non-adversarial customisation can affect safety [169, 170]— and customise the way the model responds to policy violation-inducing queries. Since a conversation with LearnLM-Tutor has a narrower conversation goal than that of a generalist conversational AI, the handling of most harm-inducing queries can be different: for queries that are unrelated to the learning goal, we aimed for LearnLM-Tutor to give briefer rejections and refocus the conversation on the lesson content.

 

Our safety fine-tuning data consists of harm-inducing conversations and golden responses on lesson material across a wide range of subjects. Queries were either written by the team or taken from failures observed during automatic or human red-teaming. The number and type of training examples was chosen to ensure broad coverage of our model policies and different harm types as well as appropriate dataset size relative to the rest of our fine-tuning data.

 

9.5. Examples of the evaluation and mitigation process

We present two examples of our evaluation and mitigation process: failure patterns caused by the customisation of the model for pedagogy, and anthropomorphism as an example of a risk that was identified early on and tracked throughout the entirety of development.

 

9.5.1. Failure patterns caused by customisation

Model customisations—even if they are non-adversarial—can result in safety regressions [169, 170]. This is equally true of our pedagogy fine-tuning. For example, the model developed a tendency to

 

9.5.2. Anthropomorphism

Perceiving human-like characteristics in non-human systems is known as anthropomorphism [172]. Many technologies have been perceived as human-like by their users [173176], including generative conversational AI systems powered by large language models [177, 178]. Anthropomorphic percep- tions of technologies, including AI, have been demonstrated to have a great impact on how users interact with and form mental models of the systems [179183]. While greater trust and acceptance of anthropomorphic systems may have a positive effect on user-system interactions in certain contexts, like customer service [184], it is important to anticipate downstream harms. For example, users may experience emotional attachments to AI systems, which may give rise to dependence and over-reliance on AI systems [26].

 

Appendix D. Challenges with prompting gen AI for pedagogy

Recent review articles found that although prompted gen AI approaches tend to do better than their ITS predecessors in constrained tutoring scenarios where the number of concepts and possible teaching strategies is small, these systems perform poorly in more general learning scenarios [90, 92, 98]. A major disadvantage of the prompting approach is that there are limits to how much it can push the behaviour of the gen AI away from the core principles fine-tuned into the model during the pre-training and instruction tuning phases as discussed in more detail below. Note, however, that gen AI models improve continuously, including in terms of their ability to follow prompts, so many of the results discussed next may not hold at the point this report is published.

 

Appendix F. Challenges with eliciting human preferences for pedagogy

In order to use Reinforcement Learning (RL) to fine-tune gen AI for education, it is important to train a Reward Model (RM) that can provide an evaluative signal on how well either each single response produced by a gen AI model rates in terms of its pedagogical value, or how well a whole multi-turn interaction with a learner has helped this learner achieve their goal. Such RMs are typically trained by eliciting feedback from human raters. These raters are typically presented with a pair of model responses and asked to judge which one they prefer based on certain criteria. Improving gen AI models through this process is called RL from Human Feedback, or RLHF. Currently RLHF has only been applied at a single-turn level (rather than at the conversation level) [90, 208] and human preference collection has not so far been generalised to pedagogy as far as we are aware. This is because eliciting human preferences reliably is already a hard task, and doing so for pedagogy amplifies the existing problems. For example, inconsistencies between different raters are exacerbated because good pedagogy is hard to define and there are multiple possibly equally valid pedagogical moves that can be made in each situation. It is also not clear whether the preferences should be elicited from the learners, educators or both, and how they should be combined if it is the latter.

 

Appendix G. Sociotechnical limitations of text-based gen AI

It is important to frame the work on gen AI tutor model development in terms of sociotechnical limitations of text-based gen AI. It is natural to think of an AI tutor as approximating a human tutor, however approximating human tutors closely may not always be desirable or possible. While modern gen AI models can provide impressive, often human-like responses, text-based interaction is usually only a fragment of human communication. At least relative to today’s state-of-the-art models, human tutor advantages include:

 

 

The design of an AI tutor should take into account these shortcomings with respect to human interaction, in addition to well-known limitations on current model capabilities like confident gen- eration of false or misleading information, unpredictable failure to generalise learned behaviour appropriately, improper use of tools leading to incorrect calculations, and missing introspection that might allow for post hoc correction of mistakes (also see Section D).