1 of 69

Avishkar Bhoopchand

Google DeepMind, Deep Learning Indaba�13 February 2025�MENA ML - Doha, Qatar

AI For Education

The challenges and opportunities

2 of 69

Talk Outline

  1. Introduction
  2. The Scale and Complexity of the Problem
  3. Evaluation
  4. Technical / Research Challenges in AI for Education Systems
  5. Ethics and safety considerations

3 of 69

Introduction

4 of 69

5 of 69

6 of 69

7 of 69

8 of 69

9 of 69

10 of 69

11 of 69

UN SDG 4

Ensure inclusive and equitable quality education and promote lifelong learning opportunities for all.

12 of 69

Bloom, B. 1984. The 2-sigma Problem: The Search for Methods of Group Instruction as Effective and one-to-one Tutoring

13 of 69

The most effective teaching method is also not possible to scale

14 of 69

Or is it..?

15 of 69

Enter Generative AI

Students are early adopters

Survey usage stats:

  • 86% use genAI
  • 54% weekly;
  • 24% daily

Digital Education Council Global AI Student Survey 2024

16 of 69

So is ChatGPT enough…?

LLMs convey information

They are tuned to be helpful.

This is not the same as learning

Creates a false sense of mastery

Lehmann M et al. 2024. AI Meets the Classroom: When Does ChatGPT Harm Learning?

17 of 69

The Scale & Complexity of the Problem

18 of 69

Education for all

1.5 billion students worldwide each with different:

  • Prior knowledge
  • Learning speeds
  • Native languages
  • Cultural contexts
  • Learning preferences
  • Access to technology
  • Support systems at home

19 of 69

Not just for Amira in her maths class…

The elementary student learning to read

The high school student tackling physics

The adult learning a new language

The vocational student mastering a trade

20 of 69

Complexity of 1:1 Tutoring

21 of 69

Immediate Feedback

22 of 69

Immediate Feedback

Infers students knowledge

23 of 69

Immediate Feedback

Infers students knowledge

Adjust explanations to suit the student

24 of 69

Immediate Feedback

Infers students knowledge

Adjust explanations to suit the student

Builds Rapport and Trust

25 of 69

Immediate Feedback

Infers students knowledge

Adjust explanations to suit the student

Builds Rapport and Trust

Adjust to emotional & cognitive state

26 of 69

Pedagogy Principles

Encourage Active Learning

Deepen Metacognition

Manage Cognitive Load

Motivate & Stimulate Curiosity

Adapt to learner’s goals & needs

Jurenka et al. 2024. Towards Responsible Development of Generative AI for Education: An Evaluation-Driven Approach

27 of 69

An AI Tutor needs to…

Adhere to pedagogical principles

Model the full space of human knowledge

Understand how knowledge builds upon itself

Adapt to individual learning trajectories

Maintain engagement over months and years

Do all this while being robust, fair, and explainable

28 of 69

AI is set to disrupt Education

29 of 69

Haven’t we heard this all before?

MOOCs were also set to disrupt education:

  • Democratise access to quality education globally
  • Self-paced interactive learning would outperform traditional education
  • Free / low-cost would disrupt traditional institutions

30 of 69

The Reality…

MOOCs have very low completion rates: < 10%

Those who succeed are typically already educated, often with post-graduate degrees and affluent

They are already skilled at self-directed learning

Sustainable business models have proved elusive

31 of 69

The Lessons

Technology often amplifies rather than bridge existing educational divides�We need to design explicitly for struggling learners

Learning infrastructure is as important as content�Design with community, feedback, accountability in mind

Tools for independent learning work best for those who need them the least�Self-learning is a complex metacognitive skill that needs to be developed

Education is deeply embedded in social and cultural contexts�One-size-fits-all solutions typically fail

Technology alone is not the answer

32 of 69

Sociotechnical systems

Stakeholders

Community Impact

Power Dynamics

Institutional practices

Cultural Norms

Policy & Governance

33 of 69

Demo - LearnLM

34 of 69

Jurenka et al. 2024. Towards Responsible Development of Generative AI for Education: An Evaluation-Driven Approach

LearnLM Team, Google. 2024. LearnLM: Improving Gemini for Learning

35 of 69

Evaluation

36 of 69

Which outcome is better?

🧑🏽‍🎓

Gets perfect test scores but can’t apply concepts to a new problem

👨🏻‍🎓

Scores lower but develops a deep understanding

🧑🏽‍🎓

Learns slowly, but develops strong study habits

👨🏻‍🎓

Makes quick progress, but becomes dependent on ChatGPT

37 of 69

What to measure?

Test scores, completion rates, engagement time

But this is not the full picture

Need to consider:

  • The time effect
  • Multiple goals
  • Generalisation
  • Ethical considerations

We need: standard protocols, longitudinal studies, metrics for understanding, measurements of secondary effects

38 of 69

Types of Evaluation

Extrinsic

Measures actual educational outcomes and impact

  • Can take a long time
  • Expensive
  • Hard to attribute

Better for strategic decisions

Intrinsic

Measure performance on specific tasks or capabilities

  • Quick to measure
  • Clear metrics
  • Relatively cheap
  • But may not correlate with real impact

Good for rapid iteration

39 of 69

1 year

1 month

1 week

1 day

1 hour

Automatic

Side by side (conversation)

(Simulated) learner ratings

Side by side (turn level)

2 stage pedagogy ratings

Effectiveness studies

Qualitative studies

User Feedback

Longitudinal Studies

Strategise

Iterate

40 of 69

1 year

1 month

1 week

1 day

1 hour

Automatic

Side by side (conversation)

(Simulated) learner ratings

Side by side (turn level)

2 stage pedagogy ratings

Effectiveness studies

Qualitative studies

User Feedback

Longitudinal Studies

Strategise

Iterate

41 of 69

Automatic Evaluations

What are they?

  • A Function: model output => metric
  • e.g. Language model rates responses of another language model

Why are they needed?

  • Human evaluations are slow and expensive
  • Allows for fast iteration … are our ideas working?
  • Compare many models
  • Quantify and track progress

Challenges

  • Lack of public benchmarks (none for tutoring!)
  • Evaluating good pedagogy is hard for humans, let alone for models

42 of 69

Automatic Evaluations

Break down the task

Carefully craft scenarios

Give the model context

Scenario-based supporting info

43 of 69

1 year

1 month

1 week

1 day

1 hour

Automatic

Side by side (conversation)

(Simulated) learner ratings

Side by side (turn level)

2 stage pedagogy ratings

Effectiveness studies

Qualitative studies

User Feedback

Longitudinal Studies

Strategise

Iterate

44 of 69

LearnLM Team, Google. 2024. LearnLM: Improving Gemini for Learning

45 of 69

1 year

1 month

1 week

1 day

1 hour

Automatic

Side by side (conversation)

(Simulated) learner ratings

Side by side (turn level)

2 stage pedagogy ratings

Effectiveness studies

Qualitative studies

User Feedback

Longitudinal Studies

Strategise

Iterate

46 of 69

TutorCopilot: A Randomised Controlled Trial

A “copilot” for online tutors

Size: 900 tutors; 1800 students

Metric: Exit tickets

Result: Significant increase in scores in treatment group

Larger effect for less effective, and less �experienced tutors

(But still far from 2-sigma!)

Wang R et al., 2024. Tutor CoPilot: A Human-AI Approach for Scaling Real-Time Expertise

47 of 69

TutorCopilot: A Randomised Controlled Trial

Qualitative analysis: higher quality strategies in treatment group

User feedback

  • Helpful especially for breaking down complex concepts
  • Sometimes not appropriate for student’s grade-level

48 of 69

Taxonomy of Choices (Intrinsic Evals)

Real Learners

Single-turn

Unguided

Advanced Learner

Human

Learner

Single-Turn

Pointwise

Role-playing experts

Multi-turn

Scenario Guided

Novice Learner

Automatic

Educator

Conversation-level

Pairwise (side by side)

Data Collection

Ratings

49 of 69

Technical & Research Challenges in AI for Education

50 of 69

The Challenges from an AI perspective

Multi-modal interaction (visual, verbal, written)

Real-time state estimation (student understanding)

Dynamic task selection (what to teach next)

Natural language generation (explanations)

Causal reasoning (why did the student make this mistake?)

Long-term planning (curriculum design)

All while handling partial observability and delayed feedback

51 of 69

Knowledge Representation & Tracing

Questions:

  • How do we model the relationship between concepts?
  • How do we handle (and encourage) multiple valid solution paths?
  • How do we model forgetting?

Possible directions:

  • Knowledge graph embeddings
  • Diverse reasoning traces

52 of 69

Theory of Mind

Questions:

  • How do we infer the knowledge state of the learner?
  • What misconceptions might explain a wrong answer?

Possible directions:

  • Multi-Agent methods
  • Causal models of misconceptions

53 of 69

Personalisation

Questions:

  • How do we handle the cold start problem?
  • How do we balance exploration vs exploitation?
  • How do we infer learning styles and preferences?
  • How do we infer the emotional state of the learner from behaviours?

Possible directions:

  • Agentic methods and clever prompting strategies
  • Meta-learning for fast adaptation

54 of 69

Content Generation

Questions:

  • How do we ensure factual accuracy?
  • How do we control the difficulty level?
  • How do we make content culturally relevant and appropriate?
  • How do we communicate with the learner in the language they’re most comfortable learning in?

Possible directions:

  • Grounded content generation
  • Recognition and handling of uncertainty
  • Factuality verification
  • Context-aware methods for multicultural and multilingual understanding

55 of 69

Interaction

Questions:

  • What modalities beyond text are most effective for pedagogy?
  • What is the role of interactivity in driving learning outcomes?

Possible directions:

  • Multi-modal models
  • Code generation
  • Human Computer Interaction and UX Research
  • Tool use

56 of 69

Optimising for Learning Outcomes

Questions:

  • What does AI pedagogy look like?
  • Can we move beyond behaviour cloning of human tutors?

Possible directions:

  • Multi-turn RL
  • Learner simulators

57 of 69

Ethical and Safety Considerations

58 of 69

Privacy and Data Protection

Data needed for personalized and effective experience

Education data can be sensitive; may involve work with minors

Case Study:

  • InBloom - non-profit to manage & warehouse student records.
  • Collected 400+ data points incl. family relationships, records about abuse & violence etc.
  • Legislation created to prevent state schools sharing data with 3rd party aggregators

59 of 69

Equity & Access

Widening the educational gap and digital divide

LLMs tend to be Western-centric and don’t perform as well in the languages and contexts of underrepresented regions

E.g. access to online learning during COVID-19 pandemic lockdowns

60 of 69

Dependency

Risk of children developing emotional dependency on AI tutors

Striking a balance between being helpful and making students dependent on the technology

Anthropomorphism and self-disclosure

61 of 69

Safety

Safety considerations look different

Discussion of otherwise sensitive topics may be allowed in educational settings for development of critical thinking skills, learning about history etc

Conflict between good pedagogical practice and safety

  • Eg. tune an AI tutor to be encouraging of questions
  • But what about:�Student: <harmful question>�Tutor: That’s a great question! I’m glad you’re thinking about this

62 of 69

Conclusion

63 of 69

Summary

Global challenge in quality education for all

64 of 69

Summary

Global challenge in quality education for all

Hugh opportunity to leverage AI for social impact

65 of 69

Summary

Global challenge in quality education for all

Hugh opportunity to leverage AI for social impact

Current GenAI is not suitable for this task

66 of 69

Summary

Global challenge in quality education for all

Hugh opportunity to leverage AI for social impact

Current GenAI is not suitable for this task

Problem is more difficult than it sounds!

67 of 69

Summary

Global challenge in quality education for all

Hugh opportunity to leverage AI for social impact

Current GenAI is not suitable for this task

Problem is more difficult than it sounds!

Many exciting research opportunities

68 of 69

Summary

Global challenge in quality education for all

Hugh opportunity to leverage AI for social impact

Current GenAI is not suitable for this task

Problem is more difficult than it sounds!

Many exciting research opportunities

Join us in this journey!

69 of 69

The End

Thank You�—

شكراً