1 of 52

AI Advancements for Literacy Learning and Beyond

Presenters:

Tamara Tate, PhD

Youngsun Moon

Jiali Wang

This material is based on work supported by the National Science Foundation under Grant No. 23152984 and the Institute of Education Sciences, U.S. Department of Education, through Grant R305C190007 to University of California, Irvine.

2 of 52

Paper 1:

Can ChatGPT Provide Useful

Holistic Essay Scoring?

Paper 2:

Comparing the Quality of Human and ChatGPT Feedback on Student Writing

Tate, Steiss, Warschauer, Bailey, Ritchie, Tseng, Moon, and Graham

Steiss, Tate, Cruz, Graham, Wang, Moon, Tseng, Warschauer, and Hebert

3 of 52

Outline

3

Section	Topic	Time
Introduction	What is ChatGPT and how does it work?	3
Introduction	Why AI might have promise in classrooms	3
Method	Participants and Measures	3
Paper 1 Results	Scoring: Consistency & Validity	8
Paper 2 Results	Feedback: Qualities of Good Feedback	5
Paper 2 Results	Human vs. AI Feedback	8
Discussion	Contextualizing Results & What’s next	5

Digital Learning Lab

4 of 52

Introduction

4

Digital Learning Lab

5 of 52

What is ChatGPT and how does it work?

Artificial intelligence trained to generate text through text prediction.
Serious flaws include the risk of data privacy breaches, unclear intellectual property protection, algorithmic biases that replicate biases in the data, and ”hallucinations” when generative AI may give completely wrong answers.
Our studies used GPT-3.5 (same model as currently free ChatGPT).

5

Digital Learning Lab

6 of 52

Cycle of Scoring & Formative Feedback

6

Human scoring and feedback is resource intensive.

Generative AI tools might improve this cycle:

24/7 availability
Unlimited attempts
Best available human standard
No/little training unlike prior AWE

Digital Learning Lab

7 of 52

Cycle of Scoring & Formative Feedback

7

But first, need to understand:

Quality of scoring and feedback

Digital Learning Lab

8 of 52

Method

8

Digital Learning Lab

9 of 52

Student Essay Samples for Paper 1

Sample 1: Source-based argumentative writing in secondary history classrooms (n = 493)

WRITE Center field trial
Two-day writing assessment
Scored on holistic rubric 1-6
Trained raters, double scored

Sample 2: Source-based argumentative writing in secondary ELA classrooms (n = 344)

Pathway study
Two-day writing assessment
Scored on holistic rubric 1-6
Trained raters, double scored

Sample 3: PERSUADE 2.0 corpus (ELA), publicly available, sampled randomly (n = 949) out of over 25,000 essays

Argumentative essays, grades 6-12
SAT rubric
Trained raters, double scored

9

Paper 1 (Scoring)

Subset (n = 30) for RQ1

Digital Learning Lab

10 of 52

Sample writing prompt--WRITE Center, History

10

Digital Learning Lab

11 of 52

Paper 1:

Can ChatGPT Provide Useful

Holistic Essay Scoring?

12 of 52

Paper 1: Can ChatGPT Provide Useful Holistic Essay Scoring?

How internally consistent are two humans and two iterations of the AI?
How does ChatGPT scoring compare to human raters on the same data?
Do scores systematically vary by English language status differently for ChatGPT ratings than for human ratings?

12

Digital Learning Lab

13 of 52

Internal Consistency of Scoring

Consistency (weighted Kappa)

GPT-4 > Humans > GPT-3.5
.80/.88 > .73 > .73/.63 (hi/low temp)

Exact agreement: 2 humans 43%; AI (depending on model, temperature) 59%-82%
Temperature (randomness) of AI did not clearly impact

13

Subset (n = 30) for RQ1

Digital Learning Lab

14 of 52

Human & AI Population Level Scores Similar

14

	Mean Score	SE	Confidence Interval
Sample 1: Human	3.16	.06	3.05-3.28
Sample 1: AI	2.69	.06	3.32-3.55
Sample 2: Human	2.69	.05	2.60-2.78
Sample 2: AI	2.71	.04	2.63-2.79
Sample 3: Human	3.38	.04	3.30-3.46
Sample 3: AI	3.36	.03	3.30-3.42

Digital Learning Lab

15 of 52

Individual Scores

15

	Human-Human	Human-AI (Sample 1)	Human-AI (Sample 2)	Human-AI (Sample 3)
Weighted Kappa	.79	.52	.23	.52
Agreement w/ 1 point	74%	76%	83%	89%

Digital Learning Lab

16 of 52

Individual Scores

As of now, AI is not consistent enough to substitute for human ratings for higher stakes assessments, but might be useful formatively
AI was less likely to score papers on the edges, fewer 1 and 6, than humans, which accounts for the higher % agreement within 1 point

16

	Human-Human	Human-AI (Sample 1)	Human-AI (Sample 2)	Human-AI (Sample 3)
Weighted Kappa	.79	.52	.23	.52
Agreement w/ 1 point	74%	76%	83%	89%

Digital Learning Lab

17 of 52

Regression-No Significant Difference AI v. Human

17

	B	SE	t	p
Constant	3.36	.07	51.18	0.000
AI scoring	.07	.11	0.65	0.584
Corpus 1	-.09	.02	-4.97	0.038
Corpus 2	-.51	.12	-4.46	0.049
English learner	-.28	.26	-1.11	0.382
AI x EL	-.03	.12	-.25	0.829

Digital Learning Lab

18 of 52

Difference by English Language Status

18

	Non-English Learner		English Learner
	Wtd. Kappa	Agreement w/ 1	Wtd. Kappa	Agreement w/ 1
Sample 1: Human-Human	.82	73%	.81	80%
Sample 1: Human-AI	.51	77%	.40	75%
Sample 3: Human-AI	.52	89%	.43	92%

Digital Learning Lab

19 of 52

Difference by English Language Status

Human consistency did not differ much between ELs and non-ELs
AI was closer to human scores with respect to the non-ELs
No specific harm to English learners in using AI scoring found

19

Digital Learning Lab

20 of 52

Implications of AI for Scoring

Humans are better at scoring papers like other humans
AI is not bad at scoring like humans
Philosophical Q: Should students be writing for a non-human audience?
No additional detrimental effects to English learners
Best available human standard

20

Digital Learning Lab

21 of 52

Limitations

Experiment was a zero-shot exercise with no training

Teachers likely will not want to take time to train model for essay prompt
Future research should consider how much training is needed for significant improvement in scoring
Models continue to improve, as does our prompting (but still only slight to moderate agreement with humans):

21

	Non-English Learners	English Learners
GPT-3.5	.52	.36
GPT-4	.58	.37

Digital Learning Lab

22 of 52

Paper 2:

Comparing the Quality of Human and ChatGPT Feedback on Student Writing

23 of 52

Example of ChatGPT & Human feedback for Student Writing

Human Feedback: Thank you for writing this essay on how the boycott succeeded! I think your claim, “Well the reason the boycott was successful was because Rosa Parks started to give flyers out to black people to stop using the buses” is interesting. But did Parks distribute the flyers? Better check that source again (I think it was Jo Ann Robinson and her students).

Next time, you might use this sentence starter when adding evidence to support your claim, “According to the Bus Boycott Timeline,___________.” This will help you support your claim with clear evidence directly from the source. Your reader might also be convinced because you get your evidence from a source too.

You are on the right track to becoming a great writer!

ChatGPT Feedback: Great start! You have identified some key factors that contributed to the success of the Montgomery Bus Boycott. However, your essay could benefit from more detail and explanation. For example, you mention that Rosa Parks gave out flyers, but you don't explain what was on the flyers or how they helped the boycott. Try to provide more specific examples and evidence to support your argument.

Additionally, it's important to note that Rosa Parks' refusal to give up her seat was not the only reason for the boycott's success. Can you think of any other factors that contributed to the boycott's success?

Keep up the good work!

23

?

Digital Learning Lab

24 of 52

Student Essay

Why was the Montgomery Bus Boycott successful? Well the reason the boycott was successful was because Rosa Parks started to give flyers out to black people to stop using the buses. It did work because after a while the buses were not getting the money they used to get when the white and black people were there. Only the white were on the buses.

The second way they got the boycott to be successful is, Rosa Parks was in the front of the bus but black people were in the back but she sat on the front which they were not supposed to be on. She…

24

Digital Learning Lab

25 of 52

Example of ChatGPT & Human feedback for Student Writing

Human Feedback: Thank you for writing this essay on how the boycott succeeded! I think your claim, “Well the reason the boycott was successful was because Rosa Parks started to give flyers out to black people to stop using the buses” is interesting. But did Parks distribute the flyers? Better check that source again (I think it was Jo Ann Robinson and her students).

Next time, you might use this sentence starter when adding evidence to support your claim, “According to the Bus Boycott Timeline,___________.” This will help you support your claim with clear evidence directly from the source. Your reader might also be convinced because you get your evidence from a source too.

You are on the right track to becoming a great writer!

ChatGPT Feedback: Great start! You have identified some key factors that contributed to the success of the Montgomery Bus Boycott. However, your essay could benefit from more detail and explanation. For example, you mention that Rosa Parks gave out flyers, but you don't explain what was on the flyers or how they helped the boycott. Try to provide more specific examples and evidence to support your argument.

Additionally, it's important to note that Rosa Parks' refusal to give up her seat was not the only reason for the boycott's success. Can you think of any other factors that contributed to the boycott's success?

Keep up the good work!

25

?

Digital Learning Lab

26 of 52

Providing Feedback for Student Writing with ChatGPT

Providing formative feedback is important for student writing development (Graham et al., 2011; MacArthur, 2018)
Giving quality feedback is very time-consuming and demanding
Previous AI-based scoring tools were limited in providing feedback
ChatGPT can potentially provide targeted, adaptive, and useful feedback for student writing

Lessen the burden on teachers
Increase the timeliness of feedback

26

Introduction: Why AI might have promise in classrooms

Digital Learning Lab

27 of 52

How does the quality of formative feedback provided by AI compare to the quality of feedback provided by human?
Does the quality of feedback vary for certain subgroups of essays (high vs. low scoring essays; ELs vs. non ELs)?

27

Research Questions

Digital Learning Lab

28 of 52

Coding Scheme for Rating the Quality of Feedback (on a scale of 1-5)

28

Standard	Quality of feedback ranges from…
Criteria-based	does not explicitly reference criteria (generic) → explicitly references criteria of source-based argumentative writing (e.g., claim, evidence)
Clear directions for improvement	vague → always offers clear directions
Accuracy	inaccurate → accurate description of student essays
Prioritization of essential features	nonessential → essential, attainable, and reasonable for students’ immediate next step
Supportive tone	unsupportive and directive → affirming and supportive

Digital Learning Lab

29 of 52

Example of Human vs. ChatGPT feedback

Human Feedback: Thank you for writing this essay on how the boycott succeeded! I think your claim, “Well the reason the boycott was successful was because Rosa Parks started to give flyers out to black people to stop using the buses” is interesting. But did Parks distribute the flyers? Better check that source again (I think it was Jo Ann Robinson and her students).

Next time, you might use this sentence starter when adding evidence to support your claim, According to the “Bus Boycott Timeline,”___________. This will help you support your claim with clear evidence directly from the source. Your reader might also be convinced because you get your evidence from a source too.

You are on the right track to becoming a great writer!

ChatGPT Feedback: Great start! You have identified some key factors that contributed to the success of the Montgomery Bus Boycott. However, your essay could benefit from more detail and explanation. For example, you mention that Rosa Parks gave out flyers, but you don't explain what was on the flyers or how they helped the boycott. Try to provide more specific examples and evidence to support your argument.

Additionally, it's important to note that Rosa Parks' refusal to give up her seat was not the only reason for the boycott's success. Can you think of any other factors that contributed to the boycott's success?

Keep up the good work!

29

clear directions & prioritizes essential features

criteria-based

Digital Learning Lab

30 of 52

Coding Scheme for Rating Feedback

30

Digital Learning Lab

31 of 52

Analysis Methods

How does the quality of formative feedback provided by AI compare to the quality of feedback provided by human raters?

Descriptive statistics of coded scores for ChatGPT & Human scorers
ANOVA to compare between groups

Does the quality of feedback vary for certain subgroups of writers (high vs low scoring students; ELs vs. non ELs)?

ANOVA to compare between three groups (low, average, high & EL, RFEP, IFEP/EO)

31

Digital Learning Lab

32 of 52

Comparing Scores for Human vs ChatGPT Feedback

32

Results

***

*

Digital Learning Lab

33 of 52

Coding Scheme for rating Feedback (on a scale of 1-5)

33

Standard	Quality of feedback ranges from…
Criteria-based	does not explicitly reference criteria (generic) → explicitly references criteria of source-based argumentative writing (e.g., claim, evidence)
Clear directions for improvement	vague → always offers clear directions
Accuracy	inaccurate → accurate description of student essays
Prioritization of essential features	nonessential → essential, attainable, and reasonable for students’ immediate next step
Supportive tone	unsupportive and directive → affirming and supportive

Digital Learning Lab

34 of 52

34

Digital Learning Lab

35 of 52

Whether the Quality of Feedback from ChatGPT varied for High- and Low-Scoring Essays

35

Digital Learning Lab

36 of 52

Whether the Quality of Feedback from Humans varied for High- and Low-Scoring Essays

36

Digital Learning Lab

37 of 52

Does the quality of feedback vary for essays of different levels of writing quality?

37

Digital Learning Lab

38 of 52

Does the quality of feedback vary for students with different language statuses?

Feedback given to students who were either initially fluent or English-only speakers (collectively, EO/IFEP) was not statistically different compared to that given to students classified as English learners (ELs) and reclassified fluent speakers (RFEP) using Scheffe correction for multiple comparisons.

38

Results

Digital Learning Lab

39 of 52

Does the quality of feedback vary for students with different language statuses?

No significant differences in the quality of feedback provided to ELs and non-ELs by either humans or AI

39

Results

Digital Learning Lab

40 of 52

Discussion: Implications of AI for formative feedback

Our evaluators were well-trained, experienced, paid, and had 3 weeks
ChatGPT scored relatively close even in this condition and w/o opportunity for multiple prompting
Feedback generated by ChatGPT can likely serve valuable instructional purposes, particularly in the early stages of writing to motivate revision work in a timely fashion

Be aware of potential challenges in accuracy for high quality essays and tone in low quality essays

40

Digital Learning Lab

41 of 52

Contextualizing Results & What’s Next?

What do writers actually do with the feedback given by AI?
Benefits of iterative prompting if students/teachers have requisite AI literacy
Effects on teacher knowledge of student writing

41

Conclusion

Digital Learning Lab

42 of 52

Questions?

43 of 52

Acknowledgements

43

Digital Learning Lab

44 of 52

Thank you!

This material is based on work supported by the National Science Foundation under Grant No. 23152984 and

the Institute of Education Sciences, U.S. Department of Education, through Grant R305C190007 to University of California, Irvine.

45 of 52

45

Digital Learning Lab

46 of 52

46

BE SURE TO USE CALL OUTS FOR SPECIFIC AREAS OF FIGURES, USE MULTIPLE SLIDES IF NEEDED

Digital Learning Lab

47 of 52

47

Digital Learning Lab

48 of 52

48

Pretend you are a secondary school teacher. Provide 2-3 pieces of specific, actionable feedback on each of the following essays written to this rubric that highlight what the student has done well and what they could improve on. Use a friendly and encouraging tone. If needed, provide examples of how the student could improve the essay.

49 of 52

49

Great job on providing a brief background on the Montgomery Bus Boycott and highlighting the key players involved. You have also provided some reasons why the boycott succeeded, such as the loss of revenue for the bus company and the strong leadership of King, MIA, NAACP, WPC, and churches. To improve your essay, consider providing more specific examples to support your points. For instance, you could mention how much revenue the bus company lost during the boycott or provide specific actions taken by the leaders and organizations involved. Additionally, you could expand on the impact of the boycott beyond Montgomery and how it influenced the Civil Rights Movement as a whole.	You did a wonderful job explaining what the Montgomery Bus Boycott was historically and you have a strong claim toward the end of your first paragraph.: “The Bus Boycott has succeeded because most of the people who rode Montgomery's buses were African Americans and this made the company lose a lot of business and revenue from the boycott so this left them with no easy solutions to fix.” Next time, I would love to see some evidence from the sources to help back up your claim. This will show your readers that your claims are supported! Keep on practicing your writing and you’ll keep improving.

50 of 52

50

Great job on providing a brief background on the Montgomery Bus Boycott and highlighting the key players involved. You have also provided some reasons why the boycott succeeded, such as the loss of revenue for the bus company and the strong leadership of King, MIA, NAACP, WPC, and churches. To improve your essay, consider providing more specific examples to support your points. For instance, you could mention how much revenue the bus company lost during the boycott or provide specific actions taken by the leaders and organizations involved. Additionally, you could expand on the impact of the boycott beyond Montgomery and how it influenced the Civil Rights Movement as a whole. ChatGPT	You did a wonderful job explaining what the Montgomery Bus Boycott was historically and you have a strong claim toward the end of your first paragraph.: “The Bus Boycott has succeeded because most of the people who rode Montgomery's buses were African Americans and this made the company lose a lot of business and revenue from the boycott so this left them with no easy solutions to fix.” Next time, I would love to see some evidence from the sources to help back up your claim. This will show your readers that your claims are supported! Keep on practicing your writing and you’ll keep improving. Human

51 of 52

Scoring the Feedback

Experts blindly coded all human and AI feedback

Criteria based
Clarity
Accuracy
Developmentally appropriate
Supportive tone

51

52 of 52

52

All were within 1 point