1 of 99

Authenticity and Assessment �in the Age of AI

Dominik.Lukes@ctl.ox.ac.uk

2 of 99

Reading & Writing Innovation Lab

bit.ly/ox-rewrilab Consultations for staff and students

E-readers

Tablets & styluses

Reading apps

Writing tools

Note taking tools

Banbury road

In-person visits

Online consultations

3 of 99

Beyond ChatGPT: State of AI, October 2023

4 of 99

Download slides

bit.ly/3yOlMAP

5 of 99

Language pedagogy is my passion

6 of 99

The three As

Authenticity

Assessment

AI

7 of 99

Three pairs

Authenticity

Assessment

AI

Authenticity

Assessment

AI

8 of 99

Assessment and authenticity

9 of 99

Assessment has to solve more than authenticity.

10 of 99

Assessment has to be 3 things at once

Authentic

Represent real world task

Reliable

Reflect actual performance

Scalable

Done quickly enough for large numbers

11 of 99

But let’s remember the engineering triangle

Pick any two

Good

Fast

Cheap

12 of 99

Engineering triangle in practice

13 of 99

Do we have an assessment triangle?

Pick any two

Authentic

Scalable

Reliable

14 of 99

Assessment triangle in practice

15 of 99

Note: Not my original idea. Overheard at a conference.

16 of 99

Cognitive break and questions

In the triangle of assessment, we try,

To be scalable, reliable, and fly,

Yet authenticity’s claim,

Is a part of the game,

Balancing three is the aim, oh my!

17 of 99

How I made the previous slide

18 of 99

AI and authenticity

19 of 99

What is AI?�Brief terminological detour

20 of 99

Often heard

ChatGPT is just a fancy autocomplete. It just predicts the next word.

21 of 99

Literally true, but not a useful way to think about AI.

22 of 99

Generative AI is a universal (semantic) translator

Language to language

Style to style

Structured text to unstructured

Unstructured to structured

Question to answer

Text to label

Image to text

Text to image

Text to code

Code to text

23 of 99

Translating languages, styles, modalities

24 of 99

Some things don’t feel like translation but AI treats them that way

25 of 99

This approach can even be used to replace old specialised systems

26 of 99

But why semantic? Doesn’t it just predict the next word?

27 of 99

The relations look something like this

28 of 99

When AI is generating text it is doing something like this

29 of 99

This may look simple in 3D space

Word

Vector

teacher

(0.8, 0.1, 0.1)

student

(0.8, 0.1, 0.2)

classroom

(0.1, 0.8, 0.1)

homework

(0.3, 0.3, 0.8)

school

(0.2, 0.9, 0.1)

textbook

(0.5, 0.1, 0.5)

curriculum

(0.5, 0.2, 0.5)

exam

(0.3, 0.2, 0.9)

30 of 99

What is inside an LLM is not statistics but geometry.

31 of 99

BUT it’s all geometry in 10,000 dimensions

32 of 99

Each item is one giant vector

[-0.08041892945766449, -0.023566607385873795, 0.04585130512714386, 0.037420596927404404, -0.09120217710733414, 0.022545181214809418, -0.0019880179315805435, 0.0587424635887146, 0.07126272469758987, -0.02159898355603218, -0.07145281136035919, 0.09984641522169113, -0.05501342564821243, 0.02485564909875393, 0.01755008101463318, 0.014556304551661015, -0.15110555291175842, -0.000567720562685281, 0.10030079632997513, -0.045505933463573456, -0.06274029612541199, -0.0683555155992508, 0.0008911662152968347, 0.01842709816992283, 0.06299598515033722, 0.02255615033209324, -0.09917508065700531, -0.07070962339639664, 0.08635025471448898, 0.06686452776193619, -0.0407336950302124, -0.04072333127260208, -0.01974628120660782, 0.07472220063209534, -0.024722471833229065, -0.13420116901397705, -0.01812688820064068, -0.07096941769123077, -0.05353084206581116, -0.10960721969604492, -0.017906684428453445, -0.04733094945549965, -0.02091103047132492, 0.1269848346710205, -0.05413510650396347, -0.046787846833467484, 0.0024005023296922445, -0.07217800617218018, -0.029329143464565277, 0.007498500403016806, -0.034666456282138824, -0.03568940982222557, 0.03427724167704582, 0.02315753698348999, -0.008645392023026943, 0.05333952233195305, 0.07456360012292862, 0.147796630859375, -0.006483903620392084, -0.08905889838933945, 0.03265034034848213, -0.0732979029417038, 0.04066538065671921, 0.023211032152175903, -0.012049349024891853, -0.02828565053641796, 0.019329581409692764, 0.09989447146654129, 0.1430598795413971, -0.061100199818611145, -0.030345138162374496, -0.02984507940709591, -0.028366880491375923, 0.052052564918994904, 0.036766957491636276, 0.003982939291745424, -0.077084481716156, 0.05044832453131676, -0.11687757074832916, 0.06646141409873962, 0.016255078837275505, -0.06982151418924332, -0.000822143629193306, -0.0026820336934179068, -0.004263593349605799, 0.09659365564584732, 0.06130471080541611, -0.06840908527374268, 0.06686245650053024, -0.04831290245056152, 0.08598440140485764, 0.08331689983606339, 0.08026000112295151, 0.05451888591051102, -0.03798443824052811, 0.04084145650267601, -0.12311697751283646, 0.023645302280783653, 0.005237551871687174, 0.03906212002038956, 0.037468183785676956, -0.05121520534157753, -0.10456130653619766, 0.009842721745371819, 0.04819759353995323, -0.13286681473255157, 0.02991127222776413, -0.06024811416864395, 0.04108288511633873, -0.008447377011179924, -0.07916080206632614, 0.06436653435230255, 0.017831943929195404, -0.054629500955343246, 0.027066148817539215, -0.030593710020184517, -0.10156133025884628, -0.0013401528121903539, 0.0011191506637260318, 0.009616676717996597, -0.02962290495634079, 0.0042936066165566444, 0.013841508887708187, -0.047656722366809845, -0.003912750165909529, 0.06500802934169769, 0.001283025718294084, -0.0816996768116951, 0.06566621363162994, -0.010957532562315464, -0.028156422078609467, 0.08978854864835739, -0.0003194105520378798, -0.02697799727320671, -0.006005867850035429, 0.07932088524103165, 0.021490609273314476, 0.013727870769798756, -0.019940776750445366, 0.031798265874385834, -0.0457642637193203, 0.03235720098018646, -0.022082772105932236, -0.04902353510260582, -0.11819718778133392, -0.04506421089172363, -0.046244461089372635, 0.029877550899982452, -0.07711911201477051, 0.05314543470740318, -0.09000932425260544, -0.023750705644488335, -0.05107633396983147, 0.001467616413719952, -0.02442317083477974, 0.01248782780021429, 0.06548482179641724, 0.043813593685626984, 0.06102786585688591, 0.021692050620913506, -0.052160654217004776, -0.009674523957073689, -0.072069451212883, -0.08633119612932205, -0.05121589079499245, -0.08108754456043243, 0.03608304262161255, 0.06553766876459122, -0.0727415531873703, -0.09346839785575867, -0.07251054048538208, 0.04504929482936859, -0.01773262582719326, -0.0005254627903923392, -0.0035706141497939825, 0.09068302065134048, 0.0152428038418293, 0.009525319561362267, 0.02502918615937233, 0.02807294949889183, -0.08951258659362793, 0.018022941425442696, 0.04113161191344261, -0.09941867738962173, 0.03642140328884125, 0.07755865901708603, 0.014834643341600895, -0.05757498741149902, -0.0052739898674190044, -0.03217893838882446, 0.029460914433002472, -0.03587955981492996, 0.016881171613931656, -0.015574142336845398, -0.10131996870040894, -0.01736866682767868, 0.014807181432843208, -0.03830776736140251, -0.0307577196508646, -0.04063287377357483, 0.0017508701421320438, 0.06622152030467987, 0.06959225982427597, 0.03921446576714516, -0.029292205348610878, -0.07731080055236816, -0.0757351890206337, 0.008267058990895748, 0.10628201067447662, -0.006961626932024956, -0.060704007744789124, -0.024280674755573273, -0.011232278309762478, 0.02305467799305916, -0.040246833115816116, 0.03551888465881348, -0.12048669904470444, -0.0057440041564404964, -0.008801680989563465, -0.038733456283807755, -0.0384967215359211, -0.0059003811329603195, 0.07543318718671799, 0.0029998512472957373, 0.11148137599229813, 0.0560586079955101, -0.01694066822528839, -0.020253779366612434, -0.11995487660169601, 0.10403268039226532, -0.022030610591173172, 0.019188301637768745, -0.03581297770142555, -0.04047590494155884, -0.03492145985364914, 0.027967417612671852, -0.07497915625572205, 0.032431814819574356, -0.025854842737317085, -0.10595495998859406, -0.09982465207576752, -0.05515384301543236, 0.02156943641602993, 0.05118619278073311, -0.03904290497303009, -0.022826874628663063, -0.053247928619384766, -0.10935184359550476, 0.0006719367229379714, -0.016026955097913742, 0.13483813405036926, 0.1173691526055336, -0.01902260072529316, -0.09690848737955093, -0.07585378736257553, 0.007626112550497055, 0.019889818504452705, -0.008633404038846493, 0.010355712845921516, 0.035737670958042145, 0.011519350111484528, -0.005264237057417631, -0.06305427849292755, -0.026263760402798653, 0.008310412988066673, -0.0068666874431073666, -0.13443514704704285, -0.025350390002131462, -0.0079041114076972, 0.014966381713747978, 0.01571144163608551, 0.06266333907842636, 0.05788900703191757, -0.022854981943964958, 0.09513315558433533, 0.1284472942352295, -0.061813995242118835, -0.049407169222831726, -0.10701776295900345, 0.06945358961820602, -0.07409369200468063, -0.028664348646998405, -0.0144350565969944, 0.029182329773902893, 0.007034373469650745, -0.026693496853113174, 0.0590004064142704, -0.002902168082073331, 0.12047384679317474, 0.023063501343131065, -0.05780957639217377, 0.058589596301317215, 0.02074800431728363, -0.030389118939638138, -0.002812192542478442, 0.06409497559070587, -0.0015993582783266902, 0.007702010218054056, 0.013223372399806976, 0.012501182034611702]

33 of 99

This gives AI enormous power but it has limitations.

34 of 99

Big 3 AI limitations

Hallucination

Plausible but not real

Replicability

Different every time

Introspection

No access to own processes or training data

35 of 99

Biggest caveat

36 of 99

It is not always possible to tell ahead of time what AI will be good at.

37 of 99

Dell’Acqua et al. 2023: Navigating the Jagged Technological Frontier

“We suggest that the capabilities of AI create a “jagged technological frontier” where some tasks are easily done by AI, while others, though seemingly similar in difficulty level, are outside the current capability of AI.”

38 of 99

Can we tell how good AI would be at something?

39 of 99

Illustrating the Jagged Frontier - @techczech

Inside frontier

Expect hard for AI

Outside the frontier

Expect easy for AI

Speak any language (mostly) grammatically

Label grammar terms (metalanguage)

Explain e = mc2

Multiply numbers

Write a poem

Reverse a random string of letters

Count people in a story

Count words in a paragraph

Generate complex photos

Place things to the right of other things

“We suggest that the capabilities of AI create a “jagged technological frontier” where some tasks are easily done by AI, while others, though seemingly similar in difficulty level, are outside the current capability of AI.”

Dell’Acqua et al. 2023: Navigating the Jagged Technological Frontier

40 of 99

Metalanguage fail example (Claude Opus 17 May 2024)

41 of 99

How people experience the jagged frontier @techczech

“We suggest that the capabilities of AI create a “jagged technological frontier” where some tasks are easily done by AI, while others, though seemingly similar in difficulty level, are outside the current capability of AI.”

Dell’Acqua et al. 2023: Navigating the Jagged Technological Frontier

😍

ChatGPT is amazing, there’s nothing it can’t do!

😡

ChatGPT is useless, it can’t get even the basics right!

42 of 99

Most tasks sit at an intersection.

AI is good as expected

AI is �unexpectedly bad

AI is much better than expected

AI cannot do this� as expected

43 of 99

A conversation I’ve had

Academic: Why are the references ChatGPT gives me wrong?

Me: ChatGPT hallucinates links and references. Do not use it to find them.

44 of 99

A conversation I’ve had

Me: ChatGPT hallucinates links and references. Don’t use it to find them.�Student: Every time I clicked on a link I asked for it worked.

45 of 99

Walters & Wilder 2023 Fabrication and errors in the bibliographic citations generated by ChatGPT

46 of 99

AI has its own assessment triangle problem

47 of 99

All AI tools are powered by models.

48 of 99

What is a model

LLM

Large Language Model

49 of 99

One model many tools

GPT (3.5 or 4)

ChatGPT

TeacherMatic

Notion AI

MS Copilot

...

50 of 99

Different kinds of models

LLMs (Text generation)

Code generation

Speech recognition

Image generation

Text to speech

Voice cloning

Video generation

Music generation

51 of 99

More than a few

52 of 99

Which ones are best?

53 of 99

ChatGPT vs ChatGPT Plus

ChatGPT

    • GPT 3.5
    • free

ChatGPT Plus

    • GPT 4 / 4o
    • $20 / mo

54 of 99

Difference between GPT3.5 and GPT-4

Exam

GPT-4 Score

GPT-3 Score

Uniform Bar Exam

298/400 (90th percentile)

213/400 (10th percentile)

LSAT

161

149

SAT Math

1410

1260

AP World History

5 (89th-100th percentile)

4 (74th-89th percentile)

AP Physics 2

4 (66th-84th percentile)

3 (30th-66th percentile)

AP Psychology

5 (83rd-100th percentile)

5 (83rd-100th percentile)

AP Statistics

5 (85th-100th percentile)

3 (40th-63rd percentile)

Medical Final Examination (English)

79.6% accuracy

58.3% accuracy

Medical Final Examination (Polish)

80.7% accuracy

56.6% accuracy

55 of 99

Example of improvements

56 of 99

GPT-3.5 is also less powerful in language

57 of 99

Spectrum of capabilities

Basic models

(GPT 3.5 Class)

    • Free
    • Many models
    • Open Source available
    • Text only
    • On device
    • <13b params

Sub-frontier models

    • Free/Paid
    • Some multimodal
    • Some Open Source/Weights
    • Beats GPT-3.5
    • Approach GPT-4 on some benchmarks

Frontier Models (GPT 4 Class)

    • Paid
    • 3 models, 4 providers
    • Closed source
    • Multimodal
    • >100b parameters

58 of 99

Three types of model assessment

Benchmarks

Head-to-head

Vibes

59 of 99

Benchmarks

60 of 99

MMLU: Most popular benchmark

61 of 99

Top models on MMLU in May 2024

62 of 99

But other benchmarks give different scores

63 of 99

Head-to-head

64 of 99

A good place to learn about them is LMSys Arena

65 of 99

Recent leaderboard – April 2024

66 of 99

Vibes

67 of 99

Let’s keep this in mind

68 of 99

How do you know what is on which side of the jagged frontier?

Here?

Here?

Here?

Here?

Here?

69 of 99

What can you do to get a “feel” for the AI jagged frontier?

Experiment

Tools

Prompts

Retries

Impressions

10+ hours

50+ tasks

2+ frontier models

Learning from others

Other users

Newsletters / YouTube / X

Research

70 of 99

Cognitive break and questions

Why did the robot chAIcken cross the road?

To get to the jagged technological frontier—because speaking any language is a breeze, but counting words in a paragraph? Now that's a real challenge!

71 of 99

AI and Assessment

72 of 99

What is the new ‘authenticity’ of the tasks we assess?

73 of 99

There’s an authenticity lag in assessment

1980s

Times tables

Calculators

1990s

Handwriting

Typing

2000s

Spelling

Spell check

74 of 99

Often heard

Spelling does not matter in the age of the spell check!

75 of 99

Spelling does not matter

Spelling becomes reliable indicator of general knowledge and skill

76 of 99

What is authentic to core academic practice?

Engaging with text

Reading

Listening

Creating text

Writing

Dictation

Engaging with knowledge

Memory

Encyclopedia

77 of 99

The future of writing is conversation?

78 of 99

Dictation vs listening

👎 Dictation

Listening 👍

79 of 99

You are not dictating. You are sharing your thoughts.

80 of 99

Conversation

👎 Dictation

Conversation 👍

81 of 99

The future of writing is conversation

82 of 99

What this means: New form of writing

83 of 99

Output

84 of 99

85 of 99

How does reading change with AI cognitive scaffolding?

86 of 99

Step 1: �Ask for bullets

87 of 99

Step 2: �Ask for propositions

88 of 99

Step 3: �Ask for tables

89 of 99

Step 4: �Ask for questions

90 of 99

Step 5: �Ask for examples

91 of 99

Step 6: �Ask for poetry

92 of 99

Cognitive break and questions

AI shapes learning,

Authentic tasks redefine,

Future in our hands.

93 of 99

Final dilemma

94 of 99

Google announced watermarkign

95 of 99

Should AI companies watermark all text generated by AI?

96 of 99

Should spelling checkers send information about what errors you made to the company who will give a list to your employer?

97 of 99

Questions

98 of 99

Thank you

Dominik.Lukes@ctl.ox.ac.uk @techczech

99 of 99

This presentation is licensed under Creative Commons By Attribution license except where otherwise noted.��Icons and stock photos licensed under Microsoft Premium Content and cannot be reused outside this document.