Authenticity and Assessment �in the Age of AI
Dominik.Lukes@ctl.ox.ac.uk
Reading & Writing Innovation Lab
bit.ly/ox-rewrilab Consultations for staff and students
E-readers
Tablets & styluses
Reading apps
Writing tools
Note taking tools
Banbury road
In-person visits
Online consultations
Beyond ChatGPT: State of AI, October 2023
Download slides
bit.ly/3yOlMAP
Language pedagogy is my passion
The three As
Authenticity
Assessment
AI
Three pairs
Authenticity
Assessment
AI
Authenticity
Assessment
AI
Assessment and authenticity
Assessment has to solve more than authenticity.
Assessment has to be 3 things at once
Authentic
Represent real world task
Reliable
Reflect actual performance
Scalable
Done quickly enough for large numbers
But let’s remember the engineering triangle
Pick any two
Good
Fast
Cheap
Engineering triangle in practice
Do we have an assessment triangle?
Pick any two
Authentic
Scalable
Reliable
Assessment triangle in practice
Note: Not my original idea. Overheard at a conference.
Cognitive break and questions
In the triangle of assessment, we try,
To be scalable, reliable, and fly,
Yet authenticity’s claim,
Is a part of the game,
Balancing three is the aim, oh my!
How I made the previous slide
AI and authenticity
What is AI?�Brief terminological detour
Often heard
ChatGPT is just a fancy autocomplete. It just predicts the next word.
Literally true, but not a useful way to think about AI.
Generative AI is a universal (semantic) translator
Language to language
Style to style
Structured text to unstructured
Unstructured to structured
Question to answer
Text to label
Image to text
Text to image
Text to code
Code to text
Translating languages, styles, modalities
Some things don’t feel like translation but AI treats them that way
This approach can even be used to replace old specialised systems
But why semantic? Doesn’t it just predict the next word?
The relations look something like this
When AI is generating text it is doing something like this
This may look simple in 3D space
Word | Vector |
teacher | (0.8, 0.1, 0.1) |
student | (0.8, 0.1, 0.2) |
classroom | (0.1, 0.8, 0.1) |
homework | (0.3, 0.3, 0.8) |
school | (0.2, 0.9, 0.1) |
textbook | (0.5, 0.1, 0.5) |
curriculum | (0.5, 0.2, 0.5) |
exam | (0.3, 0.2, 0.9) |
What is inside an LLM is not statistics but geometry.
BUT it’s all geometry in 10,000 dimensions
Each item is one giant vector
[-0.08041892945766449, -0.023566607385873795, 0.04585130512714386, 0.037420596927404404, -0.09120217710733414, 0.022545181214809418, -0.0019880179315805435, 0.0587424635887146, 0.07126272469758987, -0.02159898355603218, -0.07145281136035919, 0.09984641522169113, -0.05501342564821243, 0.02485564909875393, 0.01755008101463318, 0.014556304551661015, -0.15110555291175842, -0.000567720562685281, 0.10030079632997513, -0.045505933463573456, -0.06274029612541199, -0.0683555155992508, 0.0008911662152968347, 0.01842709816992283, 0.06299598515033722, 0.02255615033209324, -0.09917508065700531, -0.07070962339639664, 0.08635025471448898, 0.06686452776193619, -0.0407336950302124, -0.04072333127260208, -0.01974628120660782, 0.07472220063209534, -0.024722471833229065, -0.13420116901397705, -0.01812688820064068, -0.07096941769123077, -0.05353084206581116, -0.10960721969604492, -0.017906684428453445, -0.04733094945549965, -0.02091103047132492, 0.1269848346710205, -0.05413510650396347, -0.046787846833467484, 0.0024005023296922445, -0.07217800617218018, -0.029329143464565277, 0.007498500403016806, -0.034666456282138824, -0.03568940982222557, 0.03427724167704582, 0.02315753698348999, -0.008645392023026943, 0.05333952233195305, 0.07456360012292862, 0.147796630859375, -0.006483903620392084, -0.08905889838933945, 0.03265034034848213, -0.0732979029417038, 0.04066538065671921, 0.023211032152175903, -0.012049349024891853, -0.02828565053641796, 0.019329581409692764, 0.09989447146654129, 0.1430598795413971, -0.061100199818611145, -0.030345138162374496, -0.02984507940709591, -0.028366880491375923, 0.052052564918994904, 0.036766957491636276, 0.003982939291745424, -0.077084481716156, 0.05044832453131676, -0.11687757074832916, 0.06646141409873962, 0.016255078837275505, -0.06982151418924332, -0.000822143629193306, -0.0026820336934179068, -0.004263593349605799, 0.09659365564584732, 0.06130471080541611, -0.06840908527374268, 0.06686245650053024, -0.04831290245056152, 0.08598440140485764, 0.08331689983606339, 0.08026000112295151, 0.05451888591051102, -0.03798443824052811, 0.04084145650267601, -0.12311697751283646, 0.023645302280783653, 0.005237551871687174, 0.03906212002038956, 0.037468183785676956, -0.05121520534157753, -0.10456130653619766, 0.009842721745371819, 0.04819759353995323, -0.13286681473255157, 0.02991127222776413, -0.06024811416864395, 0.04108288511633873, -0.008447377011179924, -0.07916080206632614, 0.06436653435230255, 0.017831943929195404, -0.054629500955343246, 0.027066148817539215, -0.030593710020184517, -0.10156133025884628, -0.0013401528121903539, 0.0011191506637260318, 0.009616676717996597, -0.02962290495634079, 0.0042936066165566444, 0.013841508887708187, -0.047656722366809845, -0.003912750165909529, 0.06500802934169769, 0.001283025718294084, -0.0816996768116951, 0.06566621363162994, -0.010957532562315464, -0.028156422078609467, 0.08978854864835739, -0.0003194105520378798, -0.02697799727320671, -0.006005867850035429, 0.07932088524103165, 0.021490609273314476, 0.013727870769798756, -0.019940776750445366, 0.031798265874385834, -0.0457642637193203, 0.03235720098018646, -0.022082772105932236, -0.04902353510260582, -0.11819718778133392, -0.04506421089172363, -0.046244461089372635, 0.029877550899982452, -0.07711911201477051, 0.05314543470740318, -0.09000932425260544, -0.023750705644488335, -0.05107633396983147, 0.001467616413719952, -0.02442317083477974, 0.01248782780021429, 0.06548482179641724, 0.043813593685626984, 0.06102786585688591, 0.021692050620913506, -0.052160654217004776, -0.009674523957073689, -0.072069451212883, -0.08633119612932205, -0.05121589079499245, -0.08108754456043243, 0.03608304262161255, 0.06553766876459122, -0.0727415531873703, -0.09346839785575867, -0.07251054048538208, 0.04504929482936859, -0.01773262582719326, -0.0005254627903923392, -0.0035706141497939825, 0.09068302065134048, 0.0152428038418293, 0.009525319561362267, 0.02502918615937233, 0.02807294949889183, -0.08951258659362793, 0.018022941425442696, 0.04113161191344261, -0.09941867738962173, 0.03642140328884125, 0.07755865901708603, 0.014834643341600895, -0.05757498741149902, -0.0052739898674190044, -0.03217893838882446, 0.029460914433002472, -0.03587955981492996, 0.016881171613931656, -0.015574142336845398, -0.10131996870040894, -0.01736866682767868, 0.014807181432843208, -0.03830776736140251, -0.0307577196508646, -0.04063287377357483, 0.0017508701421320438, 0.06622152030467987, 0.06959225982427597, 0.03921446576714516, -0.029292205348610878, -0.07731080055236816, -0.0757351890206337, 0.008267058990895748, 0.10628201067447662, -0.006961626932024956, -0.060704007744789124, -0.024280674755573273, -0.011232278309762478, 0.02305467799305916, -0.040246833115816116, 0.03551888465881348, -0.12048669904470444, -0.0057440041564404964, -0.008801680989563465, -0.038733456283807755, -0.0384967215359211, -0.0059003811329603195, 0.07543318718671799, 0.0029998512472957373, 0.11148137599229813, 0.0560586079955101, -0.01694066822528839, -0.020253779366612434, -0.11995487660169601, 0.10403268039226532, -0.022030610591173172, 0.019188301637768745, -0.03581297770142555, -0.04047590494155884, -0.03492145985364914, 0.027967417612671852, -0.07497915625572205, 0.032431814819574356, -0.025854842737317085, -0.10595495998859406, -0.09982465207576752, -0.05515384301543236, 0.02156943641602993, 0.05118619278073311, -0.03904290497303009, -0.022826874628663063, -0.053247928619384766, -0.10935184359550476, 0.0006719367229379714, -0.016026955097913742, 0.13483813405036926, 0.1173691526055336, -0.01902260072529316, -0.09690848737955093, -0.07585378736257553, 0.007626112550497055, 0.019889818504452705, -0.008633404038846493, 0.010355712845921516, 0.035737670958042145, 0.011519350111484528, -0.005264237057417631, -0.06305427849292755, -0.026263760402798653, 0.008310412988066673, -0.0068666874431073666, -0.13443514704704285, -0.025350390002131462, -0.0079041114076972, 0.014966381713747978, 0.01571144163608551, 0.06266333907842636, 0.05788900703191757, -0.022854981943964958, 0.09513315558433533, 0.1284472942352295, -0.061813995242118835, -0.049407169222831726, -0.10701776295900345, 0.06945358961820602, -0.07409369200468063, -0.028664348646998405, -0.0144350565969944, 0.029182329773902893, 0.007034373469650745, -0.026693496853113174, 0.0590004064142704, -0.002902168082073331, 0.12047384679317474, 0.023063501343131065, -0.05780957639217377, 0.058589596301317215, 0.02074800431728363, -0.030389118939638138, -0.002812192542478442, 0.06409497559070587, -0.0015993582783266902, 0.007702010218054056, 0.013223372399806976, 0.012501182034611702]
This gives AI enormous power but it has limitations.
Big 3 AI limitations
Hallucination
Plausible but not real
Replicability
Different every time
Introspection
No access to own processes or training data
Biggest caveat
It is not always possible to tell ahead of time what AI will be good at.
Dell’Acqua et al. 2023: Navigating the Jagged Technological Frontier
“We suggest that the capabilities of AI create a “jagged technological frontier” where some tasks are easily done by AI, while others, though seemingly similar in difficulty level, are outside the current capability of AI.”
Can we tell how good AI would be at something?
Illustrating the Jagged Frontier - @techczech
Inside frontier Expect hard for AI | Outside the frontier Expect easy for AI |
Speak any language (mostly) grammatically | Label grammar terms (metalanguage) |
Explain e = mc2 | Multiply numbers |
Write a poem | Reverse a random string of letters |
Count people in a story | Count words in a paragraph |
Generate complex photos | Place things to the right of other things |
“We suggest that the capabilities of AI create a “jagged technological frontier” where some tasks are easily done by AI, while others, though seemingly similar in difficulty level, are outside the current capability of AI.”
Dell’Acqua et al. 2023: Navigating the Jagged Technological Frontier
Metalanguage fail example (Claude Opus 17 May 2024)
How people experience the jagged frontier @techczech
“We suggest that the capabilities of AI create a “jagged technological frontier” where some tasks are easily done by AI, while others, though seemingly similar in difficulty level, are outside the current capability of AI.”
Dell’Acqua et al. 2023: Navigating the Jagged Technological Frontier
😍
ChatGPT is amazing, there’s nothing it can’t do!
😡
ChatGPT is useless, it can’t get even the basics right!
Most tasks sit at an intersection.
AI is good as expected
AI is �unexpectedly bad
AI is much better than expected
AI cannot do this� as expected
A conversation I’ve had
Academic: Why are the references ChatGPT gives me wrong?
Me: ChatGPT hallucinates links and references. Do not use it to find them.
A conversation I’ve had
Me: ChatGPT hallucinates links and references. Don’t use it to find them.�Student: Every time I clicked on a link I asked for it worked.
Walters & Wilder 2023 Fabrication and errors in the bibliographic citations generated by ChatGPT
AI has its own assessment triangle problem
All AI tools are powered by models.
What is a model
LLM
Large Language Model
One model many tools
GPT (3.5 or 4)
ChatGPT
TeacherMatic
Notion AI
MS Copilot
...
Different kinds of models
LLMs (Text generation)
Code generation
Speech recognition
Image generation
Text to speech
Voice cloning
Video generation
Music generation
More than a few
Which ones are best?
ChatGPT vs ChatGPT Plus
ChatGPT
ChatGPT Plus
Difference between GPT3.5 and GPT-4
Exam | GPT-4 Score | GPT-3 Score |
Uniform Bar Exam | 298/400 (90th percentile) | 213/400 (10th percentile) |
LSAT | 161 | 149 |
SAT Math | 1410 | 1260 |
AP World History | 5 (89th-100th percentile) | 4 (74th-89th percentile) |
AP Physics 2 | 4 (66th-84th percentile) | 3 (30th-66th percentile) |
AP Psychology | 5 (83rd-100th percentile) | 5 (83rd-100th percentile) |
AP Statistics | 5 (85th-100th percentile) | 3 (40th-63rd percentile) |
Medical Final Examination (English) | 79.6% accuracy | 58.3% accuracy |
Medical Final Examination (Polish) | 80.7% accuracy | 56.6% accuracy |
Example of improvements
GPT-3.5 is also less powerful in language
Spectrum of capabilities
Basic models
(GPT 3.5 Class)
Sub-frontier models
Frontier Models (GPT 4 Class)
Three types of model assessment
Benchmarks
Head-to-head
Vibes
Benchmarks
MMLU: Most popular benchmark
Top models on MMLU in May 2024
But other benchmarks give different scores
Head-to-head
A good place to learn about them is LMSys Arena
Recent leaderboard – April 2024
Vibes
Let’s keep this in mind
How do you know what is on which side of the jagged frontier?
Here?
Here?
Here?
Here?
Here?
What can you do to get a “feel” for the AI jagged frontier?
Experiment
Tools
Prompts
Retries
Impressions
10+ hours
50+ tasks
2+ frontier models
Learning from others
Other users
Newsletters / YouTube / X
Research
Cognitive break and questions
Why did the robot chAIcken cross the road?
To get to the jagged technological frontier—because speaking any language is a breeze, but counting words in a paragraph? Now that's a real challenge!
AI and Assessment
What is the new ‘authenticity’ of the tasks we assess?
There’s an authenticity lag in assessment
1980s
Times tables
Calculators
1990s
Handwriting
Typing
2000s
Spelling
Spell check
Often heard
Spelling does not matter in the age of the spell check!
Spelling does not matter
Spelling becomes reliable indicator of general knowledge and skill
What is authentic to core academic practice?
Engaging with text
Reading
Listening
Creating text
Writing
Dictation
Engaging with knowledge
Memory
Encyclopedia
The future of writing is conversation?
Dictation vs listening
👎 Dictation
Listening 👍
You are not dictating. You are sharing your thoughts.
Conversation
👎 Dictation
Conversation 👍
The future of writing is conversation
What this means: New form of writing
Output
How does reading change with AI cognitive scaffolding?
Step 1: �Ask for bullets
Step 2: �Ask for propositions
Step 3: �Ask for tables
Step 4: �Ask for questions
Step 5: �Ask for examples
Step 6: �Ask for poetry
Cognitive break and questions
AI shapes learning,
Authentic tasks redefine,
Future in our hands.
Final dilemma
Google announced watermarkign
Should AI companies watermark all text generated by AI?
Should spelling checkers send information about what errors you made to the company who will give a list to your employer?
Questions
Thank you
Dominik.Lukes@ctl.ox.ac.uk @techczech
This presentation is licensed under Creative Commons By Attribution license except where otherwise noted.��Icons and stock photos licensed under Microsoft Premium Content and cannot be reused outside this document.