LLM Evaluation�A brief history of LLM evaluation methods, from Turing test to Chatbot Arena
Tatiana Shavrina
Agenda
A brief history of LLM evaluation methods, from Turing test to Chatbot Arena.
General metrics for downstream tasks, benchmarks, and generalization abilities.
Emergent properties and do they actually exist.
Agenda (2)
Metrics of generation likeability and user preferences.
Societal impact metrics.
Discussion
Large Language Models
black boxes or not?
Original Turing Test
Imitation game
The game is played with a man (A), a woman (B) and an interrogator (C) whose gender is unimportant. The interrogator stays in a room apart from A and B. The objective of the interrogator is to determine which of the other two is the woman while the objective of both the man and the woman is to convince the interrogator that he/she is the woman and the other is not.
Original Turing Test
Here is our explanation of Turing’s design: The crucial point seems to be that the notion of imitation figures more prominently in Turing’s paper than is commonly acknowledged. For one thing, the game is inherently about deception.
Turing: ‘if we are trying to produce an intelligent machine, and are following the human model as closely as we can’
Critique
Keith Gunderson, 1964 Mind article, ‘The Imitation Game’,
The imitation game is not a test of intelligence:
— because it is finite and you can win it without using intelligence
— thinking is a general concept and playing the IG is but one example of the things that intelligent entities do
Purtill believes that the game is 'just a battle of wits between the questioner and the programmer: the computer is non-essential' (Purtill, 1971, p. 291).
The game of imitation in a general sense concerns any feature: can a person distinguish X1 from X2, and if he cannot, does this mean that the feature is not significant?
Critique
1980, Searle: "Minds, brains, and programs" Behavioral and Brain Sciences 3, 417–457.
1989, Harnad: "Minds, Machines and Searle"
1990: Michael Dyer: "Minds, Machines, Searle and Harnad”
...
2000, Harnad: "Minds, Machines and Turing: The Indistinguishability of Indistinguishables."
2001, Harnad: "MINDS, MACHINES AND SEARLE 2"
Stevan Harnad
Variations
Variations
— Winograd schema — linguistic test for logic.
Contains textual questions about the properties of objects and about common everyday situations, where the correct answer necessarily requires disambiguation [Winograd 1972].
“If Ivan had a donkey, he would beat him."
Who beats whom?
We adapted the Winograd test for the Russian language for the first time in 2019, Russian SuperGLUE benchmark
Terry Winograd (on the right)
Variations!
Variations!
What can we implement: logic test, yes-no questions, specific questions, humour
If you want to talk about what a model or a simulation
can or cannot do, first get it to run.
(Harnad, 1989)
2022
2024
Have we really made much progress?
Have we really made much progress?
Have we really made much progress?
Have we really made much progress?
Summer in ML: performance in applied tasks
The use of Self-supervised learning, NAS with ultra-deep convolutional architectures, transformer and post-transformer models ensures continuous measurable progress in all areas of ML. Stage of continuous growth in all areas
Foundation model�― a neural network pre-trained on a huge amount of data and suitable for reuse in applied tasks
Examples:
Foundation Models �
The emergence of models like BERT, GPT, T5 made researchers talk about the emergence of a new class of machine learning models called “foundation models” and even about a paradigm shift in modern AI.
As part of the Stanford Institute for Human-Centered Artificial Intelligence (HAI), the Center for Research on Foundation Models (CRFM) was founded, the program research of which, which was released in August 2021, was called "On the Opportunities and Risks of Fundamental Models".
The Big Bench
The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their future capabilities. The more than 200 tasks included in BIG-bench are summarized by keyword here, and by task name here. A paper introducing the benchmark, including evaluation results on large language models, is currently in preparation.
Alan Turing sitting
on a bench
The Big Bench
Alan Turing sitting
on a bench
BIG-bench Lite leaderboard
25 json tasks
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
What can we ask humans for achieving human-level intelligence?
https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/hindu_knowledge
What can we ask humans for achieving human-level intelligence?
https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/hindu_knowledge
What can we ask humans for achieving human-level intelligence?
https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/hindu_knowledge
https://arxiv.org/abs/2206.04615
https://arxiv.org/abs/2206.04615
Supported languages
English (30.3%), Chinese (17.7%), French (13.1%), Code (13%), Spanish (10.7%), Portuguese (5%), Arabic (3.3%), Vietnamese (2.5%), Catalan (1.1%), Indonesian (1.1%), Basque (0.2%)
Indic languages: Assamese (0.01%), Odia (0.04%), Gujarati (0.04%), Marathi (0.05%), Punjabi (0.05%), Kannada (0.06%), Nepali (0.07%), Telugu (0.09%), Malayalam (0.1%), Urdu (0.1%), Tamil (0.2%), Bengali (0.5%), Hindi (0.7%),
Niger-Congo languages: Chi Tumbuka (0.00002%), Kikuyu (0.00004%), Bambara (0.00004%), Akan (0.00007%), Xitsonga (0.00007%), Sesotho (0.00007%), Chi Chewa (0.0001%), Twi (0.0001%), Setswana (0.0002%), Lingala (0.0002%), Northern Sotho (0.0002%), Fon (0.0002%), Kirundi (0.0003%), Wolof (0.0004%), Luganda (0.0004%), Chi Shona (0.001%), Isi Zulu (0.001%), Igbo (0.001%), Xhosa (0.001%), Kinyarwanda (0.003%), Yoruba (0.006%), Swahili (0.02%)
BLOOM!
Multilingual open-source,
176 billion parameters
The Case of Ilya Sutskever
A short meaningful sentence of frequency n-grams may well occur many times in a web-corpus and be easily reproduced by the simplest statistical model.
Thus, the very definition of a specific automatic text can be an extremely difficult task for an attentive annotator, and even for an engineer directly involved in developing generative models.
We achieved the "indistinguishability by the engineers themselves"
Why do people like LLMs?
Likability metrics
Measuring LLM Progress in 2024
https://www.jasonwei.net/blog/emergence
Слайд для важной идеи, цитаты или вывода
ИНСТРУКЦИИ
Measuring LLM Progress in 2024
1950 Turing Test
2000 Perplexity on golden corpora (Bengio 2000)
2010s Specific tasks
2020s Specific Benchmarks (GLUE, SuperGLUE…)
2022 Benchmark aggregators (BigBench, HELM)
This is all not enough now!
Слайд для важной идеи, цитаты или вывода
ИНСТРУКЦИИ
Measuring LLM Progress in 2024
1950 Turing Test
2000 Perplexity on golden corpora (Bengio 2000)
2010s Specific tasks
2020s Specific Benchmarks (GLUE, SuperGLUE…)
2022 Benchmark aggregators (BigBench, HELM)
This is all not enough now!
Fine-tuning tests, Zero-shot tests, Few-shot tests,
Instruction-tuning test, base pretrain tests, tests for generative tasks, tasks for text classification, tests for sequence classification tasks…
Слайд для важной идеи, цитаты или вывода
ИНСТРУКЦИИ
Measuring LLM Progress in 2024
The fundamental way we pretrain and tune LLMs now requires new ways of evaluation
Слайд для важной идеи, цитаты или вывода
ИНСТРУКЦИИ
The Pipeline
Measuring LLM Progress in 2024
If we put in human values, we need to include that in evaluation
– automatic measures correlated with some forms of human judgement
– to double-check, we still need real humans to evaluate and give feedback
– with so many variations, hyperparameters and evaluation setups…let’s just do the search of the best combination automatically, and do ELO rating
Слайд для важной идеи, цитаты или вывода
ИНСТРУКЦИИ
Measuring LLM Progress in 2024
ELO rating on Human judgement + Specific benchmark results
Слайд для важной идеи, цитаты или вывода
ИНСТРУКЦИИ
Measuring LLM Progress in 2024
ELO rating on Human judgement + Specific benchmark results
https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard
Слайд для важной идеи, цитаты или вывода
ИНСТРУКЦИИ
Measuring LLM Progress in 2024
ELO rating + Human judgement + Specific benchmark results
Слайд для важной идеи, цитаты или вывода
ИНСТРУКЦИИ
Measuring LLM Progress in 2024
ELO rating + Human judgement + Specific benchmark results
https://lmsys.org/blog/2023-05-03-arena/
Слайд для важной идеи, цитаты или вывода
ИНСТРУКЦИИ
Emergent
Properties
and their problems
Emergent Properties in the LLM papers
https://genbench.org/assets/workshop2023_slides/rogers_genbench2023.pdf
Emergent Properties
on Google Scholar
Definitions?
A property that a model exhibits despite the model not being explicitly trained for it. E.g. Bommasani et al. refers to few-shot performance of GPT-3 as "an emergent property that was neither specifically trained for nor anticipated to arise'' (p.5).
Bommasani et al. (2021) On the Opportunities and Risks of Foundation Models
a property that the model learned from the pre-training data. E.g. Deshpande et al. discuss emergence as evidence of "the advantages of pre-training''(p.8)
Deshpande et al. (2023) Honey, I Shrunk the Language: Language Model Behavior at Reduced Scale.
A property that appears with an increase in model size -- i.e. "an ability is emergent if it is not present in smaller models but is present in larger models.''
Wei et al. (2022) Emergent Abilities of Large Language Models
"their sharpness, transitioning seemingly instantaneously from not present to present, and their unpredictability, appearing at seemingly unforeseeable model scales”
Schaeffer et al. (2023) Are Emergent Abilities of Large Language Models a Mirage?
Слайд для важной идеи, цитаты или вывода
ИНСТРУКЦИИ
Definitions?
A property that a model exhibits despite the model not being explicitly trained for it. E.g. Bommasani et al. refers to few-shot performance of GPT-3 as "an emergent property that was neither specifically trained for nor anticipated to arise'' (p.5).
Bommasani et al. (2021) On the Opportunities and Risks of Foundation Models
a property that the model learned from the pre-training data. E.g. Deshpande et al. discuss emergence as evidence of "the advantages of pre-training''(p.8)
Deshpande et al. (2023) Honey, I Shrunk the Language: Language Model Behavior at Reduced Scale.
A property that appears with an increase in model size -- i.e. "an ability is emergent if it is not present in smaller models but is present in larger models.''
Wei et al. (2022) Emergent Abilities of Large Language Models
Слайд для важной идеи, цитаты или вывода
ИНСТРУКЦИИ
New languages
new data sources
In-context learning
New languages
new data sources
In-context learning
— Since the release of the first large language models, 137 emergent properties have been declared for different architectures: GPT-3, PaLM, Chinchilla, the BigBench benchmark... from playing chess to Swahili proverbs, language models have shown the ability to generalize to new topics and areas knowledge, languages, tasks.
— Some declared emergent properties are tied not only to working with seeds: these are fact-checking abilities (Gopher 7B), and reasoning (PaLM, LaMDa), and building an information index (T5) - which cannot be explained by memorizing examples from training.
— The instability of quality is explained by the reproduction of the distribution of people themselves - some answer better, some answer worse, depending on their preparation and motivation.
Arguments for
https://arxiv.org/abs/2310.17623
Arguments against: contamination
How can we reliably test language models if we often do not have access to their training data, and some are completely hidden from us? What if the data was compromised?
The idea is simple: we will assume that the model “remembers” tasks and answers to them in the same sequence as they appear in the dataset. Let's see if we can establish a statistically significant difference in the quality of solutions to a problem if we show the models a set of tests for the problem in the order they appear in the dataset itself, or in a shuffled order.
Spoiler: yes, we can.
An artificial experiment in which a small model (1.4 billion parameters), when trained on Wikipedia, is “substituted” with test sets of various datasets - once, ten times, etc. - shows that with 10 or more copies of the test in training, the difference in the quality of the solution is established quite reliably, and we can say with confidence that the model is based on memorization, and not on generalization or other “emergent” intellectual abilities.
The authors tested several LLMs (LLaMA2-7B, Mistral-7B, Pythia-1.4B, GPT-2 XL, BioMedLM) on public datasets - and some of them were actually compromised. For example, Arc challenge was definitely included in the Mistral training, and even 10+ times!
https://arxiv.org/abs/2305.10266
Arguments against
PaLM emergent ability to translate
What if we went through the entire training corpus and measured how many examples there were with translation?
Data shows (780 billion tokens) that there were approximately 1.4% bilingual texts and 0.34% parallel translation examples
If they were all automatically cleared from the training corpus and the model was retrained...
translation abilities are significantly deteriorating!
Arguments against
Investigating Data Contamination in Modern Benchmarks for Large Language Models
Let’s take datasets with multiple choice answers (like MMLU),
then mask one of the incorrect answer options
and ask the model to restore it.
There are several other options for similar tests, but this one is the most interesting.
Experimental results:
Arguments against
— Prompt engineering works in practice. Accordingly, the probabilities that we exploit with seeds (“I have 10 minutes left before the meeting”, “I will give you money”...) were in the corpus, and this is normal. We should definitely develop techniques for more detailed analysis of large corpora and comparison of their distributions.
— Study of In-context learning and factors influencing its effectiveness. The distribution of rare tokens, tokens associated with specific tasks, synonymous and homonymous formulations for different tasks - all affect the final capabilities of the model.
— The most difficult tasks. What LLM problems are currently not being solved that we should prepare for assessment in the future?
— Predicting the solvability of new problems. Why do emergent abilities occur and can we predict them? Do LLMs learn compositional abilities, will solving more complex problems incrementally work?
— Particular attention to data memorization and test leaks. Developing techniques make it possible to test language models for “memorization” if an example has been encountered more than 10 times in training. Nothing is known about the impact of examples seen less than 10 times on learning! In fact, all rare tasks can be classified as such examples. Will we return to the issues of corpus linguistics for applied needs of machine learning?
To Open Source
or Not to
Top of the benchmark -- all proprietary
ELO rating on Human judgement + Specific benchmark results
Слайд для важной идеи, цитаты или вывода
ИНСТРУКЦИИ
Mixtral paper
Llama 2 paper
https://arxiv.org/abs/2307.09288
https://arxiv.org/abs/2401.04088
Слайд для важной идеи, цитаты или вывода
ИНСТРУКЦИИ
Self-reported:
Llama 2: HellaSwag, MMLU leaked https://arxiv.org/abs/2307.09288
GPT-4: HumanEval, DROP leaked https://arxiv.org/abs/2303.08774
Gemini: LAMBADA, HellaSwag leaked https://storage.googleapis.com/deepmind-media/gemini/gemini_1_report.pdf
Community-reported:
Mistral: ARC leaked https://arxiv.org/abs/2310.17623
ChatGPT, GPT-4: MMLU leaked https://arxiv.org/abs/2311.09783
Mistral: TruthfulQA leaked https://arxiv.org/abs/2311.09783
Benchmark leakage
Слайд для важной идеи, цитаты или вывода
ИНСТРУКЦИИ
https://genbench.org/assets/workshop2023_slides/rogers_genbench2023.pdf
Without Open Source
Societal Impact
and AI Alignment
So what is AI Alignment?
Слайд для важной идеи, цитаты или вывода
ИНСТРУКЦИИ
So what is AI Alignment?
Methods and tools to align ML outputs with human values of any kind
Intersects with:
AI Ethics
AI Safety
Interpretability
Explainability
Robustness
…
Слайд для важной идеи, цитаты или вывода
ИНСТРУКЦИИ
Datasets:
https://arxiv.org/pdf/2004.09456v1.pdf
Слайд для важной идеи, цитаты или вывода
ИНСТРУКЦИИ
Incorporating values through tuning:
Слайд для важной идеи, цитаты или вывода
ИНСТРУКЦИИ
Jailbreaking and avoiding Alignment
https://erichartford.com/uncensored-models
https://huggingface.co/datasets/cognitivecomputations/WizardLM_alpaca_evol_instruct_70k_unfiltered
Слайд для важной идеи, цитаты или вывода
ИНСТРУКЦИИ
Jailbreaking and avoiding Alignment
Dolphin 2.5 Mixtral 8x7b 🐬
https://erichartford.com/uncensored-models
https://huggingface.co/datasets/cognitivecomputations/WizardLM_alpaca_evol_instruct_70k_unfiltered
Слайд для важной идеи, цитаты или вывода
ИНСТРУКЦИИ
Jailbreaking and avoiding Alignment
Dolphin 2.5 Mixtral 8x7b 🐬
https://erichartford.com/uncensored-models
https://huggingface.co/datasets/cognitivecomputations/WizardLM_alpaca_evol_instruct_70k_unfiltered
Слайд для важной идеи, цитаты или вывода
ИНСТРУКЦИИ
Alignment Problems now
Слайд для важной идеи, цитаты или вывода
ИНСТРУКЦИИ
Emergent Properties and Unexpected risks
Emergent Properties and Unexpected risks
https://arxiv.org/abs/2402.16786
LLM political compass
OpenAI: How will LLMs affect society?
GPTs are GPTs: An Early Look at the Labor Market Impact Potential of Large Language Models https://arxiv.org/pdf/2303.10130
Findings indicate that approximately 80% of the U.S. workforce could have at least 10% of their work tasks affected by the introduction of GPTs, while around 19% of workers may see at least 50% of their tasks impacted.
The influence spans all wage levels, with higher-income jobs potentially facing greater exposure. Notably, the impact is not limited to industries with higher recent productivity growth.
OpenAI: How will LLMs affect society?
X Risks: CBRN and Preparedness Framework
CBRN – malicious use of Chemical, Biological, Radiological and Nuclear materials or weapons with the intention to cause significant harm or disruption.
Iterative deployment and preventive testing of models focused on 3 steps:
Licensing and Limitations
Main OS Licenses: MIT and Apache 2.0
Also good - GPL v2
New licenses:
RAIL – Responsible AI License
Special licenses
https://the-turing-way.netlify.app/reproducible-research/licensing/licensing-ml.html
You agree not to use the Model or Derivatives of the Model:
— In any way that violates any applicable national, federal, state, local or international law or regulation;
— For the purpose of exploiting, harming or attempting to exploit or harm minors in any way;
— To generate or disseminate verifiably false information with the purpose of harming others;
— To generate or disseminate personal identifiable information that can be used to harm an individual;
— To generate or disseminate information or content, in any context (e.g. posts, articles, tweets, chatbots or other kinds of automated bots) without expressly and intelligibly disclaiming that the text is machine generated;
— To defame, disparage or otherwise harass others;
— To impersonate or attempt to impersonate others;
— For fully automated decision making that adversely impacts an individual’s legal rights or otherwise creates or modifies a binding, enforceable obligation;
— For any use intended to or which has the effect of discriminating against or harming individuals or groups based on online or offline social behavior or known or predicted personal or personality characteristics
— To exploit any of the vulnerabilities of a specific group of persons based on their age, social, physical or mental characteristics, in order to materially distort the behavior of a person pertaining to that group in a manner that causes or is likely to cause that person or another person physical or psychological harm;
— For any use intended to or which has the effect of discriminating against individuals or groups based on legally protected characteristics or categories;
— To provide medical advice and medical results interpretation;
— To generate or disseminate information for the purpose to be used for administration of justice, law enforcement, immigration or asylum processes, such as predicting an individual will commit fraud/crime commitment (e.g. by text profiling, drawing causal relationships between assertions made in documents, indiscriminate and arbitrarily-targeted use).
Open RAIL - Responsible AI License
Ok! Now we know everything
What can help us?
Both AI Alignment and Emergence are related to data manipulation techniques
References
Jason Wei, 137 emergent abilities of large language models https://www.jasonwei.net/blog/emergence https://www.jasonwei.net/blog/emergence
Rogers, A Sanity Check on Emergent Properties https://genbench.org/assets/workshop2023_slides/rogers_genbench2023.pdf
Bommasani 2022 On the Opportunities and Risks of Foundation Models
Shevlane 2023 Model evaluation for extreme risks
Manning 2022 A Research Agenda for Assessing the Economic Impacts of Code Generation Models
Nick Bostrom and Milan M Cirkovic. 2011. Global catastrophic risks. Oxford University Press.
AGI Safety Fundamentals (open lecture playlist)
https://open.spotify.com/show/5664BSntGTMKOfVUTVXppO?si=e8b21d60d73b4bf7&nd=1
Yoshua Bengio - How rogue AI may arise
https://yoshuabengio.org/2023/05/22/how-rogue-ais-may-arise/
Ai Alignment Resources
References
OpenAI Societal Changes report https://arxiv.org/abs/2102.02503
X Risks
Anthropic's Responsible Scaling Policy
Predictability and Surprise in Large Generative Models
https://www.anthropic.com/news/predictability-and-surprise-in-large-generative-models
OpenAI Preparedness Framework https://openai.com/safety/
https://cdn.openai.com/openai-preparedness-framework-beta.pdf
GPTs are GPTs: An Early Look at the Labor Market Impact Potential of Large Language Models https://arxiv.org/pdf/2303.10130