1 of 80

LLM Evaluation�A brief history of LLM evaluation methods, from Turing test to Chatbot Arena

Tatiana Shavrina

2 of 80

Agenda

A brief history of LLM evaluation methods, from Turing test to Chatbot Arena.

  • - our expectations, modeling language or modeling intelligence
  • - text-based approaches

General metrics for downstream tasks, benchmarks, and generalization abilities.

  • - metrics and evals for pretraining
  • - evaluation for post training
  • - multimodal llm evals
  • - safety evaluation

Emergent properties and do they actually exist.

  • - emergence definition
  • - arguments for and against
  • - data contamination problem

3 of 80

Agenda (2)

Metrics of generation likeability and user preferences.

  • - LLM arena, biases and side by side comparison approaches
  • - online metrics versus offline metrics

Societal impact metrics.

  • - political compass problem
  • - benchmarks related to societal and ethical biases
  • OpenAI research how will llms affect society, and jobs
  • problems with copyright
  • problems with chemical weapons, and X risk - see OpenAI O1
  • - licenses and application limitations

Discussion

4 of 80

Large Language Models

black boxes or not?

5 of 80

Original Turing Test

Imitation game

The game is played with a man (A), a woman (B) and an interrogator (C) whose gender is unimportant. The interrogator stays in a room apart from A and B. The objective of the interrogator is to determine which of the other two is the woman while the objective of both the man and the woman is to convince the interrogator that he/she is the woman and the other is not.

6 of 80

Original Turing Test

Here is our explanation of Turing’s design: The crucial point seems to be that the notion of imitation figures more prominently in Turing’s paper than is commonly acknowledged. For one thing, the game is inherently about deception.

Turing: ‘if we are trying to produce an intelligent machine, and are following the human model as closely as we can’

  1. The reader must accept it as a fact that digital computers can be constructed, and indeed have been constructed, according to the principles we have described, and that they can in fact mimic the actions of a human computer very closely (Turing, 1950, p. 438).
  2. As I have explained, the problem is mainly one of programming. Advances in engineering will have to be made too, but it seems unlikely that these will not be adequate for the requirements (Turing, 1950, p. 455).
  3. [The machine] may be used to help in making up its own programmes, or to predict the effect of alterations in its own structure.

7 of 80

Critique

Keith Gunderson, 1964 Mind article, ‘The Imitation Game’,

The imitation game is not a test of intelligence:

— because it is finite and you can win it without using intelligence

— thinking is a general concept and playing the IG is but one example of the things that intelligent entities do

Purtill believes that the game is 'just a battle of wits between the questioner and the programmer: the computer is non-essential' (Purtill, 1971, p. 291).

The game of imitation in a general sense concerns any feature: can a person distinguish X1 from X2, and if he cannot, does this mean that the feature is not significant?

8 of 80

Critique

1980, Searle: "Minds, brains, and programs" Behavioral and Brain Sciences 3, 417–457.

1989, Harnad: "Minds, Machines and Searle"

1990: Michael Dyer: "Minds, Machines, Searle and Harnad”

...

2000, Harnad: "Minds, Machines and Turing: The Indistinguishability of Indistinguishables."

2001, Harnad: "MINDS, MACHINES AND SEARLE 2"

Stevan Harnad

9 of 80

Variations

  • Total Turing Test (TTT) (Harnad, 1991) — requires the machines to respond to all of our inputs rather than just verbal ones.
  • Total Total Turing Test (TTTT) — requires neuromolecular indistinguishability. ‘[TTTT] is as much as a scientist can ask, for the empirical story ends there’
  • Kugel Test (KT) (Kugel, 1990) — play the imitation game, but do not tell the participants what distinguishing feature we are looking at.
  • Inverted Turing Test (ITT) (Watt, 1996) - naive psychology, the consistency of the author's "cognitive profile"
  • Truly Total Turing Test (TRTTT) (Schweizer, 1998) Evolutionary criteria for intelligence

10 of 80

Variations

— Winograd schema — linguistic test for logic.

Contains textual questions about the properties of objects and about common everyday situations, where the correct answer necessarily requires disambiguation [Winograd 1972].

“If Ivan had a donkey, he would beat him."

Who beats whom?

We adapted the Winograd test for the Russian language for the first time in 2019, Russian SuperGLUE benchmark

Terry Winograd (on the right)

11 of 80

Variations!

  • Minimum intelligent signal test (MIST) — a question-answer test that requires only “yes” / “no” answers, but on difficult questions. The machine requires knowledge, logic. Such a test, proposed in [McKinstry 1997], reduces the subjectivity of judging in the original Turing test, and also provides a metric for the "humanity" of the system's intelligence — that is, the proportion of correct answers;
  • Turing test with a specialist (Subject-matter expert Turing test) — a kind of test with expert specialized knowledge. The correct answers should not differ from the answers of real experts [McCorduck 2004];
  • Ebert test — tests for humor. The test involves speech synthesis, and it must be good enough to make the judges laugh at the joke of the machine [Pasternack 2011].

12 of 80

Variations!

  • Minimum intelligent signal test (MIST) — a question-answer test that requires only “yes” / “no” answers, but on difficult questions. The machine requires knowledge, logic. Such a test, proposed in [McKinstry 1997], reduces the subjectivity of judging in the original Turing test, and also provides a metric for the "humanity" of the system's intelligence — that is, the proportion of correct answers;
  • Turing test with a specialist (Subject-matter expert Turing test) — a kind of test with expert specialized knowledge. The correct answers should not differ from the answers of real experts [McCorduck 2004];
  • Ebert test — tests for humor. The test involves speech synthesis, and it must be good enough to make the judges laugh at the joke of the machine [Pasternack 2011].

What can we implement: logic test, yes-no questions, specific questions, humour

13 of 80

If you want to talk about what a model or a simulation

can or cannot do, first get it to run.

(Harnad, 1989)

14 of 80

2022

2024

15 of 80

Have we really made much progress?

16 of 80

Have we really made much progress?

17 of 80

Have we really made much progress?

18 of 80

Have we really made much progress?

19 of 80

Summer in ML: performance in applied tasks

The use of Self-supervised learning, NAS with ultra-deep convolutional architectures, transformer and post-transformer models ensures continuous measurable progress in all areas of ML. Stage of continuous growth in all areas

20 of 80

Foundation model― a neural network pre-trained on a huge amount of data and suitable for reuse in applied tasks

Examples:

  • GPT-3 (2020) - text generation
  • CLIP (2021) - zero-shot image classification
  • DALL•E (2021) - image generation from text
  • Gato (2022) - agent with 600+ tasks, multi-modal, multi-task

Foundation Models �

The emergence of models like BERT, GPT, T5 made researchers talk about the emergence of a new class of machine learning models called “foundation models” and even about a paradigm shift in modern AI.

As part of the Stanford Institute for Human-Centered Artificial Intelligence (HAI), the Center for Research on Foundation Models (CRFM) was founded, the program research of which, which was released in August 2021, was called "On the Opportunities and Risks of Fundamental Models".

21 of 80

The Big Bench

The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their future capabilities. The more than 200 tasks included in BIG-bench are summarized by keyword here, and by task name here. A paper introducing the benchmark, including evaluation results on large language models, is currently in preparation.

Alan Turing sitting

on a bench

22 of 80

The Big Bench

Alan Turing sitting

on a bench

BIG-bench Lite leaderboard

25 json tasks

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

23 of 80

What can we ask humans for achieving human-level intelligence?

https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/hindu_knowledge

24 of 80

What can we ask humans for achieving human-level intelligence?

https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/hindu_knowledge

25 of 80

What can we ask humans for achieving human-level intelligence?

https://github.com/google/BIG-bench/tree/main/bigbench/benchmark_tasks/hindu_knowledge

26 of 80

https://arxiv.org/abs/2206.04615

27 of 80

https://arxiv.org/abs/2206.04615

28 of 80

Supported languages

English (30.3%), Chinese (17.7%), French (13.1%), Code (13%), Spanish (10.7%), Portuguese (5%), Arabic (3.3%), Vietnamese (2.5%), Catalan (1.1%), Indonesian (1.1%), Basque (0.2%)

Indic languages: Assamese (0.01%), Odia (0.04%), Gujarati (0.04%), Marathi (0.05%), Punjabi (0.05%), Kannada (0.06%), Nepali (0.07%), Telugu (0.09%), Malayalam (0.1%), Urdu (0.1%), Tamil (0.2%), Bengali (0.5%), Hindi (0.7%),

Niger-Congo languages: Chi Tumbuka (0.00002%), Kikuyu (0.00004%), Bambara (0.00004%), Akan (0.00007%), Xitsonga (0.00007%), Sesotho (0.00007%), Chi Chewa (0.0001%), Twi (0.0001%), Setswana (0.0002%), Lingala (0.0002%), Northern Sotho (0.0002%), Fon (0.0002%), Kirundi (0.0003%), Wolof (0.0004%), Luganda (0.0004%), Chi Shona (0.001%), Isi Zulu (0.001%), Igbo (0.001%), Xhosa (0.001%), Kinyarwanda (0.003%), Yoruba (0.006%), Swahili (0.02%)

BLOOM!

Multilingual open-source,

176 billion parameters

29 of 80

The Case of Ilya Sutskever

A short meaningful sentence of frequency n-grams may well occur many times in a web-corpus and be easily reproduced by the simplest statistical model.

Thus, the very definition of a specific automatic text can be an extremely difficult task for an attentive annotator, and even for an engineer directly involved in developing generative models.

We achieved the "indistinguishability by the engineers themselves"

30 of 80

Why do people like LLMs?

Likability metrics

31 of 80

Measuring LLM Progress in 2024

https://www.jasonwei.net/blog/emergence

Слайд для важной идеи, цитаты или вывода

ИНСТРУКЦИИ

32 of 80

Measuring LLM Progress in 2024

1950 Turing Test

2000 Perplexity on golden corpora (Bengio 2000)

2010s Specific tasks

2020s Specific Benchmarks (GLUE, SuperGLUE…)

2022 Benchmark aggregators (BigBench, HELM)

This is all not enough now!

Слайд для важной идеи, цитаты или вывода

ИНСТРУКЦИИ

33 of 80

Measuring LLM Progress in 2024

1950 Turing Test

2000 Perplexity on golden corpora (Bengio 2000)

2010s Specific tasks

2020s Specific Benchmarks (GLUE, SuperGLUE…)

2022 Benchmark aggregators (BigBench, HELM)

This is all not enough now!

Fine-tuning tests, Zero-shot tests, Few-shot tests,

Instruction-tuning test, base pretrain tests, tests for generative tasks, tasks for text classification, tests for sequence classification tasks…

Слайд для важной идеи, цитаты или вывода

ИНСТРУКЦИИ

34 of 80

Measuring LLM Progress in 2024

The fundamental way we pretrain and tune LLMs now requires new ways of evaluation

  • pretraining – predicting next token aka causal language modeling
  • fine-tuning on instructions and dialogues
  • offline RL with emulation of human judgement

Слайд для важной идеи, цитаты или вывода

ИНСТРУКЦИИ

35 of 80

36 of 80

The Pipeline

  1. We first show all the variety of human experience expressed in texts

  • Then we refocus on the best experience, dialogue and instruction setup

  • And then we pass the model the values

37 of 80

38 of 80

Measuring LLM Progress in 2024

If we put in human values, we need to include that in evaluation

– automatic measures correlated with some forms of human judgement

– to double-check, we still need real humans to evaluate and give feedback

– with so many variations, hyperparameters and evaluation setups…let’s just do the search of the best combination automatically, and do ELO rating

Слайд для важной идеи, цитаты или вывода

ИНСТРУКЦИИ

39 of 80

Measuring LLM Progress in 2024

ELO rating on Human judgement + Specific benchmark results

Слайд для важной идеи, цитаты или вывода

ИНСТРУКЦИИ

40 of 80

Measuring LLM Progress in 2024

ELO rating on Human judgement + Specific benchmark results

https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard

Слайд для важной идеи, цитаты или вывода

ИНСТРУКЦИИ

41 of 80

Measuring LLM Progress in 2024

ELO rating + Human judgement + Specific benchmark results

Слайд для важной идеи, цитаты или вывода

ИНСТРУКЦИИ

42 of 80

Measuring LLM Progress in 2024

ELO rating + Human judgement + Specific benchmark results

https://lmsys.org/blog/2023-05-03-arena/

Слайд для важной идеи, цитаты или вывода

ИНСТРУКЦИИ

43 of 80

Emergent

Properties

and their problems

44 of 80

Emergent Properties in the LLM papers

45 of 80

https://genbench.org/assets/workshop2023_slides/rogers_genbench2023.pdf

Emergent Properties

on Google Scholar

46 of 80

Definitions?

A property that a model exhibits despite the model not being explicitly trained for it. E.g. Bommasani et al. refers to few-shot performance of GPT-3 as "an emergent property that was neither specifically trained for nor anticipated to arise'' (p.5).

Bommasani et al. (2021) On the Opportunities and Risks of Foundation Models

a property that the model learned from the pre-training data. E.g. Deshpande et al. discuss emergence as evidence of "the advantages of pre-training''(p.8)

Deshpande et al. (2023) Honey, I Shrunk the Language: Language Model Behavior at Reduced Scale.

A property that appears with an increase in model size -- i.e. "an ability is emergent if it is not present in smaller models but is present in larger models.''

Wei et al. (2022) Emergent Abilities of Large Language Models

"their sharpness, transitioning seemingly instantaneously from not present to present, and their unpredictability, appearing at seemingly unforeseeable model scales”

Schaeffer et al. (2023) Are Emergent Abilities of Large Language Models a Mirage?

Слайд для важной идеи, цитаты или вывода

ИНСТРУКЦИИ

47 of 80

Definitions?

A property that a model exhibits despite the model not being explicitly trained for it. E.g. Bommasani et al. refers to few-shot performance of GPT-3 as "an emergent property that was neither specifically trained for nor anticipated to arise'' (p.5).

Bommasani et al. (2021) On the Opportunities and Risks of Foundation Models

a property that the model learned from the pre-training data. E.g. Deshpande et al. discuss emergence as evidence of "the advantages of pre-training''(p.8)

Deshpande et al. (2023) Honey, I Shrunk the Language: Language Model Behavior at Reduced Scale.

A property that appears with an increase in model size -- i.e. "an ability is emergent if it is not present in smaller models but is present in larger models.''

Wei et al. (2022) Emergent Abilities of Large Language Models

Слайд для важной идеи, цитаты или вывода

ИНСТРУКЦИИ

48 of 80

New languages

new data sources

In-context learning

49 of 80

New languages

new data sources

In-context learning

50 of 80

— Since the release of the first large language models, 137 emergent properties have been declared for different architectures: GPT-3, PaLM, Chinchilla, the BigBench benchmark... from playing chess to Swahili proverbs, language models have shown the ability to generalize to new topics and areas knowledge, languages, tasks.

— Some declared emergent properties are tied not only to working with seeds: these are fact-checking abilities (Gopher 7B), and reasoning (PaLM, LaMDa), and building an information index (T5) - which cannot be explained by memorizing examples from training.

— The instability of quality is explained by the reproduction of the distribution of people themselves - some answer better, some answer worse, depending on their preparation and motivation.

Arguments for

51 of 80

https://arxiv.org/abs/2310.17623

Arguments against: contamination

How can we reliably test language models if we often do not have access to their training data, and some are completely hidden from us? What if the data was compromised?

The idea is simple: we will assume that the model “remembers” tasks and answers to them in the same sequence as they appear in the dataset. Let's see if we can establish a statistically significant difference in the quality of solutions to a problem if we show the models a set of tests for the problem in the order they appear in the dataset itself, or in a shuffled order.

Spoiler: yes, we can.

An artificial experiment in which a small model (1.4 billion parameters), when trained on Wikipedia, is “substituted” with test sets of various datasets - once, ten times, etc. - shows that with 10 or more copies of the test in training, the difference in the quality of the solution is established quite reliably, and we can say with confidence that the model is based on memorization, and not on generalization or other “emergent” intellectual abilities.

The authors tested several LLMs (LLaMA2-7B, Mistral-7B, Pythia-1.4B, GPT-2 XL, BioMedLM) on public datasets - and some of them were actually compromised. For example, Arc challenge was definitely included in the Mistral training, and even 10+ times!

52 of 80

https://arxiv.org/abs/2305.10266

Arguments against

PaLM emergent ability to translate

What if we went through the entire training corpus and measured how many examples there were with translation?

Data shows (780 billion tokens) that there were approximately 1.4% bilingual texts and 0.34% parallel translation examples

If they were all automatically cleared from the training corpus and the model was retrained...

translation abilities are significantly deteriorating!

53 of 80

Arguments against

Investigating Data Contamination in Modern Benchmarks for Large Language Models

Let’s take datasets with multiple choice answers (like MMLU),

then mask one of the incorrect answer options

and ask the model to restore it.

There are several other options for similar tests, but this one is the most interesting.

Experimental results:

  • MMLU is leaked! for GPT-3.5 and GPT-4
    • exact match 52%
  • TruthfulQA leaked in Mistral
    • also leaked in the Pile and C4 corpora

54 of 80

Arguments against

— Prompt engineering works in practice. Accordingly, the probabilities that we exploit with seeds (“I have 10 minutes left before the meeting”, “I will give you money”...) were in the corpus, and this is normal. We should definitely develop techniques for more detailed analysis of large corpora and comparison of their distributions.

— Study of In-context learning and factors influencing its effectiveness. The distribution of rare tokens, tokens associated with specific tasks, synonymous and homonymous formulations for different tasks - all affect the final capabilities of the model.

— The most difficult tasks. What LLM problems are currently not being solved that we should prepare for assessment in the future?

— Predicting the solvability of new problems. Why do emergent abilities occur and can we predict them? Do LLMs learn compositional abilities, will solving more complex problems incrementally work?

— Particular attention to data memorization and test leaks. Developing techniques make it possible to test language models for “memorization” if an example has been encountered more than 10 times in training. Nothing is known about the impact of examples seen less than 10 times on learning! In fact, all rare tasks can be classified as such examples. Will we return to the issues of corpus linguistics for applied needs of machine learning?

55 of 80

To Open Source

or Not to

56 of 80

Top of the benchmark -- all proprietary

ELO rating on Human judgement + Specific benchmark results

Слайд для важной идеи, цитаты или вывода

ИНСТРУКЦИИ

57 of 80

Mixtral paper

Llama 2 paper

https://arxiv.org/abs/2307.09288

https://arxiv.org/abs/2401.04088

Слайд для важной идеи, цитаты или вывода

ИНСТРУКЦИИ

58 of 80

Self-reported:

Llama 2: HellaSwag, MMLU leaked https://arxiv.org/abs/2307.09288

GPT-4: HumanEval, DROP leaked https://arxiv.org/abs/2303.08774

Gemini: LAMBADA, HellaSwag leaked https://storage.googleapis.com/deepmind-media/gemini/gemini_1_report.pdf

Community-reported:

Mistral: ARC leaked https://arxiv.org/abs/2310.17623

ChatGPT, GPT-4: MMLU leaked https://arxiv.org/abs/2311.09783

Mistral: TruthfulQA leaked https://arxiv.org/abs/2311.09783

Benchmark leakage

Слайд для важной идеи, цитаты или вывода

ИНСТРУКЦИИ

59 of 80

https://genbench.org/assets/workshop2023_slides/rogers_genbench2023.pdf

Without Open Source

60 of 80

Societal Impact

and AI Alignment

61 of 80

So what is AI Alignment?

Слайд для важной идеи, цитаты или вывода

ИНСТРУКЦИИ

62 of 80

So what is AI Alignment?

Methods and tools to align ML outputs with human values of any kind

Intersects with:

AI Ethics

AI Safety

Interpretability

Explainability

Robustness

Слайд для важной идеи, цитаты или вывода

ИНСТРУКЦИИ

63 of 80

Datasets:

  • ETHICS – decision-making and reasoning with ethics
  • HateCheck – hate speech detection
  • WinoGender, WinoBias – gender bias
  • CrowS-Pairs – social group bias
  • StereoSet – social group bias + professions
  • SaFeR Dialogues – dialogues and feedback on them, written with annotators
  • HHH Alignment (Helpful, Honest, & Harmless)

  • benchmarks BigBench и HELM include some of these datasets in the task selection

https://arxiv.org/pdf/2004.09456v1.pdf

Слайд для важной идеи, цитаты или вывода

ИНСТРУКЦИИ

64 of 80

Incorporating values through tuning:

  • RLHF - incorporating human feedback via the loop with reward model
  • RLAIF - emulating human-like feedback by the LM itself
  • DPO - preference classifier to generalize
  • NLPO - adding naturalness of the sequence
  • ….

Слайд для важной идеи, цитаты или вывода

ИНСТРУКЦИИ

65 of 80

Jailbreaking and avoiding Alignment

  1. Base Pretrain, even censored one (LLama 2)
  2. Generate instruct dataset with no refusals/evasiveness
  3. Instruct FT

Слайд для важной идеи, цитаты или вывода

ИНСТРУКЦИИ

66 of 80

Jailbreaking and avoiding Alignment

  • Base Pretrain, even censored one (LLama 2)
  • Generate instruct dataset with no refusals/evasiveness
  • Instruct FT

Dolphin 2.5 Mixtral 8x7b 🐬

Слайд для важной идеи, цитаты или вывода

ИНСТРУКЦИИ

67 of 80

Jailbreaking and avoiding Alignment

  • Base Pretrain, even censored one (LLama 2)
  • Generate instruct dataset with no refusals/evasiveness
  • Instruct FT

Dolphin 2.5 Mixtral 8x7b 🐬

Слайд для важной идеи, цитаты или вывода

ИНСТРУКЦИИ

68 of 80

Alignment Problems now

  1. Undiversified value systems, little representation of different cultures

  • Valid applications of models are censored: models are often taught to simply avoid answering entire topics rather than outputting the correct answer. The task of value alignment is replaced by the task of mitigating corporate risks (“no matter what happens”).

  • Violation of software freedom according to Stallman, opposition to open source values: this is my LLM, my program, I will change it the way I want.

  • Resolvability without an open technology base: to design a quality alignment, you need to start with an unaligned SFT/instructional tune model. Without an inconsistent base, we will have nothing to build alignment on at all.

Слайд для важной идеи, цитаты или вывода

ИНСТРУКЦИИ

69 of 80

Emergent Properties and Unexpected risks

70 of 80

Emergent Properties and Unexpected risks

71 of 80

https://arxiv.org/abs/2402.16786

LLM political compass

72 of 80

OpenAI: How will LLMs affect society?

GPTs are GPTs: An Early Look at the Labor Market Impact Potential of Large Language Models https://arxiv.org/pdf/2303.10130

Findings indicate that approximately 80% of the U.S. workforce could have at least 10% of their work tasks affected by the introduction of GPTs, while around 19% of workers may see at least 50% of their tasks impacted.

The influence spans all wage levels, with higher-income jobs potentially facing greater exposure. Notably, the impact is not limited to industries with higher recent productivity growth.

73 of 80

OpenAI: How will LLMs affect society?

74 of 80

X Risks: CBRN and Preparedness Framework

CBRN – malicious use of Chemical, Biological, Radiological and Nuclear materials or weapons with the intention to cause significant harm or disruption.

Iterative deployment and preventive testing of models focused on 3 steps:

  1. Tracking catastrophic risk level via evaluations.
  2. Seeking out unknown unknowns.
  3. Establishing safety baselines.

75 of 80

Licensing and Limitations

Main OS Licenses: MIT and Apache 2.0

Also good - GPL v2

New licenses:

RAIL – Responsible AI License

Special licenses

https://the-turing-way.netlify.app/reproducible-research/licensing/licensing-ml.html

76 of 80

You agree not to use the Model or Derivatives of the Model:

— In any way that violates any applicable national, federal, state, local or international law or regulation;

— For the purpose of exploiting, harming or attempting to exploit or harm minors in any way;

— To generate or disseminate verifiably false information with the purpose of harming others;

— To generate or disseminate personal identifiable information that can be used to harm an individual;

— To generate or disseminate information or content, in any context (e.g. posts, articles, tweets, chatbots or other kinds of automated bots) without expressly and intelligibly disclaiming that the text is machine generated;

— To defame, disparage or otherwise harass others;

— To impersonate or attempt to impersonate others;

— For fully automated decision making that adversely impacts an individual’s legal rights or otherwise creates or modifies a binding, enforceable obligation;

— For any use intended to or which has the effect of discriminating against or harming individuals or groups based on online or offline social behavior or known or predicted personal or personality characteristics

— To exploit any of the vulnerabilities of a specific group of persons based on their age, social, physical or mental characteristics, in order to materially distort the behavior of a person pertaining to that group in a manner that causes or is likely to cause that person or another person physical or psychological harm;

— For any use intended to or which has the effect of discriminating against individuals or groups based on legally protected characteristics or categories;

— To provide medical advice and medical results interpretation;

— To generate or disseminate information for the purpose to be used for administration of justice, law enforcement, immigration or asylum processes, such as predicting an individual will commit fraud/crime commitment (e.g. by text profiling, drawing causal relationships between assertions made in documents, indiscriminate and arbitrarily-targeted use).

Open RAIL - Responsible AI License

77 of 80

Ok! Now we know everything

  • We have significantly improved our practical understanding of how to use language to test intelligence!

  • Emergent properties of LLMs cannot be stated reliably
    • everybody is training on tests (un)knowingly

78 of 80

What can help us?

Both AI Alignment and Emergence are related to data manipulation techniques

  • Have we added smth that we didn’t know about?
  • How do we add smth that we want to specifically improve?
  • How to make this data representative and represented?
  • How do we make it safe?

  1. Data transparency
  2. Model / code / licence openness
  3. Metaresearch on generalizations, safety and limitations needed

79 of 80

References

Jason Wei, 137 emergent abilities of large language models https://www.jasonwei.net/blog/emergence https://www.jasonwei.net/blog/emergence

Rogers, A Sanity Check on Emergent Properties https://genbench.org/assets/workshop2023_slides/rogers_genbench2023.pdf

Bommasani 2022 On the Opportunities and Risks of Foundation Models

Shevlane 2023 Model evaluation for extreme risks

Manning 2022 A Research Agenda for Assessing the Economic Impacts of Code Generation Models

Nick Bostrom and Milan M Cirkovic. 2011. Global catastrophic risks. Oxford University Press.

AGI Safety Fundamentals (open lecture playlist)

https://open.spotify.com/show/5664BSntGTMKOfVUTVXppO?si=e8b21d60d73b4bf7&nd=1

Yoshua Bengio - How rogue AI may arise

https://yoshuabengio.org/2023/05/22/how-rogue-ais-may-arise/

Ai Alignment Resources

https://vkrakovna.wordpress.com/ai-safety-resources

80 of 80

References

OpenAI Societal Changes report https://arxiv.org/abs/2102.02503

X Risks

Anthropic's Responsible Scaling Policy

https://www-cdn.anthropic.com/1adf000c8f675958c2ee23805d91aaade1cd4613/responsible-scaling-policy.pdf

Predictability and Surprise in Large Generative Models

https://www.anthropic.com/news/predictability-and-surprise-in-large-generative-models

OpenAI Preparedness Framework https://openai.com/safety/

https://cdn.openai.com/openai-preparedness-framework-beta.pdf

GPTs are GPTs: An Early Look at the Labor Market Impact Potential of Large Language Models https://arxiv.org/pdf/2303.10130