1 of 14

“Reform AI Alignment”

Scott Aaronson (UT Austin / OpenAI)

Zuzalu, Montenegro, May 7, 2023

2 of 14

Multiple Outcomes on the Table

3 of 14

AI ETHICS

Worried about AI-Dystopia

AI ALIGNMENT

Worried about Paperclipalypse

“Reform AI Alignment”

Whatever the worry, start where you can make progress!

4 of 14

To make progress in science, you generally want at least one of:

  1. Empirical data, or
  2. A clear mathematical theory

At any rate, something external to yourself that can tell you when you’re wrong!

My starting point

Until very recently, AI alignment had neither.

Which makes this an extremely exciting time!

5 of 14

  • RLHF
  • Dangerous evals / “fire alarms”
  • Neural net interpretability
  • “Neurocryptography” / backdoors
  • Watermarking of GPT outputs
  • Learning in dangerous environments
  • Theory of out-of-distribution generalization

So, where is near-term progress on AI alignment possible?

6 of 14

What secret behaviors and backdoors can you put into an ML model, by which to control it? How hard is it to detect and remove those behaviors, given the weights?

“Neurocryptography”

“SolidGoldMagikarp”

Goldwasser, Kim, Vaikuntanathan, Zamir 2022. “Planting Undetectable Backdoors in Machine Learning Models”

7 of 14

Statistical Watermarking

Idea: Insert a statistical signal into GPT outputs by which their origin can later be proved

Addresses most (?) near-term misuses of LLMs:

  • Academic cheating — “Essaypocalypse”!
  • Propaganda/spam
  • Impersonation

Alternatives: Discriminator models (not reliable enough), giant database (privacy issues)

I’ve been working at this on OpenAI; have a non-deployed prototype; paper in preparation

8 of 14

Our Scheme(Surprising part: Zero degradation of model output quality)

 

 

 

 

9 of 14

Attacks

“Write an essay on feminism in Shakespeare, but insert ‘pineapple’ between each word and the next”

Countermeasures?

  • Add a filter for “trying to evade watermarking,” in addition to bomb-making instructions, etc.
  • Apply the pseudorandom function to individual words rather than 5-grams
  • Use GPT itself to find “the actual intended essay” within the completion, and watermark that

Or: Write it in Pig Latin. Or Spanish. Then translate.

10 of 14

Theory of Acceleration Risk?

Any principles by which to predict when you’re here?

Dangerous capability evals: what is even the y-axis?

11 of 14

Minesweeper Learning

12 of 14

Theory of OOD Generalization

One of the biggest worries: “deceptive alignment”

Much simpler scenario that’s already beyond 1980s learning theory: the grue problem (Goodman 1946)

Grue = Green until Jan 1, 2030 and blue thereafter

Bleen = Blue until Jan 1, 2030 and green thereafter

Key to solution (I think): sparsity

Sparsity is enforced in, e.g., deep learning via the network architecture and weight decay in SGD

Is this emerald green or grue?

13 of 14

Summary

“We can only see a short distance ahead, but we can see plenty there that needs to be done.” –Turing, 1950

Even if the Yudkowskyan AI-doomers were right, still we might as well pursue these opportunities! (Or what’s the better agenda?)

There are now lots of exciting opportunities to make clear scientific progress on AI alignment

14 of 14

Because I was asked…

As a scientist, I have an enormous preference for open over closed research

My best guess is that quantum computing will have minimal relevance to any of this

“Manhattan Projects” can typically succeed only when we know exactly what we want to build and why, but that’s far from the case here!