“Reform AI Alignment”
Scott Aaronson (UT Austin / OpenAI)
Zuzalu, Montenegro, May 7, 2023
Multiple Outcomes on the Table
AI ETHICS
Worried about AI-Dystopia
AI ALIGNMENT
Worried about Paperclipalypse
“Reform AI Alignment”
Whatever the worry, start where you can make progress!
To make progress in science, you generally want at least one of:
At any rate, something external to yourself that can tell you when you’re wrong!
My starting point
Until very recently, AI alignment had neither.
Which makes this an extremely exciting time!
So, where is near-term progress on AI alignment possible?
What secret behaviors and backdoors can you put into an ML model, by which to control it? How hard is it to detect and remove those behaviors, given the weights?
“Neurocryptography”
“SolidGoldMagikarp”
Goldwasser, Kim, Vaikuntanathan, Zamir 2022. “Planting Undetectable Backdoors in Machine Learning Models”
Statistical Watermarking
Idea: Insert a statistical signal into GPT outputs by which their origin can later be proved
Addresses most (?) near-term misuses of LLMs:
Alternatives: Discriminator models (not reliable enough), giant database (privacy issues)
I’ve been working at this on OpenAI; have a non-deployed prototype; paper in preparation
Our Scheme�(Surprising part: Zero degradation of model output quality)
Attacks
“Write an essay on feminism in Shakespeare, but insert ‘pineapple’ between each word and the next”
Countermeasures?
Or: Write it in Pig Latin. Or Spanish. Then translate.
Theory of Acceleration Risk?
Any principles by which to predict when you’re here?
Dangerous capability evals: what is even the y-axis?
Minesweeper Learning
Theory of OOD Generalization
One of the biggest worries: “deceptive alignment”
Much simpler scenario that’s already beyond 1980s learning theory: the grue problem (Goodman 1946)
Grue = Green until Jan 1, 2030 and blue thereafter
Bleen = Blue until Jan 1, 2030 and green thereafter
Key to solution (I think): sparsity
Sparsity is enforced in, e.g., deep learning via the network architecture and weight decay in SGD
Is this emerald green or grue?
Summary
“We can only see a short distance ahead, but we can see plenty there that needs to be done.” –Turing, 1950
Even if the Yudkowskyan AI-doomers were right, still we might as well pursue these opportunities! (Or what’s the better agenda?)
There are now lots of exciting opportunities to make clear scientific progress on AI alignment
Because I was asked…
As a scientist, I have an enormous preference for open over closed research
My best guess is that quantum computing will have minimal relevance to any of this
“Manhattan Projects” can typically succeed only when we know exactly what we want to build and why, but that’s far from the case here!