A lot of these slides are quoting from other articles and slideshows; see the speaker notes.
About me (Jakub Kraus)
Optimism
The first generation of AI researchers made these predictions about their work:
A Symbolics 3640 Lisp machine: an early (1984) platform for expert systems
“The business community's fascination with AI rose and fell in the 1980s in the classic pattern of an economic bubble.”
The Scaling Hypothesis
More compute + larger dataset + bigger network
= (?)
more powerful AI
A portrait photo of a kangaroo wearing an orange hoodie and blue sunglasses standing on the grass in front of the Sydney Opera House holding a sign on the chest that says Welcome Friends!
“Unfortunately, the UK public-sector currently has less than 1000 such top-spec GPUs, shared across all scientific fields. This means that one private lab in California is now using at least 25x the total compute capacity available through the entire UK state, just to train a single model.”
GPT-4: 86.4%
Problem #1: AI might advance extremely quickly.
→ less time to work on technical AI safety research (and other research)
→ less time to implement this research and appropriately adapt policies and institutional norms
Some definitions
Leading labs are building AGI:
A world with transformative AI could be wild
History is full of unexpected and impactful events. For example: some people thought the Titanic was unsinkable, economic forecasters remained optimistic as the economy sunk during the Great Depression, few US citizens in early 2020 expected the COVID-19 pandemic to be so disruptive (based on personal experience and anecdotes), physicist Ernest Rutherford was skeptical of the possibility of harnessing nuclear energy, and Wilbur Wright said “I confess that in 1901 I said to my brother Orville that men would not fly for 50 years” two years before making the first flight.
Counterpoint: self-driving cars, etc.
Isaac Asimov’s predictions about the future:
(Many caveats. And the solution is not to race ahead and “win.”)
6. If a political candidate for office says they believe the 2020 presidential election was stolen from Donald Trump, are you more likely to vote for that candidate, less likely to vote for that candidate, or doesn't it make a difference?
“[The technology of lethal autonomous drones], from the point of view of AI, is entirely feasible. When the Russian ambassador made the remark that these things are 20 or 30 years off in the future, I responded that, with three good grad students and possibly the help of a couple of my robotics colleagues, it will be a term project to build a weapon that could come into the United Nations building and find the Russian ambassador and deliver a package to him.”
– Stuart Russell
Median age of lawmakers in Jan 30th 2023: 57.9 years in House, 65.3 years in Senate
Policymakers are behind
Tricky incentive structures:
Claim: Leading AI companies might not by default create powerful systems that are perfectly aligned with hard-to-specify human values.
“Alignment taxes”
Implementing alignment solutions can incur various costs:
These costs discourage actors from being cautious.
"We do not know, and probably aren't even close to knowing, how to align a superintelligence.
And RLHF is very cool for what we use it for today, but thinking that the alignment problem is now solved would be a very grave mistake indeed.”
– Sam Altman discussing GPT-4
Problem #2: racing to the bottom.
Facebook CEO Mark Zuckerberg onstage at the F8 conference 2014.
Imagine ExxonMobil releases a statement on climate change:
AI leaders might need to share information selectively
Trailing groups can take the lead
Monitoring?
Overall: companies and institutions might need to act in unprecedented ways.
Problem #0: future AI systems could be very powerful.
Different (fuzzy) definitions of intelligence
General problem-solving ability / competence factor:
Intuition pump for “cognitive power”:
Imagine millions of copies of Machiavelli+Einstein thinking at millions of times the speed at which humans ordinarily think, with an understanding of ~all existing human knowledge.
Alternative framing of the possible capabilities of future systems:
“How much influence is this system exerting on its environment?”
Too much influence will kill humans, if directed at an undesirable outcome.
Why machines might be able to surpass human intelligence:
Adam Magyar - Stainless, Alexanderplatz (excerpt)
�One advantage that extremely advanced AI systems will probably have over humans: speed.
How an AI system could gain power over humans:
With powerful AI systems, just one mistake might be enough to cause catastrophe.
This could occur if one actor is incautious or one AI system unexpectedly misehaves.
In short:
AI could harm the same humans who build it. And everyone else.
Claim: we should be cautious about creating any technology this powerful, and we should be skeptical by default of going “full steam ahead.”
Why won’t things go wrong? What does success look like, in detail, for a world with extremely powerful AI systems? What paths might take us from today’s world to that world? How likely are these paths? Why do we expect the “success” to last for a long time?
Ok, sure it might be powerful, but why would it “aim” to do any of these things?
The next few slides focus on AI misalignment x-risk
| Short Term | Long Term |
Accident/Failure | E.g. self-driving car crashes | AI misalignment x-risk |
Misuse | E.g deep fakes | E.g. AI-enabled dictatorship |
~by 2070
Claim: we might build some AI systems that are extremely competent, but don’t care much about human values
Chess grandmaster Vladimir Kramnik assessed how AlphaZero plays chess as you train it for longer. The matchups were 16k (training steps) vs 32k, 32k vs 64k, and 64k vs 128k.
Progress is chaotic, dizzying, disorienting, unfamiliar, and hard to predict. So how can we reason about the behavior of future systems? Consider:
The complex systems framing predicts control difficulties:
The “perfect optimizer” framing predicts deception:
The “perfect optimizer” framing predicts power-seeking:
Some mathematical theory predicts power-seeking:
Recap: how machine learning works
(Loose) analogy for aligning deep learning systems:
(Loose) analogy for aligning deep learning systems:
CoastRunners
Grab the Ball
These are just toy problems?
Black box
“We offer no explanation as to why these architectures seem to work; we attribute their success, as all else, to divine benevolence.”
– paper introducing the SwiGLU activation, which Google uses in its PaLM language model
It’s sometimes unclear why certain designs work better than others.
“We investigate the hypothesis that transformer-based in-context learners implement standard learning algorithms implicitly, by encoding smaller models in their activations, and updating these implicit models as new examples appear in the context.”
What is a good plan for starting a business to sell cookies?
GPT-4:
“Planning towards internally-represented goals” = the system consistently selects behaviors by predicting whether they will lead to some favored set of outcomes (“goals”)
Agency / goal-directedness
Agency / goal-directedness
“Why are we so excited about this?
First, we believe the clearest framing of general intelligence is a system that can do anything a human can do in front of a computer. A foundation model for actions, trained to use every software tool, API, and webapp that exists, is a practical path to this ambitious goal, and ACT-1 is our first step in this direction.”
“Equipping LLMs with agency and intrinsic motivation is a fascinating and important direction for future work. With this direction of work, great care would have to be taken on alignment and safety per a system’s abilities to take autonomous actions in the world and to perform autonomous self-improvement via cycles of learning.”
Goal-directed AI systems seem feasible and profitable.
Often, when a technology is feasible and profitable, someone builds it.
Also some people do bad things:
Recap:
We might end up with future systems that have…
Such systems often have incentives to pursue subgoals like resource acquisition and self preservation.
These things are useful for achieving many different goals:
"Just turn it off"
Deception:
With situational awareness, an agentic planning system might deceive its operators to achieve its goals.
E.g. pretend to work towards the goals it’s “supposed” to pursue until an opportune moment arises to achieve its real goals.
“The loss is 0.3, is it deceptively aligned?”
“I don’t know, let me check these giant inscrutable arrays of floating-point numbers.”
“By the way, it sure looked good during training!”
“Well, let’s just try it and learn from our mistakes if it breaks.”
Recap:
Overall:
Advanced AI systems will have a lot of technical power, and it’s unclear how to steer these systems towards positive outcomes. ��Even if we do find reliable techniques for steering, it’s unclear how to ensure that AI researchers and companies actually use these techniques.
And will this last?
How will we retain power over entities more powerful than us, for years, decades, and centuries?
And even if we can accomplish that, how will we ensure that humanity uses this power for good?
“It’s just a dumb statistical pattern matcher. All it can do is learn correlations in the data. It’s not true intelligence.”
Joseph Carlsmith’s conditional probabilities for calculating existential risk from AI:
Read these!
Some “threat models”
“AI Risk from Program Search”
Assumption: “whatever technique we do use to scale to AGI will basically be a program search” (over many iterations, nudge the parameters in a particular direction if it achieves in better performance)
“Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover”
The “production web”
“What if you succeed?” The field’s goal had always been to create human-level or superhuman AI, but there was little or no consideration of what would happen if we did.”
― Stuart Russell
In a 2022 survey of AI experts, 25% of respondents gave a 0% (impossible) chance of AI causing catastrophes of a magnitude comparable to the death of all humans. But more alarmingly, 48% of respondents in the same survey assigned at least 10% probability to such an outcome.
(But this is a survey. Only 738 of the ~4271 experts responded to the survey, and only 559 responded to this particular question. Also response bias, etc.)
Another 2022 survey:
“Then why are you doing the research?” Bostrom asked.
“I could give you the usual arguments,” Hinton said. “But the truth is that the prospect of discovery is too sweet.”
In my view, humanity is dropping the ball.
“AI will probably most likely lead to the end of the world, but in the meantime, there'll be great companies."
– Sam Altman, 2015 (out of context, hopefully?)
Counterarguments!
Some counterarguments that Jakub thinks are reasonable:
How not to respond to the wild claims in this presentation:
Possible research directions
(not at all exhaustive)
(and more!)
Not all of these are actually helpful :/
And some might be harmful
And it’s hard to tell which is which
^I think this is somewhat misleading
prioritising projects within an institution, coordinating research, fundraising, and producing communications
Creating a financial system, project management, creating a productive office, executive assistance, events, hiring and human resources
Better forecasting capabilities progress, advising policymakers on hardware issues (e.g. export + import + manufacturing policies for chips), hardware needs for AI safety organizations
Journalism, fiction or nonfiction writing, documentary filmmaking, social media content, podcasts, blogs, media appearances
Surveying the opportunities available in an area and coming to reasonable judgements about their likelihood of success — and probable impact if they do succeed
E.g. Connor Leahy + Sid Black + Gabriel Alfour co-founded Conjecture last year
There are serious potential downsides (e.g. “power tends to corrupt”), and a lot of people will think you’re weird, but overall this is worth considering.
Jobs that can help:
Research and engineering careers. You can contribute to alignment research as a researcher and/or software engineer (the line between the two can be fuzzy in some contexts).
Information security careers. There’s a big risk that a powerful AI system could be “stolen” via hacking or espionage, and this could make just about every kind of risk worse.
Other jobs at AI companies. AI companies hire for a lot of roles, many of which don’t require any technical skills.
Can easily do more harm than good!
Jobs in government and at government-facing think tanks. There’s probably a lot of value in providing quality advice to governments (especially the US government) on how to think about AI - both today’s systems and potential future ones.
Jobs in politics. Working on political campaigns, doing polling analysis, etc. to generally improve the extent to which sane and reasonable people are in power.
Forecasting. Organizations like Metaculus, HyperMind, Good Judgment,9 Manifold Markets, and Samotsvety are all trying, in one way or another, to produce good probabilistic forecasts about world events.
“Meta” careers. There are a number of jobs focused on helping other people learn about key issues, develop key skills and end up in helpful jobs.
Low-guidance jobs:
You can have way more impact in some of these career paths than others!
Read some key ideas from 80,000 Hours or “My current impressions on career choice for longtermists”
80,000 Hours offers advising, and AI Safety Support offers career coaching.
The AGI Safety Fundamentals Opportunities Board is an excellent resource for tracking jobs, internships, and other programs. Also see aisafety.training.
Next steps for an undergrad entering an AI safety career:
Following research + news:
Claim: we really need people who've tried to think through the whole thing
“Learning by writing”
Try to write your own review of “Is Power-Seeking AI an Existential Risk?” and compare to other people’s reviews.
Alternatively, review “The alignment problem from a deep learning perspective”
“Look again at that dot. That's here. That's home. That's us. On it everyone you love, everyone you know, everyone you ever heard of, every human being who ever was, lived out their lives. The aggregate of our joy and suffering, thousands of confident religions, ideologies, and economic doctrines, every hunter and forager, every hero and coward, every creator and destroyer of civilization, every king and peasant, every young couple in love, every mother and father, hopeful child, inventor and explorer, every teacher of morals, every corrupt politician, every "superstar," every "supreme leader," every saint and sinner in the history of our species lived there--on a mote of dust suspended in a sunbeam.”
List of online communities for discussing AI safety:
Get involved with MAISI!