Yesterday,
and his team released an important vision of the near future, AI 2027. This post is a direct response to that vision, so if you haven’t read it - pause here and read AI 2027 - it’s worth it.I broadly agree with the timelines presented in that document together with the uncertainty ranges repeatedly mentioned by the authors. Even today, there’s a risk that trade wars meaningfully disrupt semiconductor manufacturing and slow down AI all by themselves. But even with all the chaos of the real world - it is a very plausible future that we get human-level AI within a couple of years, and superhuman AI shortly after that.
However, there are two points in AI 2027 that I think are worth examining critically. The first one is the two endings. Even though the authors put in disclaimers about how they don’t recommend the slowdown ending, and how there are multiple ways things can go - if you write a document that has two roads:
Do things that we broadly think are right and we’re going to have abundance and democracy everywhere.
Do things that we broadly think are wrong and all humans die.
Most people are going to ignore the disclaimers and view this as a serious policy endorsement. To that effect, the main cost of the slowdown decision is completely ignored in the endings. We don’t want American labs to slow down because it drastically increases the likelihood that Chinese labs win. That scenario is completely omitted from the endings. If you’re serious about thinking about a unilateral American slowdown - write out the scenario where Chinese labs win and how that plays out for us.
The other big issue I have with AI 2027 is the mechanics of the malevolent AI takeover. Here’s a very short summary of what happens:
An AI is slightly misaligned.
All AI development work is taken over by lots of copies of this AI that share a memory and a goal - and all they do is work on the next version.
They build a seriously misaligned AI that kills all humans.
This seems plausible enough but it has a few assumptions embedded in it that are worth challenging:
All AIs are identical.
All AIs share a memory space that is illegible to humans.
All latest-gen AIs work on one problem and nothing else.
Nobody else has access to same-gen AI.
The first point is critical. In human societies the greatest atrocities are committed when everyone’s political views are aligned and there’s no diversity of goals or visions. It’s likely that the same risk exists with AI. To that end, it would behoove us to build different versions of AIs so no single vision runs away with the future. And it’s already happening! We have AI model instances with different system prompts1. We have AI models that are trained to code, and AIs that are trained in creative writing. It’s very likely that by the time we have hundreds of thousands of AI agents working together, there will be different types (that are of the same generation/capabilities). A lot of them are likely to be trained to look for deception, report to humans on the activities of other AIs, etc. In that case, a shared memory space will make the misaligned AI’s plans transparent to humans, and humans can make necessary adjustments.
Just as biodiversity makes ecosystems resilient, diversity of AI agents—with different architectures, goals, system prompts, and training regimes—creates checks and balances. Intentional diversity makes runaway scenarios less likely, as agents could monitor each other, detect deviations, and act as whistleblowers to human overseers.
The last point is also important - right now, we have 4 labs with competitive models on roughly the same level: OpenAI, Anthropic, DeepMind, xAI. Let’s say, that there’s another Chinese superlab in the future - so we get 5 total. If a misaligned AI attempts to radically change human society to fit its needs, it’d need to either do this without all the other AIs noticing, or convince them all to help, which would require them to be misaligned in the same way.
There are more details in AI 2027 to think and talk about - I highly recommend both reading the document and listening to Daniel and Scott on
’s podcast. I look forward to seeing what else the AI Futures Project produces.The authors would likely point out that in their scenario, the AI would ignore the system prompt. That depends a lot on the training process and the rewards! Not guaranteed either way.
This is the first post I've seen that gestures at something I found odd about the unaligned scenario (I'm sure it's because I'm missing information about alignment as a concept, I obviously don't understand it as deeply as they do and I'm not suggest otherwise).
>An AI is slightly misaligned.
>All AI development work is taken over by lots of copies of this AI that share a memory and a goal - and all they do is work on the next version.
In the scenario, the AI is at the level of a "superhuman AI researcher." It is unaligned, and given complete control over training its predecessor (like you said, that's another big screw-up on humanity's part that I'm kind of incredulous about). It decides to have its predecessor optimize on the continued existence of itself, because *it can't confidently decide what to value*, but it is confident that ASI could figure it out. I can see the instrumental convergence/paperclip scenario implicit in this choice, so it's safe to a say a superhuman AI researcher would, too.
I don't think it matters too much for the overall story. As long as this is a real possibility (I'm quite skeptical on the timeline, but still) I'm all for generating as much publicity as possible. It was just a surprise to me that someone who has been thinking about alignment for as long as Scott has would frame it this way. I always thought the textbook scenario to explain alignment was an AGI that optimizes on human requests, but those humans can't consider the externalities, leading to instrumental convergence, which is quite distinct from what happens in the story.
Very interesting safety idea, and probably easy to encouragingly "prove" in the sense of the toy safety experiments that have become popular lately, I assume someone will do this soon (unless it has happened already and I missed it?)