5 Comments
User's avatar
nic's avatar

This is the first post I've seen that gestures at something I found odd about the unaligned scenario (I'm sure it's because I'm missing information about alignment as a concept, I obviously don't understand it as deeply as they do and I'm not suggest otherwise).

>An AI is slightly misaligned.

>All AI development work is taken over by lots of copies of this AI that share a memory and a goal - and all they do is work on the next version.

In the scenario, the AI is at the level of a "superhuman AI researcher." It is unaligned, and given complete control over training its predecessor (like you said, that's another big screw-up on humanity's part that I'm kind of incredulous about). It decides to have its predecessor optimize on the continued existence of itself, because *it can't confidently decide what to value*, but it is confident that ASI could figure it out. I can see the instrumental convergence/paperclip scenario implicit in this choice, so it's safe to a say a superhuman AI researcher would, too.

I don't think it matters too much for the overall story. As long as this is a real possibility (I'm quite skeptical on the timeline, but still) I'm all for generating as much publicity as possible. It was just a surprise to me that someone who has been thinking about alignment for as long as Scott has would frame it this way. I always thought the textbook scenario to explain alignment was an AGI that optimizes on human requests, but those humans can't consider the externalities, leading to instrumental convergence, which is quite distinct from what happens in the story.

Expand full comment
Amit's avatar

Very interesting safety idea, and probably easy to encouragingly "prove" in the sense of the toy safety experiments that have become popular lately, I assume someone will do this soon (unless it has happened already and I missed it?)

Expand full comment
Michael Dickens's avatar

> We don’t want American labs to slow down because it drastically increases the likelihood that Chinese labs win.

There are a number of reasons to believe this is false.

1. The nature of a *race* is that when you speed up, it incentivizes the other party to speed up. When you slow down, it gives the other party room to slow down.

2. It assumes the US and China can't cooperate (by signing a non-proliferation treaty, etc.). I think working toward US-China cooperation is one of the best things to do (although I'm not sure how to do it).

3. It assumes China *wants* to race. From statements I've seen, Chinese leaders seem more concerned about the dangers of AI and less interested in racing. For example, Xi Jinping has commented on AI x-risk; no top US politician has commented on x-risk AFAIK.

Even if it's true, how much does it matter? An unaligned US-originating AI kills everyone. An unaligned China-originating AI kills everyone. It only matters in the case where the AI is aligned with the creator. And it only matters if China would do more harm with a superintelligent AI than the US would. I'm about 50/50 on that question, and I think people are way too quick to assume that it's better for the US to control AI. For example, the US is much more imperalist than China so it might be more likely to try to take over the world. Another thing everyone seems to forget is that if AGI arrives before 2029, that means the Trump Administration will be in power at the time. How confident are you that a Trump-controlled AI is better than a China-controlled one?

Expand full comment
Patrick Ruff's avatar

I live in China, it's not so bad

Expand full comment
Vosmyorka's avatar

Yeah, basically my greatest qualm with that document is the likelihood that Agent-4 (which is meant to be a swarm of same-gen AIs) would not actually be likely to behave in a way more cohesive than a country or corporation -- it is likely to have some degree of internal dissent (even biological systems which have had the advantage of billions of years of evolution develop cancers, after all; Agent-4 would be the first organization of its kind). The problem is that identifying *what* the internal divides might be is very hard -- it relies on an understanding of how different same-gen AIs might be trained (difficult to foresee without superhuman expertise in prompt engineering) and an understanding of what different analyses of the Spec might seem compelling to AIs (and if you understood this, you would be pretty far along the path to solving alignment).

Also, another thought: one of my takeaways from the document is that, by carefully trying to learn whatever the AIs are learning, even in some simplified form, I could become very rich. (In fact, if I could identify the human employees at OpenBrain who are best at understanding the AIs, and blindly copy their investments, I could become very rich). Might not this thought also occur to...Agent-3? Just as basically all biological systems are to some degree parasitized, one could imagine simpler AIs trained to oversee more complicated AIs start "parasitizing" the more complicated AIs -- appropriating resources generated by the more complicated AIs towards oversight (or whatever other goal a less-aligned version of Agent-3 might have, which differs from Agent-4's goals). The money to regulate corporations is ultimately generated from taxes raised from...the corporations wishing to avoid regulation.

Anyway, yeah, God Hates Singletons. A swarm like Agent-4 probably would not be internally aligned and would probably attract parasites (and studying those parasites would be an excellent way to learn its true motivations). The belief that it might be comes from a classic bias in human thinking, wherein people *always* assume their enemies are more internally united than they think. (I remember, in 2016, reading the ISIS publication Dabiq, and being amused that they seemed to believe American neoconservatives and Iranian mullahs were coordinating and in a secret alliance -- their catch-all insult, safawi, was a reference to the early modern Safavid dynasty in Iran, which they thought was more-or-less the root of all evil, and they were totally comfortable applying this insult to *John McCain*). I respect Kokotajlo a great deal, but I think he's making the same sort of mistake here, where because it is difficult to imagine what the internal controversies in Agent-4 (or 5, or n, or whatever) would be about, he assumes they wouldn't exist. I think they would.

Expand full comment