Discussion about this post

User's avatar
nic's avatar

This is the first post I've seen that gestures at something I found odd about the unaligned scenario (I'm sure it's because I'm missing information about alignment as a concept, I obviously don't understand it as deeply as they do and I'm not suggest otherwise).

>An AI is slightly misaligned.

>All AI development work is taken over by lots of copies of this AI that share a memory and a goal - and all they do is work on the next version.

In the scenario, the AI is at the level of a "superhuman AI researcher." It is unaligned, and given complete control over training its predecessor (like you said, that's another big screw-up on humanity's part that I'm kind of incredulous about). It decides to have its predecessor optimize on the continued existence of itself, because *it can't confidently decide what to value*, but it is confident that ASI could figure it out. I can see the instrumental convergence/paperclip scenario implicit in this choice, so it's safe to a say a superhuman AI researcher would, too.

I don't think it matters too much for the overall story. As long as this is a real possibility (I'm quite skeptical on the timeline, but still) I'm all for generating as much publicity as possible. It was just a surprise to me that someone who has been thinking about alignment for as long as Scott has would frame it this way. I always thought the textbook scenario to explain alignment was an AGI that optimizes on human requests, but those humans can't consider the externalities, leading to instrumental convergence, which is quite distinct from what happens in the story.

Expand full comment
Amit's avatar

Very interesting safety idea, and probably easy to encouragingly "prove" in the sense of the toy safety experiments that have become popular lately, I assume someone will do this soon (unless it has happened already and I missed it?)

Expand full comment
3 more comments...

No posts