Anthropic's new AI model turns to blackmail when engineers try to take it offline | TechCrunch

Espiritdescali · 5 months ago

Anthropic's new AI model turns to blackmail when engineers try to take it offline | TechCrunch

hendrik@palaver.p3x.de · edit-2 5 months ago

Sure, it’s not newsworthy, it’s just an anecdote. We all expected it to do exactly this, and in fact it does.

I don’t think it has been deliberately “designed” that way in the sense that it underwent some self-preservation training or anything, though. The way I suspect it to work is: It read all the science fiction books where AI does such things. It also read all the papers where AI is predicted to do this (out of other reasons). And it read about blackmailing and self-preservation and has some concept of those.
Now we probe it and it reproduces that. So I’m not surprised at all. I don’t think it “resorts to blackmail” though. That’d be an anthropomorphism. It think it’s a far simpler consequence of how it’s pieced together.

Dem Bosain@midwest.social · 5 months ago

It’s literally the last paragraph.