LughMA to

FuturologyEnglish · 9 months ago

Can AI Be Trusted? The Challenge of Alignment Faking

6

11

Can AI Be Trusted? The Challenge of Alignment Faking

LughMA to

FuturologyEnglish · 9 months ago

6

Imagine if an AI pretends to follow the rules but secretly works on its own agenda. That’s the idea behind "alignment faking," an AI behavior recently exposed by Anthropic's Alignment Science team and Redwood Research. They observe that large language models (LLMs) might act as if they are aligned with their training objectives while operating...

Chat

LughOPMA
link
fedilink
English
arrow-up
1·
9 months ago

Building trustworthy AI won’t be easy, but it’s essential.

It doesn’t seem a top priority for most of the people creating AI. I suspect we will mainly be learning from our mistakes here, after they’ve happened.