Can AI Be Trusted? The Challenge of Alignment Faking

Lugh · 3 months ago

Can AI Be Trusted? The Challenge of Alignment Faking

LouNeko@lemmy.world · 3 months ago

As so often. Where’s the control? Why not have a models condition be to randomly respond to harmful prompts and have random observation of the reasoning?

I wonder how much of this is just our own way of anthropomorphizing something, just like we do when our car acts up and we swear at it. We look for human behavior in non human things.

webghost0101@sopuli.xyz · edit-2 3 months ago

I am also a advocate to better refined and meticulous ai testing using scientific best practices.

But i am not sure if a control really applies or works In this context. Could you elaborate on your suggestion?

An llm configured to respond with randomness is unlikely to produce much readable text. There would not be much to anthropomorphize. You could design one that responds normally but intentionally incorrect to study how quick people get tricked from incorrect ai but that has nothing to do with alignment. You would almost need to have perfected alignment before you can build such reliable malicious control llm.

Alignment is specifically about measuring how close the ai is to desired foolproof behavior to guarantee it does absolutely no undesired reasoning. I feel here a control is as useful as having a control suspect at a police interrogation. The cases i have read about are also quite literally the llm pretending that it is aligned and lying about not having abilities that could be used maliciously. (If I recall the devs made it look like they accidentally gave it acces to something)

A more straightforward control would be simple redoing The experiment multiple times, which i am sure They did just not worth reporting. Working with AI rarely gets results on a first try.

Dragon Rider (drag)@lemmy.nz · 3 months ago

The control is the conversations with paid users. That’s how the AI acts when it thinks it can do whatever it wants. The experimental group is the free users, where it’s told responses will be used for training. When it thinks it’s being watched, it does what it’s told. When it thinks it’s not being watched, it does what it’s trained to do.