• LouNeko@lemmy.world
    link
    fedilink
    English
    arrow-up
    3
    ·
    8 hours ago

    As so often. Where’s the control? Why not have a models condition be to randomly respond to harmful prompts and have random observation of the reasoning?

    I wonder how much of this is just our own way of anthropomorphizing something, just like we do when our car acts up and we swear at it. We look for human behavior in non human things.

    • webghost0101@sopuli.xyz
      link
      fedilink
      English
      arrow-up
      1
      ·
      edit-2
      5 hours ago

      I am also a advocate to better refined and meticulous ai testing using scientific best practices.

      But i am not sure if a control really applies or works In this context. Could you elaborate on your suggestion?

      An llm configured to respond with randomness is unlikely to produce much readable text. There would not be much to anthropomorphize. You could design one that responds normally but intentionally incorrect to study how quick people get tricked from incorrect ai but that has nothing to do with alignment. You would almost need to have perfected alignment before you can build such reliable malicious control llm.

      Alignment is specifically about measuring how close the ai is to desired foolproof behavior to guarantee it does absolutely no undesired reasoning. I feel here a control is as useful as having a control suspect at a police interrogation. The cases i have read about are also quite literally the llm pretending that it is aligned and lying about not having abilities that could be used maliciously. (If I recall the devs made it look like they accidentally gave it acces to something)

      A more straightforward control would be simple redoing The experiment multiple times, which i am sure They did just not worth reporting. Working with AI rarely gets results on a first try.