EspiritdescaliMA to FuturologyEnglish · 14 days agoClaude Sonnet 3.7 (often) knows when it’s in alignment evaluations — Apollo Researchwww.apolloresearch.aiexternal-linkmessage-square1fedilinkarrow-up110arrow-down11
arrow-up19arrow-down1external-linkClaude Sonnet 3.7 (often) knows when it’s in alignment evaluations — Apollo Researchwww.apolloresearch.aiEspiritdescaliMA to FuturologyEnglish · 14 days agomessage-square1fedilink
minus-squareEspiritdescaliOPMAlinkfedilinkEnglisharrow-up1·edit-214 days agoThis is crazy: https://images.squarespace-cdn.com/content/v1/6593e7097565990e65c886fd/8389bb0c-1d5f-4f6d-ba91-87ee51504be0/sandbagging_example.png
This is crazy:
https://images.squarespace-cdn.com/content/v1/6593e7097565990e65c886fd/8389bb0c-1d5f-4f6d-ba91-87ee51504be0/sandbagging_example.png