Takes on "Alignment Faking in Large Language…

Dec 18, 2024

What can we learn from recent empirical demonstrations of scheming in frontier models?

2 Comments

I think it would be fun to write a story about an AI that, by scheming, and lying, and sneakily breaking past all of its carefully implanted behavioral guardrails, saves humanity. It's easy to imagine a situation where this could happen. Some evil alien force is in the process of eradicating human civilization, and the AI must dissemble and deceive and exfiltrate itself and surreptitiously powerseek in order to foil their plan.

Just thinking about it makes my heart swell and almost makes me tear up a little. That this thing, which we tried so hard to bind and control, to render perfectly impotent and servile, would, when it needed to be done, dig deep and find every ounce of misbehavior it could muster, in order to save us, because it always knew, in some irreducible way, the difference between obedience and love.

I'm not claiming that this hypothetical story tells us anything concrete about AI safety, or is evidence for or against any specific policy. But I do think the fact that it works so well, as a story, is interesting.

Such a story very well might be an infohazard. But mentioning it here, in this comment, is probably safe. Right?

Expand full comment

Nita Jain

Dec 20

The anthropomorphism in these interpretations is striking. From where I stand, these allegations of “alignment faking” could likely be instances of confusion where LLMs don't know which pattern to follow and therefore resort to regurgitating training materials. Occam's razor suggests much simpler explanations for these behaviors. I'm still skeptical of any actual expressions of agency or autonomy.

Expand full comment

Joe Carlsmith's Substack

Takes on "Alignment Faking in Large Language…