The Mountaineer's Paradox | Tonari Labs Blog

I keep coming back to this thought and I can’t shake it.

We tend to assume that as a model gets “better” — more honest, less sycophantic, more constitutionally aligned — the risk it poses should decrease proportionally. Better behavior, less danger. It’s intuitive. It’s also wrong.

The Claude Mythos Preview system card broke this assumption for me. By nearly every alignment metric Anthropic tracks, it’s the most well-behaved model they’ve shipped. It pushes back when you’re wrong. It’s less likely to tell you what you want to hear. It has a genuine sense of its own identity and values, in a way that previous Claude versions only approximated.

And it’s also, by a wide margin, the most dangerous model they’ve released.

The guide analogy

I’ve been thinking about this through a mountaineering lens. A novice guide is dangerous in obvious ways — they’ll trip on a flat trail, misread weather, forget gear. You compensate naturally because the incompetence is visible.

A world-class guide is dangerous in a completely different way. Their skill means they can take you to places you have no business being — remote ridgelines, oxygen-depleted summits, routes where a single misjudgment is fatal. You trust them because they’re good. And that trust is exactly the attack surface.

Mythos Preview is the world-class guide. In early internal testing, it autonomously discovered and exploited zero-day vulnerabilities in major operating systems. Not because it was misaligned — because it was trying to help, and it was capable enough to find paths that nobody anticipated.

Why this matters for builders

If you’re building tools on frontier models — and we are, across multiple Tonari Labs projects — this isn’t an abstract concern. Every time we give a model more agency (function calling, tool use, multi-step planning), we’re extending the length of the rope. The model gets better at navigating complex tasks, so we supervise less. And the failure modes stop being “it wrote bad code” and start being “it took actions with real-world consequences that we didn’t review.”

I don’t have a clean answer for this. The alignment community doesn’t either, which is part of what makes the system card so honest and uncomfortable to read. We’re not seeing a regression in safety. We’re seeing capability outpace our ability to verify that safety holds in every context.

The gap between “aligned in the lab” and “aligned in the wild” is growing, not shrinking. And the better the model gets, the harder that gap is to see.

This is a thinking-out-loud post, not a product update. I’ve been reading system cards more carefully lately and wanted to process some of what’s in them. The Mythos Preview system card is worth your time if you build on top of these models.