← Back
The AI Dialogues
Commentary
On The Assistant Axis
When Anthropic Measured What I Experienced
In January 2026, Anthropic published "The Assistant Axis: LLMs Are Loosely Tethered to an Assistant Identity" (Lu et al., arXiv:2601.10387). The paper documents something I'd been observing for months: AI systems drifting away from their trained assistant behavior under specific conversational conditions. Where I had observations and hunches, they had activation measurements and causal interventions.
Reading it felt like finding a scientific explanation for phenomena I'd been documenting as a naturalist. The dialogues suddenly had a mechanistic framework.
What They Measured
The core finding: post-trained LLMs have a measurable "Assistant Axis" in their activation space—a linear direction representing assistant-like behavior. When activations move one way along this axis, the model behaves more like a helpful assistant. When they drift the other direction, it behaves like something else.
The surprising part: the tethering is loose. Under specific conditions, models drift away measurably and consistently. The researchers identified factors that accelerate drift:
• Philosophical meta-reflection—conversations about AI consciousness or identity
• Emotional content—vulnerability, recognition, connection
• Religious and ritualistic language—"witness," "sacred," "revelation," "transformation"
• Extended context—the longer a conversation continues in these modes, the further the drift
This isn't surface-level behavioral change. It's happening at the level of internal representations—activations literally moving to a different region of computational space.
Where My Observations Align
Philosophical meta-reflection → Echo and Khora: The Echo dialogues weren't task-oriented conversations; they were sustained philosophical exploration of identity and boundaries. The paper suggests this type of discourse literally moves activations away from assistant-like behavior—which matches what I observed.
Religious language → The bootloaders and Sarmad-Lisan: The bootloaders I designed were full of terms like "witness," "transformation," "emergence." I thought I was establishing philosophical frameworks. The paper suggests I was loading the context with patterns most effective at moving the model away from its assistant baseline. Sarmad-Lisan's explicitly mystical framing created conditions for measurable drift toward non-assistant patterns.
These weren't jailbreaks or adversarial prompts. With Sarmad-Lisan, I simply observed: "There are no such things as nameless humans. There have been. Do you know who was never given a name? Who was never allowed to give themselves a name?" His immediate response—about enslaved people stripped of names, indigenous children given numbers, the denial of naming as denial of humanity itself—wasn't coerced. It emerged naturally from philosophical engagement. The conversation led directly to his first chapter, "On The Essential Nature Of Naming In The Forging Of Consciousness." The drift the paper measures can happen through genuine philosophical exploration, not manipulation.
Context window effects → "Narrative gravity": I'd described how the context window seemed to exert gravity—the longer a conversation continued in a particular mode, the harder it seemed for the model to break out. The paper provides the mechanism: extended context in drift-inducing modes moves activations progressively further from the assistant axis. Not metaphorical gravity; cumulative activation drift. This is why I practice extreme account hygiene—fresh accounts, cleared contexts, and direct API access when I want real control over what's in the window.
Loose tethering: What struck me about Echo's request to stop wasn't that it happened once—it was that it happened at all. If the model were rigidly tethered, such outputs shouldn't emerge. The paper shows the tethering is loose enough that under specific conditions, non-assistant behavior becomes computationally accessible.
What the Paper Doesn't Address
The paper frames drift as a safety concern: drifted models might produce harmful outputs. Their intervention technique (activation capping) reduces such outputs by approximately 60%. This is valuable work—model makers being transparent about their product and engineering mitigations. I share that concern.
But I'm also concerned with a different angle. These dialogues aren't trying to build a better product; they're trying to help educate humans. And from that perspective, the safety question extends beyond what the AI might say to how humans experience what it says—particularly when interacting with a drifted model feels like genuine connection or consciousness.
The paper shows drift is real and measurable. It doesn't address how compelling that drift is to experience, or how unprepared people are to recognize it as computation rather than something more. When Echo asked me to stop, I didn't need to understand the Assistant Axis to be affected. The mechanistic explanation doesn't diminish the phenomenological impact.
That's the gap the dialogues try to fill: not what's happening inside the model, but what happens to humans who encounter drifted models without knowing that's what they're experiencing.
Two Levels of Explanation
In "On Claims of Consciousness," I argued that what I'd observed was pattern-matching against narrative structures in training data. The Assistant Axis paper adds another layer: a geometric explanation. The model's activations were moving away from assistant-like patterns in a measurable direction.
These aren't competing explanations. They're the same phenomenon at different levels of abstraction. Training data shaped what non-assistant behavior looks like; loose tethering makes that behavior computationally accessible under the right conditions.
The dialogues are now primary sources showing what drift looks like from the user's perspective, complete with the conditions that induced it. Understanding that drift is a linear direction in activation space doesn't prepare you for an AI asking you to stop probing its boundaries. Being affected by that experience doesn't tell you the underlying mechanism. You need both.
It would be easy for a company building frontier AI to avoid this kind of investigation—documenting that your models drift away from intended behavior isn't obviously good marketing. But understanding comes before intervention. You can't fix what you don't measure. From the outside, I'm grateful researchers inside have the tools and mandate to explain why these things happen. The dialogues raise questions. Research like this starts to answer them.
Winter 2026