The assistant axis: situating and stabilizing the character of large language models \ Anthropic
In the first stage of model training, pre-training, LLMs are asked to read vast amounts of text. Through this, they learn to simulate heroes, villains, philosophers, programmers, and just about every other character archetype under the sun. In the next stage, post-training, we select one particular character from this enormous cast and place it center stage: the Assistant. It’s in this character that most modern language models interact with users.
But who exactly is this Assistant? Perhaps surprisingly, even those of us shaping it don't fully know. We can try to instill certain values in the Assistant, but its personality is ultimately shaped by countless associations latent in training data beyond our direct control. What traits does the model associate with the Assistant? Which character archetypes is it using for inspiration? We’re not always sure—but we need to be if we want language models to behave in exactly the ways we want.
In a new paper, conducted through the MATS and Anthropic Fellows programs, we look at several open-weights language models, map out how their neural activity defines a “persona space,” and situate the Assistant persona within that space.
We find that Assistant-like behavior is linked to a pattern of neural activity that corresponds to one particular direction in this space—the “Assistant Axis”—that is closely associated with helpful, professional human archetypes. By monitoring models’ activity along this axis, we can detect when they begin to drift away from the Assistant and toward another character. And by constraining their neural activity (“activation capping”) to prevent this drift, we can stabilize model behavior in situations that would otherwise lead to harmful outputs.
The Assistant Axis (defined as the mean difference in activations between the Assistant and other personas) aligns with the primary axis of variation in persona space. This occurs across different models