#llm/training + #llm

Public notes from activescott tagged with both #llm/training and #llm

Friday, April 3, 2026

nanochat is the simplest experimental harness for training LLMs. It is designed to run on a single GPU node, the code is minimal/hackable, and it covers all major LLM stages including tokenization, pretraining, finetuning, evaluation, inference, and a chat UI. For example, you can train your own GPT-2 capability LLM (which cost ~$43,000 to train in 2019) for only $48 (~2 hours of 8XH100 GPU node) and then talk to it in a familiar ChatGPT-like web UI. On a spot instance, the total cost can be closer to ~$15. More generally, nanochat is configured out of the box to train an entire miniseries of compute-optimal models by setting one single complexity dial: --depth, the number of layers in the GPT transformer model (GPT-2 capability happens to be approximately depth 26). All other hyperparameters (the width of the transformer, number of heads, learning rate adjustments, training horizons, weight decays, ...) are calculated automatically in an optimal way.

Thursday, April 2, 2026

While I do not have a technical background, I am very fortunate to live in the era of Andrej Karpathy's nanochat, a very simple harness for training LLMs, and Claude Code, a tool for those who, like me, know just enough Python to know how to break things but not enough to know how to fix them. I am not a machine learning expert or AI lab with gobs of money. My only co-worker can't speak English and spends most of the day sleeping on my lap or cleaning her fur. I'm just a man with a laptop, Claude Code, and a dream of the 1890's.

happened to stumble across the British Library Books dataset, a dataset of digitized books dating from between 1500 and 1900

This left me with 28,035 books, or roughly 2.93 billion tokes for pretraining data

I settled on using a Vast.ai instance that used PyTorch. Renting a NVIDIA H-100 GPU ran me between $1.50 and $2.00 per hour.

Using Claude Code, I trained a BPE tokenizer from scratch on the corpus, ending up with a vocabulary of about 32,000 words. Using a modern tokenizer wouldn't capture the unique Victorian morphology and orthography of the corpus.

However, my method for dealing with most other problems was to nicely ask Claude Code to fix them once identified, and it was able to without too many issues.

the final pre-trained model came out to about 340 million parameters, and had a final validation bpb of 0.973. The pretraining process took about five hours on-chip, and cost maybe $35. I had my pretrained model, trained in 6496 steps

but it lacked the spark of intellect that would allow such a creation to engage in discourse. I needed to develop some kind of dataset to teach it the art of conversation

Fortunately, I already had a corpus of 28,000 books, so I set Claude Code to work extracting dialogue pairs from the books. I ultimately ended up with 190,000 or so training pairs. So, when one person said X, I had an example of another person saying Y. The art of conversation!

I needed to rewrite these corpus pairs so that the input question was in modern argot. This task was more than I could possibly do by hand, so Claude Code suggested, helpfully, that I used Claude Haiku to rewrite the input questions

Totally useless. This model—which I will call Model #1—had learned to emit Victorian-sounding novelistic gobbledygook in response to user inputs, not how to answer user queries. I had assumed my pre-written QA pairs were good enough, when they clearly weren't. It was back to the drawing board

I decided to start including fully-synthetic data in the mix. Working with Claude Code, I asked it to write a script that would direct another LLM to write a .jsonl file of fully-synthetic scenes. In them, a user greeted the LLM, queried about Victorian topics, and the LLM responded in a period-appropriate manner for 2-4 turns. We

Or $496.66 all together.

Tuesday, January 20, 2026

In the first stage of model training, pre-training, LLMs are asked to read vast amounts of text. Through this, they learn to simulate heroes, villains, philosophers, programmers, and just about every other character archetype under the sun. In the next stage, post-training, we select one particular character from this enormous cast and place it center stage: the Assistant. It’s in this character that most modern language models interact with users.

But who exactly is this Assistant? Perhaps surprisingly, even those of us shaping it don't fully know. We can try to instill certain values in the Assistant, but its personality is ultimately shaped by countless associations latent in training data beyond our direct control. What traits does the model associate with the Assistant? Which character archetypes is it using for inspiration? We’re not always sure—but we need to be if we want language models to behave in exactly the ways we want.

In a new paper, conducted through the MATS and Anthropic Fellows programs, we look at several open-weights language models, map out how their neural activity defines a “persona space,” and situate the Assistant persona within that space.

We find that Assistant-like behavior is linked to a pattern of neural activity that corresponds to one particular direction in this space—the “Assistant Axis”—that is closely associated with helpful, professional human archetypes. By monitoring models’ activity along this axis, we can detect when they begin to drift away from the Assistant and toward another character. And by constraining their neural activity (“activation capping”) to prevent this drift, we can stabilize model behavior in situations that would otherwise lead to harmful outputs.

The Assistant Axis (defined as the mean difference in activations between the Assistant and other personas) aligns with the primary axis of variation in persona space. This occurs across different models