activescott's Notes

Public notes from activescott

Friday, April 3, 2026

karpathy/nanochat at estragon.news

github.com/karpathy/nanochat?ref=estragon.news

nanochat is the simplest experimental harness for training LLMs. It is designed to run on a single GPU node, the code is minimal/hackable, and it covers all major LLM stages including tokenization, pretraining, finetuning, evaluation, inference, and a chat UI. For example, you can train your own GPT-2 capability LLM (which cost ~$43,000 to train in 2019) for only $48 (~2 hours of 8XH100 GPU node) and then talk to it in a familiar ChatGPT-like web UI. On a spot instance, the total cost can be closer to ~$15. More generally, nanochat is configured out of the box to train an entire miniseries of compute-optimal models by setting one single complexity dial: --depth, the number of layers in the GPT transformer model (GPT-2 capability happens to be approximately depth 26). All other hyperparameters (the width of the transformer, number of heads, learning rate adjustments, training horizons, weight decays, ...) are calculated automatically in an optimal way.

#12:12 AM

llm code llm/training karpathy

Thursday, April 2, 2026

Define success criteria and build evaluations - Claude API Docs

platform.claude.com/docs/en/test-and-evaluate/develop-tests

#8:54 PM

llm evaluations code testing qa

Backroads Active Adventure Travel: Bike Tours, Walking Trips, Hiking Vacations

www.backroads.com/

#5:48 AM

vacation

Mr. Chatterbox - a Hugging Face Space by tventurella

huggingface.co/spaces/tventurella/mr_chatterbox?ref=estragon.news

#5:47 AM

self-hosted llm/training code llm

Mr. Chatterbox, or, The Modern Prometheus

www.estragon.news/mr-chatterbox-or-the-modern-prometheus/

While I do not have a technical background, I am very fortunate to live in the era of Andrej Karpathy's nanochat, a very simple harness for training LLMs, and Claude Code, a tool for those who, like me, know just enough Python to know how to break things but not enough to know how to fix them. I am not a machine learning expert or AI lab with gobs of money. My only co-worker can't speak English and spends most of the day sleeping on my lap or cleaning her fur. I'm just a man with a laptop, Claude Code, and a dream of the 1890's.

happened to stumble across the British Library Books dataset, a dataset of digitized books dating from between 1500 and 1900

This left me with 28,035 books, or roughly 2.93 billion tokes for pretraining data

I settled on using a Vast.ai instance that used PyTorch. Renting a NVIDIA H-100 GPU ran me between $1.50 and $2.00 per hour.

Using Claude Code, I trained a BPE tokenizer from scratch on the corpus, ending up with a vocabulary of about 32,000 words. Using a modern tokenizer wouldn't capture the unique Victorian morphology and orthography of the corpus.

However, my method for dealing with most other problems was to nicely ask Claude Code to fix them once identified, and it was able to without too many issues.

the final pre-trained model came out to about 340 million parameters, and had a final validation bpb of 0.973. The pretraining process took about five hours on-chip, and cost maybe $35. I had my pretrained model, trained in 6496 steps

but it lacked the spark of intellect that would allow such a creation to engage in discourse. I needed to develop some kind of dataset to teach it the art of conversation

Fortunately, I already had a corpus of 28,000 books, so I set Claude Code to work extracting dialogue pairs from the books. I ultimately ended up with 190,000 or so training pairs. So, when one person said X, I had an example of another person saying Y. The art of conversation!

I needed to rewrite these corpus pairs so that the input question was in modern argot. This task was more than I could possibly do by hand, so Claude Code suggested, helpfully, that I used Claude Haiku to rewrite the input questions

Totally useless. This model—which I will call Model #1—had learned to emit Victorian-sounding novelistic gobbledygook in response to user inputs, not how to answer user queries. I had assumed my pre-written QA pairs were good enough, when they clearly weren't. It was back to the drawing board

I decided to start including fully-synthetic data in the mix. Working with Claude Code, I asked it to write a script that would direct another LLM to write a .jsonl file of fully-synthetic scenes. In them, a user greeted the LLM, queried about Victorian topics, and the LLM responded in a period-appropriate manner for 2-4 turns. We

Or $496.66 all together.

#5:28 AM

self-hosted llm/training code llm

Wednesday, April 1, 2026

BEM — Block Element Modifier

getbem.com/

Blocks, Elements and Modifiers Block Standalone entity that is meaningful on its own.

Examples header, container, menu, checkbox, input

Element A part of a block that has no standalone meaning and is semantically tied to its block.

Examples menu item, list item, checkbox caption, header title

Modifier A flag on a block or element. Use them to change appearance or behavior.

Examples disabled, highlighted, checked, fixed, size big, color yellow

#11:46 PM

code css

Ways to Watch - NASA

www.nasa.gov/ways-to-watch/

#11:31 PM

nasa video space

ultrabug/uhashring: Full featured consistent hashing python library compatible with ketama

github.com/ultrabug/uhashring

uhashring implements consistent hashing in pure Python.

Consistent hashing is mostly used on distributed systems/caches/databases as this avoid the total reshuffling of your key-node mappings when adding or removing a node in your ring (called continuum on libketama). More information and details about this can be found in the literature section.

This full featured implementation offers:

a lot of convenient methods to use your consistent hash ring in real world applications. simple integration with other libs such as memcache through monkey patching. a full ketama compatibility if you need to use it (see important mention below). all the missing functions in the libketama C python binding (which is not even available on pypi) for ketama users. possibility to use your own weight and hash functions if you don't care about the ketama compatibility. instance-oriented usage so you can use your consistent hash ring object directly in your code (see advanced usage). native pypy support, since this is a pure python library. tests of implementation, key distribution and ketama compatibility.

#10:54 PM

code python distributed-systems

Bildbot - Grow, Monetize and Secure your App

www.bildbot.com/

Solo founders and small teams are shipping real products faster than ever.

Launching is the easy part. Once you're live, the hard part starts -- growth, monetization, security, stability — that's where most apps stall out or die.

We've scaled products to millions of users and hundreds of millions in revenue. Now we're building the tools we wish we had when we started.

Think of us as the cofounder you wish you had.

#8:32 PM

code marketing business

Guided Colorado River Overnight Rafting | Colorado Adventure Guides

fareharbor.com/embeds/book/cbstadventures/items/605961/calendar/2026/04/?full-items=yes&back=https://coloradoraftingcompany.com/&flow=1331793&g4=yes

#6:24 PM

travel vacation

Escorted Tours, Small Ship and River Cruises and Family Travel | Tauck

www.tauck.com/

#6:22 PM

vacation travel

Argo Workflows | Argo

argoproj.github.io/workflows/

Argo Workflows is an open source container-native workflow engine for orchestrating parallel jobs on Kubernetes.

Define workflows where each step in the workflow is a container. Model multi-step workflows as a sequence of tasks or capture the dependencies between tasks using a graph (DAG). Easily run compute intensive jobs for machine learning or data processing in a fraction of the time using Argo Workflows on Kubernetes. Run CI/CD pipelines natively on Kubernetes without configuring complex software development products.

#3:00 PM

kubernetes ci/cd

Tuesday, March 31, 2026

Slackdown - Markdown to Slack Converter

slackdown.com/

God bless the person who did this. Slack is so frustrating for having their own markdown.

#11:55 PM

tools markdown

AdguardTeam/AdGuardSDNSFilter: AdGuard DNS filter

github.com/AdguardTeam/AdGuardSDNSFilter

A filter composed of several other filters (AdGuard Base filter, Social media filter, Tracking Protection filter, Mobile Ads filter, EasyList and EasyPrivacy) and simplified specifically to be better compatible with DNS-level ad blocking.

The direct link to the filter: https://adguardteam.github.io/AdGuardSDNSFilter/Filters/filter.txt.

Please note, that to use this filter it is necessary to support basic ad blocking rules syntax. It does not make much sense to extract just the hosts file.

#5:25 AM

advertising privacy dns networking

Monday, March 30, 2026

Post Affiliate Pro

www.postaffiliatepro.com/?fh_id=af6a1b0bea664bab

affiliate

#4:54 AM

marketing affiliate-marketing

DevInsight - A powerful engineering platform to drive DevOps performance

www.devinsight.ai/

#3:09 AM

software-engineering-intelligence-platforms code metrics dora

The SPACE of Developer Productivity - ACM Queue

queue.acm.org/detail.cfm?id=3454124

SPACE was created explicitly to address the limitations of single-dimension productivity metrics (including DORA). Its core argument is that developer productivity is multidimensional and cannot be captured by any single metric or even a single category of metrics. You need to measure across multiple dimensions and combine perceptual (self-reported) data with behavioral (system-observed) data.

S — Satisfaction and Well-being

What it measures: How fulfilled, happy, and healthy developers feel about their work, team, tools, and culture. Why it matters: Developer satisfaction is both an outcome worth caring about and a leading indicator of future productivity. Dissatisfied developers leave, disengage, or burn out — all of which destroy team productivity over time. Satisfaction is also the dimension most likely to surface problems that system metrics miss (e.g., "our CI is technically fast but the developer experience of debugging failures is awful"). Example metrics:

Developer satisfaction surveys (NPS-style or Likert scale) Retention and turnover rates Burnout indicators (after-hours work patterns, survey responses) Tool satisfaction ratings

P — Performance

What it measures: The outcomes of the work — not how much was done, but whether what was done achieved its intended result. Why it matters: Activity without outcomes is waste. A team can be very busy (high activity) and still underperform (low performance) if they're working on the wrong things, producing low-quality output, or failing to deliver customer value. Example metrics:

Customer satisfaction / NPS Feature adoption rates Reliability (uptime, error rates) Code quality indicators (defect density, code review quality) Revenue or business KPIs tied to engineering output

A — Activity

What it measures: The count or volume of actions and outputs produced by developers and teams. Why it matters (with caveats): Activity metrics are the most straightforward to collect from systems (commits, PRs, deployments, reviews). They're useful as a component of productivity measurement but dangerous as the primary measure because they incentivize volume over value. The SPACE authors explicitly warn against using activity metrics in isolation. Example metrics:

Number of PRs opened, reviewed, merged Number of commits Number of code reviews completed Number of deployments Number of incidents responded to CI/CD pipeline runs

C — Communication and Collaboration

What it measures: How effectively people and teams share information, coordinate work, review each other's contributions, and work together. Why it matters: Software development is a team sport. Individual velocity means little if coordination overhead is high. Teams with poor communication have longer cycle times, more rework, and more integration conflicts — even if individual developers are productive in isolation. Example metrics:

Code review turnaround time (time from review request to first review) PR review depth (number of review comments, reviewers per PR) Knowledge distribution (bus factor — how many people can work on a given area?) Cross-team PR review frequency Meeting load and interruption frequency

E — Efficiency and Flow

What it measures: Whether developers can do their work with minimal interruptions, delays, and friction. This dimension captures the experience of getting work done — are there unnecessary handoffs, tool-switching, waiting periods, or manual steps? Why it matters: This is the heart of the "developer experience" concept. Two teams with identical DORA metrics can have radically different developer experiences if one team's pipeline is smooth and automated while the other requires manual interventions, workarounds, and waiting. Example metrics:

Time spent waiting (for CI, for reviews, for environments) Handoffs between teams or tools Manual steps in automated workflows Context switches per day "Flow state" time (uninterrupted coding time) Toil and workaround frequency

#2:34 AM

software-engineering-intelligence-platforms code metrics dora

DORA | Get Better at Getting Better

dora.dev/

DORA is the largest and longest running research program of its kind, that seeks to understand the capabilities that drive software delivery and operations performance. DORA helps teams apply those capabilities, leading to better organizational performance.

#2:15 AM

software-engineering-intelligence-platforms code dora metrics

Engineering Productivity Software for Delivery Leaders | Middleware

middlewarehq.com/

#2:04 AM

software-engineering-intelligence-platforms code

Engineering Productivity Software for Delivery Leaders | Middleware

middlewarehq.com/

#2:04 AM

software-engineering-intelligence-platforms code