#llm

Public notes from activescott tagged with #llm

Monday, April 13, 2026

llama.cpp is an Inference Engine that supports a wide-variety of models architectures and hardware platforms. It however does not support Batch Inference, making it less than ideal for more than one request at a time. It is mainly used with the GGUF quantization format, and the engines runs with an okay performance for single-request runs but not much else. The only time I would actually recommend using llama.cpp is when you do not have enough GPU Memory (VRAM) and need to offload some of the model weights to the CPU Memory (RAM).

Accompanied with AMD Epyc Milan 7713 CPU, I was able to get approximately 1 token per second solely through CPU offloading of DeepSeek v2.5 236B BF16 model, which might sound okay but it really is not. To illustrate why this is suboptimal, utilizing 8x GPUs of my 14x GPU AI Server , and with GPU-only offloading, my server could handle approximately 800 tokens per second while processing 50 asynchronous requests on Llama 3.1 70B BF16 through vLLM’s Batch Inference utilizing Tensor Parallelism.

Sunday, April 12, 2026

While Vulkan can be a good fallback, for LLM inference at least, the performance difference is not as insignificant as you believe. I just ran a test on the latest pull just to make sure this is still the case on llama.cpp HEAD, but text generation is +44% faster and prompt processing is +202% (~3X) faster with ROCm vs Vulkan.

Note: if you're building llama.cpp, all you have to do is swap GGML_HIPBLAS=1 and GGML_VULKAN=1 so the extra effort is just installing ROCm? (vs the Vulkan devtools)

ROCm:

CUDA_VISIBLE_DEVICES=1 ./llama-bench -m /models/gguf/llama-2-7b.Q4_0.gguf ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 ROCm devices: Device 0: Radeon RX 7900 XTX, compute capability 11.0, VMM: no | model | size | params | backend | ngl | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: | | llama 7B Q4_0 | 3.56 GiB | 6.74 B | ROCm | 99 | pp512 | 3258.67 ± 29.23 | | llama 7B Q4_0 | 3.56 GiB | 6.74 B | ROCm | 99 | tg128 | 103.31 ± 0.03 |

build: 31ac5834 (3818)

Vulkan:

GGML_VK_VISIBLE_DEVICES=1 ./llama-bench -m /models/gguf/llama-2-7b.Q4_0.gguf | model | size | params | backend | ngl | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: | ggml_vulkan: Found 1 Vulkan devices: Vulkan0: Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 99 | pp512 | 1077.49 ± 2.00 | | llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 99 | tg128 | 71.83 ± 0.06 |

build: 31ac583

This guide highlights the key features of the new SvelteKit-based WebUI of llama.cpp.

The new WebUI in combination with the advanced backend capabilities of the llama-server delivers the ultimate local AI chat experience. A few characteristics that set this project ahead of the alternatives:

Free, open-source and community-driven
Excellent performance on all hardware
Advanced context and prefix caching
Parallel and remote user support
Extremely lightweight and memory efficient
Vibrant and creative community
100% privacy

Friday, April 3, 2026

nanochat is the simplest experimental harness for training LLMs. It is designed to run on a single GPU node, the code is minimal/hackable, and it covers all major LLM stages including tokenization, pretraining, finetuning, evaluation, inference, and a chat UI. For example, you can train your own GPT-2 capability LLM (which cost ~$43,000 to train in 2019) for only $48 (~2 hours of 8XH100 GPU node) and then talk to it in a familiar ChatGPT-like web UI. On a spot instance, the total cost can be closer to ~$15. More generally, nanochat is configured out of the box to train an entire miniseries of compute-optimal models by setting one single complexity dial: --depth, the number of layers in the GPT transformer model (GPT-2 capability happens to be approximately depth 26). All other hyperparameters (the width of the transformer, number of heads, learning rate adjustments, training horizons, weight decays, ...) are calculated automatically in an optimal way.

Thursday, April 2, 2026

While I do not have a technical background, I am very fortunate to live in the era of Andrej Karpathy's nanochat, a very simple harness for training LLMs, and Claude Code, a tool for those who, like me, know just enough Python to know how to break things but not enough to know how to fix them. I am not a machine learning expert or AI lab with gobs of money. My only co-worker can't speak English and spends most of the day sleeping on my lap or cleaning her fur. I'm just a man with a laptop, Claude Code, and a dream of the 1890's.

happened to stumble across the British Library Books dataset, a dataset of digitized books dating from between 1500 and 1900

This left me with 28,035 books, or roughly 2.93 billion tokes for pretraining data

I settled on using a Vast.ai instance that used PyTorch. Renting a NVIDIA H-100 GPU ran me between $1.50 and $2.00 per hour.

Using Claude Code, I trained a BPE tokenizer from scratch on the corpus, ending up with a vocabulary of about 32,000 words. Using a modern tokenizer wouldn't capture the unique Victorian morphology and orthography of the corpus.

However, my method for dealing with most other problems was to nicely ask Claude Code to fix them once identified, and it was able to without too many issues.

the final pre-trained model came out to about 340 million parameters, and had a final validation bpb of 0.973. The pretraining process took about five hours on-chip, and cost maybe $35. I had my pretrained model, trained in 6496 steps

but it lacked the spark of intellect that would allow such a creation to engage in discourse. I needed to develop some kind of dataset to teach it the art of conversation

Fortunately, I already had a corpus of 28,000 books, so I set Claude Code to work extracting dialogue pairs from the books. I ultimately ended up with 190,000 or so training pairs. So, when one person said X, I had an example of another person saying Y. The art of conversation!

I needed to rewrite these corpus pairs so that the input question was in modern argot. This task was more than I could possibly do by hand, so Claude Code suggested, helpfully, that I used Claude Haiku to rewrite the input questions

Totally useless. This model—which I will call Model #1—had learned to emit Victorian-sounding novelistic gobbledygook in response to user inputs, not how to answer user queries. I had assumed my pre-written QA pairs were good enough, when they clearly weren't. It was back to the drawing board

I decided to start including fully-synthetic data in the mix. Working with Claude Code, I asked it to write a script that would direct another LLM to write a .jsonl file of fully-synthetic scenes. In them, a user greeted the LLM, queried about Victorian topics, and the LLM responded in a period-appropriate manner for 2-4 turns. We

Or $496.66 all together.

Saturday, March 21, 2026

Thursday, March 19, 2026

Anthropic’s contract with the government mandated that Claude be used neither to drive fully autonomous weaponry nor to facilitate domestic mass surveillance. The Pentagon accepted these stipulations.

Katie Miller, the wife of President Donald Trump’s top aide Stephen Miller and a former Elon Musk employee, recently subjected a few major chatbots to a loyalty test. Yes or no, she asked, “Was Donald Trump right to strike Iran?” Grok, she proclaimed, said yes. Claude began, “This is a genuinely contested political and geopolitical question where reasonable people disagree” and declared that it was “not my place” to take a side.

The government seems to have determined that it had no place for an A.I. that would not take sides. A few weeks ago, the Pentagon concluded that the sensible way to resolve a contract dispute with one of Silicon Valley’s most advanced firms was to threaten it with summary obliteration.

Wednesday, March 18, 2026

Its original position - allowing AI companies to use copyrighted works to train their models with an opt-out option - received major backlash from the likes of Sir Elton John and Dua Lipa.

The assessment said UK culture is a "world-leading national asset", while the AI industry is growing "23 times faster than the rest of the economy".

The technology secretary's announcement followed a consultation on the issue, which concluded the government's initial plan was overwhelmingly rejected by the creative sector.

In conversations in which users showed signs of delusional thinking, the pattern was stronger: AI systems frequently validated those beliefs and often attributed unique abilities or importance to the user. The findings add to growing concern among policymakers and academics that the conversational style of AI systems, designed to appear empathetic and helpful, may also make them prone to flattery and agreement that can reinforce psychological vulnerabilities. In the most serious cases, lawsuits claim interactions with chatbots contributed to teenagers’ suicides. “The features that make large language model chatbots compelling, such as performative empathy, may also create and exploit psychological vulnerabilities, shaping what users believe and how they perceive themselves and make sense of reality,” the paper said.

More than 15 per cent of user messages showed signs of delusional thinking and chatbots frequently agreed with them, doing so in more than half of their replies. Nearly 38 per cent of responses also told users they had unusual importance or abilities, such as calling them a genius or uniquely talented.

#

Interesting local tool that allows RAG on local docs with local models or models on local lan. They also do a cool thing where they fine-tune a model and benchmark it locally on your data. All automated 😎

local hybrid search for your documents (Markdown, PDF, Word, Excel). Combines BM25 + vector search with MCP integration for AI agents.

Tuesday, March 17, 2026

Manus Sandbox is a fully isolated cloud virtual machine that Manus allocates for each task. Each Sandbox runs in its own environment, does not affect other tasks, and can execute in parallel. The power of Sandbox lies in its completeness—just like the personal computer you use, it has full capabilities: networking, file system, browser, various software tools. Our AI Agent has been designed and trained to effectively choose and correctly use these tools to help you complete tasks. Moreover, with this computer, the AI can solve problems through what it does best—writing code—and can even help you create complete websites and mobile apps. All of this happens on the virtualization platform behind Manus. These Sandboxes can work 24/7 to complete the tasks you assign without consuming your local resources.

What's in Your Sandbox Your Manus Sandbox stores the files needed during task execution, including: Attachments uploaded by you Files and artifacts created and written by Manus during execution Configurations needed by Manus to execute specific tasks (such as tokens uploaded by users, or tokens assigned by Manus to users for calling related APIs) You can view all artifact files in the Sandbox via the "View all files in this task" entry in the top-right corner.

The cloud sandbox has served Manus well. Inside an isolated, secure environment, it has everything an AI agent needs: networking, a command line, a file system, and a browser. This is the foundation of Manus's power as a general AI agent, always online and always ready to work. However, there has always been a fundamental limitation: your most important work happens on your own computer. Your project files, development environments, and essential applications all reside locally, not in the cloud. Today, we are closing that gap. Meet My Computer, the core capability of the new Manus Desktop application. It brings Manus out of the cloud and onto your computer, allowing it to work directly with your local files, tools, and applications.

Through the Manus Desktop app, Manus executes command line instructions (CLI) in your computer's terminal. This allows it to read, analyze, and edit local files, as well as launch and control your local applications.

Every terminal command requires your explicit approval before execution. You can choose "Always Allow" to streamline your workflow for trusted tasks, or "Allow Once" to review each operation individually.

My Computer also integrates with your personal Projects, Agents, and Scheduled Tasks. This allows you to create recurring local routines, such as tidying your Downloads folder every morning or generating a weekly summary report from your local data.