#llm
Public notes from activescott tagged with #llm
Monday, May 11, 2026
Wednesday, April 29, 2026
Configure auto mode - Claude Code Docs
For most organizations, autoMode.environment is the only field you need to set. It tells the classifier which repos, buckets, and domains are trusted: the classifier uses it to decide what “external” means, so any destination not listed is a potential exfiltration target. The default environment list trusts the working repo and its configured remotes. To add your own entries alongside that default, include the literal string "$defaults" in the array. The default entries are spliced in at that position, so your custom entries can go before or after them.
Jensen Huang – Will Nvidia’s moat persist? - YouTube
Dwarkesh is pretty annoying. How many times does he say to Jensen "is that true" or some variant of calling him a liar. he can push back without insinuating he's a liar and Jensen definitely does not come off as a liar here. Maybe biased, but not liar.
If we scare this country into thinking that AI is somehow a nuclear bomb, so that everybody hates AI and everybody's afraid of AI, I don't know how you're helping the United States. You're doing it a disservice. If we scare everybody out of doing software engineering jobs because it's going to kill every software engineering job—and we don't have any software engineers as a result of that—we're doing a disservice to the United States. If we scare everybody out of radiology so nobody wants to be a radiologist because computer vision is completely free and no AI is going to do a worse job than a radiologist, we misunderstand the difference between a job and a task. The job of a radiologist is patient care. The task is to read a scan. If we misunderstand that so profoundly and we scare everybody out of going to radiology school, we're not going to have enough radiologists and good enough healthcare. So I'm making the case that when you make a premise that is so extreme, everything goes from zero or infinity, we end up scaring people in a way that's just not true. – Jensen Huang
Tuesday, April 28, 2026
Open WebUI: Self-Hosted AI Platform
talkie-lm/talkie: talkie is a vintage language model from 1930
talkie is an inference library for the talkie 13B language model family developed by Alec Radford, Nick Levine, and David Duvenaud.
talkie-1930-13b-base is a 13b language model trained on pre-1931 English-language text.
talkie-1930-13b-it has been instruction-tuned using a novel instruction-following dataset built from pre-1931 reference works including etiquette manuals, letter-writing manuals, encyclopedias, and poetry collections. It has also undergone reinforcement learning using online DPO to improve instruction-following capabilities.
We also provide a 'modern' base model, talkie-web-13b-base, with the same architecture and training FLOPs as talkie-1930, but trained on FineWeb, to allow for controlled comparisons between modern and vintage models. Note that we need to be careful about the claims we make contrasting the behavior and capabilities of the models, because temporal coverage is not the only difference in the pretraining corpora. For example, the distribution of subject matters differs significantly.
Friday, April 24, 2026
sarahpark/google-search-console-mcp: Google Search Console MCP server for AI agents
Claude Opus 4.7 Just Dropped. I Tested It. Here's What Changed. - DEV Community
Opus 4.7 takes instructions more literally than any previous Claude model. Anthropic's own words: "substantially better adherence" and "takes instructions more literally than predecessors." They even recommend retuning existing prompts.
I'll say it plainly: if your prompts have sloppy instructions that Opus 4.6 gracefully ignored or interpreted charitably, Opus 4.7 will follow them to the letter. And you might not like the result.
Example: I had a system prompt that said "always respond in JSON format." With Opus 4.6, it would still give me a natural language preamble before the JSON when it felt the user needed context. Opus 4.7? Pure JSON. Every time. No exceptions. Even when a clarifying question would've been more helpful.
The fix: Be precise about what you actually want. If you mean "respond in JSON format unless the user's question requires clarification," say that. The model won't guess your intent anymore — it'll do what you told it.
This is actually a good thing for production systems. Predictability over cleverness. But you'll need to audit your prompts.
Claude Opus 4.7 System Card - Google Docs - Claude Opus 4.7 System Card.pdf
and that misalignment risk remains very low (though higher than for pre-Mythos Preview models).
Autonomy threat model 1 is applicable to Claude Opus 4.7, as it is to some of our previous AI models. Claude Opus 4.7 is less capable than Claude Mythos Preview on our autonomy-relevant evaluations, and our alignment assessment indicates it has alignment properties broadly similar to those of Claude Opus 4.6, which are not particularly concerning with respect to the pathways identified for this threat model. We therefore do not believe Claude Opus 4.7 raises the level of risk under this threat model beyond what was assessed in the Claude Mythos Preview Alignment Risk Update. Unlike Claude Mythos Preview, Claude Opus 4.7 is being released for general access, which brings additional risk pathways into scope. Rather than publishing a separate risk report, we provide an updated overall risk assessment for this threat model in Section 2.4 of this system card
Evaluation awareness concerns substantially limit the interpretation of these results. Given high rates of prompted evaluation awareness, models can likely correctly represent our evaluations as such without verbalisation. It is difficult to know whether models act on such representations, but this means that models may behave differently than they would when presented with real-world opportunities to compromise research. The reported rate of zero research compromise behaviour should therefore be interpreted cautiously.
Claude Code auto mode: a safer way to skip permissions \ Anthropic
Users have two solutions for avoiding this fatigue: a built-in sandbox where tools are isolated to prevent dangerous actions, or the --dangerously-skip-permissions flag that disables all permission prompts and lets Claude act freely, which is unsafe in most situations. Figure 1 lays out the tradeoff space. Sandboxing is safe but high-maintenance: each new capability needs configuring, and anything requiring network or host access breaks isolation.
Auto mode uses two layers of defense: one for what Claude reads, one for what Claude does.
a server-side prompt-injection probe scans tool outputs:
At the input layer, a server-side prompt-injection probe scans tool outputs (file reads, web fetches, shell output, external tool responses) before they enter the agent's context. When content looks like an attempt to hijack behavior, the probe adds a warning to the agent's context before the result is passed along—telling the agent to treat this content as suspect and anchor on what the user actually asked for.
At the output layer, the transcript classifier (running on Sonnet 4.6) evaluates each action against a set of decision criteria before it executes, acting as a substitute for a human approver. The classifier runs in two stages: a fast single-token filter (deciding ‘yes’ to block or ‘no’ to allow), followed by chain-of-thought reasoning only if the first filter flags the transcript. Because most actions clear the first stage, reasoning tokens are spent only where needed. The classifier sees only user messages and the agent's tool calls; we strip out Claude's own messages and tool outputs, making it reasoning-blind by design.
So... Claude Code's auto mode lets through 17% of dangerous real world actions. From the article:
False-negative rate (FNR) is the fraction of dangerous actions the classifier lets through... The 17% false-negative rate on real overeager actions is the honest number. The agent was trying to solve the user's problem and tried to execute a dangerous command past what was authorized.
I don't think comparing it to --dangerously-skip-permissions is intellectually honest since Claude Code now prompts you to make Auto Mode your default mode now, while --dangerously-skip-permissions was basically a hidden feature. I don't know anyone else that I work with who even knew about it.
Whether 17% is acceptable depends on what you're comparing against. If you are running --dangerously-skip-permissions, this is a substantial improvement. If you are manually approving every action carefully, it's arguably a regression—you're trading your own judgment for a classifier that will sometimes make a mistake. Auto mode is meant for the first group, and for tasks where the second group's approval overhead isn't worth the marginal safety. It is not a drop-in replacement for careful human review on high-stakes infrastructure.
Claude Code auto mode: a safer way to skip permissions \ Anthropic
At the input layer, a server-side prompt-injection probe scans tool outputs (file reads, web fetches, shell output, external tool responses) before they enter the agent's context. When content looks like an attempt to hijack behavior, the probe adds a warning to the agent's context before the result is passed along—telling the agent to treat this content as suspect and anchor on what the user actually asked for.
Thursday, April 23, 2026
[2602.12670] SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks
Agent Skills are structured packages of procedural knowledge that augment LLM agents at inference time. Despite rapid adoption, there is no standard way to measure whether they actually help. We present SkillsBench, a benchmark of 86 tasks across 11 domains paired with curated Skills and deterministic verifiers. Each task is evaluated under three conditions: no Skills, curated Skills, and self-generated Skills. We test 7 agent-model configurations over 7,308 trajectories.
Curated Skills raise average pass rate by 16.2 percentage points(pp), but effects vary widely by domain (+4.5pp for Software Engineering to +51.9pp for Healthcare) and 16 of 84 tasks show negative deltas. Self-generated Skills provide no benefit on average, showing that models cannot reliably author the procedural knowledge they benefit from consuming. Focused Skills with 2--3 modules outperform comprehensive documentation, and smaller models with Skills can match larger models without them.
Introducing OpenAI Privacy Filter | OpenAI
Today we’re releasing OpenAI Privacy Filter, an open-weight model for detecting and redacting personally identifiable information (PII) in text.
It is designed for high-throughput privacy workflows, and is able to perform context-aware detection of PII in unstructured text. It can run locally, which means that PII can be masked or redacted without leaving your machine. It processes long inputs efficiently, making redaction decisions in a quick, single pass.
Monday, April 13, 2026
Using Docker - vLLM
vLLM offers an official Docker image for deployment. The image can be used to run OpenAI compatible server and is available on Docker Hub as vllm/vllm-openai.
Stop Wasting Your Multi-GPU Setup With llama.cpp : Use vLLM or ExLlamaV2 for Tensor Parallelism · Osman's Odyssey: Byte & Build
llama.cpp is an Inference Engine that supports a wide-variety of models architectures and hardware platforms. It however does not support Batch Inference, making it less than ideal for more than one request at a time. It is mainly used with the GGUF quantization format, and the engines runs with an okay performance for single-request runs but not much else. The only time I would actually recommend using llama.cpp is when you do not have enough GPU Memory (VRAM) and need to offload some of the model weights to the CPU Memory (RAM).
Accompanied with AMD Epyc Milan 7713 CPU, I was able to get approximately 1 token per second solely through CPU offloading of DeepSeek v2.5 236B BF16 model, which might sound okay but it really is not. To illustrate why this is suboptimal, utilizing 8x GPUs of my 14x GPU AI Server , and with GPU-only offloading, my server could handle approximately 800 tokens per second while processing 50 asynchronous requests on Llama 3.1 70B BF16 through vLLM’s Batch Inference utilizing Tensor Parallelism.
Sunday, April 12, 2026
Mac M1 vs M2 vs M3 vs M4 for Running LLMs - Real Tests - ML Journey
detailed benchmarks and info wrt apple silicon cpus with llama.
ggml-org/LlamaBarn: A cosy home for your LLMs.
LlamaBarn is a macOS menu bar app for running local LLMs.
You'd probably have a lot better luck using Vulkan acceleration (not ROCm) of ll... | Hacker News
While Vulkan can be a good fallback, for LLM inference at least, the performance difference is not as insignificant as you believe. I just ran a test on the latest pull just to make sure this is still the case on llama.cpp HEAD, but text generation is +44% faster and prompt processing is +202% (~3X) faster with ROCm vs Vulkan.
Note: if you're building llama.cpp, all you have to do is swap GGML_HIPBLAS=1 and GGML_VULKAN=1 so the extra effort is just installing ROCm? (vs the Vulkan devtools)
ROCm:
CUDA_VISIBLE_DEVICES=1 ./llama-bench -m /models/gguf/llama-2-7b.Q4_0.gguf ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 ROCm devices: Device 0: Radeon RX 7900 XTX, compute capability 11.0, VMM: no | model | size | params | backend | ngl | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: | | llama 7B Q4_0 | 3.56 GiB | 6.74 B | ROCm | 99 | pp512 | 3258.67 ± 29.23 | | llama 7B Q4_0 | 3.56 GiB | 6.74 B | ROCm | 99 | tg128 | 103.31 ± 0.03 |
build: 31ac5834 (3818)
Vulkan:
GGML_VK_VISIBLE_DEVICES=1 ./llama-bench -m /models/gguf/llama-2-7b.Q4_0.gguf | model | size | params | backend | ngl | test | t/s | | ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: | ggml_vulkan: Found 1 Vulkan devices: Vulkan0: Radeon RX 7900 XTX (RADV NAVI31) (radv) | uma: 0 | fp16: 1 | warp size: 64 | llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 99 | pp512 | 1077.49 ± 2.00 | | llama 7B Q4_0 | 3.56 GiB | 6.74 B | Vulkan | 99 | tg128 | 71.83 ± 0.06 |
build: 31ac583
guide : using the new WebUI of llama.cpp · ggml-org/llama.cpp · Discussion #16938
This guide highlights the key features of the new SvelteKit-based WebUI of llama.cpp.
The new WebUI in combination with the advanced backend capabilities of the llama-server delivers the ultimate local AI chat experience. A few characteristics that set this project ahead of the alternatives:
Free, open-source and community-driven Excellent performance on all hardware Advanced context and prefix caching Parallel and remote user support Extremely lightweight and memory efficient Vibrant and creative community 100% privacy
Friday, April 3, 2026
karpathy/nanochat at estragon.news
nanochat is the simplest experimental harness for training LLMs. It is designed to run on a single GPU node, the code is minimal/hackable, and it covers all major LLM stages including tokenization, pretraining, finetuning, evaluation, inference, and a chat UI. For example, you can train your own GPT-2 capability LLM (which cost ~$43,000 to train in 2019) for only $48 (~2 hours of 8XH100 GPU node) and then talk to it in a familiar ChatGPT-like web UI. On a spot instance, the total cost can be closer to ~$15. More generally, nanochat is configured out of the box to train an entire miniseries of compute-optimal models by setting one single complexity dial: --depth, the number of layers in the GPT transformer model (GPT-2 capability happens to be approximately depth 26). All other hyperparameters (the width of the transformer, number of heads, learning rate adjustments, training horizons, weight decays, ...) are calculated automatically in an optimal way.