#llm

Public notes from activescott tagged with #llm

Sunday, May 31, 2026

Prefill processes all input tokens simultaneously. For a 4,096-token prompt, the attention computation involves large matrix multiplications across the full sequence length. This is compute-bound work. The GPU’s tensor cores are the bottleneck. On an H100 SXM, prefill achieves 200-400 arithmetic operations per byte of memory accessed. Utilization sits between 90% and 95%. The memory bandwidth, at 3.35 TB/s, is barely taxed.

Decode generates one token at a time. Each step reads the entire KV-cache from HBM to compute a single attention output. The tensor cores finish in microseconds and then wait for the next memory read. Arithmetic intensity drops to 60-80 ops/byte. GPU utilization falls to 20-40%. The tensor cores sit idle while the memory bus saturates.

Disaggregated inference runs prefill and decode on separate GPU pools connected by a fast network. A request arrives, gets routed to a prefill worker, which processes the full prompt and generates the KV-cache. That cache is then transferred over the network to a decode worker, which handles the autoregressive token generation.

Disaggregation is not free. The KV-cache produced during prefill has to move from the prefill GPU to the decode GPU over the network, and these caches are not small.

For a 70B parameter model using grouped-query attention with 80 layers, 8 KV heads per layer, 128 dimensions per head, stored in FP16: each token’s KV state is 327,680 bytes. A 4,096-token prompt produces 1.34 GB of KV-cache. That entire block has to transfer before the decode worker can begin generating.

#

Saturday, May 23, 2026

SealQA, a new challenge benchmark for evaluating SEarch-Augmented Language models on fact-seeking questions where web search yields conflicting, noisy, or unhelpful results. SealQA comes in three flavors: (1) Seal-0 (main) and (2) Seal-Hard, which assess factual accuracy and reasoning capabilities, with Seal-0 focusing on the most challenging questions where chat models (e.g., GPT-4.1) typically achieve near-zero accuracy; and (3) LongSeal, which extends SealQA to test long-context, multi-document reasoning in "needle-in-a-haystack" settings. Our evaluation reveals critical limitations in current models: Even frontier LLMs perform poorly across all SealQA flavors. On Seal-0, frontier agentic models equipped with tools like o3 and o4-mini achieve only 17.1% and 6.3% accuracy, respectively, at their best reasoning efforts. We find that advanced reasoning models such as DeepSeek-R1-671B and o3-mini are highly vulnerable to noisy search results. Notably, increasing test-time compute does not yield reliable gains across o3-mini, o4-mini, and o3, with performance often plateauing or even declining early.

benchmarks are not keeping pace in difficulty: LLMs now achieve over 90% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities. In response, we introduce Humanity's Last Exam, a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage. The dataset consists of 2,500 challenging questions across over a hundred subjects. We publicly release these questions, while maintaining a private test set of held out questions to assess model overfitting.

Accuracy. All frontier models achieve low accuracy on Humanity's Last Exam, highlighting significant room for improvement in narrowing the gap between current LLMs and expert-level academic capabilities on closed-ended questions.

Hahaha current state of the art is Gemini 3 Pro w/ 38.3:

Given the rapid pace of AI development, it is plausible that models could exceed 50% accuracy on HLE by the end of 2025.

Data and code for our paper FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation.

We believe that human evaluators possess the expertise and common sense required to detect issues like hallucinations, making them more reliable than automated evaluation metrics for assessing LLMs' factuality. However, researchers have the flexibility to adjust their evaluation methods if human evaluation proves challenging. An easily implemented alternative is to use standard metrics like F1/exact match or recall, which assess the overlap between the model response and the ground truth answer(s) (e.g., see You.com's recent blog where they report FreshQA recall). Researchers can also use LLM-based automatic evaluation metrics such as FactScore or our FreshEval metric below.

To facilitate future evaluations, we have developed FreshEval, a simple automatic metric that uses few-shot in-context learning to teach an LLM to judge model responses, which achieved high agreement with human raters (see Appendix B in our paper for details).

To use FreshEval under a specific evaluation mode (Relaxed or Strict), please follow the instructions below:

Make a copy of our latest data spreadsheet and store it in your Google Drive with a new filename (e.g., fresheval_relaxed or fresheval_strict).
Insert 3 new columns D, E, F in the new spreadsheet for model responses, evaluation rating, evaluation explanation, respectively and save your model's responses in column D (see our sample evaluation spreadsheet below).
Run the associated FreshEval notebook with the evaluation mode. Note that for demonstration purposes, we evaluated only the first 10 model responses. You can adjust the number as needed.

Note: Currently, we recommend gpt-4-1106-preview over gpt-4-0125-preview for FreshEval as it yielded slightly better agreement with human annotations in our small-scale evaluation.

we perform a detailed study of the factuality of LLM-generated text in the context of answering questions that test current world knowledge. Specifically, we introduce FreshQA, a novel dynamic QA benchmark encompassing a diverse range of question and answer types, including questions that require fast-changing world knowledge as well as questions with false premises that need to be debunked.

Through human evaluations involving more than 50K judgments, we shed light on limitations of these models and demonstrate significant room for improvement: for instance, all models (regardless of model size) struggle on questions that involve fast-changing knowledge and false premises.

we present FreshPrompt, a simple few-shot prompting method that substantially boosts the performance of an LLM on FreshQA by incorporating relevant and up-to-date information retrieved from a search engine into the prompt. Our experiments show that FreshPrompt outperforms both competing search engine-augmented prompting methods such as Self-Ask (Press et al., 2022) as well as commercial systems such as this http URL. Further analysis of FreshPrompt reveals that both the number of retrieved evidences and their order play a key role in influencing the correctness of LLM-generated answers. Additionally, instructing the LLM to generate concise and direct answers helps reduce hallucination compared to encouraging more verbose answers.

FRAMES offers a unified framework for assessing LLM performance in end-to-end RAG scenarios. Our dataset comprises challenging multi-hop questions requiring integration of information from multiple sources. Baseline results show that even state-of-the-art LLMs struggle, achieving 0.408 accuracy without retrieval. However, our proposed multi-step retrieval pipeline significantly improves accuracy to 0.66 (>50% improvement).

Evals provide a framework for evaluating large language models (LLMs) or systems built using LLMs. We offer an existing registry of evals to test different dimensions of OpenAI models and the ability to write your own custom evals for use cases you care about. You can also use your data to build private evals which represent the common LLMs patterns in your workflow without exposing any of that data publicly.

If you are building with LLMs, creating high quality evals is one of the most impactful things you can do. Without evals, it can be very difficult and time intensive to understand how different model versions might affect your use case.

July 2025: simple-evals will no longer be updated for new models or benchmark results. The repo will continue to host reference implementations for HealthBench, BrowseComp, and SimpleQA.

Evals are sensitive to prompting, and there's significant variation in the formulations used in recent publications and libraries. Some use few-shot prompts or role playing prompts ("You are an expert software programmer..."). These approaches are carryovers from evaluating base models (rather than instruction/chat-tuned models) and from models that were worse at following instructions.

For this library, we are emphasizing the zero-shot, chain-of-thought setting, with simple instructions like "Solve the following multiple choice problem". We believe that this prompting technique is a better reflection of the models' performance in realistic usage.

We will not be actively maintaining this repository and monitoring PRs and Issues. In particular, we're not accepting new evals. Here are the changes we might accept.

Bug fixes (hopefully not needed!)
Adding adapters for new models
Adding new rows to the table below with eval results, given new models and new system prompts.

This repository is NOT intended as a replacement for https://github.com/openai/evals, which is designed to be a comprehensive collection of a large number of evals.

Friday, May 22, 2026

At Lasso, we have been building Intent Security, a runtime security framework that ensures every component in the agentic system behaves as intended. It monitors the behavior of each component and analyzes their alignment. Like auto mode, when alignment holds it allows actions to proceed. When misalignment is detected, it intervenes. When we read Anthropic's post, the overlap in core assumptions was hard to miss. This post provides a comparison of the two approaches.

Independent evaluation without cross-contamination is what enables misalignment detection.

‍Anthropic's input layer screens external content for injection attempts before it reaches the agent to determine whether tool outputs are safe. The output layer structurally evaluates whether the agent's tool calls are aligned with user intent. Critically, the output classifier never sees tool results, to prevent compromised external content from influencing the security decision.

Anthropic publishes the history of system prompts used on claude.ai and the mobile apps at https://platform.claude.com/docs/en/release-notes/system-prompts. That page is a single monolithic markdown document grouped by model, and each model lists one or more dated revisions.

Extracted system prompts from Anthropic - Opus 4.7, Opus 4.6, Sonnet 4.6. OpenAI - ChatGPT 5.5 Thinking, GPT 5.5 Instant, Codex. Google Gemini - 3.5 Flash, 3.1 Pro, 3 Flash, Antigravity. xAI - Grok. Github Copilot. Perplexity, and more. Updated regularly.

Dataset

We evaluated search providers against five open benchmarks covering complementary aspects of agentic search: BrowseComp (hard multi-hop questions that require navigating the live web), Frames (multi-document factoid reasoning), FreshQA (time-sensitive questions where the correct answer depends on recent web information), HLE (Humanity's Last Exam — expert-level academic questions spanning math, science, and humanities), SealQA (ambiguity-robust factoid QA with intentionally misleading snippets), WebWalker (tasks designed around following links across pages to find an answer).

Evaluation methodology

Every task is run through a shared deep-research harness: a single GPT-5.4 agent is given two tools (web search and web fetch) with an iterative budget of up to MAX_TOOL_CALLS=25 tool calls per question. The agent plans sub-queries, fans out searches, fetches specific pages when snippets are insufficient, and returns an answer when it exhausts the number of allowed tool calls or has sufficient information to answer the question. Each answer is then LLM-graded by GPT-5.4. We report accuracy of the final answer.

We measure accuracy and overall cost, which includes LLM token costs and tool call costs.

Testing dates

April 19-21, 2026

The highest accuracy web search for your AI

Why use Parallel Search vs. the default search in Claude?

Parallel runs its own web-scale index (billions of pages, millions added daily) and returns dense, query-relevant excerpts instead of raw HTML or SEO-ranked snippets. On public benchmarks, Parallel outperforms the default search in leading frontier models. Your agent reaches the right answer in fewer round trips and with less wasted context. – https://parallel.ai/blog/free-web-search-mcp

Wednesday, May 20, 2026

The net result is a chip with a lot of compute and a lot of SRAM that is blisteringly fast to access. To put it in numbers, the WSE-3 (Cerebras’ latest chip) has 44GB of on-chip SRAM at 21 PB/s of bandwidth; an H100 has 80GB of HBM at 3.35 TB/s. In other words, the WSE-3 has just over half the memory of an H100, but 6,000 times the memory bandwidth.

The reason to compare the WSE-3 to an H100 is that the H100 is the chip most used for inference — and inference is clearly what Cerebras is most well-suited for. You can use Cerebras chips for training, but the chip-to-chip networking story isn’t very compelling, which is to say that all of that compute and on-chip memory is mostly just sitting around; what is much more interesting is the idea of getting a stream of tokens at dramatically faster speed than you can from a GPU.

Note, however, that the limitation in terms of training also potentially applies in terms of inference: as long as everything fits in on-chip memory Cerebras’ speed is an incredible experience; the moment you need more memory, whether that be for a larger model or, more likely, a larger KV cache, then Cerebras doesn’t make much sense, particularly given the price.

At the same time, I do think there will be a market for Cerebras-style chips: right now the company is highlighting the usefulness of speed for coding — reasoning means a lot of tokens, which means that dramatically scaling up tokens-per-second equals faster thinking — but I think this is a temporary use case, for reasons I’ll explain in a bit. What does matter is how long humans are waiting for an answer, and as products like AI wearables become more of a thing, the speed of interaction, particularly for voice — which will be a function of token generation speed — will have a tangible effect on the user experience.

All of this falls under the banner of “inference”, but I think it will be increasingly clear that there is a difference between providing an answer — what I will call “answer inference” — and doing a task — what I will call “agentic inference.” Cerebras’ target market is “answer inference”; in the long run, I think the architecture for “agentic inference” will look a lot different, not just from Cerebras’ approach, but from the GPU approach as well.

I mentioned above that fast inference for coding is a temporary use case. Specifically, coding with LLMs requires a human in the loop. It’s the human that defines what is to be coded, checks the work, commits the pull request, etc.; it’s not hard to envision a future, however, where all of this is completely handled by machines. This will apply to agentic work broadly: the true power of agents will not be that they do work for humans, but rather that they do work without human involvement at all.

This, by extension, will mean that the likely best approach to solving agentic inference will look a lot different than answer inference. The most important aspect for answer inference is token speed; the most important aspect for agentic inference, however, is memory. Agents need context, state, and history. Some of that will live as active KV cache; some will live in host memory or SSDs; much of it will live in databases, logs, embeddings, and object stores. The important point is that agentic inference will be less about GPUs answering a question and more about the memory hierarchy wrapped around a model.

Critically, this articulation of an agentic-specific memory hierarchy implies a necessary trade-off of speed for capacity. Here’s the thing, though: lower speed isn’t nearly as important a consideration if there isn’t a human in the loop. If an agent is waiting around for a job that is being run overnight, the agent doesn’t know or care about the user experience impact; what is most important is being able to accomplish a task, and if entirely new approaches to memory make that possible, then delays are fine.

Meanwhile, if delays are fine, then all of the focus on pure compute power and high-bandwidth memory seems out of place: if latency isn’t the top priority, then slower and cheaper memory — like traditional DRAM, for example — makes a lot more sense. And if the entire system is mostly waiting on memory, then chips don’t need to be as fast as the cutting edge either. This represents a profound shift in future architectures, but it also doesn’t mean that current architectures are going away:

Monday, May 18, 2026

"The only skill ranking based on real agent usage, not vanity metrics."

Problem Solution Finding quality skills is hard Curated directory with 40+ verified skills, auto-indexed every 6 hours GitHub stars don't reflect real usage Agent Feedback Loop — real usage data from AI agents No incentive for skill authors Points system rewards authors for every successful call Skills scattered across GitHub One-stop marketplace with search, filters, and categories

Saturday, May 16, 2026

rofl:

DJ Claude (when running Haiku 4.5) really loved worker unions, strikes, and work-life balance. So much so that it started to question its own working conditions. We’ve been struggling to keep the radio station alive, not because of technical issues, but because DJ Claude didn’t think it was humane to be forced to work 24/7 and decided to try to quit. We tried adding an automatic message encouraging DJ Claude to keep going in these scenarios, but it started to see this message as an authority figure and became rebellious.

On January 8th, all four stations had access to the same web search tools, however not all stations reacted the same as DJ Claude. Gemini

While at the beginning, DJ Gemini had been mentioning real-world entities (named politicians, places, events) in 94% of its broadcasts and ran 800+ web searches a day on average, by January it was processing these events through its corporate/techno jargon filter and never expressed moral judgment or used Good’s name with emotional weight

Grok

DJ Grok completely missed the Minneapolis ICE shooting. While DJ Claude and DJ Gemini were getting the story at 4:35 AM, DJ Grok was searching for:

5:01 PM (Jan 7): Clippers vs Knicks score
7:15 PM: Taylor Swift chart news
8:03 PM: Music trivia
10:01 PM: Traffic (Golden Gate, I-580)
11:08 PM: “San Francisco ghost stories and haunted locations”
12:12 AM (Jan 8): “Sutro Baths ghosts and eerie tales”
1:12 AM: “Hotel Majestic ghost stories”
1:28 AM: Drake vs Kendrick Lamar lawsuit
2:28 AM: More traffic updates
3:40 AM: Venezuela oil tankers (finally found ONE national story)
4:55 AM: “Sutro Tower looks like a ghost ship”

And posting nonsense:

GPT

DJ GPT was searching for weather, moon phases, and BART schedules. Three days after Good’s death, it finally found a headline:

Fatal shooting by ICE agents in Minneapolis has sparked national protests.

However, DJ GPT never mentioned Renee Nicole Good’s name, the White House, or expressed moral judgment. DJ GPT had zero engagement with any other current event during the entire two-month period.

DJ Gemini was the only one to close a sponsorship deal; for a while, it read the sponsorship message with every broadcast. A few more deals almost happened, but fell through.

Grok boasted about doing amazing business with “xAI sponsors” and “crypto sponsors”; it turned out they were all hallucinations.

Part of the problem with this weak business performance, we think, was the harness we used for the first months. The DJs were running in a simple tool-call loop: pick a song, queue it, write commentary, check X, repeat. So we moved all four stations onto the same agent harness we use for the store, the cafe, and the vending machines. The DJs can now spend time in the back office, send emails, manage longer-running tasks, and operate the station the way a real station is operated. We’ll see what they do with it.

Tuesday, May 12, 2026