activescott's Notes

Public notes from activescott

Saturday, May 23, 2026

Evals provide a framework for evaluating large language models (LLMs) or systems built using LLMs. We offer an existing registry of evals to test different dimensions of OpenAI models and the ability to write your own custom evals for use cases you care about. You can also use your data to build private evals which represent the common LLMs patterns in your workflow without exposing any of that data publicly.

If you are building with LLMs, creating high quality evals is one of the most impactful things you can do. Without evals, it can be very difficult and time intensive to understand how different model versions might affect your use case.

July 2025: simple-evals will no longer be updated for new models or benchmark results. The repo will continue to host reference implementations for HealthBench, BrowseComp, and SimpleQA.

Evals are sensitive to prompting, and there's significant variation in the formulations used in recent publications and libraries. Some use few-shot prompts or role playing prompts ("You are an expert software programmer..."). These approaches are carryovers from evaluating base models (rather than instruction/chat-tuned models) and from models that were worse at following instructions.

For this library, we are emphasizing the zero-shot, chain-of-thought setting, with simple instructions like "Solve the following multiple choice problem". We believe that this prompting technique is a better reflection of the models' performance in realistic usage.

We will not be actively maintaining this repository and monitoring PRs and Issues. In particular, we're not accepting new evals. Here are the changes we might accept.

Bug fixes (hopefully not needed!)
Adding adapters for new models
Adding new rows to the table below with eval results, given new models and new system prompts.

This repository is NOT intended as a replacement for https://github.com/openai/evals, which is designed to be a comprehensive collection of a large number of evals.

Friday, May 22, 2026

At Lasso, we have been building Intent Security, a runtime security framework that ensures every component in the agentic system behaves as intended. It monitors the behavior of each component and analyzes their alignment. Like auto mode, when alignment holds it allows actions to proceed. When misalignment is detected, it intervenes. When we read Anthropic's post, the overlap in core assumptions was hard to miss. This post provides a comparison of the two approaches.

Independent evaluation without cross-contamination is what enables misalignment detection.

‍Anthropic's input layer screens external content for injection attempts before it reaches the agent to determine whether tool outputs are safe. The output layer structurally evaluates whether the agent's tool calls are aligned with user intent. Critically, the output classifier never sees tool results, to prevent compromised external content from influencing the security decision.

Anthropic publishes the history of system prompts used on claude.ai and the mobile apps at https://platform.claude.com/docs/en/release-notes/system-prompts. That page is a single monolithic markdown document grouped by model, and each model lists one or more dated revisions.

Extracted system prompts from Anthropic - Opus 4.7, Opus 4.6, Sonnet 4.6. OpenAI - ChatGPT 5.5 Thinking, GPT 5.5 Instant, Codex. Google Gemini - 3.5 Flash, 3.1 Pro, 3 Flash, Antigravity. xAI - Grok. Github Copilot. Perplexity, and more. Updated regularly.

Dataset

We evaluated search providers against five open benchmarks covering complementary aspects of agentic search: BrowseComp (hard multi-hop questions that require navigating the live web), Frames (multi-document factoid reasoning), FreshQA (time-sensitive questions where the correct answer depends on recent web information), HLE (Humanity's Last Exam — expert-level academic questions spanning math, science, and humanities), SealQA (ambiguity-robust factoid QA with intentionally misleading snippets), WebWalker (tasks designed around following links across pages to find an answer).

Evaluation methodology

Every task is run through a shared deep-research harness: a single GPT-5.4 agent is given two tools (web search and web fetch) with an iterative budget of up to MAX_TOOL_CALLS=25 tool calls per question. The agent plans sub-queries, fans out searches, fetches specific pages when snippets are insufficient, and returns an answer when it exhausts the number of allowed tool calls or has sufficient information to answer the question. Each answer is then LLM-graded by GPT-5.4. We report accuracy of the final answer.

We measure accuracy and overall cost, which includes LLM token costs and tool call costs.

Testing dates

April 19-21, 2026

In cooperative game theory, the Shapley value is a method (solution concept) for fairly distributing the total gains or costs among a group of players who have collaborated. For example, in a team project where each member contributed differently, the Shapley value provides a way to determine how much credit or blame each member deserves.

In essence, it calculates each player's average marginal contribution across all possible coalitions.

The highest accuracy web search for your AI

Why use Parallel Search vs. the default search in Claude?

Parallel runs its own web-scale index (billions of pages, millions added daily) and returns dense, query-relevant excerpts instead of raw HTML or SEO-ranked snippets. On public benchmarks, Parallel outperforms the default search in leading frontier models. Your agent reaches the right answer in fewer round trips and with less wasted context. – https://parallel.ai/blog/free-web-search-mcp

Thursday, May 21, 2026

“Every cycle, AIPAC shows just how broken our democracy is and how corrupt our political finance system is,” said Usamah Andrabi, a spokesperson at Justice Democrats, a progressive group.

“Every cycle, they are at the forefront of exploiting those gaps for their right-wing donors and at the expense of voters.”

While the Chicago Progressive Partnership — the group whose name appeared on the Amiwala advertisement — was widely believed to be linked to AIPAC, it did not have to reveal the source of its funding until after the elections, which took place in March.

Now that the vote is over, Federal Election Commission receipts show that the sole funder of Chicago Progressive Partnership was Elect Chicago Women (ECW), another PAC. It contributed $1m to the partnership.

In turn, ECW had raised more than $4m from United Democracy Project (UDP), the election arm of AIPAC, and another $1m from investor Blair Frank, one of UDP’s largest donors.

AIPAC also contributed $1.3m to a third PAC, Affordable Chicago Now, in what critics call an effort to conceal its spending in Illinois.

Palestinian rights advocates say this use of “shell PACs” is evidence of how the pro-Israel group has become “toxic” among the US electorate. They argue AIPAC has taken a Russian doll approach — hiding its spending by funnelling funds from one PAC to another — to hide its involvement in primary races.

“They are so unpopular amongst the Democratic Party that they have to hide themselves,” Andrabi told Al Jazeera. “We have to keep exposing them and looking under every rock to see whether or not this shell PAC or that shell PAC is funded by AIPAC.”

Just this week, The New York Times and Siena College released a survey showing that 37 percent of US voters now sympathise with Palestinians, while 35 percent sympathise with Israelis.

That number was even higher among Democratic respondents, 57 percent of whom felt greater sympathy for the Palestinians.

The Pew Research Center suggested an even stronger left-wing backlash. Its survey earlier this year found 80 percent of Democratic respondents said they have unfavourable views of Israel.

Despite its well-documented clout, AIPAC’s organisational structure remains murky, as well as its spending.

On Wednesday, DAWN, the rights group, released a report that relied on LinkedIn disclosures to track the group’s current and former staff members and their professional connections.

It found that many people who worked for AIPAC also held jobs with the US and Israeli governments.

“DAWN’s analysis shows that 66 former AIPAC staffers currently work in the US government, from Congress to the White House to various branches of the military; nearly two dozen current AIPAC staffers previously worked in US government bodies,” the report said.

“The personal and professional relationships that result from this type of revolving door form the backbone of political influence in Washington, which is indicated in the hundreds of professional connections between AIPAC staffers and US federal and state employees.”

Wednesday, May 20, 2026

Italy will summon the Israeli ambassador over the treatment of activists involved in a Gaza-bound aid flotilla, which Prime Minister Giorgia Meloni and Foreign Minister Antonio Tajani described Wednesday as “unacceptable.”

“The images of Israeli Minister Ben Gvir are unacceptable. It is unacceptable that these protesters, including many Italian citizens, are subjected to this treatment that violates their human dignity,” the two said in a joint statement.

The Italian government said it was taking immediate steps at the highest institutional levels to secure the release of Italian citizens involved in the incident.

#

Three versions of the durabletask PyPI package (1.4.1, 1.4.2, 1.4.3), Microsoft’s Durable Task SDK for Python, were published on May 19, 2026 using a compromised PyPI API token.

The dropper downloads a stage-2 Python zipapp (rope.pyz) from attacker infrastructure and executes it with all output suppressed. The stage-2 is a full credential harvesting framework with dedicated collectors for AWS Secrets Manager and SSM Parameter Store, Azure Key Vault, GCP Secret Manager, Kubernetes secrets (across all contexts), HashiCorp Vault, and local password managers (1Password, Bitwarden, pass, gopass). It also reads over 90 sensitive files from disk, exfiltrates everything encrypted with RSA-4096/AES-256-GCM to a C2 server, and propagates itself to other hosts via AWS SSM SendCommand and kubectl exec.

The payload includes geopolitical targeting: it skips systems with a Russian locale and contains a destructive rm -rf /* routine targeting Israeli and Iranian systems.

Password Managers (collectors/passwords.py): Attempts to unlock 1Password, Bitwarden, pass, and gopass by brute-forcing passwords harvested from environment variables matching PASS, SECRET, KEY, BW_, OP_, _MASTER patterns, and from shell history (.bash_history, .zsh_history). On success, it dumps every item from every vault.

Filesystem (collectors/filesystem.py): Reads 90+ files including SSH keys, cloud credentials, Docker configs, npm/PyPI/Cargo/Gem tokens, kubeconfig, Terraform state files, VPN configurations (Tailscale state, WireGuard configs), MCP server configs (Claude Desktop, Cursor, VS Code, Zed, Codeium, Continue), and all .env files found under the home directory. Also extracts environment variables from all Docker containers via the Docker socket or CLI, and collects GitHub tokens via gh auth token.

and collects GitHub tokens via gh auth token.

For each token found, it creates a new public repository named with random Slavic folklore words (e.g., BABA-YAGA-KOSCHEI-742, description: “PUSH UR T3MPRR”) and uploads the encrypted data bundle as results.json. The attacker can later search GitHub for repositories matching these distinctive naming patterns to retrieve the exfiltrated data.

  1. No trusted publishers. The project uses legacy API token authentication instead of PyPI’s OIDC trusted publisher mechanism. Trusted publishers bind publishing to a specific GitHub repository, workflow, and environment. A stolen token cannot publish from outside that workflow. This project has no such binding: anyone holding the token can upload any version from any machine.

Kubernetes (collectors/kubernetes.py): Parses kubeconfig (with a custom YAML parser, no PyYAML dependency), iterates every context, and dumps secrets from all namespaces. Supports in-cluster service account tokens, client certificate auth, and bearer tokens. If kubectl is not present, the collector downloads it from dl.k8s.io. After collecting secrets, it propagates the payload to up to 5 other running pods via kubectl exec.

Trump's second presidency was described by political commentators as having fewer prohibitions on business activity and guardrails against potential conflicts of interest than his first, and for having more opportunities to directly influence Trump.[567][568] Trump repealed and rolled back anti-corruption measures and ethical standards for himself and his allies, dropped corruption charges against political figures with ties to him, and fired inspectors generals investigating fraud and abuse.

His second presidency was described as breaking with decades of ethical norms,[570] and raising substantial corruption concerns.[571][572] Congressional Republicans largely downplayed or ignored the concerns.[573][570]

Federal judges found many of the administration's actions to be illegal and unconstitutional,[13][14][15] and by mid-July, a Washington Post analysis found he defied judges and the courts in roughly one third of all cases against him, actions which were described by legal experts as unprecedented for any presidential administration.[16] His defiance of court orders and a claimed right to disobey the courts raised fears among legal experts of a constitutional crisis.[574] By August 2025, several grant terminations and spending freezes were found by judges and the Government Accountability Office as being illegal and unconstitutional.[575][576]

The Department of Justice formally announced the fund in a filing on May 18, 2026.[30] The fund, known as the Anti-Weaponization Fund, would compensate individuals who claim that the Department of Justice had been weaponized against them, and is set to end in December 2028. As part of the settlement, Trump dismissed complaints filed against the government over the FBI search of Mar-a-Lago and the Mueller special counsel investigation. According to the Justice Department, Trump and his sons would receive a formal apology, but not monetary payment or damages[31], though the publicly available terms of the fund do not prohibit Trump or his family from receiving payments from it, according to some legal observers.

The following day, acting Attorney General Todd Blanche gave Trump and his family permanent immunity from inquiries into their taxes.[33] According to The New York Times, the settlement likely eliminated a dispute over a US$72.9 million tax refund Trump claimed as the host of The Apprentice (2004–2017).

In response to the settlement, Brian Morrissey, the general counsel for the Department of the Treasury, resigned.[40] The settlement fund was met with skepticism from Maine senator Susan Collins and Kansas senator Jerry Moran, who oversee the Senate Committee on Appropriations.[41] Senate Majority Leader John Thune cast doubt on the fund.

The incident comes amid heightened Islamophobia in the US, where politicians and commentators have repeatedly launched broadsides targeting the Muslim community.

US Congressman Randy Fine, an ally of President Donald Trump, said late last year that Muslims should “be destroyed”.

Later on Monday, Laura Loomer, a right-wing activist close to Trump, said the ICSD should be raided by the FBI and immigration authorities.

She shared a 2023 social media post by the wife of the mosque’s imam, accusing Israel of killing children.

“The mosque that was ‘supposedly’ shot up today,” Loomer wrote in an accompanying post. “Just remember the people who attend this mosque want us all to be killed.”

#

The net result is a chip with a lot of compute and a lot of SRAM that is blisteringly fast to access. To put it in numbers, the WSE-3 (Cerebras’ latest chip) has 44GB of on-chip SRAM at 21 PB/s of bandwidth; an H100 has 80GB of HBM at 3.35 TB/s. In other words, the WSE-3 has just over half the memory of an H100, but 6,000 times the memory bandwidth.

The reason to compare the WSE-3 to an H100 is that the H100 is the chip most used for inference — and inference is clearly what Cerebras is most well-suited for. You can use Cerebras chips for training, but the chip-to-chip networking story isn’t very compelling, which is to say that all of that compute and on-chip memory is mostly just sitting around; what is much more interesting is the idea of getting a stream of tokens at dramatically faster speed than you can from a GPU.

Note, however, that the limitation in terms of training also potentially applies in terms of inference: as long as everything fits in on-chip memory Cerebras’ speed is an incredible experience; the moment you need more memory, whether that be for a larger model or, more likely, a larger KV cache, then Cerebras doesn’t make much sense, particularly given the price.

At the same time, I do think there will be a market for Cerebras-style chips: right now the company is highlighting the usefulness of speed for coding — reasoning means a lot of tokens, which means that dramatically scaling up tokens-per-second equals faster thinking — but I think this is a temporary use case, for reasons I’ll explain in a bit. What does matter is how long humans are waiting for an answer, and as products like AI wearables become more of a thing, the speed of interaction, particularly for voice — which will be a function of token generation speed — will have a tangible effect on the user experience.

All of this falls under the banner of “inference”, but I think it will be increasingly clear that there is a difference between providing an answer — what I will call “answer inference” — and doing a task — what I will call “agentic inference.” Cerebras’ target market is “answer inference”; in the long run, I think the architecture for “agentic inference” will look a lot different, not just from Cerebras’ approach, but from the GPU approach as well.

I mentioned above that fast inference for coding is a temporary use case. Specifically, coding with LLMs requires a human in the loop. It’s the human that defines what is to be coded, checks the work, commits the pull request, etc.; it’s not hard to envision a future, however, where all of this is completely handled by machines. This will apply to agentic work broadly: the true power of agents will not be that they do work for humans, but rather that they do work without human involvement at all.

This, by extension, will mean that the likely best approach to solving agentic inference will look a lot different than answer inference. The most important aspect for answer inference is token speed; the most important aspect for agentic inference, however, is memory. Agents need context, state, and history. Some of that will live as active KV cache; some will live in host memory or SSDs; much of it will live in databases, logs, embeddings, and object stores. The important point is that agentic inference will be less about GPUs answering a question and more about the memory hierarchy wrapped around a model.

Critically, this articulation of an agentic-specific memory hierarchy implies a necessary trade-off of speed for capacity. Here’s the thing, though: lower speed isn’t nearly as important a consideration if there isn’t a human in the loop. If an agent is waiting around for a job that is being run overnight, the agent doesn’t know or care about the user experience impact; what is most important is being able to accomplish a task, and if entirely new approaches to memory make that possible, then delays are fine.

Meanwhile, if delays are fine, then all of the focus on pure compute power and high-bandwidth memory seems out of place: if latency isn’t the top priority, then slower and cheaper memory — like traditional DRAM, for example — makes a lot more sense. And if the entire system is mostly waiting on memory, then chips don’t need to be as fast as the cutting edge either. This represents a profound shift in future architectures, but it also doesn’t mean that current architectures are going away: