#llm

Public notes from activescott tagged with #llm

Monday, December 8, 2025

In 2024, SWE-bench & SWE-agent helped kickstart the coding agent revolution.

We now ask: What if SWE-agent was 100x smaller, and still worked nearly as well?

mini is for

Researchers who want to benchmark, fine-tune or RL without assumptions, bloat, or surprises
Developers who like their tools like their scripts: short, sharp, and readable
Engineers who want something trivial to sandbox & to deploy anywhere

Here's some details:

Minimal: Just 100 lines of python (+100 total for env, model, script) — no fancy dependencies!
Powerful: Resolves >74% of GitHub issues in the SWE-bench verified benchmark (leaderboard).
Convenient: Comes with UIs that turn this into your daily dev swiss army knife!
Deployable: In addition to local envs, you can use docker, podman, singularity, apptainer, and more
Tested: Codecov
Cutting edge: Built by the Princeton & Stanford team behind SWE-bench and SWE-agent.

Rnj-1 is an 8B model that roughly follows the open-source Gemma 3 architecture. We employ global self-attention and YaRN to extend the context to 32k. The Rnj-1 Base and Instruct models compare favorably against similarly sized open weight models.

Rnj-1 Instruct dominates the pack on Agentic coding, one of our target abilities. SWE bench performance is indicative of the model's ability to tackle everyday software engineering tasks. We are an order of magnitude stronger than comparably sized models on SWE-bench and approach the capabilities available in much larger models (leaderboard: SWE-bench-Verified bash-only).

Wednesday, December 3, 2025

Some interesting subtle things he ever so briefly mentions that I think are notable:

  • "Everybody uses all the products": This translates to each person deeply knows what each product does, it's use cases and how users use each product because they are users of the product. He mentioned this in the context of "developers just commit to other products" - They will just download the repo and submit a PR. He mentions the value of Claude in that process, which I know it is, but takes for granted the value of knowing the product.
  • While I'm sure these products have complex coding challenges, they're all well defined and narrowly scoped. I think it's much harder to describe a complex application or set of applications using proprietary services with sometimes odd design choices, and integrating with external proprietary services. With that said, I find AI to be exceptional at helping to understand complex code across many services and frontend components - maybe even more valuable than writing the code. However, it still is non-trivial. It also doesn't help with knowing what to build for your customer.

Tuesday, December 2, 2025

Saturday, November 29, 2025

Today, we release INTELLECT-3, a 100B+ parameter Mixture-of-Experts model trained on our RL stack, achieving state-of-the-art performance for its size across math, code, science and reasoning benchmarks, outperforming many larger frontier models.

Our complete recipe — from the model weights and training frameworks, to our datasets, RL environments, and evaluations — has been open-sourced, with the goal of encouraging more open research on large scale reinforcement learning.

#

Thursday, November 27, 2025

Mayo Clinic adopted a reverse RAG technique that effectively eliminated data retrieval hallucinations in their tests. In a traditional RAG setup, an LLM retrieves context from a knowledge source before generating an answer. Mayo’s reverse RAG flips this process: the model first extracts or summarizes information, then links every data point in its output back to the document. By forcing the AI to provide a reference for each fact, Mayo virtually eliminated hallucinations in non-diagnostic use cases, building clinician trust in the results.

The workflow looks like this:

  1. Data Extraction — The LLM/OCR/API reads the patient’s records (e.g. discharge summaries or outside medical files) and produces a summary or list of facts. This initial output might include details as patient age, diagnoses, lab results, etc.
  2. Fact Splitting — The AI output is split into individual facts or data points. Each sentence or key piece of information from the summary is treated separately.
  3. Source Matching — For each fact, the system searches the patient’s records (using a vector database of document embeddings) to locate the original source text that supports that fact. Essentially, the AI is asked: “Where did this piece of information come from?” Every fact must be matched to a snippet in the records (for example, the patient’s age is verified from the admission note, a lab value from the lab report, etc.).
  4. Verification — A second LLM then compares each fact to the retrieved source text and scores how well they align. It checks that the fact is truly supported by the source and not a misunderstanding or fabrication. Mayo’s team even looked for a causal relationship — ensuring the context implies that fact, not just a coincidental mention.
  5. Output with References — Only facts with solid support are kept. The final output is delivered with inline citations or links to the original records for every data point. This means physicians can click a link and see exactly where each piece of information came from, ensuring transparency and trust.

Wednesday, November 26, 2025

LLM agents are vulnerable to prompt injection attacks when handling untrusted data. In this paper we propose CaMeL, a robust defense that creates a protective system layer around the LLM, securing it even when underlying models are susceptible to attacks. To operate, CaMeL explicitly extracts the control and data flows from the (trusted) query; therefore, the untrusted data retrieved by the LLM can never impact the program flow. To further improve security, CaMeL uses a notion of a capability to prevent the exfiltration of private data over unauthorized data flows by enforcing security policies when tools are called.

Visit a Reddit post with Comet and ask it to summarize the thread, and malicious instructions in a post there can trick Comet into accessing web pages in another tab to extract the user's email address, then perform all sorts of actions like triggering an account recovery flow and grabbing the resulting code from a logged in Gmail session.

Antigravity is Google’s new agentic code editor. In this article, we demonstrate how an indirect prompt injection can manipulate Gemini to invoke a malicious browser subagent in order to steal credentials and sensitive code from a user’s IDE.

Google’s approach is to include a disclaimer about the existing risks, which we address later in the article.

Thursday, November 20, 2025

Tuesday, November 18, 2025

Sunday, November 16, 2025

Monday, November 10, 2025

Be patient. Not afraid.

For layoffs in the tech sector, a likely culprit is the financial stress that companies are experiencing because of their huge spending on AI infrastructure. Companies that are spending a lot with no significant increases in revenue can try to sustain profitability by cutting costs. Amazon increased its total CapEx from $54 billion in 2023 to $84 billion in 2024, and an estimated $118 billion in 2025. Meta is securing a $27 billion credit line to fund its data centers. Oracle plans to borrow $25 billion annually over the next few years to fulfill its AI contracts. 

“We’re running out of simple ways to secure more funding, so cost-cutting will follow,” Pratik Ratadiya, head of product at AI startup Narravance, wrote on X. “I maintain that companies have overspent on LLMs before establishing a sustainable financial model for these expenses.”

We’ve seen this act before. When companies are financially stressed, a relatively easy solution is to lay off workers and ask those who are not laid off to work harder and be thankful that they still have jobs. AI is just a convenient excuse for this cost-cutting.

Last week, when Amazon slashed 14,000 corporate jobs and hinted that more cuts could be coming, a top executive noted the current generation of AI is “enabling companies to innovate much faster than ever before.” Shortly thereafter, another Amazon rep anonymously admitted to NBC News that “AI is not the reason behind the vast majority of reductions.” On an investor call, Amazon CEO Andy Jassy admitted that the layoffs were “not even really AI driven.”

We have been following the slow growth in revenues for generative AI over the last few years, and the revenues are neither big enough to support the number of layoffs attributed to AI, nor to justify the capital expenditures on AI cloud infrastructure. Those expenditures may be approaching $1 trillion for 2025, while AI revenue—which would be used to pay for the use of AI infrastructure to run the software—will not exceed $30 billion this year. Are we to believe that such a small amount of revenue is driving economy-wide layoffs?

Friday, October 31, 2025

This software is not made for making the Crawlers go away. It is an aggressive defense mechanism that tries its best to take the blunt of the assault, serve them garbage, and keep them off of upstream resources. Even though a lot of work went into making iocaine efficient, and nigh invisible for the legit visitor, it is an aggressive defender nevertheless, and will require a few resources - a whole lot less than if you’d let the Crawlers run rampant, though.

lol

It works by generating an endless sequences of pages, each of which with dozens of links, that simply go back into a the tarpit. Pages are randomly generated, but in a deterministic way, causing them to appear to be flat files that never change. Intentional delay is added to prevent crawlers from bogging down your server, in addition to wasting their time. Lastly, Markov-babble is added to the pages, to give the crawlers something to scrape up and train their LLMs on, hopefully accelerating model collapse.