#llm + #agents

Public notes from activescott tagged with both #llm and #agents

Sunday, February 15, 2026

Sunday, February 1, 2026

Wednesday, January 28, 2026

An interesting tool that uses playwright to extract structure based on apparently accessibility roles and geometry of “important” elements and use that for an execution agent to process the page results. Important elements are somehow ranked. Then geometry is inferred from those elements.

Also relies on jest-style assertions to explicitly assert whether a step succeeded or failed.

Friday, January 23, 2026

Monday, December 8, 2025

We introduce the Berkeley Function Calling Leaderboard (BFCL), the first comprehensive and executable function call evaluation dedicated to assessing Large Language Models' (LLMs) ability to invoke functions. Unlike previous evaluations, BFCL accounts for various forms of function calls, diverse scenarios, and executability.

In 2024, SWE-bench & SWE-agent helped kickstart the coding agent revolution.

We now ask: What if SWE-agent was 100x smaller, and still worked nearly as well?

mini is for

Researchers who want to benchmark, fine-tune or RL without assumptions, bloat, or surprises
Developers who like their tools like their scripts: short, sharp, and readable
Engineers who want something trivial to sandbox & to deploy anywhere

Here's some details:

Minimal: Just 100 lines of python (+100 total for env, model, script) — no fancy dependencies!
Powerful: Resolves >74% of GitHub issues in the SWE-bench verified benchmark (leaderboard).
Convenient: Comes with UIs that turn this into your daily dev swiss army knife!
Deployable: In addition to local envs, you can use docker, podman, singularity, apptainer, and more
Tested: Codecov
Cutting edge: Built by the Princeton & Stanford team behind SWE-bench and SWE-agent.

Sunday, November 16, 2025