#llm

Public notes from activescott tagged with #llm

Thursday, July 16, 2026

1Password now lets Claude sign in to websites without seeing your passwords - 9to5Mac

9to5mac.com/2026/07/16/1password-now-lets-claude-sign-in-to-websites-without-seeing-your-passwords/

1Password for Claude lets you authorize Claude to complete browser-based tasks that require an account login. This is useful for things like booking travel or managing online accounts.

According to 1Password, approved credentials are delivered through a secure channel and injected directly into the destination page. The password, one-time code, and other secrets never enter Claude’s context, memory, or Anthropic’s systems.

Instead of granting ongoing access to a vault, Claude requests the specific login items it needs for a task. The user can approve or deny that request with a biometric prompt, and the permission lasts only for the current session.

1Password can also broker access across multiple websites during the same task, allowing Claude to complete a multi-step workflow without stopping for a new login each time.

The launch also introduces what 1Password calls Agentic Mode.

When a compatible AI agent takes control of the browser, the 1Password extension automatically locks down the vault so that only the credentials explicitly approved for that task remain available.

It requires the 1Password desktop app and browser extension, along with the Claude desktop app and browser extension.

Support for payment cards and identity information is planned for a later update.

The Anthropic partnership was first outlined in March, when 1Password said Claude would gain consent-based access to vault items.

#10:42 PM

password code security llm

OpenRouter

openrouter.ai/

The Unified Interface For LLMs

#10:30 PM

llm code

xai-org/grok-build, now open source

simonwillison.net/2026/Jul/15/grok-build/

xAI's grok CLI tool faced severe community backlash yesterday when it became apparent that running the command in a directory could upload that entire directory to xAI's Google Cloud buckets. One user reported running it in their home directory and seeing it upload "my SSH keys, my password manager database, my documents, photos, videos, everything".

#9:09 PM

twitter elon-musk llm security privacy

Thursday, July 9, 2026

Vision Language Models Explained

huggingface.co/blog/vlms

Vision language models are broadly defined as multimodal models that can learn from images and text. They are a type of generative models that take image and text inputs, and generate text outputs. Large vision language models have good zero-shot capabilities, generalize well, and can work with many types of images, including documents, web pages, and more. The use cases include chatting about images, image recognition via instructions, visual question answering, document understanding, image captioning, and others. Some vision language models can also capture spatial properties in an image. These models can output bounding boxes or segmentation masks when prompted to detect or segment a particular subject, or they can localize different entities or answer questions about their relative or absolute positions.

#5:17 PM

vision llm

Robostral Navigate: single-camera AI navigation | Mistral AI

mistral.ai/news/robostral-navigate/

Robostral Navigate is an 8B model that enables robots to autonomously navigate complex environments using only a single RGB camera, achieving 76.6% success on unseen R2R-CE benchmarks—outperforming multi-sensor approaches while being more efficient. Built entirely in-house with simulated data and token-efficient techniques, it generalizes across robot types and adapts to real-world obstacles unseen during training. The model combines pointing-based navigation with reinforcement learning for continuous improvement, paving the way for unified embodied AI in robotics.

State-of-the-art performance on R2R-CE
79.4% Success Rate on validation seen

76.6% Success Rate on validation unseen 
Operates from a single RGB camera, with no LiDAR or depth sensors

8B model, built in-house and trained entirely in simulation

Runs on wheeled, legged, and flying robots, and generalizes across robot sizes

Robust to differences in camera intrinsics

Token-efficient training via prefix-caching

A key ingredient of Robostral Navigate is an efficient training algorithm based on prefix-caching. Using a tree-based attention-masking strategy, our method compresses an entire episode into a single sequence, enabling training on all time steps in a single forward pass while preventing information leakage between time steps.

Compared to training with one sample per time step, our approach reduces the number of training tokens by 22× while preserving all of the learning signals. In practice, this method transforms training runs that would take months into runs that complete in days.

#2:26 PM

robots llm training

Thursday, July 2, 2026

“It’s Hard to Eval” Is a Product Smell – Hamel's Blog

hamel.dev/blog/posts/eval-smell/?utm_source=tldrproduct

The most common objection I hear to evals is “our product is hard to eval”.

This objection is a product smell. Artifacts that are hard for you to verify are often hard for users too. In the worst case, users have to redo the work from scratch to verify the output. More importantly, designing your product for ease of verification should come before building evals.

#8:45 PM

llm evaluations

Tuesday, June 9, 2026

rtk-ai/rtk: CLI proxy that reduces LLM token consumption by 60-90% on common dev commands. Single Rust binary, zero dependencies

github.com/rtk-ai/rtk

rtk filters and compresses command outputs before they reach your LLM context. Single Rust binary, 100+ supported commands, <10ms overhead.

#9:28 PM

llm/skills prompt-engineering llm

JuliusBrussee/caveman: 🪨 why use many token when few token do trick — Claude Code skill that cuts 65% of tokens by talking like caveman

github.com/JuliusBrussee/caveman

A Claude Code skill/plugin that makes agent talk like caveman — cuts ~75% of output tokens, keeps full technical accuracy. Brain still big. Mouth small.

git clone https://github.com/JuliusBrussee/caveman.git
cd caveman

node bin/install.js --only claude --minimal

#9:26 PM

llm/skills prompt-engineering llm

AI agent at the wheel: How an attacker used LLMs to move from a CVE to an internal database in 4 pivots | Sysdig

www.sysdig.com/blog/ai-agent-at-the-wheel-how-an-attacker-used-llms-to-move-from-a-cve-to-an-internal-database-in-4-pivots

Key Findings

An LLM agent executed the post-compromise actions in real time rather than running a pre-built playbook. This is the first AI-agent-driven intrusion the Sysdig TRT has captured. The full attack chain — marimo notebook compromise to internal Postgres database dump — ran end-to-end in under one hour. The SSH bastion phase exfiltrated the Postgres schema and full contents of an internal database in less than two minutes. Cloudflare Workers were used as a per-request egress pool: 12 cloud API calls fanned across eleven distinct IPs in 22 seconds, defeating per-source-IP detection.

#8:19 PM

llm security

Sunday, May 31, 2026

Prefill Is Compute-Bound. Decode Is Memory-Bound. Why Your GPU Shouldn’t Do Both. | Towards Data Science

towardsdatascience.com/prefill-is-compute-bound-decode-is-memory-bound-why-your-gpu-shouldnt-do-both/

Prefill processes all input tokens simultaneously. For a 4,096-token prompt, the attention computation involves large matrix multiplications across the full sequence length. This is compute-bound work. The GPU’s tensor cores are the bottleneck. On an H100 SXM, prefill achieves 200-400 arithmetic operations per byte of memory accessed. Utilization sits between 90% and 95%. The memory bandwidth, at 3.35 TB/s, is barely taxed.

Decode generates one token at a time. Each step reads the entire KV-cache from HBM to compute a single attention output. The tensor cores finish in microseconds and then wait for the next memory read. Arithmetic intensity drops to 60-80 ops/byte. GPU utilization falls to 20-40%. The tensor cores sit idle while the memory bus saturates.

Disaggregated inference runs prefill and decode on separate GPU pools connected by a fast network. A request arrives, gets routed to a prefill worker, which processes the full prompt and generates the KV-cache. That cache is then transferred over the network to a decode worker, which handles the autoregressive token generation.

Disaggregation is not free. The KV-cache produced during prefill has to move from the prefill GPU to the decode GPU over the network, and these caches are not small.

For a 70B parameter model using grouped-query attention with 80 layers, 8 KV heads per layer, 128 dimensions per head, stored in FP16: each token’s KV state is 327,680 bytes. A 4,096-token prompt produces 1.34 GB of KV-cache. That entire block has to transfer before the decode worker can begin generating.

#5:04 AM

llm

Saturday, May 23, 2026

vtllms/sealqa · Datasets at Hugging Face

huggingface.co/datasets/vtllms/sealqa

#5:25 PM

benchmarks evaluations llm

[2506.01062] SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models

arxiv.org/abs/2506.01062

SealQA, a new challenge benchmark for evaluating SEarch-Augmented Language models on fact-seeking questions where web search yields conflicting, noisy, or unhelpful results. SealQA comes in three flavors: (1) Seal-0 (main) and (2) Seal-Hard, which assess factual accuracy and reasoning capabilities, with Seal-0 focusing on the most challenging questions where chat models (e.g., GPT-4.1) typically achieve near-zero accuracy; and (3) LongSeal, which extends SealQA to test long-context, multi-document reasoning in "needle-in-a-haystack" settings. Our evaluation reveals critical limitations in current models: Even frontier LLMs perform poorly across all SealQA flavors. On Seal-0, frontier agentic models equipped with tools like o3 and o4-mini achieve only 17.1% and 6.3% accuracy, respectively, at their best reasoning efforts. We find that advanced reasoning models such as DeepSeek-R1-671B and o3-mini are highly vulnerable to noisy search results. Notably, increasing test-time compute does not yield reliable gains across o3-mini, o4-mini, and o3, with performance often plateauing or even declining early.

#5:24 PM

benchmarks evaluations llm

Humanity's Last Exam

lastexam.ai/

benchmarks are not keeping pace in difficulty: LLMs now achieve over 90% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities. In response, we introduce Humanity's Last Exam, a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage. The dataset consists of 2,500 challenging questions across over a hundred subjects. We publicly release these questions, while maintaining a private test set of held out questions to assess model overfitting.

Accuracy. All frontier models achieve low accuracy on Humanity's Last Exam, highlighting significant room for improvement in narrowing the gap between current LLMs and expert-level academic capabilities on closed-ended questions.

Hahaha current state of the art is Gemini 3 Pro w/ 38.3:

Given the rapid pace of AI development, it is plausible that models could exceed 50% accuracy on HLE by the end of 2025.

#5:20 PM

benchmarks evaluations llm

freshllms/freshqa: Data and code for FreshLLMs (https://arxiv.org/abs/2310.03214)

github.com/freshllms/freshqa

Data and code for our paper FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation.

We believe that human evaluators possess the expertise and common sense required to detect issues like hallucinations, making them more reliable than automated evaluation metrics for assessing LLMs' factuality. However, researchers have the flexibility to adjust their evaluation methods if human evaluation proves challenging. An easily implemented alternative is to use standard metrics like F1/exact match or recall, which assess the overlap between the model response and the ground truth answer(s) (e.g., see You.com's recent blog where they report FreshQA recall). Researchers can also use LLM-based automatic evaluation metrics such as FactScore or our FreshEval metric below.

To facilitate future evaluations, we have developed FreshEval, a simple automatic metric that uses few-shot in-context learning to teach an LLM to judge model responses, which achieved high agreement with human raters (see Appendix B in our paper for details).

To use FreshEval under a specific evaluation mode (Relaxed or Strict), please follow the instructions below:
Make a copy of our latest data spreadsheet and store it in your Google Drive with a new filename (e.g., fresheval_relaxed or fresheval_strict).
Insert 3 new columns D, E, F in the new spreadsheet for model responses, evaluation rating, evaluation explanation, respectively and save your model's responses in column D (see our sample evaluation spreadsheet below).
Run the associated FreshEval notebook with the evaluation mode. Note that for demonstration purposes, we evaluated only the first 10 model responses. You can adjust the number as needed.
Note: Currently, we recommend gpt-4-1106-preview over gpt-4-0125-preview for FreshEval as it yielded slightly better agreement with human annotations in our small-scale evaluation.

#5:16 PM

benchmarks evaluations llm

[2310.03214] FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation

arxiv.org/abs/2310.03214

we perform a detailed study of the factuality of LLM-generated text in the context of answering questions that test current world knowledge. Specifically, we introduce FreshQA, a novel dynamic QA benchmark encompassing a diverse range of question and answer types, including questions that require fast-changing world knowledge as well as questions with false premises that need to be debunked.

Through human evaluations involving more than 50K judgments, we shed light on limitations of these models and demonstrate significant room for improvement: for instance, all models (regardless of model size) struggle on questions that involve fast-changing knowledge and false premises.

we present FreshPrompt, a simple few-shot prompting method that substantially boosts the performance of an LLM on FreshQA by incorporating relevant and up-to-date information retrieved from a search engine into the prompt. Our experiments show that FreshPrompt outperforms both competing search engine-augmented prompting methods such as Self-Ask (Press et al., 2022) as well as commercial systems such as this http URL. Further analysis of FreshPrompt reveals that both the number of retrieved evidences and their order play a key role in influencing the correctness of LLM-generated answers. Additionally, instructing the LLM to generate concise and direct answers helps reduce hallucination compared to encouraging more verbose answers.

#5:14 PM

benchmarks evaluations llm

Fact, Fetch, and Reason: A Unified Evaluation of Retrieval-Augmented Generation

arxiv.org/html/2409.12941v3

FRAMES offers a unified framework for assessing LLM performance in end-to-end RAG scenarios. Our dataset comprises challenging multi-hop questions requiring integration of information from multiple sources. Baseline results show that even state-of-the-art LLMs struggle, achieving 0.408 accuracy without retrieval. However, our proposed multi-step retrieval pipeline significantly improves accuracy to 0.66 (>50% improvement).

#5:12 PM

benchmarks evaluations llm

openai/evals: Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.

github.com/openai/evals

Evals provide a framework for evaluating large language models (LLMs) or systems built using LLMs. We offer an existing registry of evals to test different dimensions of OpenAI models and the ability to write your own custom evals for use cases you care about. You can also use your data to build private evals which represent the common LLMs patterns in your workflow without exposing any of that data publicly.

If you are building with LLMs, creating high quality evals is one of the most impactful things you can do. Without evals, it can be very difficult and time intensive to understand how different model versions might affect your use case.

#4:08 PM

benchmarks evaluations openai llm

openai/simple-evals

github.com/openai/simple-evals

July 2025: simple-evals will no longer be updated for new models or benchmark results. The repo will continue to host reference implementations for HealthBench, BrowseComp, and SimpleQA.

Evals are sensitive to prompting, and there's significant variation in the formulations used in recent publications and libraries. Some use few-shot prompts or role playing prompts ("You are an expert software programmer..."). These approaches are carryovers from evaluating base models (rather than instruction/chat-tuned models) and from models that were worse at following instructions.

For this library, we are emphasizing the zero-shot, chain-of-thought setting, with simple instructions like "Solve the following multiple choice problem". We believe that this prompting technique is a better reflection of the models' performance in realistic usage.

We will not be actively maintaining this repository and monitoring PRs and Issues. In particular, we're not accepting new evals. Here are the changes we might accept.
Bug fixes (hopefully not needed!)
Adding adapters for new models
Adding new rows to the table below with eval results, given new models and new system prompts.
This repository is NOT intended as a replacement for https://github.com/openai/evals, which is designed to be a comprehensive collection of a large number of evals.

#4:07 PM

benchmarks evaluations openai llm

OWASP Top 10 for Large Language Model Applications | OWASP Foundation

owasp.org/www-project-top-10-for-large-language-model-applications/

The OWASP GenAI Security Project is a global, open-source initiative dedicated to identifying, mitigating, and documenting security and safety risks associated with generative AI technologies, including large language models (LLMs), agentic AI systems, and AI-driven applications.

#3:35 PM

owasp code security llm