#search + #llm

Public notes from activescott tagged with both #search and #llm

Friday, May 22, 2026

Dataset

We evaluated search providers against five open benchmarks covering complementary aspects of agentic search: BrowseComp (hard multi-hop questions that require navigating the live web), Frames (multi-document factoid reasoning), FreshQA (time-sensitive questions where the correct answer depends on recent web information), HLE (Humanity's Last Exam — expert-level academic questions spanning math, science, and humanities), SealQA (ambiguity-robust factoid QA with intentionally misleading snippets), WebWalker (tasks designed around following links across pages to find an answer).

Evaluation methodology

Every task is run through a shared deep-research harness: a single GPT-5.4 agent is given two tools (web search and web fetch) with an iterative budget of up to MAX_TOOL_CALLS=25 tool calls per question. The agent plans sub-queries, fans out searches, fetches specific pages when snippets are insufficient, and returns an answer when it exhausts the number of allowed tool calls or has sufficient information to answer the question. Each answer is then LLM-graded by GPT-5.4. We report accuracy of the final answer.

We measure accuracy and overall cost, which includes LLM token costs and tool call costs.

Testing dates

April 19-21, 2026

The highest accuracy web search for your AI

Why use Parallel Search vs. the default search in Claude?

Parallel runs its own web-scale index (billions of pages, millions added daily) and returns dense, query-relevant excerpts instead of raw HTML or SEO-ranked snippets. On public benchmarks, Parallel outperforms the default search in leading frontier models. Your agent reaches the right answer in fewer round trips and with less wasted context. – https://parallel.ai/blog/free-web-search-mcp