Fact, Fetch, and Reason: A Unified Evaluation of Retrieval-Augmented Generation

Created 5/23/2026 at 5:12:41 PM

FRAMES offers a unified framework for assessing LLM performance in end-to-end RAG scenarios. Our dataset comprises challenging multi-hop questions requiring integration of information from multiple sources. Baseline results show that even state-of-the-art LLMs struggle, achieving 0.408 accuracy without retrieval. However, our proposed multi-step retrieval pipeline significantly improves accuracy to 0.66 (>50% improvement).

benchmarks evaluations llm

Public