freshllms/freshqa: Data and code for FreshLLMs (https://arxiv.org/abs/2310.03214)
Data and code for our paper FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation.
We believe that human evaluators possess the expertise and common sense required to detect issues like hallucinations, making them more reliable than automated evaluation metrics for assessing LLMs' factuality. However, researchers have the flexibility to adjust their evaluation methods if human evaluation proves challenging. An easily implemented alternative is to use standard metrics like F1/exact match or recall, which assess the overlap between the model response and the ground truth answer(s) (e.g., see You.com's recent blog where they report FreshQA recall). Researchers can also use LLM-based automatic evaluation metrics such as FactScore or our FreshEval metric below.
To facilitate future evaluations, we have developed FreshEval, a simple automatic metric that uses few-shot in-context learning to teach an LLM to judge model responses, which achieved high agreement with human raters (see Appendix B in our paper for details).
To use FreshEval under a specific evaluation mode (Relaxed or Strict), please follow the instructions below:
Make a copy of our latest data spreadsheet and store it in your Google Drive with a new filename (e.g., fresheval_relaxed or fresheval_strict). Insert 3 new columns D, E, F in the new spreadsheet for model responses, evaluation rating, evaluation explanation, respectively and save your model's responses in column D (see our sample evaluation spreadsheet below). Run the associated FreshEval notebook with the evaluation mode. Note that for demonstration purposes, we evaluated only the first 10 model responses. You can adjust the number as needed.Note: Currently, we recommend gpt-4-1106-preview over gpt-4-0125-preview for FreshEval as it yielded slightly better agreement with human annotations in our small-scale evaluation.