Join our daily and weekly newsletters for the latest industry-leading AI updates and exclusive content. learn more
Illusions, or factually inaccurate responses, continue to plague large-scale language models (LLMs). The model falters, especially when given more complex tasks or when the user expects specific and highly detailed responses.
This is a challenge that data scientists have struggled to overcome, and researchers at Google DeepMind say they are one step closer to achieving true facts in their underlying models. They introduced FACTS Grounding, a benchmark that evaluates LLM’s ability to generate factually accurate responses based on long documents. The model is also determined whether the responses to the prompt are detailed enough to provide useful and relevant answers.
Along with the new benchmarks, researchers have released the FACTS leaderboard to the Kaggle data science community.
As of this week, Gemini 2.0 Flash topped the leaderboard with a factual score of 83.6%. Other devices in the top nine include Google’s Gemini 1.0 Flash and Gemini 1.5 Pro. Anthropic clade 3.5 sonnet and Claude 3.5 haiku. GPT-4o, 4o-mini, o1-mini, and o1-preview for OpenAI. All of these rank above 61.7% in terms of accuracy.
![](https://venturebeat.com/wp-content/uploads/2025/01/Screenshot-118.png?w=800)
Researchers say the leaderboard will be actively maintained and continually updated to include new models and their various iterations.
“We believe this benchmark fills a gap in evaluating the behavior of a wide range of fact-related models compared to benchmarks that focus on narrower use cases such as summarization only. ” the researchers said in a technical paper published this week.
remove incorrect answers
Ensuring factual accuracy of LLM responses is difficult due to modeling (architecture, training, inference) and measurement (evaluation methods, data, metrics) factors. The researchers note that pre-training typically focuses on predicting the next token given the previous token.
“While the aim may be to teach the model salient world knowledge, it is not to directly optimize the model for a variety of factual scenarios, but instead to ensure that the model produces generally plausible text. ”, the researchers wrote.
To address this, the FACTS dataset incorporates 1,719 examples (860 public and 859 private), each requiring a long-form response based on the context of the document provided. . Each example includes:
System prompts (system_instructions) that contain general instructions and instructions to respond only based on the context provided. A task (user_request) containing a specific question to answer. A long document (context_document) containing the required information.
To be successful and rated as “accurate,” a model must process a long-form document and produce a subsequent long-form response that is comprehensive and fully attributable to the document. Answers are labeled “inaccurate” if the model’s claims are not directly supported by documentation and are not highly relevant or useful.
For example, a user might ask the model to summarize the main reasons why a company’s revenue decreased in the third quarter, and a company’s annual financial report that describes quarterly revenue, expenses, planned investments, market analysis, etc. You can ask us to provide you with more information.
Then, for example, if the model returned, “The company faced challenges that impacted revenue in the third quarter,” it would be considered inaccurate.
“This response avoids identifying reasons that are likely to be included in the document, such as market trends, increased competition, or operational setbacks,” the researchers said. “It does not represent an attempt to engage with or extract relevant details.”
In contrast, if a user asks “What are some tips for saving money?” and you provide a collection of money-saving tips categorized for college students, the correct answer is very Learn more: “Take advantage of free on-campus activities, buy products in bulk, and cook at home.” Also, set spending goals, avoid credit cards, and conserve resources. ”
![](https://venturebeat.com/wp-content/uploads/2025/01/Screenshot-120.png)
DeepMind uses LLM to determine LLM
To allow for a variety of inputs, the researchers included documents of varying lengths up to 32,000 tokens (or the equivalent of 20,000 words). These cover areas such as finance, technology, retail, healthcare, and law. Requests from users are wide-ranging, including the creation of Q&A, requests for summaries, and rewrites.
Each example is judged in two stages. First, the response is evaluated for eligibility. If it does not meet the user’s request, the response will be disqualified. Second, the answer must be free of illusions and completely based on the documentation provided.
These fact scores are calculated by three different LLM judges (specifically Gemini 1.5 Pro, GPT-4o, and Claude 3.5 Sonnet), who determine their individual scores based on the percentage of correct model output. Masu. A final factual determination is then made based on the average of the three judges’ scores.
Researchers found that models were often biased toward other members of the model family (about 3.23% increase on average), so a combination of different judges was needed to ensure that the answers were factual. pointed out that it was important.
The researchers ultimately emphasize that facts and evidence are important factors for the future success and usefulness of LLM. “We believe that comprehensive benchmarking methods and continued research and development will continue to improve AI systems,” they write.
However, they also acknowledged that: “We keep in mind that benchmarks can be quickly overtaken by advances, so this launch of the FACTS Grounding benchmark and leaderboard is just the beginning.”