A new benchmarking tool, FACTS Grounding, was recently announced as a collaboration between Google DeepMind and Google Research. Evaluate LLM factual accuracy.
Introducing FACTS Grounding. New benchmark launched in collaboration with @GoogleDeepMind to evaluate LLM’s factual accuracy on over 1700 tasks. 🧠📐 pic.twitter.com/MvyRbbuMwK
— Kaggle (@kaggle) December 17, 2024
The FACTS Grounding benchmark and associated leaderboards aim to measure how well an AI model generates responses based on the source material provided. This initiative addresses challenges such as misinformation and hallucinations in AI-generated content.
“To track your progress, we’re also launching FACTS leaderboards on Kaggle,” the developer announced on its blog.
This is intended to increase confidence in LLM and limit its application in the real world, as LLM is prone to hallucinating false information, especially when given complex inputs.
95% more reliable results
The FACTS Grounding evaluation process reveals detailed insights into the factual accuracy of leading language models.
Models tested include Gemini 1.5 Pro and Flash (Gemini Team), Gemini 2.0 Flash Experimental, GPT-4o (OpenAI), OpenAI o1-preview and o1-mini, Claude 3.5 Haiku and Sonnet (Anthropic) .
During the aggregation process, the model was found to outperform the output of competing models by an average of 3.23% more. This is a trend observed in previous studies. To counter this bias, a multiple judge model was adopted to increase the computational cost while ensuring the fairness of the evaluation.
Disqualifying an ineligible answer decreased the final fact score by 1% to 5%. This adjustment also caused a slight change in the model rankings, with Gemini 1.5 Flash dropping from 1st to 2nd place. In any case, 95% confidence intervals were presented.
Google tells its Gemini AI testers to “address” any prompts they don’t understand, suggesting they assess their understanding and note any confusion.
The company assures us that this approach does not compromise Gemini’s accuracy, pointing to the newly introduced FACTS Grounding… pic.twitter.com/VcmSIZqR8t
— Daniel Gabai (@DanielGabai_) December 20, 2024
Model rankings were determined through a “fusion rank” metric that aggregates the individual rankings from the various splits and uses the Condorcet algorithm to judge the models.
How was the test conducted?
This benchmark consists of 1,719 examples that test the model on various tasks such as summarization, question answering, and rewriting.
Datasets and methodologies prioritize real-world applicability, and tasks span finance, law, and technology. Automated evaluation includes multiple decision models to evaluate model performance.
Your answer will be disqualified if you fail to adequately address your question or if the content you provide lacks evidence.
Will Google take the lead?
Google has launched several other major developments this year. This makes Google DeepMind the leader in the AGI race, surpassing OpenAI and its rivals.
The company announced a series of breakthrough innovations, including its latest quantum chip Willow, and the advanced Gemini Flash 2, Pro, and Agent. We also introduced Project Astra and Project Mariner, demonstrating our commitment to cutting-edge research.
Further advances include text-to-video model Veo 2 and text-to-image model Imagen 3, demonstrating progress in generative AI. Additionally, the Gemini 2.0 Flash Thinking framework represents a major advance in model inference and robotics.
This latest FACTS Grounding benchmark is seen as an important step towards promoting trustworthiness and accuracy of AI-generated content.