Google Deepmind's new benchmark assesses the factuality of LLM

A new benchmarking tool, FACTS Grounding, was recently announced as a collaboration between Google DeepMind and Google Research. Evaluate LLM factual accuracy.

Introducing FACTS Grounding. New benchmark launched in collaboration with @GoogleDeepMind to evaluate LLM’s factual accuracy on over 1700 tasks. 🧠📐 pic.twitter.com/MvyRbbuMwK

— Kaggle (@kaggle) December 17, 2024

The FACTS Grounding benchmark and associated leaderboards aim to measure how well an AI model generates responses based on the source material provided. This initiative addresses challenges such as misinformation and hallucinations in AI-generated content.

“To track your progress, we’re also launching FACTS leaderboards on Kaggle,” the developer announced on its blog.

This is intended to increase confidence in LLM and limit its application in the real world, as LLM is prone to hallucinating false information, especially when given complex inputs.

95% more reliable results

The FACTS Grounding evaluation process reveals detailed insights into the factual accuracy of leading language models.

Models tested include Gemini 1.5 Pro and Flash (Gemini Team), Gemini 2.0 Flash Experimental, GPT-4o (OpenAI), OpenAI o1-preview and o1-mini, Claude 3.5 Haiku and Sonnet (Anthropic) .

During the aggregation process, the model was found to outperform the output of competing models by an average of 3.23% more. This is a trend observed in previous studies. To counter this bias, a multiple judge model was adopted to increase the computational cost while ensuring the fairness of the evaluation.

Disqualifying an ineligible answer decreased the final fact score by 1% to 5%. This adjustment also caused a slight change in the model rankings, with Gemini 1.5 Flash dropping from 1st to 2nd place. In any case, 95% confidence intervals were presented.

Google tells its Gemini AI testers to “address” any prompts they don’t understand, suggesting they assess their understanding and note any confusion.

The company assures us that this approach does not compromise Gemini’s accuracy, pointing to the newly introduced FACTS Grounding… pic.twitter.com/VcmSIZqR8t

— Daniel Gabai (@DanielGabai_) December 20, 2024

Model rankings were determined through a “fusion rank” metric that aggregates the individual rankings from the various splits and uses the Condorcet algorithm to judge the models.

How was the test conducted?

This benchmark consists of 1,719 examples that test the model on various tasks such as summarization, question answering, and rewriting.

Datasets and methodologies prioritize real-world applicability, and tasks span finance, law, and technology. Automated evaluation includes multiple decision models to evaluate model performance.

Your answer will be disqualified if you fail to adequately address your question or if the content you provide lacks evidence.

Will Google take the lead?

Google has launched several other major developments this year. This makes Google DeepMind the leader in the AGI race, surpassing OpenAI and its rivals.

The company announced a series of breakthrough innovations, including its latest quantum chip Willow, and the advanced Gemini Flash 2, Pro, and Agent. We also introduced Project Astra and Project Mariner, demonstrating our commitment to cutting-edge research.

Further advances include text-to-video model Veo 2 and text-to-image model Imagen 3, demonstrating progress in generative AI. Additionally, the Gemini 2.0 Flash Thinking framework represents a major advance in model inference and robotics.

This latest FACTS Grounding benchmark is seen as an important step towards promoting trustworthiness and accuracy of AI-generated content.

Source link

What's Hot

India’s AMCA rollout: 120 jets, 157,844 rupees, speed 2,600 km/h, payload 6,500 kg | India News

Nature finance myths that need to be broken to protect biodiversity

Tommy Robinson Police visit to Channel Tunnel illegal, court hears

Google Deepmind’s new benchmark assesses the factuality of LLM

AI systems learn from many types of scientific information and run experiments to discover new materials | MIT News

Among the most troublesome relationships in healthcare AI

Does access to AI become a fundamental human right? Sam Altman says, “Everyone would want…”

20 Most Anticipated Sex Movies of 2025

President Trump’s SEC nominee Paul Atkins marries multi-billion dollar roof fortune

How to tell the difference between fake and genuine Adidas Sambas

Alice Munro’s Passive Voice | New Yorker

AI systems learn from many types of scientific information and run experiments to discover new materials | MIT News

Among the most troublesome relationships in healthcare AI

Does access to AI become a fundamental human right? Sam Altman says, “Everyone would want…”

Google’s Gemini AI is on TV

Our Picks

India’s AMCA rollout: 120 jets, 157,844 rupees, speed 2,600 km/h, payload 6,500 kg | India News

Nature finance myths that need to be broken to protect biodiversity

Tommy Robinson Police visit to Channel Tunnel illegal, court hears

Most Popular

10 things you should never say to an AI chatbot

Character.AI faces lawsuit over child safety concerns

Analyst warns Salesforce investors about AI agent optimism

Subscribe to Updates

What's Hot

Google Deepmind’s new benchmark assesses the factuality of LLM

95% more reliable results

How was the test conducted?

Will Google take the lead?

Related Posts