Google Deepmind's new benchmark assesses the factuality of LLM

A new benchmarking tool, FACTS Grounding, was recently announced as a collaboration between Google DeepMind and Google Research. Evaluate LLM factual accuracy.

Introducing FACTS Grounding. New benchmark launched in collaboration with @GoogleDeepMind to evaluate LLM’s factual accuracy on over 1700 tasks. 🧠📐 pic.twitter.com/MvyRbbuMwK

— Kaggle (@kaggle) December 17, 2024

The FACTS Grounding benchmark and associated leaderboards aim to measure how well an AI model generates responses based on the source material provided. This initiative addresses challenges such as misinformation and hallucinations in AI-generated content.

“To track your progress, we’re also launching FACTS leaderboards on Kaggle,” the developer announced on its blog.

This is intended to increase confidence in LLM and limit its application in the real world, as LLM is prone to hallucinating false information, especially when given complex inputs.

95% more reliable results

The FACTS Grounding evaluation process reveals detailed insights into the factual accuracy of leading language models.

Models tested include Gemini 1.5 Pro and Flash (Gemini Team), Gemini 2.0 Flash Experimental, GPT-4o (OpenAI), OpenAI o1-preview and o1-mini, Claude 3.5 Haiku and Sonnet (Anthropic) .

During the aggregation process, the model was found to outperform the output of competing models by an average of 3.23% more. This is a trend observed in previous studies. To counter this bias, a multiple judge model was adopted to increase the computational cost while ensuring the fairness of the evaluation.

Disqualifying an ineligible answer decreased the final fact score by 1% to 5%. This adjustment also caused a slight change in the model rankings, with Gemini 1.5 Flash dropping from 1st to 2nd place. In any case, 95% confidence intervals were presented.

Google tells its Gemini AI testers to “address” any prompts they don’t understand, suggesting they assess their understanding and note any confusion.

The company assures us that this approach does not compromise Gemini’s accuracy, pointing to the newly introduced FACTS Grounding… pic.twitter.com/VcmSIZqR8t

— Daniel Gabai (@DanielGabai_) December 20, 2024

Model rankings were determined through a “fusion rank” metric that aggregates the individual rankings from the various splits and uses the Condorcet algorithm to judge the models.

How was the test conducted?

This benchmark consists of 1,719 examples that test the model on various tasks such as summarization, question answering, and rewriting.

Datasets and methodologies prioritize real-world applicability, and tasks span finance, law, and technology. Automated evaluation includes multiple decision models to evaluate model performance.

Your answer will be disqualified if you fail to adequately address your question or if the content you provide lacks evidence.

Will Google take the lead?

Google has launched several other major developments this year. This makes Google DeepMind the leader in the AGI race, surpassing OpenAI and its rivals.

The company announced a series of breakthrough innovations, including its latest quantum chip Willow, and the advanced Gemini Flash 2, Pro, and Agent. We also introduced Project Astra and Project Mariner, demonstrating our commitment to cutting-edge research.

Further advances include text-to-video model Veo 2 and text-to-image model Imagen 3, demonstrating progress in generative AI. Additionally, the Gemini 2.0 Flash Thinking framework represents a major advance in model inference and robotics.

This latest FACTS Grounding benchmark is seen as an important step towards promoting trustworthiness and accuracy of AI-generated content.

Source link

What's Hot

I’ve seen all the Marvel movies. Here’s how to save your MCU

London Stock Exchange Group share price rises as PISCES debut nears and financial results approach

Indian Americans largely disapprove of Trump’s first-year performance, but Democrats aren’t benefiting: Survey

Google Deepmind’s new benchmark assesses the factuality of LLM

D Street Massacre, Humanity Milestones, Bangladesh Election Results, PMO Shift, and More

A smarter way for AI to understand text and images

Surprisingly Tough Competition for Meta’s Ray-Ban

20 Most Anticipated Sex Movies of 2025

How to tell the difference between fake and genuine Adidas Sambas

President Trump’s SEC nominee Paul Atkins marries multi-billion dollar roof fortune

Alice Munro’s Passive Voice | New Yorker

D Street Massacre, Humanity Milestones, Bangladesh Election Results, PMO Shift, and More

A smarter way for AI to understand text and images

Surprisingly Tough Competition for Meta’s Ray-Ban

How AI assistance impacts the formation of coding skills \ Anthropic

Our Picks

I’ve seen all the Marvel movies. Here’s how to save your MCU

London Stock Exchange Group share price rises as PISCES debut nears and financial results approach

Indian Americans largely disapprove of Trump’s first-year performance, but Democrats aren’t benefiting: Survey

Most Popular

chatgpt makers claim data breach claims “seriously”

Everything you need to know

Everything you need to know about Google’s premium AI

Subscribe to Updates

What's Hot

Google Deepmind’s new benchmark assesses the factuality of LLM

95% more reliable results

How was the test conducted?

Will Google take the lead?

Related Posts