Close Menu
Karachi Chronicle
  • Home
  • AI
  • Business
  • Entertainment
  • Fashion
  • Politics
  • Sports
  • Tech
  • World

Subscribe to Updates

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

What's Hot

Instead of Timothée Chalamett or Tom Holland, Sean Penn declares the Oscar-winning actress “the last movie star.” Hollywood

Does an American pope change U.S. politics? : The NPR Politics Podcast : NPR

Amazon will face Elon Musk’s Tesla with the robot launch.

Facebook X (Twitter) Instagram
  • Home
  • About us
  • Advertise
  • Contact us
  • DMCA
  • Privacy Policy
  • Terms & Conditions
Facebook X (Twitter) Instagram Pinterest Vimeo
Karachi Chronicle
  • Home
  • AI
  • Business
  • Entertainment
  • Fashion
  • Politics
  • Sports
  • Tech
  • World
Karachi Chronicle
You are at:Home » Google DeepMind researchers introduce new benchmark to improve LLM factuality and reduce hallucinations
AI

Google DeepMind researchers introduce new benchmark to improve LLM factuality and reduce hallucinations

Adnan MaharBy Adnan MaharJanuary 10, 2025No Comments5 Mins Read0 Views
Facebook Twitter Pinterest Telegram LinkedIn Tumblr Email Reddit
Share
Facebook Twitter LinkedIn Pinterest WhatsApp Email


Join our daily and weekly newsletters for the latest industry-leading AI updates and exclusive content. learn more

Illusions, or factually inaccurate responses, continue to plague large-scale language models (LLMs). The model falters, especially when given more complex tasks or when the user expects specific and highly detailed responses.

This is a challenge that data scientists have struggled to overcome, and researchers at Google DeepMind say they are one step closer to achieving true facts in their underlying models. They introduced FACTS Grounding, a benchmark that evaluates LLM’s ability to generate factually accurate responses based on long documents. The model is also determined whether the responses to the prompt are detailed enough to provide useful and relevant answers.

Along with the new benchmarks, researchers have released the FACTS leaderboard to the Kaggle data science community.

As of this week, Gemini 2.0 Flash topped the leaderboard with a factual score of 83.6%. Other devices in the top nine include Google’s Gemini 1.0 Flash and Gemini 1.5 Pro. Anthropic clade 3.5 sonnet and Claude 3.5 haiku. GPT-4o, 4o-mini, o1-mini, and o1-preview for OpenAI. All of these rank above 61.7% in terms of accuracy.

Researchers say the leaderboard will be actively maintained and continually updated to include new models and their various iterations.

“We believe this benchmark fills a gap in evaluating the behavior of a wide range of fact-related models compared to benchmarks that focus on narrower use cases such as summarization only. ” the researchers said in a technical paper published this week.

remove incorrect answers

Ensuring factual accuracy of LLM responses is difficult due to modeling (architecture, training, inference) and measurement (evaluation methods, data, metrics) factors. The researchers note that pre-training typically focuses on predicting the next token given the previous token.

“While the aim may be to teach the model salient world knowledge, it is not to directly optimize the model for a variety of factual scenarios, but instead to ensure that the model produces generally plausible text. ”, the researchers wrote.

To address this, the FACTS dataset incorporates 1,719 examples (860 public and 859 private), each requiring a long-form response based on the context of the document provided. . Each example includes:

System prompts (system_instructions) that contain general instructions and instructions to respond only based on the context provided. A task (user_request) containing a specific question to answer. A long document (context_document) containing the required information.

To be successful and rated as “accurate,” a model must process a long-form document and produce a subsequent long-form response that is comprehensive and fully attributable to the document. Answers are labeled “inaccurate” if the model’s claims are not directly supported by documentation and are not highly relevant or useful.

For example, a user might ask the model to summarize the main reasons why a company’s revenue decreased in the third quarter, and a company’s annual financial report that describes quarterly revenue, expenses, planned investments, market analysis, etc. You can ask us to provide you with more information.

Then, for example, if the model returned, “The company faced challenges that impacted revenue in the third quarter,” it would be considered inaccurate.

“This response avoids identifying reasons that are likely to be included in the document, such as market trends, increased competition, or operational setbacks,” the researchers said. “It does not represent an attempt to engage with or extract relevant details.”

In contrast, if a user asks “What are some tips for saving money?” and you provide a collection of money-saving tips categorized for college students, the correct answer is very Learn more: “Take advantage of free on-campus activities, buy products in bulk, and cook at home.” Also, set spending goals, avoid credit cards, and conserve resources. ”

DeepMind uses LLM to determine LLM

To allow for a variety of inputs, the researchers included documents of varying lengths up to 32,000 tokens (or the equivalent of 20,000 words). These cover areas such as finance, technology, retail, healthcare, and law. Requests from users are wide-ranging, including the creation of Q&A, requests for summaries, and rewrites.

Each example is judged in two stages. First, the response is evaluated for eligibility. If it does not meet the user’s request, the response will be disqualified. Second, the answer must be free of illusions and completely based on the documentation provided.

These fact scores are calculated by three different LLM judges (specifically Gemini 1.5 Pro, GPT-4o, and Claude 3.5 Sonnet), who determine their individual scores based on the percentage of correct model output. Masu. A final factual determination is then made based on the average of the three judges’ scores.

Researchers found that models were often biased toward other members of the model family (about 3.23% increase on average), so a combination of different judges was needed to ensure that the answers were factual. pointed out that it was important.

The researchers ultimately emphasize that facts and evidence are important factors for the future success and usefulness of LLM. “We believe that comprehensive benchmarking methods and continued research and development will continue to improve AI systems,” they write.

However, they also acknowledged that: “We keep in mind that benchmarks can be quickly overtaken by advances, so this launch of the FACTS Grounding benchmark and leaderboard is just the beginning.”

Daily insights into business use cases with VB Daily

If you want to impress your boss, VB Daily has you covered. From regulatory changes to real-world implementations, we give you the inside scoop on what companies are doing with generative AI. This allows you to share insights to maximize ROI.

Read our privacy policy

Thank you for subscribing. Check out other VB newsletters here.

An error has occurred.



Source link

Share. Facebook Twitter Pinterest LinkedIn Reddit WhatsApp Telegram Email
Previous ArticleNew shows and movies available to stream on Netflix, Hulu, Prime Video, Apple TV, and more
Next Article Disney employee fired, pleads guilty in Peanuts hacking scandal
Adnan Mahar
  • Website

Adnan is a passionate doctor from Pakistan with a keen interest in exploring the world of politics, sports, and international affairs. As an avid reader and lifelong learner, he is deeply committed to sharing insights, perspectives, and thought-provoking ideas. His journey combines a love for knowledge with an analytical approach to current events, aiming to inspire meaningful conversations and broaden understanding across a wide range of topics.

Related Posts

Google, Nvidia invests in AI startup Safe Superintelligence, co-founder of Openai Ilya Sutskever

April 14, 2025

This $30 billion AI startup can be very strange by a man who said that neural networks may already be aware of it

February 24, 2025

As Deepseek and ChatGpt Surge, is Delhi behind?

February 18, 2025
Leave A Reply Cancel Reply

Top Posts

President Trump’s SEC nominee Paul Atkins marries multi-billion dollar roof fortune

December 14, 202495 Views

Alice Munro’s Passive Voice | New Yorker

December 23, 202453 Views

2025 Best Actress Oscar Predictions

December 12, 202434 Views

20 Most Anticipated Sex Movies of 2025

January 22, 202533 Views
Don't Miss
AI April 14, 2025

Google, Nvidia invests in AI startup Safe Superintelligence, co-founder of Openai Ilya Sutskever

Alphabet and Nvidia are investing in Safe Superintelligence (SSI), a stealth mode AI startup co-founded…

This $30 billion AI startup can be very strange by a man who said that neural networks may already be aware of it

As Deepseek and ChatGpt Surge, is Delhi behind?

Openai’s Sam Altman reveals his daily use of ChatGpt, and that’s not what you think

Subscribe to Updates

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

About Us
About Us

Welcome to Karachi Chronicle, your go-to source for the latest and most insightful updates across a range of topics that matter most in today’s fast-paced world. We are dedicated to delivering timely, accurate, and engaging content that covers a variety of subjects including Sports, Politics, World Affairs, Entertainment, and the ever-evolving field of Artificial Intelligence.

Facebook X (Twitter) Pinterest YouTube WhatsApp
Our Picks

Instead of Timothée Chalamett or Tom Holland, Sean Penn declares the Oscar-winning actress “the last movie star.” Hollywood

Does an American pope change U.S. politics? : The NPR Politics Podcast : NPR

Amazon will face Elon Musk’s Tesla with the robot launch.

Most Popular

ATUA AI (TUA) develops cutting-edge AI infrastructure to optimize distributed operations

October 11, 20020 Views

10 things you should never say to an AI chatbot

November 10, 20040 Views

Character.AI faces lawsuit over child safety concerns

December 12, 20050 Views
© 2025 karachichronicle. Designed by karachichronicle.
  • Home
  • About us
  • Advertise
  • Contact us
  • DMCA
  • Privacy Policy
  • Terms & Conditions

Type above and press Enter to search. Press Esc to cancel.