Close Menu
Karachi Chronicle
  • Home
  • AI
  • Business
  • Entertainment
  • Fashion
  • Politics
  • Sports
  • Tech
  • World

Subscribe to Updates

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

What's Hot

Instead of Timothée Chalamett or Tom Holland, Sean Penn declares the Oscar-winning actress “the last movie star.” Hollywood

Does an American pope change U.S. politics? : The NPR Politics Podcast : NPR

Amazon will face Elon Musk’s Tesla with the robot launch.

Facebook X (Twitter) Instagram
  • Home
  • About us
  • Advertise
  • Contact us
  • DMCA
  • Privacy Policy
  • Terms & Conditions
Facebook X (Twitter) Instagram Pinterest Vimeo
Karachi Chronicle
  • Home
  • AI
  • Business
  • Entertainment
  • Fashion
  • Politics
  • Sports
  • Tech
  • World
Karachi Chronicle
You are at:Home » These researchers have benchmarked the AI ​​”inference” model using NPR Sunday puzzle questions
AI

These researchers have benchmarked the AI ​​”inference” model using NPR Sunday puzzle questions

Adnan MaharBy Adnan MaharFebruary 16, 2025No Comments4 Mins Read0 Views
Facebook Twitter Pinterest Telegram LinkedIn Tumblr Email Reddit
Share
Facebook Twitter LinkedIn Pinterest WhatsApp Email


Every Sunday, we quizzes thousands of listeners in a long-term segment called NPR host Will Shortz, The Sunday Puzzle, a leading figure in the New York Times crossword puzzles. It is said to be resolved without too much foresight, but blenders are usually challenging even for skilled contestants.

Therefore, some experts see it as a promising way to test the limitations of AI problem-solving capabilities.

In a recent study, a team of researchers from Wellesley College, Overlin College, University of Texas at Austin, Northeastern University, Charles University and Startup Cursor used the mystery of Sunday’s puzzle episode to create AI benchmarks. The team says their tests revealed surprising insights like their inference models (such as Openai’s O1).

“We wanted to develop a benchmark with problems that humans can understand with just general knowledge,” said Arjun Guha, a Northeastern computer science teacher and one of the research co-authors, in TechCrunch. He spoke.

The AI ​​industry is currently a bit benchmarked. It is commonly used to assess AI model probe skills, such as the ability to mathematics and science questions at PHD levels that are not relevant to the average user. On the other hand, many benchmarks are quickly approaching saturation points, even relatively recently released benchmarks.

The advantage of public radio quiz games like Sunday puzzles is that it doesn’t test the esoteric knowledge, and the challenges are expressed so that the model cannot draw “memory memory” to solve them. Guha explained that he was there.

“What makes these problems difficult is that it’s really difficult to make meaningful progress on a problem until you solve it. That’s when it’s all clicking at once,” Guha says. I did. “That requires a combination of insight and exclusion processes.”

Of course, there is no perfect benchmark. Sunday puzzles are only US-centric and English. And since the quiz is public, models trained with them could in a way be “cheating”, but Guha says he hasn’t seen this evidence.

“New questions are published weekly, and we can expect the latest questions to be truly invisible,” he added. “We’re going to keep our benchmarks fresh and track how the performance of our models changes over time.”

In the researcher’s benchmark, which consists of around 600 Sunday puzzle mysteries, reasoning models such as O1 and Deepseek’s R1 far outweigh the rest. Inference models thoroughly fact-check the model before producing results. This avoids some of the pitfalls that usually trip down AI models. The trade-off is that it takes a little longer for the inference model to reach the solution – usually seconds to minutes longer.

At least one model, Deepseek’s R1, offers solutions that we know are wrong for some of the Sunday puzzle questions. R1 says verbatim “I give up,” followed by a seemingly randomly chosen false answer. This person is certainly related.

The model makes other strange choices, tease the better ones, and tries to fail again, such as giving the wrong answer just to retract it. They also stop “thinking” forever and give a meaningless explanation of the answer or quickly arrive at the correct answer, but consider alternative answers without obvious reasons.

“On difficult issues, I say R1 is literally “frustrated,” Guha said. “It was interesting to see how models emulate what humans say. It’s not yet known how “frustration” in reasoning affects the quality of model results. ”

NPR Benchmark
R1 I’m “irritated” when I ask questions about the Puzzle Challenge Set on Sunday.Image credit: Guha et al.

The current best performance model on the benchmark is O1 with a score of 59%, following the recently released “inference effort” (47%). (R1 won 35%.) As a next step, researchers plan to expand the test to an additional inference model.

NPR Benchmark
The scores of models tested by the team on the benchmark.Image credit: Guha et al.

“It’s possible to design inference benchmarks that do not require PHD level knowledge because they are good at reasoning, so it’s possible to design inference benchmarks that do not require PHD level knowledge,” Guha said. “Benchmarks with broader access allow a wider range of researchers to understand and analyze results, potentially leading to better solutions in the future. Furthermore, cutting-edge models are As it is increasingly deployed in settings that affect everyone, we believe that everyone can intuitively in what these models are.



Source link

Share. Facebook Twitter Pinterest LinkedIn Reddit WhatsApp Telegram Email
Previous ArticleFive uses of old tech gadgets that are likely lying around the house
Next Article 15 Best Patriotic Movies Made in the U.S.A.
Adnan Mahar
  • Website

Adnan is a passionate doctor from Pakistan with a keen interest in exploring the world of politics, sports, and international affairs. As an avid reader and lifelong learner, he is deeply committed to sharing insights, perspectives, and thought-provoking ideas. His journey combines a love for knowledge with an analytical approach to current events, aiming to inspire meaningful conversations and broaden understanding across a wide range of topics.

Related Posts

Google, Nvidia invests in AI startup Safe Superintelligence, co-founder of Openai Ilya Sutskever

April 14, 2025

This $30 billion AI startup can be very strange by a man who said that neural networks may already be aware of it

February 24, 2025

As Deepseek and ChatGpt Surge, is Delhi behind?

February 18, 2025
Leave A Reply Cancel Reply

Top Posts

President Trump’s SEC nominee Paul Atkins marries multi-billion dollar roof fortune

December 14, 202495 Views

Alice Munro’s Passive Voice | New Yorker

December 23, 202453 Views

2025 Best Actress Oscar Predictions

December 12, 202434 Views

20 Most Anticipated Sex Movies of 2025

January 22, 202533 Views
Don't Miss
AI April 14, 2025

Google, Nvidia invests in AI startup Safe Superintelligence, co-founder of Openai Ilya Sutskever

Alphabet and Nvidia are investing in Safe Superintelligence (SSI), a stealth mode AI startup co-founded…

This $30 billion AI startup can be very strange by a man who said that neural networks may already be aware of it

As Deepseek and ChatGpt Surge, is Delhi behind?

Openai’s Sam Altman reveals his daily use of ChatGpt, and that’s not what you think

Subscribe to Updates

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

About Us
About Us

Welcome to Karachi Chronicle, your go-to source for the latest and most insightful updates across a range of topics that matter most in today’s fast-paced world. We are dedicated to delivering timely, accurate, and engaging content that covers a variety of subjects including Sports, Politics, World Affairs, Entertainment, and the ever-evolving field of Artificial Intelligence.

Facebook X (Twitter) Pinterest YouTube WhatsApp
Our Picks

Instead of Timothée Chalamett or Tom Holland, Sean Penn declares the Oscar-winning actress “the last movie star.” Hollywood

Does an American pope change U.S. politics? : The NPR Politics Podcast : NPR

Amazon will face Elon Musk’s Tesla with the robot launch.

Most Popular

ATUA AI (TUA) develops cutting-edge AI infrastructure to optimize distributed operations

October 11, 20020 Views

10 things you should never say to an AI chatbot

November 10, 20040 Views

Character.AI faces lawsuit over child safety concerns

December 12, 20050 Views
© 2025 karachichronicle. Designed by karachichronicle.
  • Home
  • About us
  • Advertise
  • Contact us
  • DMCA
  • Privacy Policy
  • Terms & Conditions

Type above and press Enter to search. Press Esc to cancel.