After the success of large-scale language models (LLMS), current research extends beyond text-based understanding to multimodal inference tasks. These tasks integrate vision and language. This is essential for artificial general information (AGI). Cognitive benchmarks such as PuzzleVQA and algopuzzleVQA evaluate the AI’s ability to process abstract visual information and algorithmic inference. Despite advances, LLMS struggles with multimodal inference, particularly pattern recognition and spatial problem solving. High computational costs exacerbate these challenges.
Previous evaluations relied on iconic benchmarks such as ARC-AGI and visual assessments such as Raven’s progressive matrix. However, these do not properly challenge the AI’s ability to handle multimodal inputs. Recently, datasets such as PuzzleVQA and AlgopuzzleVQA have been introduced to evaluate abstract visual inference and algorithm problem solving. These datasets require models that integrate visual perception, logical deductions, and structured inference. Previous models such as GPT-4-Turbo and GPT-4O demonstrated improvements, but still faced the limitations of abstract inference and multimodal interpretation.
Researchers at the Singapore Institute of Technology (SUTD) have introduced a systematic evaluation of Openai’s GPT-(N) and O-(N) model series on multimodal puzzle solving tasks. Their study examined how reasoning ability evolved across different model generations. This study aims to identify gaps in AI recognition, abstract reasoning and problem-solving skills. The team compared the performance of models such as GPT-4-Turbo, GPT-4O, and O1 in PuzzlevQA and AlgopuzzlevQA datasets, including abstract visual puzzles and algorithmic inference tasks.
The researchers conducted a structured assessment using two main data sets.
PuzzleVQA: PuzzleVQA focuses on abstract visual inference and requires a model that recognizes patterns of numbers, shapes, colors and sizes. algopuzzlevqa:algopuzzlevqa describes an algorithmic problem-solving task that requires logical deductions and computational inference.
Evaluations were performed using both multiple choice and open-ended question formats. This study adopted a zero-shot thinking chain (COT) to encourage inference and analyze performance degradation when switching from multiple choice to open-ended responses. The models were also tested under conditions where visual perception and induced inference were provided separately to diagnose specific weaknesses.
This study observed a steady improvement in the reasoning ability of various model generations. The GPT-4o performed better than the GPT-4-turbo, while O1 achieved the most notable advances, particularly in the algorithmic inference task. However, these benefits resulted in a sharp increase in computational costs. Despite overall advances, AI models still struggled with tasks that required accurate visual interpretation, such as recognizing missing shapes and deciphering abstract patterns. O1 worked well with numerical inference, but was difficult to handle shape-based puzzles. The difference in accuracy between multiple selection and open-ended tasks indicates a strong dependency on answer prompts. Also, recognition remained a major challenge in all models, as the accuracy was greatly improved when explicit visual details were provided.
With a simple summary, the task can be summarised into several detailed points.
In this study, a significant upward trend in inference ability from GPT-4-turbo to GPT-4O and O1 was observed. GPT-4O showed moderate benefits, but the transition to O1 resulted in significant improvements, but increased computational costs by 750 times compared to GPT-4O. Overall, O1 achieved an average accuracy of 79.2% in the multiple selection setting, surpassing GPT-4O’s 60.6% and GPT-4-Turbo’s 54.2%. However, in the open-ended task, all models showed performance drops, with O1 being 66.3%, GPT-4O being 46.8% and GPT-4-Turbo being 38.6%. In algopuzzlevqa, O1 was greatly improved in previous models, especially puzzles that require numerical and spatial deductions. O1 scored 55.3% compared to 43.6% for GPT-4O and 36.5% for GPT-4-Turbo on multiple choice tasks. However, for open-ended tasks, its accuracy was reduced by 23.1%. In this study, perception was identified as a major limitation across all models. Injecting explicit visual details improves accuracy by 22%-30%, indicating a dependence on external perceptual AIDS. In particular, in numerical and spatial pattern recognition, the inductive inference guidance increased performance by 6% to 19%. O1 was excellent at numerical reasoning, but struggled with shape-based puzzles, showing a 4.5% drop compared to the shape recognition task of GPT-4O. It also worked well in structured problem solving, but faced challenges in open-ended scenarios that require independent deductions.
Please see the paper and the github page. All credits for this study will be sent to researchers in this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn groups. Don’t forget to join the 75k+ ml subreddit.
Commended open source AI platform recommended: “Intelagent is an open source multi-agent framework for evaluating complex conversational AI systems” (promotion)

Sana Hassan, a consulting intern at MarkTechPost and a dual-level student at IIT Madras, is passionate about applying technology and AI to address real-world challenges. With a strong interest in solving real problems, he brings a new perspective to the intersection of AI and real solutions.
✅ (Recommended) Join the Telegram Channel