URM shows that small-scale recursive models can outperform large-scale LLMs on inference tasks

This article is part of our coverage of the latest AI research.

Ubiquant researchers have proposed a new deep learning architecture that improves the ability of AI models to solve complex inference tasks. Their architecture, the Universal Reasoning Model (URM), is an improvement on the Universal Transformer (UT) framework used by other research teams to tackle difficult benchmarks such as ARC-AGI and Sudoku.

While recent models such as hierarchical reasoning models (HRMs) and minimal recursive models (TRMs) highlight the potential of recurrent architectures, the Ubiquant team has identified key areas where these models can be optimized. The resulting approach significantly improved inference performance compared to these existing small-scale inference models and achieved best-in-class results on inference benchmarks.

For universal transformers

To understand URM, you first need to understand Universal Transformer (UT) and how it differs from the standard architecture used by most large language models (LLMs). Standard transformer models process data by passing it through a stack of separate layers. Each layer has its own set of parameters.

In contrast, UT applies a single layer (often called a transition block) repeatedly in a loop to refine the token representation. This weighting mechanism allows the model to perform iterative inference without increasing the number of parameters, making it theoretically more expressive for tasks that require deep thought.

Recent iterations of this concept, such as HRM and TRM, have shown that small-scale UT-based models can outperform much larger standard transformers on inference tasks.

Minimal Recursive Model (TRM) (Source: arXiv)

However, the authors of the URM paper argue that the specific causes of these performance improvements have been misunderstood. While previous research has attributed success to sophisticated architectural design, Ubiquant researchers found that the improvement primarily stems from the “repetition-induced bias” inherent in universal transformers. In other words, this advantage comes from the model’s ability to reuse the exact same parameters to refine the thought process.

Furthermore, their analysis revealed that “nonlinear depth calculations” play a much larger role than previously realized. Specifically, the feedforward network (MLP), rather than the attention mechanism (a key component of the transformer block), constitutes the main source of the representational nonlinearity required for complex inference.

Strengthening the inference loop

Based on these insights, URM introduces two key innovations to the UT framework: the ConvSwiGLU module and Truncated Backpropagation Through Loops (TBPTL).

Universal Reasoning Model (URM) (Source: arXiv)

The first innovation addresses the limitations of the standard SwiGLU activation function used in MLP blocks in modern transformers. SwiGLU provides the necessary nonlinearity, but it is a point-by-point operation that processes all tokens independently, preventing mixing of information between different tokens within that particular layer. The researchers enhanced this by inserting a short depthwise convolution within the MLP block. This ConvSwiGLU mechanism forces local context interactions and allows the model to mix information across channels in the token space without significantly increasing computational complexity.

The second innovation, truncated backpropagation with loops, addresses the training instability inherent in recurrent models. Deep inference requires many iterations because URM relies on loops in the same layer. However, computing gradients over long chains of loops can lead to noise accumulation and optimization problems. To solve this, the researchers split the rollout into a “transfer-only” segment and a “trainable” segment.

If you enjoyed this article, consider supporting TechTalks with a paid subscription (and get access to subscriber-only posts).

For a setup with eight inner loops, we found that the best balance was to run the first two loops as forward-only and compute gradients only for the last six loops. This technique stabilizes training by ignoring noisy gradients in the early stages of the inference process, allowing the model to focus on later stages of refinement.

URM in action

Together, these architectural changes resulted in significant performance improvements over previous UT-based approaches. On the ARC-AGI 1 benchmark, URM achieved 53.8% in one run (pass@1), significantly outperforming TRM (40.0%) and HRM (34.4%). URM took an even bigger lead in ARC-AGI 2, earning a 16.0% pass rate (1), nearly tripling HRM scores and more than doubling TRM. The Sudoku benchmark showed a similar advantage, reaching an accuracy of 77.6%.

Beyond raw accuracy, the results highlight the efficiency of the iterative approach. The researchers showed that a UT with only four times the parameters of the basic transformer block can achieve a pass@1 score of 40.0, dramatically outperforming a vanilla transformer with 32 blocks.

The researchers also note that “simply scaling depth and width in vanilla transformers yields diminishing returns and can even lead to performance degradation. This highlights a fundamental inefficiency in how parameters are used to support multi-step inference.”

The researchers have published the URM code on GitHub.

URM performance — URM outperforms other universal transformers on key inference benchmarks (source: arXiv)

This finding shows that iterative computation is often more beneficial than simply adding independent layers. As the authors explain, “In a standard Transformer, the additional FLOPs are often spent on redundant refinement in upper layers, whereas in iterative computations, the same budget translates into an effective increase in depth.” It is worth noting that URM and UT are still far behind frontier models in inference benchmarks such as ARC-AGI. Poetiq recently developed an improved technology for ARC-AGI-2 that significantly outperforms URM by 54%. Universal Transformer models are specifically trained for ARC-AGI type problems (even if they are not overfitted to a particular dataset), making them useless for the general applications that Frontier LLMs tackle. But these are new experiments that show how new architectures and approaches can address complex problems at a fraction of the compute and memory budgets previously required. It will be interesting to see the new research directions and applications that the UT model leads to.

Source link

What's Hot

I’ve seen all the Marvel movies. Here’s how to save your MCU

London Stock Exchange Group share price rises as PISCES debut nears and financial results approach

Indian Americans largely disapprove of Trump’s first-year performance, but Democrats aren’t benefiting: Survey

URM shows that small-scale recursive models can outperform large-scale LLMs on inference tasks

D Street Massacre, Humanity Milestones, Bangladesh Election Results, PMO Shift, and More

A smarter way for AI to understand text and images

Surprisingly Tough Competition for Meta’s Ray-Ban

20 Most Anticipated Sex Movies of 2025

How to tell the difference between fake and genuine Adidas Sambas

President Trump’s SEC nominee Paul Atkins marries multi-billion dollar roof fortune

Alice Munro’s Passive Voice | New Yorker

D Street Massacre, Humanity Milestones, Bangladesh Election Results, PMO Shift, and More

A smarter way for AI to understand text and images

Surprisingly Tough Competition for Meta’s Ray-Ban

How AI assistance impacts the formation of coding skills \ Anthropic

Our Picks

I’ve seen all the Marvel movies. Here’s how to save your MCU

London Stock Exchange Group share price rises as PISCES debut nears and financial results approach

Indian Americans largely disapprove of Trump’s first-year performance, but Democrats aren’t benefiting: Survey

Most Popular

Anthropic agrees to work with music publishers to prevent copyright infringement

chatgpt makers claim data breach claims “seriously”

Everything you need to know

Subscribe to Updates

What's Hot

URM shows that small-scale recursive models can outperform large-scale LLMs on inference tasks

For universal transformers

Strengthening the inference loop

URM in action

Something like this:

Related Posts