
This article is part of our coverage of the latest AI research.
Ubiquant researchers have proposed a new deep learning architecture that improves the ability of AI models to solve complex inference tasks. Their architecture, the Universal Reasoning Model (URM), is an improvement on the Universal Transformer (UT) framework used by other research teams to tackle difficult benchmarks such as ARC-AGI and Sudoku.
While recent models such as hierarchical reasoning models (HRMs) and minimal recursive models (TRMs) highlight the potential of recurrent architectures, the Ubiquant team has identified key areas where these models can be optimized. The resulting approach significantly improved inference performance compared to these existing small-scale inference models and achieved best-in-class results on inference benchmarks.
For universal transformers
To understand URM, you first need to understand Universal Transformer (UT) and how it differs from the standard architecture used by most large language models (LLMs). Standard transformer models process data by passing it through a stack of separate layers. Each layer has its own set of parameters.
In contrast, UT applies a single layer (often called a transition block) repeatedly in a loop to refine the token representation. This weighting mechanism allows the model to perform iterative inference without increasing the number of parameters, making it theoretically more expressive for tasks that require deep thought.
Recent iterations of this concept, such as HRM and TRM, have shown that small-scale UT-based models can outperform much larger standard transformers on inference tasks.

However, the authors of the URM paper argue that the specific causes of these performance improvements have been misunderstood. While previous research has attributed success to sophisticated architectural design, Ubiquant researchers found that the improvement primarily stems from the “repetition-induced bias” inherent in universal transformers. In other words, this advantage comes from the model’s ability to reuse the exact same parameters to refine the thought process.
Furthermore, their analysis revealed that “nonlinear depth calculations” play a much larger role than previously realized. Specifically, the feedforward network (MLP), rather than the attention mechanism (a key component of the transformer block), constitutes the main source of the representational nonlinearity required for complex inference.
Strengthening the inference loop
Based on these insights, URM introduces two key innovations to the UT framework: the ConvSwiGLU module and Truncated Backpropagation Through Loops (TBPTL).

The first innovation addresses the limitations of the standard SwiGLU activation function used in MLP blocks in modern transformers. SwiGLU provides the necessary nonlinearity, but it is a point-by-point operation that processes all tokens independently, preventing mixing of information between different tokens within that particular layer. The researchers enhanced this by inserting a short depthwise convolution within the MLP block. This ConvSwiGLU mechanism forces local context interactions and allows the model to mix information across channels in the token space without significantly increasing computational complexity.
The second innovation, truncated backpropagation with loops, addresses the training instability inherent in recurrent models. Deep inference requires many iterations because URM relies on loops in the same layer. However, computing gradients over long chains of loops can lead to noise accumulation and optimization problems. To solve this, the researchers split the rollout into a “transfer-only” segment and a “trainable” segment.
If you enjoyed this article, consider supporting TechTalks with a paid subscription (and get access to subscriber-only posts).
For a setup with eight inner loops, we found that the best balance was to run the first two loops as forward-only and compute gradients only for the last six loops. This technique stabilizes training by ignoring noisy gradients in the early stages of the inference process, allowing the model to focus on later stages of refinement.
URM in action
Together, these architectural changes resulted in significant performance improvements over previous UT-based approaches. On the ARC-AGI 1 benchmark, URM achieved 53.8% in one run (pass@1), significantly outperforming TRM (40.0%) and HRM (34.4%). URM took an even bigger lead in ARC-AGI 2, earning a 16.0% pass rate (1), nearly tripling HRM scores and more than doubling TRM. The Sudoku benchmark showed a similar advantage, reaching an accuracy of 77.6%.
Beyond raw accuracy, the results highlight the efficiency of the iterative approach. The researchers showed that a UT with only four times the parameters of the basic transformer block can achieve a pass@1 score of 40.0, dramatically outperforming a vanilla transformer with 32 blocks.
The researchers also note that “simply scaling depth and width in vanilla transformers yields diminishing returns and can even lead to performance degradation. This highlights a fundamental inefficiency in how parameters are used to support multi-step inference.”
The researchers have published the URM code on GitHub.

This finding shows that iterative computation is often more beneficial than simply adding independent layers. As the authors explain, “In a standard Transformer, the additional FLOPs are often spent on redundant refinement in upper layers, whereas in iterative computations, the same budget translates into an effective increase in depth.” It is worth noting that URM and UT are still far behind frontier models in inference benchmarks such as ARC-AGI. Poetiq recently developed an improved technology for ARC-AGI-2 that significantly outperforms URM by 54%. Universal Transformer models are specifically trained for ARC-AGI type problems (even if they are not overfitted to a particular dataset), making them useless for the general applications that Frontier LLMs tackle. But these are new experiments that show how new architectures and approaches can address complex problems at a fraction of the compute and memory budgets previously required. It will be interesting to see the new research directions and applications that the UT model leads to.
