Close Menu
Karachi Chronicle
  • Home
  • AI
  • Business
  • Entertainment
  • Fashion
  • Politics
  • Sports
  • Tech
  • World

Subscribe to Updates

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

What's Hot

Republican “big beautiful” budget bill means your money

The Truth Berns: How Democrats became undemocratic long before Donald Trump | World News

Instead of Timothée Chalamett or Tom Holland, Sean Penn declares the Oscar-winning actress “the last movie star.” Hollywood

Facebook X (Twitter) Instagram
  • Home
  • About us
  • Advertise
  • Contact us
  • DMCA
  • Privacy Policy
  • Terms & Conditions
Facebook X (Twitter) Instagram Pinterest Vimeo
Karachi Chronicle
  • Home
  • AI
  • Business
  • Entertainment
  • Fashion
  • Politics
  • Sports
  • Tech
  • World
Karachi Chronicle
You are at:Home » RG-LRU: A breakthrough recursion layer that redefines the efficiency of NLP models
AI

RG-LRU: A breakthrough recursion layer that redefines the efficiency of NLP models

Adnan MaharBy Adnan MaharJanuary 13, 2025No Comments5 Mins Read0 Views
Facebook Twitter Pinterest Telegram LinkedIn Tumblr Email Reddit
Share
Facebook Twitter LinkedIn Pinterest WhatsApp Email


author:

(1) Soham De, Google DeepMind and equivalent contributions.

(2) Samuel L. Smith, Google DeepMind, equivalent contribution.

(3) Anushan Fernando, Google DeepMind and equivalent contributions.

(4) Aleksandar Botev, Google DeepMind, and comparable contributions.

(5) George Cristian-Muraru, Google DeepMind and equivalent contributions.

(6) Albert Gu, work done while at Google DeepMind.

(7) Ruba Haroun, Google DeepMind.

(8) Leonard Berrada, Google DeepMind.

(9) Yutian Chen, Google DeepMind.

(10) Srivatsan Srinivasan, Google DeepMind.

(11) Guillaume Desjardins, Google DeepMind.

(12) Arnaud Doucet, Google DeepMind.

(13) David Budden, Google DeepMind.

(14) Yee Whye Teh, Google DeepMind.

(15) David Budden, Google DeepMind.

(16) Razvan Pascanu, Google DeepMind.

(17) Nando de Freitas, Google DeepMind.

(18) Caglar Gulcehre, Google DeepMind.

link table

1 Introduction

2 Model architecture

3 Recurrent models can scale as efficiently as transformers

3.1. Scaling curve

3.2. Evaluation of downstream tasks

4 Efficient Training of Recursive Models on Devices and 4.1. Parallelism of Models for Large-Scale Training

4.2. Efficient linear iteration on the device

4.3. Training speed in long sequences

5. Inference speed

5.1. Simple model of decoding step

5.2.Results

6. Long Context Modeling and 6.1. Improving Next Token Prediction with Longer Contexts

6.2.Copy and retrieval functions

7. Related works

8. Conclusion, Acknowledgments, and References

A. RG-LRU Recurrence Gate

B. Composite Gate Linear Recurrent Unit (CG-LRU)

C. Model-scale hyperparameters

D. Efficient linear recursion on devices

E. Griffin’s Local Attention Window Size

F. Reasoning speed

G. Improving Next Token Prediction with Longer Context: Additional Results

H. Additional details for copy and retrieval tasks

2. Model architecture

All models include the following components: (i) residual block, (ii) MLP block, and (iii) temporal mixing block. (i) and (ii) are the same for all models, but we consider three temporal mixed blocks: global multiquery attention (MQA), local (sliding window) MQA, and the proposed recurrent block. Use a Real Gate Linear Recurrent Unit (RG-LRU) as part of the recurrent block. This is a new recurrent layer inspired by linear recurrent units (Orvieto et al., 2023b).

The residual block shown in Figure 2(a) defines the global structure of the model and is inspired by prenorm transformers (Xiong et al., 2020). After embedding the input sequence, we pass it to a block such as 𝑁 (where 𝑁 indicates the depth of the model) and apply RMSNorm (Zhang and Sennrich, 2019) to generate the final activations. To calculate the probability of a token, we apply a final linear layer followed by a softmax. This layer’s weights are shared with the input embedding layer.

2.1.Residual blocks

The residual block contains two components that are applied in sequence. The first component obtains the hidden state 𝑥 and applies RMSNorm (Zhang and Sennrich, 2019), followed by a temporal mixing block.

Figure 2 | a) The main backbone of our mode architecture is a residual block that is stacked 𝑁 times. b) Gate MLP block to use. c) Recurrent blocks proposed as an alternative to Multi-Query Attention (MQA). This uses our proposed RG-LRU layer, defined in Section 2.4.Figure 2 | a) The main backbone of our mode architecture is a residual block that is stacked 𝑁 times. b) Gate MLP block to use. c) Recurrent blocks proposed as an alternative to Multi-Query Attention (MQA). This uses our proposed RG-LRU layer defined in Section 2.4.

Next, we merge the skip connection and output from 𝑥 through addition. Similarly, the second component applies RMSNorm, followed by an MLP block, and merges its output with the skip connection from RMSNorm’s input. This block is shown in Figure 2(a).

2.2. MLP Block

We use a gated MLP block (Dauphin et al., 2017) (shown in Figure 2(b)) to create two branches from an input of dimension 𝐷. Apply a linear layer of output dimension 𝑀𝐷 to each branch. Here, 𝑀 denotes the expansion factor. For simplicity, we will use 𝑀 =3 throughout this work. Similar to GeGeLU (Shazeer, 2020), we apply GeLU nonlinearity (Hendrycks and Gimpel, 2016) to one of the branches before merging them via element-wise multiplication. However, the MLP block applies a final linear layer of output dimension 𝐷 to the output of the GeGeLU layer.

2.3. Temporal mixed blocks

A temporal mixing block is a component of the model that aggregates the activations of hidden layers at different temporal positions within a sequence. We consider three temporal mixing blocks: global MQA (Shazeer, 2019), local MQA (Beltagy et al., 2020), and our proposed Recurrent block.

Local Sliding Window Attention One of the main drawbacks of using global attention is that the computational complexity increases quadratically with the length of the sequence. To address this, some works have begun to employ local attention (Beltagy et al., 2020), also known as sliding window attention. This allows each position to participate in only a certain number of tokens in the past. This not only reduces the FLOPs of the computation, but also limits the size of the KV cache to the size of the window, making the sequence length non-quadratic. All other details are the same as global MQA.

2.4.Real Gate Linear Recurrent Unit (RG-LRU)

The RG-LRU layer we propose has a simple recursion inspired by linear recursive units (LRUs) (Orvieto et al., 2023b), but it also has a simple recursion inspired by the literature on nonlinear RNNs, especially LSTMs (Hochreiter and Schmidhuber). It incorporates a gate mechanism motivated by. , 1997) and GRU (Chung et al., 2014). The equation describing the layers is:

(1) We propose to do away with the use of complex numbers in other modalities and provide more information about complex-valued versions of the RG-LRU layer in Appendix B.



Source link

Share. Facebook Twitter Pinterest LinkedIn Reddit WhatsApp Telegram Email
Previous ArticleUnited States nuclear weapons, 2025
Next Article Business Insider launches new homepage
Adnan Mahar
  • Website

Adnan is a passionate doctor from Pakistan with a keen interest in exploring the world of politics, sports, and international affairs. As an avid reader and lifelong learner, he is deeply committed to sharing insights, perspectives, and thought-provoking ideas. His journey combines a love for knowledge with an analytical approach to current events, aiming to inspire meaningful conversations and broaden understanding across a wide range of topics.

Related Posts

Google, Nvidia invests in AI startup Safe Superintelligence, co-founder of Openai Ilya Sutskever

April 14, 2025

This $30 billion AI startup can be very strange by a man who said that neural networks may already be aware of it

February 24, 2025

As Deepseek and ChatGpt Surge, is Delhi behind?

February 18, 2025
Leave A Reply Cancel Reply

Top Posts

President Trump’s SEC nominee Paul Atkins marries multi-billion dollar roof fortune

December 14, 202495 Views

Alice Munro’s Passive Voice | New Yorker

December 23, 202453 Views

20 Most Anticipated Sex Movies of 2025

January 22, 202538 Views

2025 Best Actress Oscar Predictions

December 12, 202434 Views
Don't Miss
AI April 14, 2025

Google, Nvidia invests in AI startup Safe Superintelligence, co-founder of Openai Ilya Sutskever

Alphabet and Nvidia are investing in Safe Superintelligence (SSI), a stealth mode AI startup co-founded…

This $30 billion AI startup can be very strange by a man who said that neural networks may already be aware of it

As Deepseek and ChatGpt Surge, is Delhi behind?

Openai’s Sam Altman reveals his daily use of ChatGpt, and that’s not what you think

Subscribe to Updates

Subscribe to our newsletter and never miss our latest news

Subscribe my Newsletter for New Posts & tips Let's stay updated!

About Us
About Us

Welcome to Karachi Chronicle, your go-to source for the latest and most insightful updates across a range of topics that matter most in today’s fast-paced world. We are dedicated to delivering timely, accurate, and engaging content that covers a variety of subjects including Sports, Politics, World Affairs, Entertainment, and the ever-evolving field of Artificial Intelligence.

Facebook X (Twitter) Pinterest YouTube WhatsApp
Our Picks

Republican “big beautiful” budget bill means your money

The Truth Berns: How Democrats became undemocratic long before Donald Trump | World News

Instead of Timothée Chalamett or Tom Holland, Sean Penn declares the Oscar-winning actress “the last movie star.” Hollywood

Most Popular

ATUA AI (TUA) develops cutting-edge AI infrastructure to optimize distributed operations

October 11, 20020 Views

10 things you should never say to an AI chatbot

November 10, 20040 Views

Character.AI faces lawsuit over child safety concerns

December 12, 20050 Views
© 2025 karachichronicle. Designed by karachichronicle.
  • Home
  • About us
  • Advertise
  • Contact us
  • DMCA
  • Privacy Policy
  • Terms & Conditions

Type above and press Enter to search. Press Esc to cancel.