author:
(1) Soham De, Google DeepMind and equivalent contributions.
(2) Samuel L. Smith, Google DeepMind, equivalent contribution.
(3) Anushan Fernando, Google DeepMind and equivalent contributions.
(4) Aleksandar Botev, Google DeepMind, and comparable contributions.
(5) George Cristian-Muraru, Google DeepMind and equivalent contributions.
(6) Albert Gu, work done while at Google DeepMind.
(7) Ruba Haroun, Google DeepMind.
(8) Leonard Berrada, Google DeepMind.
(9) Yutian Chen, Google DeepMind.
(10) Srivatsan Srinivasan, Google DeepMind.
(11) Guillaume Desjardins, Google DeepMind.
(12) Arnaud Doucet, Google DeepMind.
(13) David Budden, Google DeepMind.
(14) Yee Whye Teh, Google DeepMind.
(15) David Budden, Google DeepMind.
(16) Razvan Pascanu, Google DeepMind.
(17) Nando de Freitas, Google DeepMind.
(18) Caglar Gulcehre, Google DeepMind.
link table
1 Introduction
2 Model architecture
3 Recurrent models can scale as efficiently as transformers
3.1. Scaling curve
3.2. Evaluation of downstream tasks
4 Efficient Training of Recursive Models on Devices and 4.1. Parallelism of Models for Large-Scale Training
4.2. Efficient linear iteration on the device
4.3. Training speed in long sequences
5. Inference speed
5.1. Simple model of decoding step
5.2.Results
6. Long Context Modeling and 6.1. Improving Next Token Prediction with Longer Contexts
6.2.Copy and retrieval functions
7. Related works
8. Conclusion, Acknowledgments, and References
A. RG-LRU Recurrence Gate
B. Composite Gate Linear Recurrent Unit (CG-LRU)
C. Model-scale hyperparameters
D. Efficient linear recursion on devices
E. Griffin’s Local Attention Window Size
F. Reasoning speed
G. Improving Next Token Prediction with Longer Context: Additional Results
H. Additional details for copy and retrieval tasks
2. Model architecture
All models include the following components: (i) residual block, (ii) MLP block, and (iii) temporal mixing block. (i) and (ii) are the same for all models, but we consider three temporal mixed blocks: global multiquery attention (MQA), local (sliding window) MQA, and the proposed recurrent block. Use a Real Gate Linear Recurrent Unit (RG-LRU) as part of the recurrent block. This is a new recurrent layer inspired by linear recurrent units (Orvieto et al., 2023b).
The residual block shown in Figure 2(a) defines the global structure of the model and is inspired by prenorm transformers (Xiong et al., 2020). After embedding the input sequence, we pass it to a block such as 𝑁 (where 𝑁 indicates the depth of the model) and apply RMSNorm (Zhang and Sennrich, 2019) to generate the final activations. To calculate the probability of a token, we apply a final linear layer followed by a softmax. This layer’s weights are shared with the input embedding layer.
2.1.Residual blocks
The residual block contains two components that are applied in sequence. The first component obtains the hidden state 𝑥 and applies RMSNorm (Zhang and Sennrich, 2019), followed by a temporal mixing block.
Next, we merge the skip connection and output from 𝑥 through addition. Similarly, the second component applies RMSNorm, followed by an MLP block, and merges its output with the skip connection from RMSNorm’s input. This block is shown in Figure 2(a).
2.2. MLP Block
We use a gated MLP block (Dauphin et al., 2017) (shown in Figure 2(b)) to create two branches from an input of dimension 𝐷. Apply a linear layer of output dimension 𝑀𝐷 to each branch. Here, 𝑀 denotes the expansion factor. For simplicity, we will use 𝑀 =3 throughout this work. Similar to GeGeLU (Shazeer, 2020), we apply GeLU nonlinearity (Hendrycks and Gimpel, 2016) to one of the branches before merging them via element-wise multiplication. However, the MLP block applies a final linear layer of output dimension 𝐷 to the output of the GeGeLU layer.
2.3. Temporal mixed blocks
A temporal mixing block is a component of the model that aggregates the activations of hidden layers at different temporal positions within a sequence. We consider three temporal mixing blocks: global MQA (Shazeer, 2019), local MQA (Beltagy et al., 2020), and our proposed Recurrent block.
Local Sliding Window Attention One of the main drawbacks of using global attention is that the computational complexity increases quadratically with the length of the sequence. To address this, some works have begun to employ local attention (Beltagy et al., 2020), also known as sliding window attention. This allows each position to participate in only a certain number of tokens in the past. This not only reduces the FLOPs of the computation, but also limits the size of the KV cache to the size of the window, making the sequence length non-quadratic. All other details are the same as global MQA.
2.4.Real Gate Linear Recurrent Unit (RG-LRU)
The RG-LRU layer we propose has a simple recursion inspired by linear recursive units (LRUs) (Orvieto et al., 2023b), but it also has a simple recursion inspired by the literature on nonlinear RNNs, especially LSTMs (Hochreiter and Schmidhuber). It incorporates a gate mechanism motivated by. , 1997) and GRU (Chung et al., 2014). The equation describing the layers is:
(1) We propose to do away with the use of complex numbers in other modalities and provide more information about complex-valued versions of the RG-LRU layer in Appendix B.