This content originally appeared on HackerNoon and was authored by Anomaly Detection
Table of Links
1.2 Hardness of long compositions
1.3 Hardness of global reasoning
Results on the local reasoning barrier
2.1 Defining locality and auto-regressive locality
Scratchpads to break the locality
C. Experiment and implementation details
F. Discussion on circuit complexity connections
G. More experiments with ChatGPT
C Experiment and implementation details
C.1 Architecture and datasets
In all of the experiments, we use GPT2-style [29] decoder-only Transformers and we train them from scratch in an autoregressive manner with the cross-entropy loss and AdamW [90] optimizer. Our implementation uses the PyTorch framework [91] and is mostly built on NanoGPT’s implementation [92]. In particular, our Transformers use causal attention masking and absolute learnable positional embeddings. For most experiments, we use a small model with 6 layers, 6 heads, and an embedding dimension of 384 which results in a model with approximately 10M parameters.12 We only change the size of the model in Figure 1b where we use models with 8 layers, 8 heads, and an embedding dimension of 512 (approximately 25M parameters), and 12 layers, 12 heads, and an embedding dimension of 768 (roughly 85M parameters).
\ For the cycles task, we use 1000 node names formatted like v123. For simplicity of analysis, we regard each node name as a single token. Other than that and for the other tasks, we treat each character as a single token.
\ For the length generalization experiments, we realized that the performance of the model depends on the random seed of the network. So we repeated each experiment 10 times and reported the median of the results in the plots (along with other runs). For other experiments, we did not observe much variation between seeds and repeated each experiment 5/10 times and reported 95% confidence intervals. We used different Nvidia GPU devices for running our experiments including H100, A100, and RTX4090. We approximate that the runtime for experiments presented in this paper is around 200 hours (excluding hyperparameter search).
\ Our code is publicly available at https://github.com/aryol/inductive-scratchpad.
\ C.1.1 Hyperparameter tuning and sensitivity
\ In general, we tried different values for the learning rate (with different schedules), batch size, dropout, and weight decay. For different tasks, we have used the hyperparameters that were the most stable and fast. The most significant sensitivity is to the batch size. We often find that larger batch sizes help with the training to the point that some tasks cannot be learned with batch sizes smaller than 256.[13] Also, in the length generalization experiments, we observed that the experiments are rather sensitive to the dropout and weight decay parameters. Generally, strong regularizations can increase the uncertainty of the models. Considering the large number of tokens in the scratchpad of the length generalization experiments and their sensitivity, this uncertainty can increase the probability of error. Of course, a weak regularization can also result in a worse generalization. In addition, for most of the experiments, we used either fresh samples or a rather large number of samples as our objective is mostly measuring the time complexity or OOD generalization. The exact value of hyperparameters for each task is available in our code.
C.2 Implementation of the inductive scratchpad
Here we describe the implementation of the inductive scratchpad. Assume that the question and scratchpad sequence are given by Q
\
First, we provide two solutions for training time. One simple way is to break the original sequence Q
\
Q
\
We also need to make sure that no loss is not computed for Q
\

\
where we have assumed that t tokens have appeared before the first state s[1]. Note that this number may vary across samples. We can easily create an attention mask based on the attention groups. All groups only attend to tokens in their group and tokens in group 0 (everything that comes before
\
Now, we discuss token generation. Assume that we are generating the tokens of ith state s[i], i.e., Q
C.3 Comparison with relative positional embeddings
In this section, we discuss whether it is possible to induce an inductive structure for the scratchpad using relative positional embedding [46, 47] instead of using the explicit inductive scratchpad format introduced in this paper. For an inductive structure to work, we want each state to be generated using only the tokens of the previous state and the question in an identical manner for different states.
\
More precisely, assume that the question and scratchpad are given by Q
\ There are a few obstacles, however:
\
The distance between the states and question tokens increases as each state is generated. However, in order to have an inductive structure, we want the attention pattern between different states and the question not to change. One could think of using encoder-decoder architectures where the question is in the encoder part and computing the cross-attention between states and the question ignoring the positions of the state tokens in the decoder part. However, even this approach would lose some information. I.e., the attention between the scratchpad tokens and the question tokens cannot use the position of the scratchpad tokens within a state.
\
Most of the relative positional embeddings [48, 49, 50] allow tokens to attend to all other tokens. Even if one uses an attention mechanism that limits the attention to the D previous tokens, after L transformer layers, tokens of up to distance DL can attend to each other. So in general, it is most likely, that tokens of s[i] can attend to tokens of other states as well as tokens of s[i-1] which hinders the inductive behavior.
\
Similarly, assume that the Tth (last) token of s[i] attends to all tokens of s[i-1] including its first token. Note that the distance between the last token of state s[i] and the first token of state s[i-1] is 2T − 1. As a result, the first token of s[i] would also attend to all tokens of s[i-2] except the first one in a similar manner (because the distance between the first token of s[i] and the second token of s[i-2] is 2T − 1) which is undesirable.
\ Note that in the analysis above we assumed a fixed number of tokens per state. If the number of tokens in each state varies, issues like (2) and (3) above can become more difficult to handle for a model that only relies on relative positional embedding. To sum up, there is a minor similarity between relative positional embeddings and the idea of induction in the scratchpad. Factors such as pretraining and the recency bias of attention may promote inductive behavior in the model to some extent. Nevertheless, there is no reason to think that the model can implement the inductive behavior relying only on relative positional embeddings for a general inductive task. In general, one can use the inductive scratchpad idea along with relative positional embedding to get better length generalization. Note that inductive scratchpad can generalize on the number of training steps. However, understanding an input with growing length still requires relying on relative positional embeddings.
\
As an alternative solution, independent of positional embeddings being absolute or relative, one can use special tokens
\
:::info Authors:
(1) Emmanuel Abbe, Apple and EPFL;
(2) Samy Bengio, Apple;
(3) Aryo Lotf, EPFL;
(4) Colin Sandon, EPFL;
(5) Omid Saremi, Apple.
:::
:::info This paper is available on arxiv under CC BY 4.0 license.
:::
[12] The exact number depends on the task and the vocabulary size of it.
\ [13] For some experiments, we increased the gradient accumulation steps instead of the batch size to get the same effect with less memory consumption.
This content originally appeared on HackerNoon and was authored by Anomaly Detection
Anomaly Detection | Sciencx (2025-11-05T11:30:02+00:00) Overcoming Locality in Auto-Regressive Transformers. Retrieved from https://www.scien.cx/2025/11/05/overcoming-locality-in-auto-regressive-transformers/
Please log in to upload a file.
There are no updates yet.
Click the Upload button above to add an update.