Multi-Token Prediction: Mastering Algorithmic Reasoning with Enhanced Resource Use

Discover how multi-token prediction improves LLM algorithmic reasoning, potentially by learning to allocate computational resources more efficiently across token positions, as explored through experiments with pause tokens.


This content originally appeared on HackerNoon and was authored by Cosmological thinking: time, space and universal causation

Abstract and 1. Introduction

2. Method

3. Experiments on real data

4. Ablations on synthetic data

5. Why does it work? Some speculation

6. Related work

7. Conclusion, Impact statement, Environmental impact, Acknowledgements and References

A. Additional results on self-speculative decoding

B. Alternative architectures

C. Training speeds

D. Finetuning

E. Additional results on model scaling behavior

F. Details on CodeContests finetuning

G. Additional results on natural language benchmarks

H. Additional results on abstractive text summarization

I. Additional results on mathematical reasoning in natural language

J. Additional results on induction learning

K. Additional results on algorithmic reasoning

L. Additional intuitions on multi-token prediction

M. Training hyperparameters

K. Additional results on algorithmic reasoning

We investigate the following computation-sharing hypothesis for explaining the efficacy of multi-token prediction as training loss.

\ The prediction difficulty of different tokens in natural text varies greatly. Some tokens may be the continuations of partial words that are uniquely determined from their preceding context without any effort, while others may require to predict theorem names in difficult mathematical proofs or the correct answer to an exam question. Language models with residual connections have been shown to refine their output token distribution with each successive layer, and can be trained with early exit strategies that spend variable amounts of computational resources per token position. Multi-token prediction losses explicitly encourage information-sharing between adjacent token positions and can thus be viewed as a method to learn allocating computational resources in language models more efficiently to the tokens that benefit most of it.

\ To check the truth of this hypothesis, we augment the polynomial arithmetic task from Section 4.2 with a varying number of pause tokens (Goyal et al., 2023) inserted between the question and a token that denotes the beginning of the answer. Pause tokens introduce additional computational resources that can be expended for computations that are expected to be useful later on in the sequence, in other words: to start thinking about the answer. According to the computation-sharing hypothesis, multi-token prediction models learn information-sharing and thus computation-sharing between token positions more easily, and may be better at making use of these additional computational resources than next-token prediction models are. In Figure S15, we show the evaluation results on the polynomial arithmetic task with a fixed number of pause tokens inserted both at training and evaluation time. Multi-token prediction models likewise outperform next-token prediction models on these task variants across task difficulties and model sizes. However, we do not see strong evidence of a widening or shrinking of this gap i.e. we cannot conclude from these experiments on the veracity of the computation-sharing hypothesis.

\ In Table S11, we report results from another experiment in the same spirit: by adding spaces and newlines to HumanEval and MBPP prompts, we add “pause tokens” in a somewhat natural way. According to these results, multi-token prediction models have a slight advantage at using this additionally provided compute, but the effect is marginal.

\ Figure S15: Accuracy on a polynomial arithmetic task with varying number of operations per expression and pause tokens. We train and evaluate models on the polynomial arithmetic task described in Section 4.2, modified by the addition of pause tokens (Goyal et al., 2023): between the question and the equality sign that indicates the beginning of the answer, we add a constant number of pause tokens both in training and evaluation. For both a variant with five and with ten pause tokens, respectively, we observe comparable improvements from using multi-token prediction to the ones obtained in the case without pause tokens (Figure 8).

\ Table S11: Utilization of additional whitespace tokens in code benchmarks.

\ Figure S16: Accuracy on a polynomial arithmetic task for two model sizes. We train and evaluate models with 30M and 100M parameters on the polynomial arithmetic task described in Section 4.2. Tripling the model size has a smaller effect on performance than replacing next-token prediction loss by multi-token prediction. Shown are two independent runs per configuration and their means, the 100M parameter models being identical to the ones in Figure 8.

\ Table S12: Optimal temperatures for all numbers in table 1

\

:::info Authors:

(1) Fabian Gloeckle, FAIR at Meta, CERMICS Ecole des Ponts ParisTech and Equal contribution;

(2) Badr Youbi Idrissi, FAIR at Meta, LISN Université Paris-Saclayand and Equal contribution;

(3) Baptiste Rozière, FAIR at Meta;

(4) David Lopez-Paz, FAIR at Meta and a last author;

(5) Gabriel Synnaeve, FAIR at Meta and a last author.

:::


:::info This paper is available on arxiv under CC BY 4.0 DEED license.

:::

\


This content originally appeared on HackerNoon and was authored by Cosmological thinking: time, space and universal causation


Print Share Comment Cite Upload Translate Updates
APA

Cosmological thinking: time, space and universal causation | Sciencx (2025-07-23T15:45:02+00:00) Multi-Token Prediction: Mastering Algorithmic Reasoning with Enhanced Resource Use. Retrieved from https://www.scien.cx/2025/07/23/multi-token-prediction-mastering-algorithmic-reasoning-with-enhanced-resource-use/

MLA
" » Multi-Token Prediction: Mastering Algorithmic Reasoning with Enhanced Resource Use." Cosmological thinking: time, space and universal causation | Sciencx - Wednesday July 23, 2025, https://www.scien.cx/2025/07/23/multi-token-prediction-mastering-algorithmic-reasoning-with-enhanced-resource-use/
HARVARD
Cosmological thinking: time, space and universal causation | Sciencx Wednesday July 23, 2025 » Multi-Token Prediction: Mastering Algorithmic Reasoning with Enhanced Resource Use., viewed ,<https://www.scien.cx/2025/07/23/multi-token-prediction-mastering-algorithmic-reasoning-with-enhanced-resource-use/>
VANCOUVER
Cosmological thinking: time, space and universal causation | Sciencx - » Multi-Token Prediction: Mastering Algorithmic Reasoning with Enhanced Resource Use. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2025/07/23/multi-token-prediction-mastering-algorithmic-reasoning-with-enhanced-resource-use/
CHICAGO
" » Multi-Token Prediction: Mastering Algorithmic Reasoning with Enhanced Resource Use." Cosmological thinking: time, space and universal causation | Sciencx - Accessed . https://www.scien.cx/2025/07/23/multi-token-prediction-mastering-algorithmic-reasoning-with-enhanced-resource-use/
IEEE
" » Multi-Token Prediction: Mastering Algorithmic Reasoning with Enhanced Resource Use." Cosmological thinking: time, space and universal causation | Sciencx [Online]. Available: https://www.scien.cx/2025/07/23/multi-token-prediction-mastering-algorithmic-reasoning-with-enhanced-resource-use/. [Accessed: ]
rf:citation
» Multi-Token Prediction: Mastering Algorithmic Reasoning with Enhanced Resource Use | Cosmological thinking: time, space and universal causation | Sciencx | https://www.scien.cx/2025/07/23/multi-token-prediction-mastering-algorithmic-reasoning-with-enhanced-resource-use/ |

Please log in to upload a file.




There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.