Training Time Comparison: Multi-Token vs. Next-Token Prediction

This table (S5) quantifies the training time overhead of multi-token prediction relative to next-token prediction, demonstrating its computational efficiency across different LLM sizes.


This content originally appeared on HackerNoon and was authored by Large Models (dot tech)

Abstract and 1. Introduction

2. Method

3. Experiments on real data

3.1. Benefits scale with model size and 3.2. Faster inference

3.3. Learning global patterns with multi-byte prediction and 3.4. Searching for the optimal n

3.5. Training for multiple epochs and 3.6. Finetuning multi-token predictors

3.7. Multi-token prediction on natural language

4. Ablations on synthetic data and 4.1. Induction capability

4.2. Algorithmic reasoning

5. Why does it work? Some speculation and 5.1. Lookahead reinforces choice points

5.2. Information-theoretic argument

6. Related work

7. Conclusion, Impact statement, Environmental impact, Acknowledgements, and References

A. Additional results on self-speculative decoding

B. Alternative architectures

C. Training speeds

D. Finetuning

E. Additional results on model scaling behavior

F. Details on CodeContests finetuning

G. Additional results on natural language benchmarks

H. Additional results on abstractive text summarization

I. Additional results on mathematical reasoning in natural language

J. Additional results on induction learning

K. Additional results on algorithmic reasoning

L. Additional intuitions on multi-token prediction

M. Training hyperparameters

C. Training speeds

Table S5: Training time relative to next-token prediction training. The slight overhead when using multi-token prediction here is explained by a suboptimal use of Fully Sharded Data Parallel. In our implementation, when doing separate backward passes for each head, we lose the overlap of layer weight communication and computation, therefore it incurs a very slight overhead that can be removed if reimplemented correctly.

\

:::info This paper is available on arxiv under CC BY 4.0 DEED license.

:::

:::info Authors:

(1) Fabian Gloeckle, FAIR at Meta, CERMICS Ecole des Ponts ParisTech, and contributed equally;

(2) Badr Youbi IdrissiFAIR at Meta, LISN Université Paris-Saclay, and contributed equally;

(3) Baptiste Rozière, FAIR at Meta;

(4) David Lopez-Paz, FAIR at Meta and his the last author;

(5) Gabriel Synnaeve, FAIR at Meta and the last author.

:::

\


This content originally appeared on HackerNoon and was authored by Large Models (dot tech)


Print Share Comment Cite Upload Translate Updates
APA

Large Models (dot tech) | Sciencx (2025-06-08T15:30:03+00:00) Training Time Comparison: Multi-Token vs. Next-Token Prediction. Retrieved from https://www.scien.cx/2025/06/08/training-time-comparison-multi-token-vs-next-token-prediction/

MLA
" » Training Time Comparison: Multi-Token vs. Next-Token Prediction." Large Models (dot tech) | Sciencx - Sunday June 8, 2025, https://www.scien.cx/2025/06/08/training-time-comparison-multi-token-vs-next-token-prediction/
HARVARD
Large Models (dot tech) | Sciencx Sunday June 8, 2025 » Training Time Comparison: Multi-Token vs. Next-Token Prediction., viewed ,<https://www.scien.cx/2025/06/08/training-time-comparison-multi-token-vs-next-token-prediction/>
VANCOUVER
Large Models (dot tech) | Sciencx - » Training Time Comparison: Multi-Token vs. Next-Token Prediction. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2025/06/08/training-time-comparison-multi-token-vs-next-token-prediction/
CHICAGO
" » Training Time Comparison: Multi-Token vs. Next-Token Prediction." Large Models (dot tech) | Sciencx - Accessed . https://www.scien.cx/2025/06/08/training-time-comparison-multi-token-vs-next-token-prediction/
IEEE
" » Training Time Comparison: Multi-Token vs. Next-Token Prediction." Large Models (dot tech) | Sciencx [Online]. Available: https://www.scien.cx/2025/06/08/training-time-comparison-multi-token-vs-next-token-prediction/. [Accessed: ]
rf:citation
» Training Time Comparison: Multi-Token vs. Next-Token Prediction | Large Models (dot tech) | Sciencx | https://www.scien.cx/2025/06/08/training-time-comparison-multi-token-vs-next-token-prediction/ |

Please log in to upload a file.




There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.