Strategic LLM Training: Multi-Token Prediction’s Data Efficiency in Mathematical Reasoning

This figure illustrates the profound impact of training scale on multi-token prediction models’ performance on GSM8K, highlighting critical data efficiency considerations for mathematical reasoning.


This content originally appeared on HackerNoon and was authored by Cosmological thinking: time, space and universal causation

Abstract and 1. Introduction

2. Method

3. Experiments on real data

4. Ablations on synthetic data

5. Why does it work? Some speculation

6. Related work

7. Conclusion, Impact statement, Environmental impact, Acknowledgements and References

A. Additional results on self-speculative decoding

B. Alternative architectures

C. Training speeds

D. Finetuning

E. Additional results on model scaling behavior

F. Details on CodeContests finetuning

G. Additional results on natural language benchmarks

H. Additional results on abstractive text summarization

I. Additional results on mathematical reasoning in natural language

J. Additional results on induction learning

K. Additional results on algorithmic reasoning

L. Additional intuitions on multi-token prediction

M. Training hyperparameters

I. Additional results on mathematical reasoning in natural language

Figure S13: Performance on the mathematical reasoning benchmark GSM8K (Cobbe et al., 2021). We evaluate pretrained next-token and multi-token prediction models trained on 200B and 500B tokens of natural language in 8-shot mode using nucleus sampling (Holtzman et al., 2020) with probability mass 0.95 and various sampling temperatures. Reported are the frequencies of the correct final answer to appear among k samples, for k = 1, 10, 100, estimated from 200 samples like in code generation benchmarks (Chen et al., 2021). After 200B tokens, the 2-token prediction model has a clear advantage over the next-token baseline but the order reverses after 500B tokens. The 4-token prediction model is worse throughout. We interpret this similarly to the findings in Section 4.1: the follow-your-nose chains-of-thought required for GSM8K may be difficult to learn from a limited amount of data, attesting to the data efficiency of multi-token prediction training. Once the correct circuits for correct autoregressive chains-of-thought in this domain have formed, however, multi-token prediction comes at a cost.

\

:::info Authors:

(1) Fabian Gloeckle, FAIR at Meta, CERMICS Ecole des Ponts ParisTech and Equal contribution;

(2) Badr Youbi Idrissi, FAIR at Meta, LISN Université Paris-Saclayand and Equal contribution;

(3) Baptiste Rozière, FAIR at Meta;

(4) David Lopez-Paz, FAIR at Meta and a last author;

(5) Gabriel Synnaeve, FAIR at Meta and a last author.

:::


:::info This paper is available on arxiv under CC BY 4.0 DEED license.

:::

\


This content originally appeared on HackerNoon and was authored by Cosmological thinking: time, space and universal causation


Print Share Comment Cite Upload Translate Updates
APA

Cosmological thinking: time, space and universal causation | Sciencx (2025-07-23T15:15:03+00:00) Strategic LLM Training: Multi-Token Prediction’s Data Efficiency in Mathematical Reasoning. Retrieved from https://www.scien.cx/2025/07/23/strategic-llm-training-multi-token-predictions-data-efficiency-in-mathematical-reasoning/

MLA
" » Strategic LLM Training: Multi-Token Prediction’s Data Efficiency in Mathematical Reasoning." Cosmological thinking: time, space and universal causation | Sciencx - Wednesday July 23, 2025, https://www.scien.cx/2025/07/23/strategic-llm-training-multi-token-predictions-data-efficiency-in-mathematical-reasoning/
HARVARD
Cosmological thinking: time, space and universal causation | Sciencx Wednesday July 23, 2025 » Strategic LLM Training: Multi-Token Prediction’s Data Efficiency in Mathematical Reasoning., viewed ,<https://www.scien.cx/2025/07/23/strategic-llm-training-multi-token-predictions-data-efficiency-in-mathematical-reasoning/>
VANCOUVER
Cosmological thinking: time, space and universal causation | Sciencx - » Strategic LLM Training: Multi-Token Prediction’s Data Efficiency in Mathematical Reasoning. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2025/07/23/strategic-llm-training-multi-token-predictions-data-efficiency-in-mathematical-reasoning/
CHICAGO
" » Strategic LLM Training: Multi-Token Prediction’s Data Efficiency in Mathematical Reasoning." Cosmological thinking: time, space and universal causation | Sciencx - Accessed . https://www.scien.cx/2025/07/23/strategic-llm-training-multi-token-predictions-data-efficiency-in-mathematical-reasoning/
IEEE
" » Strategic LLM Training: Multi-Token Prediction’s Data Efficiency in Mathematical Reasoning." Cosmological thinking: time, space and universal causation | Sciencx [Online]. Available: https://www.scien.cx/2025/07/23/strategic-llm-training-multi-token-predictions-data-efficiency-in-mathematical-reasoning/. [Accessed: ]
rf:citation
» Strategic LLM Training: Multi-Token Prediction’s Data Efficiency in Mathematical Reasoning | Cosmological thinking: time, space and universal causation | Sciencx | https://www.scien.cx/2025/07/23/strategic-llm-training-multi-token-predictions-data-efficiency-in-mathematical-reasoning/ |

Please log in to upload a file.




There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.