Multi-Token Prediction: Architecture for Memory-Efficient LLM Training

Standard language modeling learns about a large text corpus x1, . . . xT by implementing a next-token prediction task. Formally, the learning objective is to minimize the cross-entropy loss

\ In this work, we generalize the above by implementing a multi-token prediction task, where at each position of the training corpus, the model is instructed to predict n future tokens at once. This translates into the cross-entropy loss

\ Figure 2: Order of the forward/backward in an n-token prediction model with n = 2 heads. By performing the forward/backward on the heads in sequential order, we avoid materializing all unembedding layer gradients in memory simultaneously and reduce peak GPU memory usage.

:::info This paper is available on arxiv under CC BY 4.0 DEED license.

:::

:::info Authors:

(1) Fabian Gloeckle, FAIR at Meta, CERMICS Ecole des Ponts ParisTech, and contributed equally;

(2) Badr Youbi IdrissiFAIR at Meta, LISN Université Paris-Saclay, and contributed equally;

(3) Baptiste Rozière, FAIR at Meta;

(4) David Lopez-Paz, FAIR at Meta and his the last author;

(5) Gabriel Synnaeve, FAIR at Meta and his the last author.

:::

This content originally appeared on HackerNoon and was authored by Large Models (dot tech)

Print Share Comment Cite Upload Translate Updates

APA

Large Models (dot tech) | Sciencx (2025-06-03T15:00:03+00:00) Multi-Token Prediction: Architecture for Memory-Efficient LLM Training. Retrieved from https://www.scien.cx/2025/06/03/multi-token-prediction-architecture-for-memory-efficient-llm-training/

MLA

" » Multi-Token Prediction: Architecture for Memory-Efficient LLM Training." Large Models (dot tech) | Sciencx - Tuesday June 3, 2025, https://www.scien.cx/2025/06/03/multi-token-prediction-architecture-for-memory-efficient-llm-training/

HARVARD

Large Models (dot tech) | Sciencx Tuesday June 3, 2025 » Multi-Token Prediction: Architecture for Memory-Efficient LLM Training., viewed ,<https://www.scien.cx/2025/06/03/multi-token-prediction-architecture-for-memory-efficient-llm-training/>

VANCOUVER

Large Models (dot tech) | Sciencx - » Multi-Token Prediction: Architecture for Memory-Efficient LLM Training. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2025/06/03/multi-token-prediction-architecture-for-memory-efficient-llm-training/

CHICAGO

" » Multi-Token Prediction: Architecture for Memory-Efficient LLM Training." Large Models (dot tech) | Sciencx - Accessed . https://www.scien.cx/2025/06/03/multi-token-prediction-architecture-for-memory-efficient-llm-training/

IEEE

" » Multi-Token Prediction: Architecture for Memory-Efficient LLM Training." Large Models (dot tech) | Sciencx [Online]. Available: https://www.scien.cx/2025/06/03/multi-token-prediction-architecture-for-memory-efficient-llm-training/. [Accessed: ]

rf:citation

» Multi-Token Prediction: Architecture for Memory-Efficient LLM Training | Large Models (dot tech) | Sciencx | https://www.scien.cx/2025/06/03/multi-token-prediction-architecture-for-memory-efficient-llm-training/ |

Please log in to upload a file.

There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.

Table of Links

2. Method

Related Posts