Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Rafael Rafailovが第一著者,Stanford

The proposed method improved Proximal Preference Optimization.
Direct preference optimization improved the method of updating policies.

RL fine-tuning is conducted as follows:

\

Using the partition function

We can d…


This content originally appeared on DEV Community and was authored by Takara Taniguchi

Rafael Rafailovが第一著者,Stanford

The proposed method improved Proximal Preference Optimization.
Direct preference optimization improved the method of updating policies.

RL fine-tuning is conducted as follows:

Image description\

Using the partition function

Image description

We can delete Z(xx, which is difficult to calculate

![Image description](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/i40dsvgp9jyboh93n2at.png

Then, we do not have to make reward modeling and directly optimize the loss function.


This content originally appeared on DEV Community and was authored by Takara Taniguchi


Print Share Comment Cite Upload Translate Updates
APA

Takara Taniguchi | Sciencx (2025-05-28T00:56:09+00:00) Direct Preference Optimization: Your Language Model is Secretly a Reward Model. Retrieved from https://www.scien.cx/2025/05/28/direct-preference-optimization-your-language-model-is-secretly-a-reward-model-2/

MLA
" » Direct Preference Optimization: Your Language Model is Secretly a Reward Model." Takara Taniguchi | Sciencx - Wednesday May 28, 2025, https://www.scien.cx/2025/05/28/direct-preference-optimization-your-language-model-is-secretly-a-reward-model-2/
HARVARD
Takara Taniguchi | Sciencx Wednesday May 28, 2025 » Direct Preference Optimization: Your Language Model is Secretly a Reward Model., viewed ,<https://www.scien.cx/2025/05/28/direct-preference-optimization-your-language-model-is-secretly-a-reward-model-2/>
VANCOUVER
Takara Taniguchi | Sciencx - » Direct Preference Optimization: Your Language Model is Secretly a Reward Model. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2025/05/28/direct-preference-optimization-your-language-model-is-secretly-a-reward-model-2/
CHICAGO
" » Direct Preference Optimization: Your Language Model is Secretly a Reward Model." Takara Taniguchi | Sciencx - Accessed . https://www.scien.cx/2025/05/28/direct-preference-optimization-your-language-model-is-secretly-a-reward-model-2/
IEEE
" » Direct Preference Optimization: Your Language Model is Secretly a Reward Model." Takara Taniguchi | Sciencx [Online]. Available: https://www.scien.cx/2025/05/28/direct-preference-optimization-your-language-model-is-secretly-a-reward-model-2/. [Accessed: ]
rf:citation
» Direct Preference Optimization: Your Language Model is Secretly a Reward Model | Takara Taniguchi | Sciencx | https://www.scien.cx/2025/05/28/direct-preference-optimization-your-language-model-is-secretly-a-reward-model-2/ |

Please log in to upload a file.




There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.