Extending Direct Nash Optimization for Regularized Preferences

This section presents an extension of the Direct Nash Optimization (DNO) framework for handling regularized preferences. The key difference between SPO and Nash-MD lies in the use of smoothed policies for the latter, which helps obtain a late-iteration guarantee. The section introduces a new version of DNO, designed to converge to a Nash equilibrium using KL-regularization. The algorithm (Algorithm 3) works iteratively, adjusting the policy distribution through a partition function and reward function, ultimately refining the policy with each iteration. The approach helps address the challenges of regularized preferences while ensuring stable convergence.


This content originally appeared on HackerNoon and was authored by Language Models (dot tech)

:::info Authors:

(1) Corby Rosset, Microsoft Research and Correspondence to corbyrosset@microsoft.com;

(2) Ching-An Cheng, Microsoft Research;

(3) Arindam Mitra, Microsoft Research;

(4) Michael Santacroce, Microsoft Research;

(5) Ahmed Awadallah, Microsoft Research and Correspondence to hassanam@microsoft.com;

(6) Tengyang Xie, Microsoft Research and Correspondence to tengyangxie@microsoft.com.

:::

Abstract and 1 Introduction

2 Preliminaries

2.1 RLHF Based on Reward Models

2.2 RLHF with General Preferences

3 Direct Nash Optimization and 3.1 Derivation of Algorithm 1

3.2 Theoretical Analysis

4 Practical Algorithm – Iterative Contrastive Self-Improvement

5 Experiments and 5.1 Experimental Setup

5.2 Results and Analysis

6 Related Work

7 Conclusion and References

\ Appendix

A Extension to Regularized Preferences

B Detailed Proofs

C Additional Experimental Details

A Extension to Regularized Preferences

In this section, we discuss how to extend the DNO framework to the case of regularized preferences (defined in Eq. (5)),

\

\ which was first introduced and solved by Munos et al. (2023) via Nash-MD introduced earlier.

\

\

\

\

\

:::info This paper is available on arxiv under CC BY 4.0 DEED license.

:::

\


This content originally appeared on HackerNoon and was authored by Language Models (dot tech)


Print Share Comment Cite Upload Translate Updates
APA

Language Models (dot tech) | Sciencx (2025-04-17T13:00:02+00:00) Extending Direct Nash Optimization for Regularized Preferences. Retrieved from https://www.scien.cx/2025/04/17/extending-direct-nash-optimization-for-regularized-preferences/

MLA
" » Extending Direct Nash Optimization for Regularized Preferences." Language Models (dot tech) | Sciencx - Thursday April 17, 2025, https://www.scien.cx/2025/04/17/extending-direct-nash-optimization-for-regularized-preferences/
HARVARD
Language Models (dot tech) | Sciencx Thursday April 17, 2025 » Extending Direct Nash Optimization for Regularized Preferences., viewed ,<https://www.scien.cx/2025/04/17/extending-direct-nash-optimization-for-regularized-preferences/>
VANCOUVER
Language Models (dot tech) | Sciencx - » Extending Direct Nash Optimization for Regularized Preferences. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2025/04/17/extending-direct-nash-optimization-for-regularized-preferences/
CHICAGO
" » Extending Direct Nash Optimization for Regularized Preferences." Language Models (dot tech) | Sciencx - Accessed . https://www.scien.cx/2025/04/17/extending-direct-nash-optimization-for-regularized-preferences/
IEEE
" » Extending Direct Nash Optimization for Regularized Preferences." Language Models (dot tech) | Sciencx [Online]. Available: https://www.scien.cx/2025/04/17/extending-direct-nash-optimization-for-regularized-preferences/. [Accessed: ]
rf:citation
» Extending Direct Nash Optimization for Regularized Preferences | Language Models (dot tech) | Sciencx | https://www.scien.cx/2025/04/17/extending-direct-nash-optimization-for-regularized-preferences/ |

Please log in to upload a file.




There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.