Publikation
Mitigating Reward Over-optimization in Direct Alignment Algorithms with Importance Sampling
Phuc Minh Nguyen; Ngoc-Hieu Nguyen; Ho Minh Duy Nguyen; Anji Liu; An Mai; Binh T. Nguyen; Daniel Sonntag; Khoa D. Doan
In: The Thirty-Ninth Annual Conference on Neural Information Processing Systems (NeurIPS). Neural Information Processing Systems (NeurIPS-2025), December 2-12, USA, Advances in Neural Information Processing Systems, 2025.
Zusammenfassung
Direct Alignment Algorithms (DAAs) such as Direct Preference Optimization (DPO) have emerged as alternatives to the standard Reinforcement Learning from
Human Feedback (RLHF) for aligning large language models (LLMs) with human values. However, these methods are more susceptible to over-optimization, in which
the model drifts away from the reference policy, leading to degraded performance as training progresses. This paper proposes a novel importance-sampling approach to
mitigate the over-optimization problem of offline DAAs. This approach, called (ISDAAs), multiplies the DAA objective with an importance ratio that accounts for the
reference policy distribution. IS-DAAs additionally avoid the high variance issue associated with importance sampling by clipping the importance ratio to a maximum
value. Our extensive experiments demonstrate that IS-DAAs can effectively mitigate over-optimization, especially under low regularization strength, and achieve
better performance than other methods designed to address this problem
