Item request has been placed!

Item request cannot be made.

Processing Request

Iterative Length-Regularized Direct Preference Optimization: A Case Study on Improving 7B Language Models to GPT-4 Level

Item request has been placed!

Item request cannot be made.

Processing Request

اقرأ أكثر حفظ في قائمتي

المؤلفون: Liu, Jie; Zhou, Zhanhui; Liu, Jiaheng; Bu, Xingyuan; Yang, Chao; Zhong, Han-Sen; Ouyang, Wanli
الموضوع:
Computer Science - Computation and Language; Computer Science - Artificial Intelligence; Computer Science - Machine Learning
نوع التسجيلة:
Working Paper
الدخول الالكتروني :
http://arxiv.org/abs/2406.11817

معلومة اضافية
- الموضوع:
  2024
- Collection:
  Computer Science
- نبذة مختصرة :
  Direct Preference Optimization (DPO), a standard method for aligning language models with human preferences, is traditionally applied to offline preferences. Recent studies show that DPO benefits from iterative training with online preferences labeled by a trained reward model. In this work, we identify a pitfall of vanilla iterative DPO - improved response quality can lead to increased verbosity. To address this, we introduce iterative length-regularized DPO (iLR-DPO) to penalize response length. Our empirical results show that iLR-DPO can enhance a 7B model to perform on par with GPT-4 without increasing verbosity. Specifically, our 7B model achieves a $50.5\%$ length-controlled win rate against $\texttt{GPT-4 Preview}$ on AlpacaEval 2.0, and excels across standard benchmarks including MT-Bench, Arena-Hard and OpenLLM Leaderboard. These results demonstrate the effectiveness of iterative DPO in aligning language models with human feedback.
- الرقم المعرف:
  edsarx.2406.11817

تعليقات

No Comments.