Item request has been placed!

Item request cannot be made.

Processing Request

Decoupled Alignment for Robust Plug-and-Play Adaptation

Item request has been placed!

Item request cannot be made.

Processing Request

اقرأ أكثر حفظ في قائمتي

المؤلفون: Luo, Haozheng; Yu, Jiahao; Zhang, Wenxin; Li, Jialong; Hu, Jerry Yao-Chieh; Xing, Xinyu; Liu, Han
الموضوع:
Computer Science - Computation and Language; Computer Science - Artificial Intelligence; Computer Science - Cryptography and Security
نوع التسجيلة:
Working Paper
الدخول الالكتروني :
http://arxiv.org/abs/2406.01514

معلومة اضافية
- الموضوع:
  2024
- Collection:
  Computer Science
- نبذة مختصرة :
  We introduce a low-resource safety enhancement method for aligning large language models (LLMs) without the need for supervised fine-tuning (SFT) or reinforcement learning from human feedback (RLHF). Our main idea is to exploit knowledge distillation to extract the alignment information from existing well-aligned LLMs and integrate it into unaligned LLMs in a plug-and-play fashion. Methodology, we employ delta debugging to identify the critical components of knowledge necessary for effective distillation. On the harmful question dataset, our method significantly enhances the average defense success rate by approximately 14.41%, reaching as high as 51.39%, in 17 unaligned pre-trained LLMs, without compromising performance.
- الرقم المعرف:
  edsarx.2406.01514

تعليقات

No Comments.