Safety Alignment

Safe RLHF: Safe Reinforcement Learning from Human Feedback
Safe RLHF: Safe Reinforcement Learning from Human Feedback
Josef Dai , Xuehai Pan , Ruiyang Sun , Jiaming Ji , Xinbo Xu , Mickel Liu , Yizhou Wang , Yaodong Yang
ICLR 2024. Spotlight
Safety Alignment, Reinforcement Learning from Human Feedback
BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset
BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset
Jiaming Ji , Mickel Liu , Juntao Dai , Xuehai Pan , Ce Bian , Chi Zhang , Ruiyang Sun , Yizhou Wang , Yaodong Yang
NeurIPS 2023.
Large Language Models, Safety Alignment, Reinforcement Learning from Human Feedback