Paper-Conference

SafeSora: Towards Safety Alignment of Text2Video Generation via a Human Preference Dataset
SafeSora: Towards Safety Alignment of Text2Video Generation via a Human Preference Dataset
Juntao Dai , Tianle Chen , Xuyao Wang , Ziran Yang , Taiye Chen , Jiaming Ji , Yaodong Yang
NeurIPS 2024.
AI Safety, Safety Alignment
Language Models Resist Alignment: Evidence From Data Compression
Language Models Resist Alignment: Evidence From Data Compression
ACL 2025 Best Paper
Large Language Models, Safety Alignment, AI Safety
ProgressGym: Alignment with a Millennium of Moral Progress
ProgressGym: Alignment with a Millennium of Moral Progress
Tianyi Qiu , Yang Zhang , Xuchuan Huang , Jasmine Xinze Li , Jiaming Ji , Yaodong Yang
NeurIPS 2024.
Large Language Models, AI Alignment
PKU-SafeRLHF: Towards Multi-Level Safety Alignment for LLMs with Human Preference
PKU-SafeRLHF: Towards Multi-Level Safety Alignment for LLMs with Human Preference
Jiaming Ji , Donghai Hong , Borong Zhang , Boyuan Chen , Josef Dai , Boren Zheng , Tianyi Qiu , Boxun Li , Yaodong Yang
ACL 2025 Main.
Large Language Models, Safety Alignment, Reinforcement Learning from Human Feedback
Safe RLHF: Safe Reinforcement Learning from Human Feedback
Safe RLHF: Safe Reinforcement Learning from Human Feedback
Josef Dai , Xuehai Pan , Ruiyang Sun , Jiaming Ji , Xinbo Xu , Mickel Liu , Yizhou Wang , Yaodong Yang
ICLR 2024. Spotlight
Safety Alignment, Reinforcement Learning from Human Feedback
Safety-Gymnasium: A Unified Safe Reinforcement Learning Benchmark
Safety-Gymnasium: A Unified Safe Reinforcement Learning Benchmark
Jiaming Ji , Borong Zhang , Jiayi Zhou , Xuehai Pan , Weidong Huang , Ruiyang Sun , Yiran Geng , Yifan Zhong , Juntao Dai , Yaodong Yang
NeurIPS 2023.
Safe Reinforcement Learning, Robotics
BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset
BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset
Jiaming Ji , Mickel Liu , Juntao Dai , Xuehai Pan , Ce Bian , Chi Zhang , Ruiyang Sun , Yizhou Wang , Yaodong Yang
NeurIPS 2023.
Large Language Models, Safety Alignment, Reinforcement Learning from Human Feedback