Publications
-
AI Alignment: A Contemporary Survey
ACM Computing Surveys, 2025
-
InterMT: Multi-Turn Interleaved Preference Alignment with Human Feedback
NeurIPS, 2025
-
Generative RLHF-V: Learning Principles from Multi-modal Human Preference
NeurIPS, 2025
-
Safe RLHF-V: Safe Reinforcement Learning from Multi-modal Human Feedback
NeurIPS, 2025
-
SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Safe Reinforcement Learning
NeurIPS, 2025 Spotlight
-
Language Models Resist Alignment: Evidence From Data Compression
ACL, 2025 Best Paper
-
PKU-SafeRLHF: Towards Multi-Level Safety Alignment for LLMs with Human Preference
ACL Main, 2025
-
Reward Generalization in RLHF: A Topological Perspective
ACL Findings, 2025
-
SAE-V: Interpreting Multimodal Models for Enhanced Alignment
ICML, 2025
-
Stream Aligner: Efficient Sentence-Level Alignment via Distribution Induction
AAAI, 2025
-
Sequence to Sequence Reward Modeling: Improving RLHF by Language Feedback
AAAI, 2025 Oral
-
Align Anything: Training All-Modality Models to Follow Instructions with Language Feedback
Arxiv, 2025
-
OmniSafe: An Infrastructure for Accelerating Safe Reinforcement Learning Research
JMLR, 2024
-
Aligner: Efficient Alignment by Learning to Correct
NeurIPS, 2024 Oral
-
SafeSora: Towards Safety Alignment of Text2Video Generation via a Human Preference Dataset
NeurIPS, 2024
-
ProgressGym: Alignment with a Millennium of Moral Progress
NeurIPS, 2024 Spotlight
-
Safe RLHF: Safe Reinforcement Learning from Human Feedback
ICLR, 2024 Spotlight
-
SafeDreamer: Safe Reinforcement Learning with World Models
ICLR, 2024
-
Safety-Gymnasium: A Unified Safe Reinforcement Learning Benchmark
NeurIPS, 2023
-
BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset
NeurIPS, 2023
-
Baichuan 2: Open Large-scale Language Models
Arxiv (Technical Report), 2023