PKU-Alignment Group @Pair-LAB

AI Safety and Alignment!

👋 ACL 2025 Best Paper

Group Members at ACL 2025!

Professor Yang Yaodong took part in the 2025 China · AI Festival aired on CCTV-1.

Our Mission

The PKU-Alignment Group, under the PKU Pair-Lab, is a pioneering research interest group dedicated to advancing the frontiers of AI safety and alignment. Our mission is to explore the fundamental algorithms and mechanisms that underpin AI alignment, driving both theoretical innovation and practical deployment.

Our mission is to ensure that AI systems remain consistently aligned with human goals. The team actively shares the latest advances in AI research, while fostering the development and real-world adoption of safety and alignment practices. Our key research direction include:

Mechanisms and Interpretability in Alignment: Investigating whether large models can be effectively aligned, their resilience to misalignment, and the interpretability of alignment mechanisms;
Reinforcement Learning and Post-training of Language Models: Designing more efficient and reliable post-alignment algorithms;
Safety Alignment and Superalignment: Addressing frontier-risk alignment challenges such as deceptive alignment, scalable oversight, CBRN hazards, and interpretability; as well as value alignment issues, including regional value alignment and bidirectional value lock-in.

Latest News

Sep 20, 2025 min read

PKU-Alignment Group in NeurIPS 2025!

PKU-Alignment Group has 4 papers accepted to NeurIPS 2025, including 2 Spotlights. Full abstracts and project links are provided below in one page. Our NeurIPS 2025 papers advances safety alignment and human preference learning across multimodal and embodied AI. SafeVLA proposes an integrated safety approach that formulates safety-constrained policy optimization for vision-language-action models, achieving substantial safety gains while preserving task performance. InterMT introduces the first multi-turn, interleaved multimodal preference dataset with expert oversight and establishes InterMT-Bench, revealing multi-turn scaling behavior for judge models. Generative RLHF-V unifies generative reward modeling with multimodal RLHF in a two-stage pipeline, delivering consistent performance improvements and analyzing reward hacking and generalization. Safe RLHF-V pioneers multimodal safety alignment with dual-preference data and a multi-level guardrail system, significantly improving both safety and helpfulness. Together, these works contribute scalable datasets, principled algorithms, and evaluation protocols that move alignment closer to robust, reliable deployment in complex real-world settings.

Aug 2, 2025 min read

PKU-Alignment Group Win the Best Paper Award at ACL 2025

在国家自然科学基金项目优秀青年科学基金等项目资助下，北京大学人工智能研究院杨耀东助理教授团队在大模型后训练与对齐领域取得关键进展，相关成果以“Language Models Resist Alignment: Evidence From Data Compression”为题在自然语言处理领域顶级会议Association for Computational Linguistics（ACL） 2025中发表，并被评为最佳论文（Best Paper Award）。ACL 2025共收到投稿8000余篇，评选出四篇最佳论文，该工作是唯一由中国科研机构独立完成的获奖论文。

Meet the team →