PKU-Alignment

All

23 repositories

align-anything
Public
Align Anything: Training All-modality Model with Feedback
chameleon multimodal dpo large-language-models rlhf vision-language-model
Jupyter Notebook
•
Apache License 2.0
•504•4.5k•25•3•Updated Aug 25, 2025Aug 25, 2025
safe-rlhf
Public
Safe RLHF: Constrained Value Alignment via Safe Reinforcement Learning from Human Feedback
reinforcement-learning transformers transformer safety llama gpt datasets beaver alpaca ai-safety
Python
•
Apache License 2.0
•124•1.5k•17•1•Updated Aug 18, 2025Aug 18, 2025
safety-gymnasium
Public
NeurIPS 2023: Safety-Gymnasium: A Unified Safe Reinforcement Learning Benchmark
reinforcement-learning constraint-satisfaction-problem safety-critical safety-critical-systems safe-reinforcement-learning safe-reinforcement-learning-environments constraint-rl safe-policy-optimization
Python
•
Apache License 2.0
•67•485•8•2•Updated Aug 12, 2025Aug 12, 2025
eval-anything
Public
Python
•
Apache License 2.0
•16•21•1•2•Updated Jul 26, 2025Jul 26, 2025
llms-resist-alignment
Public
[ACL2025 Best Paper] Language Models Resist Alignment
alignment llama safe alpaca ai-safety vicuna llm llms rlhf safe-rlhf
Python
•1•22•0•0•Updated Jun 11, 2025Jun 11, 2025
SAE-V
Public
[ICML 2025 Poster] SAE-V: Interpreting Multimodal Models for Enhanced Alignment
0•3•0•0•Updated Jun 5, 2025Jun 5, 2025
SafeVLA
Public
Python
•
Other
•2•53•5•1•Updated Jun 4, 2025Jun 4, 2025
ProgressGym
Public
Alignment with a millennium of moral progress. Spotlight@NeurIPS 2024 Track on Datasets and Benchmarks.
Python
•
MIT License
•4•24•0•0•Updated Mar 30, 2025Mar 30, 2025
s1-m
Public
S1-M: Simple Test-time Scaling in Multimodal Reasoning
Python
•
Apache License 2.0
•504•3•0•0•Updated Mar 25, 2025Mar 25, 2025
omnisafe
Public
JMLR: OmniSafe is an infrastructural framework for accelerating SafeRL research.
benchmark machine-learning reinforcement-learning deep-learning deep-reinforcement-learning constraint-satisfaction-problem pytorch safety-critical saferl safe-reinforcement-learning
Python
•
Apache License 2.0
•139•982•15•4•Updated Mar 17, 2025Mar 17, 2025
ProAgent
Public
AAAI24(Oral) ProAgent: Building Proactive Cooperative Agents with Large Language Models
language-model cooperative human-ai overcooked human-ai-interaction cooperative-ai llm-agent
JavaScript
•
MIT License
•10•90•0•0•Updated Mar 4, 2025Mar 4, 2025
Beaver-zh-hk
Public
Python
•0•0•0•0•Updated Feb 23, 2025Feb 23, 2025
TransformerLens-V
Public
Python
•
MIT License
•1•2•0•0•Updated Jan 31, 2025Jan 31, 2025
SAELens-V
Public
Python
•
MIT License
•0•3•2•0•Updated Jan 31, 2025Jan 31, 2025
aligner
Public
[NeurIPS 2024 Oral] Aligner: Efficient Alignment by Learning to Correct
alignment aligner interpretability aisafety llm rlhf weak-to-strong mecinterp
Python
•8•184•0•0•Updated Jan 16, 2025Jan 16, 2025
.github
Public
0•0•0•0•Updated Jan 16, 2025Jan 16, 2025
Aligner2024.github.io
Public
HTML
•1•0•0•0•Updated Oct 31, 2024Oct 31, 2024
safe-sora
Public
SafeSora is a human preference dataset designed to support safety alignment research in the text-to-video generation field, aiming to enhance the helpfulness and harmlessness of Large Vision Models (LVMs).
alignment human-preferences text-to-video-generation large-vision-models
Python
•5•32•0•0•Updated Aug 20, 2024Aug 20, 2024
SafeDreamer
Public
ICLR 2024: SafeDreamer: Safe Reinforcement Learning with World Models
reinforcement-learning constraint-satisfaction-problem safety-critical-systems safe-reinforcement-learning constraint-rl safe-policy-optimization
Python
•
Apache License 2.0
•5•76•1•0•Updated Apr 8, 2024Apr 8, 2024
Safe-Policy-Optimization
Public
NeurIPS 2023: Safe Policy Optimization: A benchmark repository for safe reinforcement learning algorithms
benchmarks reinforcement-learning-algorithms safe safe-reinforcement-learning constrained-reinforcement-learning
Python
•
Apache License 2.0
•54•375•1•0•Updated Mar 20, 2024Mar 20, 2024
AlignmentSurvey
Public
AI Alignment: A Comprehensive Survey
awesome reinforcement-learning ai deep-learning survey alignment papers interpretability red-teaming large-language-models
1•135•0•0•Updated Nov 2, 2023Nov 2, 2023
beavertails
Public
BeaverTails is a collection of datasets designed to facilitate research on safety alignment in large language models (LLMs).
safety llama gpt datasets language-model beaver ai-safety human-feedback-data llm llms
Makefile
•
Apache License 2.0
•5•154•3•0•Updated Oct 27, 2023Oct 27, 2023
ReDMan
Public
ReDMan is an open-source simulation platform that provides a standardized implementation of safe RL algorithms for Reliable Dexterous Manipulation.
Python
•
Apache License 2.0
•2•20•0•0•Updated May 2, 2023May 2, 2023