Release v0.2 release · volcengine/verl

Highlights

New algorithms and features

GRPO
ReMax
REINFORCE++
Checkpoint manager for FSDP backend
Sandbox for reward verification and scoring in PRIME

Performance optimization:

Remove padding tokens (i.e. sequence packing). Significant throughput increase expected for Llama, Mistral, Gemma, Qwen2 transformer models. Documentation

actor_rollout_ref.model.use_remove_padding=True
critic.model.use_remove_padding=True

Dynamic batch size. Significant throughput increase for variable length sequences. Documentation and example

actor_rollout_ref.actor.ppo_max_token_len_per_gpu
actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu
actor_rollout_ref.ref.log_prob_max_token_len_per_gpu
critic.ppo_max_token_len_per_gpu
critic.forward_micro_batch_size_per_gpu
reward_model.forward_micro_batch_size_per_gpu

Sequence parallelism for long context training. Documentation and example

actor_rollout_ref.actor.ulysses_sequence_parallel_size
critic.ulysses_sequence_parallel_size
reward_model.ulysses_sequence_parallel_size

vllm v0.7+ integration (preview). For the qwen2 ppo example, 25% time reduction in rollout compared to v0.6.3, and 45% time reduction when cuda graph is enabled. Documentation

actor_rollout_ref.rollout.enforce_eager=False
actor_rollout_ref.rollout.free_cache_engine=False

Liger-kernel integration for SFT. Documentation

model.use_liger=True

Changelog

New Features

Algorithm Support:
- Added support for GRPO algorithm (#124).
- Implemented REINFORCE++ algorithm (#228).
- Added ReMax algorithm (#234)
Performance Improvements:
- Enabled dynamic batch size support (#118).
- Added meta device initialization and parallel load for FSDP to avoid OOMs during init (#123).
- Improved gradient accumulation in sequence balance (#141).
- Added ref/RM offload support (#121).
- Added LoRA support for SFT (#127).
- feat: spport rmpad/data-packing in FSDP with transformers (#91)
- Liger kernel integration (#133)
Experiment Tracking:
- Integrated SwanLab for experiment tracking with online/offline mode and local dashboard support (#218).
- Added Mlflow support (#74).

Bug Fixes

Critical Fixes:
- Fixed checkpoint save with existing directories (#174).
- Fixed incorrect response_attention_mask in vLLM rollout (#213).
- Fixed gradient accumulation loss value (#102).
- Fixed reward model issues with TokenClassification models (#99).
Code Fixes:
- Fixed redundant non_zero_mask (#152).
- Fixed validation dp_size (#90).
- Fixed response_mask index (#60).

Improvements

Performance:
- Improved memory efficiency in logprobs_from_logits_v2 (#220).
- Enabled multiprocess dataloader in SFT trainer (#122).
- Added MFU calculation support (#117).
Miscellaneous:
- Added option to log validation generations to wandb (#177).

Deprecations and Breaking Changes

Breaking Changes:
- Changed micro_batch_size to micro_batch_size_per_gpu (#136).
- Removed @ray.remote on workers to allow inheritance (#61).
- Refactored old_log_prob into a separate function (#129).

Contributors

A big thank you to all the contributors who made this release possible:
@zhanluxianshen @xingyaoww @fzyzcjy @emergenz @openhands-agent @ZSL98 @YSLIU627 @ZefanW @corbt @jaysonfrancis @hiyouga @Jiayi-Pan @hongpeng-guo @eltociear @chujiezheng @PanAndy @zwhe99 @pcmoritz @huiyeruzhou @VPeterV @uygnef @zhiqi-0 @ExtremeViscent @liziniu @nch0w @Cppowboy @TonyLianLong @4332001876 @tyler-romero @ShaohonChen @kinman0224 @willem-bd @bebetterest @WeiXiongUST @dignfei

Pypi package will be soon available! Please let us know on Github if there's a problem extending RL training recipe based on the pip installed version fo verl.

Full Changelog: v0.1...v0.2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.2 release

Highlights

Changelog

New Features

Bug Fixes

Improvements

Deprecations and Breaking Changes

Contributors

Contributors