Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BREAKING][misc] feat: change micro_batch_size to micro_batch_size_per_gpu #136

Merged
merged 27 commits into from
Jan 27, 2025

Conversation

PeterSH6
Copy link
Collaborator

@PeterSH6 PeterSH6 commented Jan 26, 2025

Summary

This PR changes all the micro_batch_size to micro_batch_size_per_gpu.

The Core logic of setting batch size:

  • All algorithmic metrics (train batch size, ppo mini batch size): are global (from the perspective of single-controller), which will be normalized in each Worker.
  • All performance-related parameters (micro batch size, max token length in dynamic batch size) are local parameters, which represent the data sizes per GPU (i.e., each Worker).

Main Changes

  1. Change the scripts and config and delete the normalization for micro_bsz
  2. Fix CI for SFT

@PeterSH6 PeterSH6 requested a review from vermouth1992 January 26, 2025 10:30
Copy link
Collaborator

@vermouth1992 vermouth1992 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a good idea to break user's API. We should introduce per gpu bsz in addition to the original parameter. When both of them present, we should assert false. When the old parameter is used, we should print a warning.

@PeterSH6
Copy link
Collaborator Author

PeterSH6 commented Jan 26, 2025

Sure. We should set the default value of micro_batch_size_per_gpu to null to implement that.

Moreover, for the examples we provide, I think we should simply use micro_batch_size_per_gpu but not the original micro_batch_size.

@PeterSH6 PeterSH6 merged commit f2a76ac into volcengine:main Jan 27, 2025
10 checks passed
Chendong98 pushed a commit to Chendong98/verl that referenced this pull request Feb 4, 2025
…r_gpu (volcengine#136)

## Summary

This PR changes all the micro_batch_size to micro_batch_size_per_gpu.

**The Core logic of setting batch size:**
- **All algorithmic metrics** (train batch size, ppo mini batch size):
are global (from the perspective of single-controller), which will be
normalized in each Worker.
- **All performance-related parameters** (micro batch size, max token
length in dynamic batch size) are local parameters, which represent the
data sizes per GPU (i.e., each Worker).

## Main Changes

1. Change the scripts and config and delete the normalization for
micro_bsz
2. Fix CI for SFT
as12138 pushed a commit to as12138/verl that referenced this pull request Feb 20, 2025
…r_gpu (volcengine#136)

This PR changes all the micro_batch_size to micro_batch_size_per_gpu.

**The Core logic of setting batch size:**
- **All algorithmic metrics** (train batch size, ppo mini batch size):
are global (from the perspective of single-controller), which will be
normalized in each Worker.
- **All performance-related parameters** (micro batch size, max token
length in dynamic batch size) are local parameters, which represent the
data sizes per GPU (i.e., each Worker).

1. Change the scripts and config and delete the normalization for
micro_bsz
2. Fix CI for SFT
as12138 pushed a commit to as12138/verl that referenced this pull request Feb 20, 2025
…r_gpu (volcengine#136)

This PR changes all the micro_batch_size to micro_batch_size_per_gpu.

**The Core logic of setting batch size:**
- **All algorithmic metrics** (train batch size, ppo mini batch size):
are global (from the perspective of single-controller), which will be
normalized in each Worker.
- **All performance-related parameters** (micro batch size, max token
length in dynamic batch size) are local parameters, which represent the
data sizes per GPU (i.e., each Worker).

1. Change the scripts and config and delete the normalization for
micro_bsz
2. Fix CI for SFT
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants