[BREAKING][misc] feat: change micro_batch_size to micro_batch_size_per_gpu #136

PeterSH6 · 2025-01-26T10:30:58Z

Summary

This PR changes all the micro_batch_size to micro_batch_size_per_gpu.

The Core logic of setting batch size:

All algorithmic metrics (train batch size, ppo mini batch size): are global (from the perspective of single-controller), which will be normalized in each Worker.
All performance-related parameters (micro batch size, max token length in dynamic batch size) are local parameters, which represent the data sizes per GPU (i.e., each Worker).

Main Changes

Change the scripts and config and delete the normalization for micro_bsz
Fix CI for SFT

vermouth1992

It's not a good idea to break user's API. We should introduce per gpu bsz in addition to the original parameter. When both of them present, we should assert false. When the old parameter is used, we should print a warning.

PeterSH6 · 2025-01-26T12:00:32Z

Sure. We should set the default value of micro_batch_size_per_gpu to null to implement that.

Moreover, for the examples we provide, I think we should simply use micro_batch_size_per_gpu but not the original micro_batch_size.

…r_gpu (volcengine#136) ## Summary This PR changes all the micro_batch_size to micro_batch_size_per_gpu. **The Core logic of setting batch size:** - **All algorithmic metrics** (train batch size, ppo mini batch size): are global (from the perspective of single-controller), which will be normalized in each Worker. - **All performance-related parameters** (micro batch size, max token length in dynamic batch size) are local parameters, which represent the data sizes per GPU (i.e., each Worker). ## Main Changes 1. Change the scripts and config and delete the normalization for micro_bsz 2. Fix CI for SFT

…r_gpu (volcengine#136) This PR changes all the micro_batch_size to micro_batch_size_per_gpu. **The Core logic of setting batch size:** - **All algorithmic metrics** (train batch size, ppo mini batch size): are global (from the perspective of single-controller), which will be normalized in each Worker. - **All performance-related parameters** (micro batch size, max token length in dynamic batch size) are local parameters, which represent the data sizes per GPU (i.e., each Worker). 1. Change the scripts and config and delete the normalization for micro_bsz 2. Fix CI for SFT

PeterSH6 added 18 commits January 23, 2025 17:44

fix normalize when using sp

35c333b

fix script for sp

42127f3

Merge branch 'main' into gm/fix_bsz

bf17beb

[misc] change micro_bsz to micro_bsz_per_gpu

648d4b8

update some scripts for ci

43c4f68

fix main trainer

feb6c6f

update grpo script

8b1a9e8

fix sft ci

2d98fcf

fix sft script

97040ed

conservative config for megatron per gpu

1834741

fix non_tensor_batch union

4d86af8

use pandas to fix

eea6226

Merge branch 'gm/fix_union' into gm/fix_bsz

8fb3b94

update split example config

af6fd99

update some ppo script

bed2aaa

add ci and fix series to dataframe to support >1 D

d0db579

Merge branch 'gm/fix_union' into gm/fix_bsz

928c024

all scripts test done and simple performance tuning

531a247

PeterSH6 requested a review from vermouth1992 January 26, 2025 10:30

lint

4a17804

vermouth1992 reviewed Jan 26, 2025

View reviewed changes

PeterSH6 added 8 commits January 26, 2025 20:33

add back config

f750969

update doc

41ef879

backward compatible and check config

d4f3399

lint

3fe4167

fix ci

8a26255

fix ci

f27f0b0

fix for dynamic bsz config validate

746d767

lint

e53bb04

vermouth1992 approved these changes Jan 26, 2025

View reviewed changes

PeterSH6 merged commit f2a76ac into volcengine:main Jan 27, 2025
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BREAKING][misc] feat: change micro_batch_size to micro_batch_size_per_gpu #136

[BREAKING][misc] feat: change micro_batch_size to micro_batch_size_per_gpu #136

PeterSH6 commented Jan 26, 2025 •

edited

Loading

vermouth1992 left a comment

PeterSH6 commented Jan 26, 2025 •

edited

Loading

[BREAKING][misc] feat: change micro_batch_size to micro_batch_size_per_gpu #136

[BREAKING][misc] feat: change micro_batch_size to micro_batch_size_per_gpu #136

Conversation

PeterSH6 commented Jan 26, 2025 • edited Loading

Summary

Main Changes

vermouth1992 left a comment

Choose a reason for hiding this comment

PeterSH6 commented Jan 26, 2025 • edited Loading

PeterSH6 commented Jan 26, 2025 •

edited

Loading

PeterSH6 commented Jan 26, 2025 •

edited

Loading