Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bugfix] promote state in bf16_optimizer #5767

Merged
merged 3 commits into from
Jul 16, 2024

Conversation

billishyahao
Copy link
Contributor

This patch is to promote state in bf16_optimizer so it can be accessible in downstream deepspeed usecase.

For example, without the patch, we found issue in megatron-deepspeed llama showcase:

[rank3]: Traceback (most recent call last):                                                                                                                             
[rank3]:   File "/yahao/Megatron-DeepSpeed/pretrain_gpt.py", line 356, in <module>                                                                                      
[rank3]:     pretrain(train_valid_test_datasets_provider,                                                                                                               
[rank3]:   File "/yahao/Megatron-DeepSpeed/megatron/training.py", line 222, in pretrain                                                                                 
[rank3]:     iteration = train(forward_step_func,                                                                                                                       
[rank3]:   File "/yahao/Megatron-DeepSpeed/megatron/training.py", line 1264, in train                                                                                   
[rank3]:     report_memory_flag = training_log(loss_dict, total_loss_dict,                                                                                              
[rank3]:   File "/yahao/Megatron-DeepSpeed/megatron/training.py", line 999, in training_log                                                                             
[rank3]:     opt_stats[0] += (torch.norm(optimizer.state[param]['exp_avg_sq']).item())**2                                                                               
[rank3]: AttributeError: 'BF16_Optimizer' object has no attribute 'state'

With the patch, the invocation can pass smoothly.

@HeyangQin HeyangQin added this pull request to the merge queue Jul 12, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jul 12, 2024
@loadams loadams added this pull request to the merge queue Jul 12, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jul 12, 2024
@loadams loadams added this pull request to the merge queue Jul 15, 2024
@loadams loadams removed this pull request from the merge queue due to a manual request Jul 15, 2024
@loadams loadams enabled auto-merge July 15, 2024 23:24
@loadams loadams added this pull request to the merge queue Jul 16, 2024
Merged via the queue into microsoft:master with commit 98272d1 Jul 16, 2024
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants