-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Finetune loss and acc is pool #11
Comments
Could you please share the training logs? Thanks! |
Sorry for the late reply c3 finetune args: |
It seems that the pre-trained weights are not loaded (the lm loss printed for the first time should be less than 5,which in the provided log is around 10). If the model is successfully loaded, the log printed to the stdout (not log.txt) should contain message like |
Thanks,in terms of the initial loss, I this you are right, but 'successfully loaded' appears in the train_log.
|
Did you use the docker we provided? Or the deepspeed from github? |
I used the latest deepspeed(v0.4.3 or v0.4.5). |
We have reproduced the problem with deepspeed(v0.5.0). We think the problem is caused by a bug in deepspeed which we have fixed in the docker we provided. Specifically, there is a copy operation from the fp32 states of the optimizer to the fp16 states of the model in the deepspeed zero optimizer. This works well when optimizer states (zero_pp_rank_0_mp_rank_01_optim_states.pt) are loaded. But when the optimizer states are not provided, the fp32 states in the optimizer are randomly initialized and overide the pre-trained states in the model. This is where the problem occurs. To fix the problem, we recommend you to use our docker directly. But if you would like to use the latest deepspeed, you can fix the bug by adding a few lines of code in If you have any problems, please let us know. Thanks! |
👍🏻 Thank you! |
Has the latest version of deepspeed(v0.7.7) been fixed? I seem to be OK |
@t1101675 Does cpm1-finetune have the same problem? How to check the loading success, see loss? So what should the initial loss be |
thanks very much!
The text was updated successfully, but these errors were encountered: