You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@DanielHesslow has opened a PR #212. This allows us to evaluate Megatron-Deepspeed models using the EAI harness directly in this repo, without needing to convert models into HF format.
The current issue, is that we train models using deepspeed (*ModelPipe) but the evaluation script loads a model without deepspeed (*Model). This creates issues where we might have discrepancies between those two. Ex: #222
We need a test making sure that the output of both models are equal given an arbitrary configurations (regardless if we merge #212 or not)
thomasw21
changed the title
Make sure Deepspeed power models and equivalent with their non deepspeed version
Make sure deepspeed powered models are equivalent with their non deepspeed version
Jan 7, 2022
Hi, Thanks for providing these amazing Bloom model!
Just a quick question, whether someone has ensured the Deepspeed and Huggingface checkpoints lead to identical outputs?
Hi @philippmtk unfortunately they don't lead to indentical outputs.
Unfortunately, resharding checkpoints changes the order of operations and this is a problem with floating point arithmetic.
fp operations are not associative.
Please refer to this issue: pytorch/pytorch#76232
@DanielHesslow has opened a PR #212. This allows us to evaluate Megatron-Deepspeed models using the EAI harness directly in this repo, without needing to convert models into HF format.
The current issue, is that we train models using deepspeed (*ModelPipe) but the evaluation script loads a model without deepspeed (*Model). This creates issues where we might have discrepancies between those two. Ex: #222
We need a test making sure that the output of both models are equal given an arbitrary configurations (regardless if we merge #212 or not)
cc @SaulLu @DanielHesslow @TevenLeScao
The text was updated successfully, but these errors were encountered: