You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi,I am currently studying the Megatron framework. I noticed that bfloat16 in megatron requires gradient accumulation, and allreductions need to be completed in fp32. Then the gradient communicates in fp32 format.
However, fp16 requires gradient accumulation, and allreductions can be completed in fp16.Then the gradient communicates in fp16 format.
I want to know what are the special reasons for these two different calculation methods
Line 159-160 of the megatron/arguments. py file
The text was updated successfully, but these errors were encountered:
Hi,I am currently studying the Megatron framework. I noticed that bfloat16 in megatron requires gradient accumulation, and allreductions need to be completed in fp32. Then the gradient communicates in fp32 format.
However, fp16 requires gradient accumulation, and allreductions can be completed in fp16.Then the gradient communicates in fp16 format.
I want to know what are the special reasons for these two different calculation methods
Line 159-160 of the megatron/arguments. py file
The text was updated successfully, but these errors were encountered: