-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for Llama3-70b #101
Conversation
state_dict_for_key[key] = torch.cat(tensors, 0) | ||
else: | ||
if not all( | ||
torch.allclose(tensors[0], tensor, atol=1e-6) | ||
torch.allclose(tensors[0], tensor, atol=1e-2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason to loose condition by e**4 magnitude?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The layer norm weights in llama-3 are not consistent across shards. I don't know why is this the case. These weights are expected to be replicated. It errors out if we don't reduce the precision here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@qihqi are you ok with 1e-2 gap? I feel it's risky when we loose condition by e**4 magnitude for a single tensor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah that is fine
default_shardings/llama-3.yaml
Outdated
|
||
|
||
freqs_cis : -1 # torch.complex64 (2048, 64) | ||
tok_embeddings.weight : 0 # torch.float32 (vocab_size, 4096) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sharding file seems to be the same as llama-2. What's the difference between the llama-2 and llama-3 sharding file?
From the change in convert_checkpoints.py
, it seems that llama-3 weight is sharded in a different way. This sharding file is only used for model sharding during runtime.
If this is the case, we don't need to have another sharding yaml file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tok_embeddings.weight is sharded differently between llama-2 and llama-3. For llama-2, embeddings are sharded along axis 1 and for llama-3, they are sharded along axis 0. But I agree, that it shouldn't make a difference in accuracy during runtime. If you think that it is better to keep the same sharding for both of them then I can revert this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
they shouldnt be sharded differently -- the only difference would be performance; lets run with both and keep the faster one.
The output of Llama3-70B dropped if we compare it with LLama2-7B. Can you create a bug to track it? There are repeated output in example: |
Sorry, can you explain the problem a little bit more. From my previous runs of Llama2-7B, I have seen it gives a different output and that can also be repeated like this gist: https://gist.github.com/bhavya01/40a344e671a2e5dde980f163141545db |
I see, looks like there are accuracy issues in quantization. When I mentioned quality drop, I compared it with bfp16. For quantization accuracy issue, it's not related with this cl. |
state_dict_for_key[key] = torch.cat(tensors, 0) | ||
else: | ||
if not all( | ||
torch.allclose(tensors[0], tensor, atol=1e-6) | ||
torch.allclose(tensors[0], tensor, atol=1e-2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah that is fine
default_shardings/llama-3.yaml
Outdated
|
||
|
||
freqs_cis : -1 # torch.complex64 (2048, 64) | ||
tok_embeddings.weight : 0 # torch.float32 (vocab_size, 4096) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
they shouldnt be sharded differently -- the only difference would be performance; lets run with both and keep the faster one.
Test run output: https://gist.github.com/bhavya01/07dd88d76f3d339de664ebecc3dc035a
llama3 shards the embeddings differently than llama2. So, I created a new default_sharding file for it.
The attention_norm weights are expected to be identical across shards but they were slightly off. So, increased the tolerance while converting checkpoints.