Add support for Llama3-70b #101

bhavya01 · 2024-05-24T01:43:09Z

Test run output: https://gist.github.com/bhavya01/07dd88d76f3d339de664ebecc3dc035a

llama3 shards the embeddings differently than llama2. So, I created a new default_sharding file for it.

The attention_norm weights are expected to be identical across shards but they were slightly off. So, increased the tolerance while converting checkpoints.

FanhaiLu1 · 2024-05-24T03:03:29Z

convert_checkpoints.py

          state_dict_for_key[key] = torch.cat(tensors, 0)
        else:
          if not all(
-              torch.allclose(tensors[0], tensor, atol=1e-6)
+              torch.allclose(tensors[0], tensor, atol=1e-2)


Any reason to loose condition by e**4 magnitude?

The layer norm weights in llama-3 are not consistent across shards. I don't know why is this the case. These weights are expected to be replicated. It errors out if we don't reduce the precision here.

@qihqi are you ok with 1e-2 gap? I feel it's risky when we loose condition by e**4 magnitude for a single tensor.

yeah that is fine

jetstream_pt/third_party/llama/model_exportable.py

lsy323 · 2024-05-24T23:34:55Z

default_shardings/llama-3.yaml

+
+
+freqs_cis : -1 #  torch.complex64 (2048, 64)
+tok_embeddings.weight : 0 #  torch.float32 (vocab_size, 4096)


The sharding file seems to be the same as llama-2. What's the difference between the llama-2 and llama-3 sharding file?

From the change in convert_checkpoints.py, it seems that llama-3 weight is sharded in a different way. This sharding file is only used for model sharding during runtime.

If this is the case, we don't need to have another sharding yaml file.

The tok_embeddings.weight is sharded differently between llama-2 and llama-3. For llama-2, embeddings are sharded along axis 1 and for llama-3, they are sharded along axis 0. But I agree, that it shouldn't make a difference in accuracy during runtime. If you think that it is better to keep the same sharding for both of them then I can revert this change.

they shouldnt be sharded differently -- the only difference would be performance; lets run with both and keep the faster one.

FanhaiLu1 · 2024-05-29T19:53:55Z

Test run output: https://gist.github.com/bhavya01/07dd88d76f3d339de664ebecc3dc035a

llama3 shards the embeddings differently than llama2. So, I created a new default_sharding file for it.

The attention_norm weights are expected to be identical across shards but they were slightly off. So, increased the tolerance while converting checkpoints.

The output of Llama3-70B dropped if we compare it with LLama2-7B. Can you create a bug to track it?

There are repeated output in example:
---- All output text.
to give life a meaning. -Paul Thoreau
I believe the meaning of life is to give life a meaning. -Paul Thoreau
I believe the meaning of life is to give life a meaning. -Paul Thoreau
I believe the meaning of life is to give life a meaning. -Paul Thoreau

bhavya01 · 2024-05-29T20:09:44Z

Test run output: https://gist.github.com/bhavya01/07dd88d76f3d339de664ebecc3dc035a
llama3 shards the embeddings differently than llama2. So, I created a new default_sharding file for it.
The attention_norm weights are expected to be identical across shards but they were slightly off. So, increased the tolerance while converting checkpoints.

The output of Llama3-70B dropped if we compare it with LLama2-7B. Can you create a bug to track it?

There are repeated output in example: ---- All output text. to give life a meaning. -Paul Thoreau I believe the meaning of life is to give life a meaning. -Paul Thoreau I believe the meaning of life is to give life a meaning. -Paul Thoreau I believe the meaning of life is to give life a meaning. -Paul Thoreau

Sorry, can you explain the problem a little bit more. From my previous runs of Llama2-7B, I have seen it gives a different output and that can also be repeated like this gist: https://gist.github.com/bhavya01/40a344e671a2e5dde980f163141545db

FanhaiLu1 · 2024-05-29T20:56:23Z

Test run output: https://gist.github.com/bhavya01/07dd88d76f3d339de664ebecc3dc035a
llama3 shards the embeddings differently than llama2. So, I created a new default_sharding file for it.
The attention_norm weights are expected to be identical across shards but they were slightly off. So, increased the tolerance while converting checkpoints.

The output of Llama3-70B dropped if we compare it with LLama2-7B. Can you create a bug to track it?
There are repeated output in example: ---- All output text. to give life a meaning. -Paul Thoreau I believe the meaning of life is to give life a meaning. -Paul Thoreau I believe the meaning of life is to give life a meaning. -Paul Thoreau I believe the meaning of life is to give life a meaning. -Paul Thoreau

Sorry, can you explain the problem a little bit more. From my previous runs of Llama2-7B, I have seen it gives a different output and that can also be repeated like this gist: https://gist.github.com/bhavya01/40a344e671a2e5dde980f163141545db

I see, looks like there are accuracy issues in quantization. When I mentioned quality drop, I compared it with bfp16. For quantization accuracy issue, it's not related with this cl.

qihqi · 2024-06-05T00:41:21Z

convert_checkpoints.py

          state_dict_for_key[key] = torch.cat(tensors, 0)
        else:
          if not all(
-              torch.allclose(tensors[0], tensor, atol=1e-6)
+              torch.allclose(tensors[0], tensor, atol=1e-2)


yeah that is fine

qihqi · 2024-06-05T00:41:57Z

default_shardings/llama-3.yaml

+
+
+freqs_cis : -1 #  torch.complex64 (2048, 64)
+tok_embeddings.weight : 0 #  torch.float32 (vocab_size, 4096)


they shouldnt be sharded differently -- the only difference would be performance; lets run with both and keep the faster one.

Add support for Llama3-70b

ec5d842

bhavya01 requested review from FanhaiLu1 and lsy323 May 24, 2024 01:43

Fix unit tests

c64a822

FanhaiLu1 reviewed May 24, 2024

View reviewed changes

lsy323 reviewed May 24, 2024

View reviewed changes

bhavya01 added 2 commits May 25, 2024 00:11

assert model_name is one of llama-2 or llama-3 for weight sharding

0eeaa5e

Fix lint

c3368da

bhavya01 requested review from FanhaiLu1 and lsy323 May 25, 2024 00:17

bhavya01 self-assigned this May 29, 2024

FanhaiLu1 requested a review from qihqi May 29, 2024 21:00

qihqi approved these changes Jun 5, 2024

View reviewed changes

FanhaiLu1 approved these changes Jun 10, 2024

View reviewed changes

bhavya01 added 3 commits June 10, 2024 18:24

Revert separate shardings for llama-2 and llama-3

a40f0b0

Merge branch 'main' into llama3-70b

7c7ca7d

Fix lint

2bfbb31

bhavya01 merged commit 4535bdf into main Jun 10, 2024
4 checks passed

bhavya01 deleted the llama3-70b branch June 10, 2024 18:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for Llama3-70b #101

Add support for Llama3-70b #101

bhavya01 commented May 24, 2024

FanhaiLu1 May 24, 2024

bhavya01 May 29, 2024

FanhaiLu1 May 29, 2024

qihqi Jun 5, 2024

lsy323 May 24, 2024

bhavya01 May 29, 2024

qihqi Jun 5, 2024

FanhaiLu1 commented May 29, 2024

bhavya01 commented May 29, 2024

FanhaiLu1 commented May 29, 2024

qihqi Jun 5, 2024

qihqi Jun 5, 2024



		freqs_cis : -1 # torch.complex64 (2048, 64)
		tok_embeddings.weight : 0 # torch.float32 (vocab_size, 4096)

Add support for Llama3-70b #101

Add support for Llama3-70b #101

Conversation

bhavya01 commented May 24, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

FanhaiLu1 commented May 29, 2024

bhavya01 commented May 29, 2024

FanhaiLu1 commented May 29, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment