Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix alibi #222

Merged
merged 1 commit into from
Jan 6, 2022
Merged

Fix alibi #222

merged 1 commit into from
Jan 6, 2022

Conversation

thomasw21
Copy link
Member

No description provided.

Comment on lines +313 to +315
if not hasattr(self, "logged_alibi"):
logger.debug("Using Alibi.")
self.logged_alibi = True
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing purposes.

Copy link
Collaborator

@SaulLu SaulLu Jan 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for information, on my side I was going to propose to isolate the 3 ways to calculate the score according to the positional embedding by creating a method for each positional embedding method and then wrapping these methods in log_debug_usage to have a log (as for the activation functions) to detect in the tests.

The micro advantage is that it also allows to test rotary and absolute (in all cases) but I don"t mind if you think it's easier to keep it like you did.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's true that your version looks nice. I just think it's an incorrect abstraction. There's no reason to group all the positional embeddings together (they get applied in different places, they do different things, they have different constraints despite having the common purpose on given sequential information). One could argue that using a pure causal mask is a position embedding mechanism.

What I was thinking of is abstracting only the alibi function in a seperate function to use the pretty decorator, but I was lazy ^^' @log_debug_usage(logger, msg)

Copy link
Contributor

@ofirpress ofirpress Jan 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using a pure causal mask is a position embedding mechanism

I agree with this statement :) Models that don't have any position embeddings (like sinusoidal or learned or alibi) are actually able to achieve good (but not great) PPL because the causal mask encodes some kind of order.

Copy link
Collaborator

@SaulLu SaulLu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the fix! 🎊

Can we also add a test in test_model.py in order to test alibi when we use it without megatron?

Comment on lines +313 to +315
if not hasattr(self, "logged_alibi"):
logger.debug("Using Alibi.")
self.logged_alibi = True
Copy link
Collaborator

@SaulLu SaulLu Jan 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for information, on my side I was going to propose to isolate the 3 ways to calculate the score according to the positional embedding by creating a method for each positional embedding method and then wrapping these methods in log_debug_usage to have a log (as for the activation functions) to detect in the tests.

The micro advantage is that it also allows to test rotary and absolute (in all cases) but I don"t mind if you think it's easier to keep it like you did.

megatron/model/transformer.py Show resolved Hide resolved
@thomasw21
Copy link
Member Author

Not super clear on what you mean by without megatron. The test in test_model tests out invariants that we can have using prefix lm or gpt. I don't see any invariants for alibi version?

@SaulLu
Copy link
Collaborator

SaulLu commented Jan 5, 2022

Oh sorry, I was meaning without deepspeed (I understand that the tests in test_model.py use GPTModel and not GPTModelPipe) !

@thomasw21
Copy link
Member Author

Ah makes sense ... I'll try something out

@thomasw21
Copy link
Member Author

thomasw21 commented Jan 6, 2022

Okay after thinking about it. I think we need to create a test so that all the models we have match their deepspeed version. Basically the only reason where I see the need of adding this test is that our EAI evaluation #212 is using Megatron-LM version of models. So I think this applies to all models we have.

I don't think we should implement that test in this PR. I'll open an issue to create a test for all models. #226

Copy link
Collaborator

@SaulLu SaulLu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following the last offline discussions, if ever we have to run the evaluation on the Megatron version without deepspeed we will add a test to verify that the architectures are identical.

In the meantime, I have full confidence in this fix and I think we can move forward and re-launch a training.

Thanks again Thomas! 🚀

@thomasw21 thomasw21 merged commit 7ab5c05 into main Jan 6, 2022
@@ -470,9 +486,19 @@ def __init__(self, init_method, output_layer_init_method,
self.mlp = ParallelMLP(init_method,
output_layer_init_method)

# Alibi
if args.position_embedding_type == PositionEmbeddingType.alibi:
self.alibi = self._build_alibi_tensor(args.seq_length, args.num_attention_heads, args.micro_batch_size).to(torch.cuda.current_device())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does args.micro_batch_size work with batch size rampup? And if not, do we care?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

micro batch size doesn't increase during batch size rampup. It's constant.

We should care since all our experiment run with batch size rampup, but I would expect it to crash badly if it doesn't match.

@DanielHesslow
Copy link
Collaborator

Have we tested that we get the same results with/without tensor parallelism? Correct me if I'm wrong but I think that the attention heads are distributed across workers since self.query_key_value is ColumnParallel. We might need to extract the correct part of the alibi tensor depending on the tensor parallel index since different attention heads should have different alibi slopes?

ie.

matmul_result = alibi[:output_size[0]*output_size[1], :, :output_size[3]]

might be wrong?

@thomasw21
Copy link
Member Author

That's actually a great point. Looking at things in diagonal, I feel you're completely right! Though we are lucky as we're using TP=1 in our experiment. If I understand correctly you believe :output_size[0]*output_size[1] is wrong right? because it doesn't take in account the ordering of the heads, ie there should be an offset linked to the tp_rank right?

If so I would suggest you open an issue/PR to introduce such test, and if it fails to fix it?

@DanielHesslow
Copy link
Collaborator

Yup exactly there should be some offset depending on the tp_rank, and possibly also a stride since the batch_size is baked in the first dimension as well.

I've opened an issue for it here: #227

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants