-
Notifications
You must be signed in to change notification settings - Fork 222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix alibi #222
Fix alibi #222
Conversation
if not hasattr(self, "logged_alibi"): | ||
logger.debug("Using Alibi.") | ||
self.logged_alibi = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Testing purposes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for information, on my side I was going to propose to isolate the 3 ways to calculate the score according to the positional embedding by creating a method for each positional embedding method and then wrapping these methods in log_debug_usage
to have a log (as for the activation functions) to detect in the tests.
The micro advantage is that it also allows to test rotary and absolute (in all cases) but I don"t mind if you think it's easier to keep it like you did.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's true that your version looks nice. I just think it's an incorrect abstraction. There's no reason to group all the positional embeddings together (they get applied in different places, they do different things, they have different constraints despite having the common purpose on given sequential information). One could argue that using a pure causal mask is a position embedding mechanism.
What I was thinking of is abstracting only the alibi function in a seperate function to use the pretty decorator, but I was lazy ^^' @log_debug_usage(logger, msg)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using a pure causal mask is a position embedding mechanism
I agree with this statement :) Models that don't have any position embeddings (like sinusoidal or learned or alibi) are actually able to achieve good (but not great) PPL because the causal mask encodes some kind of order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the fix! 🎊
Can we also add a test in test_model.py
in order to test alibi when we use it without megatron?
if not hasattr(self, "logged_alibi"): | ||
logger.debug("Using Alibi.") | ||
self.logged_alibi = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for information, on my side I was going to propose to isolate the 3 ways to calculate the score according to the positional embedding by creating a method for each positional embedding method and then wrapping these methods in log_debug_usage
to have a log (as for the activation functions) to detect in the tests.
The micro advantage is that it also allows to test rotary and absolute (in all cases) but I don"t mind if you think it's easier to keep it like you did.
Not super clear on what you mean by |
Oh sorry, I was meaning |
Ah makes sense ... I'll try something out |
Okay after thinking about it. I think we need to create a test so that all the models we have match their deepspeed version. Basically the only reason where I see the need of adding this test is that our EAI evaluation #212 is using Megatron-LM version of models. So I think this applies to all models we have. I don't think we should implement that test in this PR. I'll open an issue to create a test for all models. #226 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following the last offline discussions, if ever we have to run the evaluation on the Megatron version without deepspeed we will add a test to verify that the architectures are identical.
In the meantime, I have full confidence in this fix and I think we can move forward and re-launch a training.
Thanks again Thomas! 🚀
@@ -470,9 +486,19 @@ def __init__(self, init_method, output_layer_init_method, | |||
self.mlp = ParallelMLP(init_method, | |||
output_layer_init_method) | |||
|
|||
# Alibi | |||
if args.position_embedding_type == PositionEmbeddingType.alibi: | |||
self.alibi = self._build_alibi_tensor(args.seq_length, args.num_attention_heads, args.micro_batch_size).to(torch.cuda.current_device()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does args.micro_batch_size
work with batch size rampup? And if not, do we care?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
micro batch size doesn't increase during batch size rampup. It's constant.
We should care since all our experiment run with batch size rampup, but I would expect it to crash badly if it doesn't match.
Have we tested that we get the same results with/without tensor parallelism? Correct me if I'm wrong but I think that the attention heads are distributed across workers since self.query_key_value is ColumnParallel. We might need to extract the correct part of the alibi tensor depending on the tensor parallel index since different attention heads should have different alibi slopes? ie.
might be wrong? |
That's actually a great point. Looking at things in diagonal, I feel you're completely right! Though we are lucky as we're using TP=1 in our experiment. If I understand correctly you believe If so I would suggest you open an issue/PR to introduce such test, and if it fails to fix it? |
Yup exactly there should be some offset depending on the tp_rank, and possibly also a stride since the batch_size is baked in the first dimension as well. I've opened an issue for it here: #227 |
No description provided.