-
Notifications
You must be signed in to change notification settings - Fork 27.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BLOOM] Clean modeling code #18344
[BLOOM] Clean modeling code #18344
Conversation
The documentation is not available anymore as the PR was closed or merged. |
2550054
to
ddbe33e
Compare
Ran quick test on the generation speed: from transformers import AutoTokenizer, AutoModelForCausalLM
from timeit import timeit
def main():
model_name = "bigscience/bloom-350m"
max_length = 50
model = AutoModelForCausalLM.from_pretrained(model_name).cuda()
tokenizer = AutoTokenizer.from_pretrained(model_name)
text = "Hello my name is"
input_ids = tokenizer.encode(text, return_tensors="pt").cuda()
print(timeit(lambda: model.generate(input_ids, max_length=max_length), number=100))
if __name__ == "__main__":
main() The results are the following on A100:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome!
matmul_result = alibi.baddbmm( | ||
batch1=query_layer, | ||
batch2=key_layer, | ||
beta=self.beta, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can remove the beta if it's 1 by default anyways?
beta=self.beta, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like relying on default :D
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really nice job! Some small nits; Did you already try rerunning any of the eval with this branch, e.g. human eval?
self.layer_number = max(1, layer_number) | ||
self.norm_factor = math.sqrt(self.head_dim) * self.layer_number | ||
self.inv_norm_factor = 1.0 / math.sqrt(self.head_dim) | ||
self.beta = 1.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change is because we did not use apply_query_key_layer_scaling
for the training of BLOOM models?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No we did, but I actually don't think we need it. I think the trick was used to be able to train fp16 models, but since our fp16 models are small and the big ones are bf16 that trick shouldn't be use anymore IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But don't we still need it then, as the checkpoints are trained with layer scaling?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? So that you mimick perfectly the logits? Not sure this matters, and we should make sure to obtain the maximum precision we can get from the weights IMO. I'm running a bunch of greedy_search on 175b right now to assess if that matters.
Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>
Agree with @sgugger regarding the shape of the past key values: fine for me for the shape to be changed. We did change the format for this output two years ago in #9391 and #9596, which could also have been considered breaking. I'd like for this change to be contained to BLOOM (i.e., not ported to GPT-2) until @patrickvonplaten can comment as he was very involved in that reshape. |
Thanks @LysandreJik ! Yeah this PR is only changing BLOOM modeling. |
This reverts commit 50e1f2f.
Okay tested 176b on lambada:
I think this PR is ready to be checked. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All look good to me, and multi-GPU tests pass with/without Accelerate. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wanna say I especially love the change regarding the new way the attention mask is computed and used.
It is much cleaner and could potentially avoid a few issues if we use large negative values and summing different masks.
While working on a previous PR #17306, I have a discussion on Slack which suggested a similar approach.
However, there are other considerations to apply this change to the long existing models in transformers
.
For the record (the discussion on Slack):
I am wondering if another approach might be better: what if we don't add large negative values as currently done, and no need to do sth. like (1.0 - extended_attention_mask) * -10000.0 etc. We just pass attention_mask as 0 and 1 , and use it as bool mask, like torch.where(attn_mask, attn_weights, -torch.inf)
before sending it to softmax.
(we already do this for causal_mask , but there still use -1e4) The disadvantage of this approach is it requires more changes than just using torch.finfo(tensor.dtype).min .
IMO this requires too many changes and could confuse users that are used to our way of changing the attention_mask into large negative values. I'd prefer torch.finfo(tensor.dtype).min . Also I think XLA likes add operation more that tf.where / jnp.where
* replace triu operation for ONNX support * remove in-place addition because `seq_length` could be a tensor and we don't want to change its value * fix ONNX config past shapes * nit: refactor head_dim * add comment for why we replaced `triu`
- latest accelerate, transformers, huggingface_hub - rearrange attention caches to support huggingface/transformers#18344 - remove unused code - fix edge case where session crashes when receiving seq length 0 - assert transformer version when importing WrappedBloomBlock Co-authored-by: Alexander Borzunov <borzunov.alexander@gmail.com> Co-authored-by: Max Ryabinin <mryabinin0@gmail.com>
What does this PR do?
THERE'S A BREAKING CHANGE:
past_key_values
changes in terms of format, is that acceptable @sgugger @patrickvonplaten ?Changes
Make a pass at cleaning up some code:
causal_mask
to bool tensor and use it via mask.causal_mask
creation function.baddbmm
instead ofbmm
, it was unclear why this was changed, and training codebase usesbaddbmm
.attention_scores
upcasting to fp32 as it mimics the training procedure the closest.self.layer_number
normalization. It was introduced in Meg-DS to prevent overflow when computing attention scores in float16, but it doesn't seem to impact the model at all here.reshape
in.split_head()
which reduces memory footprint and increases throughputQKV
with past values. Though this introduces a BREAKING CHANGE wherepast_key
instead of being stored in[batch_size, num_heads, seq_length, head_dim]
it's stored in[batch_size * num_heads, head_dim, seq_length]
..view
,.reshape
batch_size
,seq_length
etc ...Speed-up in generation
Using the following script:
Results on A100: