Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BLOOM] Clean modeling code #18344

Merged
merged 51 commits into from
Aug 4, 2022
Merged

[BLOOM] Clean modeling code #18344

merged 51 commits into from
Aug 4, 2022

Conversation

thomasw21
Copy link
Contributor

@thomasw21 thomasw21 commented Jul 28, 2022

What does this PR do?

THERE'S A BREAKING CHANGE: past_key_values changes in terms of format, is that acceptable @sgugger @patrickvonplaten ?

Changes

Make a pass at cleaning up some code:

  • convert causal_mask to bool tensor and use it via mask.
  • simplify causal_mask creation function.
  • revert back to baddbmm instead of bmm, it was unclear why this was changed, and training codebase uses baddbmm.
  • switch back attention_scores upcasting to fp32 as it mimics the training procedure the closest.
  • remove self.layer_number normalization. It was introduced in Meg-DS to prevent overflow when computing attention scores in float16, but it doesn't seem to impact the model at all here.
  • remove a reshape in .split_head() which reduces memory footprint and increases throughput
  • remove multiple reshapes when computing QKV with past values. Though this introduces a BREAKING CHANGE where past_key instead of being stored in [batch_size, num_heads, seq_length, head_dim] it's stored in [batch_size * num_heads, head_dim, seq_length].
  • explicit all dimensions when computing .view, .reshape
  • standardize namings for batch_size, seq_length etc ...
  • type hint improvements

Speed-up in generation

Using the following script:

from transformers import AutoTokenizer, AutoModelForCausalLM
from timeit import timeit

def main():
    model_name = "bigscience/bloom-350m"
    max_length = 50
    batch_size = 16
    model = AutoModelForCausalLM.from_pretrained(model_name).cuda()
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    texts = ["Hello my name is"] * batch_size
    input_ids = tokenizer.batch_encode_plus(texts, return_tensors="pt").to("cuda")

    print(timeit(lambda: model.generate(**input_ids, max_length=max_length), number=100))


if __name__ == "__main__":
    main()

Results on A100:

This branch: 67.85658120410517
main: 78.72558180289343

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Jul 28, 2022

The documentation is not available anymore as the PR was closed or merged.

@thomasw21 thomasw21 force-pushed the thomas/bloom_clean_code branch from 2550054 to ddbe33e Compare July 28, 2022 19:44
@thomasw21
Copy link
Contributor Author

thomasw21 commented Jul 28, 2022

Ran quick test on the generation speed:

from transformers import AutoTokenizer, AutoModelForCausalLM
from timeit import timeit

def main():
    model_name = "bigscience/bloom-350m"
    max_length = 50
    model = AutoModelForCausalLM.from_pretrained(model_name).cuda()
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    text = "Hello my name is"
    input_ids = tokenizer.encode(text, return_tensors="pt").cuda()

    print(timeit(lambda: model.generate(input_ids, max_length=max_length), number=100))


if __name__ == "__main__":
    main()

The results are the following on A100:

This branch: 62.44300797022879
Main: 70.0128909018822

Copy link
Member

@michaelbenayoun michaelbenayoun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!

matmul_result = alibi.baddbmm(
batch1=query_layer,
batch2=key_layer,
beta=self.beta,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can remove the beta if it's 1 by default anyways?

Suggested change
beta=self.beta,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like relying on default :D

Copy link
Contributor

@Muennighoff Muennighoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice job! Some small nits; Did you already try rerunning any of the eval with this branch, e.g. human eval?

Comment on lines -233 to +231
self.layer_number = max(1, layer_number)
self.norm_factor = math.sqrt(self.head_dim) * self.layer_number
self.inv_norm_factor = 1.0 / math.sqrt(self.head_dim)
self.beta = 1.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is because we did not use apply_query_key_layer_scaling for the training of BLOOM models?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No we did, but I actually don't think we need it. I think the trick was used to be able to train fp16 models, but since our fp16 models are small and the big ones are bf16 that trick shouldn't be use anymore IMO.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But don't we still need it then, as the checkpoints are trained with layer scaling?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? So that you mimick perfectly the logits? Not sure this matters, and we should make sure to obtain the maximum precision we can get from the weights IMO. I'm running a bunch of greedy_search on 175b right now to assess if that matters.

src/transformers/models/bloom/modeling_bloom.py Outdated Show resolved Hide resolved
thomasw21 and others added 2 commits August 1, 2022 11:45
Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>
@LysandreJik
Copy link
Member

Agree with @sgugger regarding the shape of the past key values: fine for me for the shape to be changed. We did change the format for this output two years ago in #9391 and #9596, which could also have been considered breaking.

I'd like for this change to be contained to BLOOM (i.e., not ported to GPT-2) until @patrickvonplaten can comment as he was very involved in that reshape.

@thomasw21
Copy link
Contributor Author

Thanks @LysandreJik ! Yeah this PR is only changing BLOOM modeling.

@thomasw21
Copy link
Contributor Author

thomasw21 commented Aug 1, 2022

Okay tested 176b on lambada:

## This branch 
{
  "results": {
    "lambada": {
      "ppl": 3.9275610127374367,
      "ppl_stderr": 0.08456328894237608,
      "acc": 0.6716475839316903,
      "acc_stderr": 0.006542638265686496
    }
  },
  "versions": {
    "lambada": 0
  },
}

## 4.21.0
{
  "results": {
    "lambada": {
      "ppl": 3.933199250271968,
      "ppl_stderr": 0.08471163223917309,
      "acc": 0.6708713370851931,
      "acc_stderr": 0.006546580975553108
    }
  },
  "versions": {
    "lambada": 0
  },
}

I think this PR is ready to be checked.

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All look good to me, and multi-GPU tests pass with/without Accelerate. Thanks!

Copy link
Collaborator

@ydshieh ydshieh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wanna say I especially love the change regarding the new way the attention mask is computed and used.
It is much cleaner and could potentially avoid a few issues if we use large negative values and summing different masks.

While working on a previous PR #17306, I have a discussion on Slack which suggested a similar approach.
However, there are other considerations to apply this change to the long existing models in transformers.

For the record (the discussion on Slack):

I am wondering if another approach might be better: what if we don't add large negative values as currently done, and no need to do sth. like (1.0 - extended_attention_mask) * -10000.0 etc. We just pass attention_mask as 0 and 1 , and use it as bool mask, like torch.where(attn_mask, attn_weights, -torch.inf)
before sending it to softmax.
(we already do this for causal_mask , but there still use -1e4) The disadvantage of this approach is it requires more changes than just using torch.finfo(tensor.dtype).min .

IMO this requires too many changes and could confuse users that are used to our way of changing the attention_mask into large negative values. I'd prefer torch.finfo(tensor.dtype).min . Also I think XLA likes add operation more that tf.where / jnp.where

* replace triu operation for ONNX support

* remove in-place addition
because `seq_length` could be a tensor and we don't want to change its value

* fix ONNX config past shapes

* nit: refactor head_dim

* add comment for why we replaced `triu`
@thomasw21 thomasw21 merged commit b69a62d into main Aug 4, 2022
@thomasw21 thomasw21 deleted the thomas/bloom_clean_code branch August 4, 2022 09:08
This was referenced Aug 10, 2022
justheuristic added a commit to bigscience-workshop/petals that referenced this pull request Dec 13, 2022
- latest accelerate, transformers, huggingface_hub
- rearrange attention caches to support huggingface/transformers#18344
- remove unused code
- fix edge case where session crashes when receiving seq length 0
- assert transformer version when importing WrappedBloomBlock

Co-authored-by: Alexander Borzunov <borzunov.alexander@gmail.com>
Co-authored-by: Max Ryabinin <mryabinin0@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants