New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[BLOOM] Clean modeling code #18344

Merged

thomasw21 merged 51 commits into main from thomas/bloom_clean_code

Aug 4, 2022

Contributor

thomasw21 commented Jul 28, 2022 •

edited

Loading

What does this PR do?

THERE'S A BREAKING CHANGE: past_key_values changes in terms of format, is that acceptable @sgugger @patrickvonplaten ?

Changes

Make a pass at cleaning up some code:

convert causal_mask to bool tensor and use it via mask.
simplify causal_mask creation function.
revert back to baddbmm instead of bmm, it was unclear why this was changed, and training codebase uses baddbmm.
switch back attention_scores upcasting to fp32 as it mimics the training procedure the closest.
remove self.layer_number normalization. It was introduced in Meg-DS to prevent overflow when computing attention scores in float16, but it doesn't seem to impact the model at all here.
remove a reshape in .split_head() which reduces memory footprint and increases throughput
remove multiple reshapes when computing QKV with past values. Though this introduces a BREAKING CHANGE where past_key instead of being stored in [batch_size, num_heads, seq_length, head_dim] it's stored in [batch_size * num_heads, head_dim, seq_length].
explicit all dimensions when computing .view, .reshape
standardize namings for batch_size, seq_length etc ...
type hint improvements

Speed-up in generation

Using the following script:

from transformers import AutoTokenizer, AutoModelForCausalLM
from timeit import timeit

def main():
    model_name = "bigscience/bloom-350m"
    max_length = 50
    batch_size = 16
    model = AutoModelForCausalLM.from_pretrained(model_name).cuda()
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    texts = ["Hello my name is"] * batch_size
    input_ids = tokenizer.batch_encode_plus(texts, return_tensors="pt").to("cuda")

    print(timeit(lambda: model.generate(**input_ids, max_length=max_length), number=100))


if __name__ == "__main__":
    main()

Results on A100:

This branch: 67.85658120410517
main: 78.72558180289343

HuggingFaceDocBuilderDev commented Jul 28, 2022 •

edited

Loading

The documentation is not available anymore as the PR was closed or merged.

thomasw21 commented

View reviewed changes

src/transformers/models/bloom/modeling_bloom.py Outdated Show resolved Hide resolved

thomasw21 commented

View reviewed changes

src/transformers/models/bloom/modeling_bloom.py Outdated Show resolved Hide resolved

thomasw21 added 18 commits

July 28, 2022 21:44


          Cleanup some code

d29881f


          make style

baa5d87


          Woops

69227ae


          Woops

0078a6c


          Improve signatures

ed09d70

WIP

1a8b80b


          Try to reduce the number of reshape/copies

62adfce


          Woops

ec7442c


          I don't think we actually need the layer_num scaling trick

298e3fd


          Woops

42e5954


          Woops

96307a4


          No need for duplication

9b2c1ca


          Try to fix beam_search

a8cde02


          Fix beam search

3a095d0


          Woops

0ffecbf


          Woops

d40ee96


          Removing layer num normalization seems to be breaking

2677a28

Nit

ddbe33e

thomasw21 force-pushed the thomas/bloom_clean_code branch from 2550054 to ddbe33e Compare

July 28, 2022 19:44

Contributor Author

thomasw21 commented Jul 28, 2022 •

edited

Loading

Ran quick test on the generation speed:

from transformers import AutoTokenizer, AutoModelForCausalLM
from timeit import timeit

def main():
    model_name = "bigscience/bloom-350m"
    max_length = 50
    model = AutoModelForCausalLM.from_pretrained(model_name).cuda()
    tokenizer = AutoTokenizer.from_pretrained(model_name)

    text = "Hello my name is"
    input_ids = tokenizer.encode(text, return_tensors="pt").cuda()

    print(timeit(lambda: model.generate(input_ids, max_length=max_length), number=100))


if __name__ == "__main__":
    main()

The results are the following on A100:

This branch: 62.44300797022879
Main: 70.0128909018822

thomasw21 added 7 commits

July 28, 2022 22:28


          Not sure self.layer_number normalization actually matters

77f19b3


          make style

5ed059c


          Try and be backward compatible

6c3bf96


          Woops

995d31a


          Try to fix beam_search

7795e23


          Woops

4596be9


          Revert attempt to be backward compatible

47e1969

thomasw21 mentioned this pull request

Add FX support for torch.baddbmm and torch.Tensor.baddbmm #18363

Merged

thomasw21 commented

View reviewed changes

src/transformers/models/bloom/modeling_bloom.py Show resolved Hide resolved

michaelbenayoun approved these changes

View reviewed changes

Member

michaelbenayoun left a comment

Awesome!

src/transformers/models/bloom/modeling_bloom.py Show resolved Hide resolved

Muennighoff reviewed

View reviewed changes

src/transformers/models/bloom/modeling_bloom.py Outdated Show resolved Hide resolved

Muennighoff reviewed

View reviewed changes

src/transformers/models/bloom/modeling_bloom.py

+                      matmul_result = alibi.baddbmm(
+                          batch1=query_layer,
+                          batch2=key_layer,
+                          beta=self.beta,

Contributor

Muennighoff Jul 29, 2022

Can remove the beta if it's 1 by default anyways?

Suggested change

beta=self.beta,

Contributor Author

thomasw21 Aug 1, 2022

I don't like relying on default :D

Muennighoff reviewed

View reviewed changes

Contributor

Muennighoff left a comment

Really nice job! Some small nits; Did you already try rerunning any of the eval with this branch, e.g. human eval?

src/transformers/models/bloom/modeling_bloom.py Show resolved Hide resolved

src/transformers/models/bloom/modeling_bloom.py

Comment on lines -233 to +231

    
                      self.layer_number = max(1, layer_number)

                      self.norm_factor = math.sqrt(self.head_dim) * self.layer_number

                      self.inv_norm_factor = 1.0 / math.sqrt(self.head_dim)

                      self.beta = 1.0

Contributor

Muennighoff Jul 29, 2022

This change is because we did not use apply_query_key_layer_scaling for the training of BLOOM models?

Contributor Author

thomasw21 Jul 29, 2022

No we did, but I actually don't think we need it. I think the trick was used to be able to train fp16 models, but since our fp16 models are small and the big ones are bf16 that trick shouldn't be use anymore IMO.

Contributor

Muennighoff Jul 29, 2022

But don't we still need it then, as the checkpoints are trained with layer scaling?

Contributor Author

thomasw21 Jul 29, 2022

Why? So that you mimick perfectly the logits? Not sure this matters, and we should make sure to obtain the maximum precision we can get from the weights IMO. I'm running a bunch of greedy_search on 175b right now to assess if that matters.

src/transformers/models/bloom/modeling_bloom.py Outdated Show resolved Hide resolved

NouamaneTazi mentioned this pull request

Fix BLOOM's softmax for half precisions #18185

Closed

thomasw21 and others added 2 commits

August 1, 2022 11:45


          Apply suggestions from code review

bd1ae60

Co-authored-by: Niklas Muennighoff <n.muennighoff@gmail.com>


          Add comment about support for torchScript v1.11

7c399e6

Member

LysandreJik commented Aug 1, 2022

Agree with @sgugger regarding the shape of the past key values: fine for me for the shape to be changed. We did change the format for this output two years ago in #9391 and #9596, which could also have been considered breaking.

I'd like for this change to be contained to BLOOM (i.e., not ported to GPT-2) until @patrickvonplaten can comment as he was very involved in that reshape.

Contributor Author

thomasw21 commented Aug 1, 2022

Thanks @LysandreJik ! Yeah this PR is only changing BLOOM modeling.

thomasw21 added 2 commits

August 1, 2022 14:52


          Add back layer_number normalization

50e1f2f


          Revert "Add back layer_number normalization"

8f4d603

This reverts commit 50e1f2f.

Contributor Author

thomasw21 commented Aug 1, 2022 •

edited

Loading

Okay tested 176b on lambada:

## This branch 
{
  "results": {
    "lambada": {
      "ppl": 3.9275610127374367,
      "ppl_stderr": 0.08456328894237608,
      "acc": 0.6716475839316903,
      "acc_stderr": 0.006542638265686496
    }
  },
  "versions": {
    "lambada": 0
  },
}

## 4.21.0
{
  "results": {
    "lambada": {
      "ppl": 3.933199250271968,
      "ppl_stderr": 0.08471163223917309,
      "acc": 0.6708713370851931,
      "acc_stderr": 0.006546580975553108
    }
  },
  "versions": {
    "lambada": 0
  },
}

I think this PR is ready to be checked.

sgugger approved these changes

View reviewed changes

Collaborator

sgugger left a comment

All look good to me, and multi-GPU tests pass with/without Accelerate. Thanks!

ydshieh reviewed

View reviewed changes

Collaborator

ydshieh left a comment •

edited

Loading

Just wanna say I especially love the change regarding the new way the attention mask is computed and used.
It is much cleaner and could potentially avoid a few issues if we use large negative values and summing different masks.

While working on a previous PR #17306, I have a discussion on Slack which suggested a similar approach.
However, there are other considerations to apply this change to the long existing models in transformers.

For the record (the discussion on Slack):

I am wondering if another approach might be better: what if we don't add large negative values as currently done, and no need to do sth. like (1.0 - extended_attention_mask) * -10000.0 etc. We just pass attention_mask as 0 and 1 , and use it as bool mask, like torch.where(attn_mask, attn_weights, -torch.inf)
before sending it to softmax.
(we already do this for causal_mask , but there still use -1e4) The disadvantage of this approach is it requires more changes than just using torch.finfo(tensor.dtype).min .

IMO this requires too many changes and could confuse users that are used to our way of changing the attention_mask into large negative values. I'd prefer torch.finfo(tensor.dtype).min . Also I think XLA likes add operation more that tf.where / jnp.where


          fix ONNX support for bloom (#18456)

213ec2c

* replace triu operation for ONNX support

* remove in-place addition
because `seq_length` could be a tensor and we don't want to change its value

* fix ONNX config past shapes

* nit: refactor head_dim

* add comment for why we replaced `triu`

thomasw21 mentioned this pull request

fix ONNX support for bloom #18456

Merged

NouamaneTazi approved these changes

View reviewed changes

thomasw21 merged commit b69a62d into main

thomasw21 deleted the thomas/bloom_clean_code branch

August 4, 2022 09:08

Muennighoff mentioned this pull request

[Bloom] Remove unused position_ids, improve modeling code (lm head, alibi, attention multiplication) #18141

Closed

This was referenced Aug 10, 2022

BLOOM Flax #18022

Closed

BLOOM - modifying slow tests #17963

Closed

thomasw21 mentioned this pull request

Changing a single example for BLOOM 176-B affects forward pass for other examples in a batch #18809

Closed

4 tasks

NouamaneTazi mentioned this pull request

fix BLOOM ONNX config #19573

Merged

justheuristic mentioned this pull request

Bump transformers to 4.25.1 bigscience-workshop/petals#151

Merged

justheuristic added a commit to bigscience-workshop/petals that referenced this pull request


          Bump transformers to 4.25.1 (#151)

b04982c

- latest accelerate, transformers, huggingface_hub
- rearrange attention caches to support huggingface/transformers#18344
- remove unused code
- fix edge case where session crashes when receiving seq length 0
- assert transformer version when importing WrappedBloomBlock

Co-authored-by: Alexander Borzunov <borzunov.alexander@gmail.com>
Co-authored-by: Max Ryabinin <mryabinin0@gmail.com>

amyeroberts mentioned this pull request

[bloomz] attn_mask return bool, but Deepspeed softmax input needs int #23280

Closed

4 tasks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

Muennighoff Muennighoff left review comments

ydshieh ydshieh left review comments

sgugger sgugger approved these changes

NouamaneTazi NouamaneTazi approved these changes

michaelbenayoun michaelbenayoun approved these changes

LysandreJik Awaiting requested review from LysandreJik

patrickvonplaten Awaiting requested review from patrickvonplaten

younesbelkada Awaiting requested review from younesbelkada

Labels

None yet