Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Efficient BLOOM Inference on PyTorch #432

Open
wants to merge 18 commits into
base: main
Choose a base branch
from

Conversation

stas00
Copy link
Contributor

@stas00 stas00 commented Jul 20, 2022

WIP: Efficient BLOOM Inference on PyTorch

Covers Accelerate, DS-ZeRO, DS-Inference and hopefully a whole new implementation Nicolas created in Rust for servers.

Preview: https://github.com/stas00/blog/blob/bloom-inference-pytorch/bloom-inference-pytorch.md

cc: @sgugger, @Narsil

Copy link
Contributor

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for drafting this! Left some nits on the text.

bloom-inference-pytorch.md Outdated Show resolved Hide resolved
bloom-inference-pytorch.md Outdated Show resolved Hide resolved
bloom-inference-pytorch.md Outdated Show resolved Hide resolved
bloom-inference-pytorch.md Outdated Show resolved Hide resolved
bloom-inference-pytorch.md Outdated Show resolved Hide resolved
bloom-inference-pytorch.md Outdated Show resolved Hide resolved
bloom-inference-pytorch.md Outdated Show resolved Hide resolved
bloom-inference-pytorch.md Outdated Show resolved Hide resolved
bloom-inference-pytorch.md Outdated Show resolved Hide resolved
bloom-inference-pytorch.md Outdated Show resolved Hide resolved
Copy link
Contributor

@Narsil Narsil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few comments.

I think for the big picture we need an extra paragraph explaining that for a live server / a few queries we're bottlenecked by memory bandwidth, but that as batch_size increases it's less and less relevant because we're more and more compute bound.

This is a very important fact to keep in mind, especially comparing throughputs like we're doing since measuring throughput when increasing batch_size is essentially measuring how well GPU can distribute across new cores (so what I am trying to say is that it's not really a fair comparison to when we're starting to be compute bound)

Since memory bandwidth is the limiting factor it should be a little more present in the message we're trying to convey IMO (because that's how DS obtains better performance, not really by using TP)

bloom-inference-pytorch.md Outdated Show resolved Hide resolved

For doing inference on a server the two important metrics are the overall latency and the throughput of tokens. The loading time is less important since once loaded, the model becomes persistent.

The time to generate a token is straightforward - this is just how long it takes for a new token to be generated. As long as `use_cache=True` (the default argument to `generate`) all the previous logits are saved and therefore each token takes the same amount of time to generate. This of course comes at a memory cost - as the model is huge it can easily take GBs of memory to keep the cache. One can turn the caching off, and save memory, but it'd mean recalculating all the previous logits - which would make a long sequence generation very slow.
Copy link
Contributor

@Narsil Narsil Jul 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which would make a long sequence generation very slow

Very surprisingly, disabling past cache actually speeds up inference with accelerate + pipeline.
I don't have a real explanation but it was ~350ms/token with cache and ~320ms/token without. (on small sequences + 20 tokens) and no batching.

This is very suprising to me but really true. I can do the experiments again if you want.
I suspect the issue was even more tensor copies everywhere and since this model is memory bandwidth bound it only adds up to the overall latency.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes please - perhaps it was a very short output - otherwise it's super strange!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a very short input/output, still surprising to me.


The time to generate a token is straightforward - this is just how long it takes for a new token to be generated. As long as `use_cache=True` (the default argument to `generate`) all the previous logits are saved and therefore each token takes the same amount of time to generate. This of course comes at a memory cost - as the model is huge it can easily take GBs of memory to keep the cache. One can turn the caching off, and save memory, but it'd mean recalculating all the previous logits - which would make a long sequence generation very slow.

The latency is more difficult to define. If the server is idle and the model is ready to generate, and one or more prompts are sent from the same user at once, the latency would be just the time to generate a single token multiplied by the number of new tokens to generate. However, if you have multiple users sending prompts at different times, things become much more complicated. The first user gets the fastest response and thus fast latency, but subsequent users if they are unlucky to submit their request while the previous request is being processed may have to first wait till the first request has been generated and only their request will be attended. It's possible to design a very efficient system that uses a pipeline parallelism, where a new request can join the pipeline at any moment. But this of course will make the overall throughput lower. Then of course there is a small overhead of processing the input, sending data to the server and sending the results back to the user, but this is usually insignificant compared to the cost of generation on a huge model.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

of course will make the overall throughput lower

I disagree with that statement. Why would it reduce throughput in any manner. There are some caveats for sure, but it's not obvious either way. Actually if done properly you should get MORE throughput (since you're batching more and batching is essentially free when in low batch sizes mode)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be slower since

  1. in the simple case you give all gpus to a single batch.
  2. with the pipeline you share gpus to process multiple batches.

So the time to process a single batch is now slower. a throughput of a single batch is slower.

However the back-to-back latency of a user entering the busy queue is faster.

Are we on the same page now?

That paragraph was there to specifically disambiguate the definition of latency in this context. So clearly it needs some work if it was unclear.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd be slower since

  1. in the simple case you give all gpus to a single batch.
  2. with the pipeline you share gpus to process multiple batches.

I think I need to devise drawings since in both cases you send the batch to all GPUs (doesn't matter how you share the weights/computation across). The only thing that matters, is the overhead added by having another user on the same machine as you. And if a user blocks the machine for a single forward loop, or if it blocks for the full generate loop.

Copy link
Contributor Author

@stas00 stas00 Aug 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic is quite simple - the gpus will process a BS=1 and BS=64 at about the same speed. So if you can't fit inputs into large batches the processing time of sequential batches that will start after the first batch will lead to a slower overall througput no matter how you cut it.

bloom-inference-pytorch.md Outdated Show resolved Hide resolved

In a situation where there are multiple GPUs with enough space to accomodate the whole model, it switches control from one GPU to the next until all layers have run. Only one GPU works at any given time, which sounds very inefficient but it does produce excellent throughput despite the idling of the GPUs.

It is also very flexible since the same code can run on any given setup. Accelerate will use all available GPUs first, then offload on the CPU until the RAM is full, and finally on the disk. Offloading to CPU or disk will make things slower. As an example, users have reported running BLOOM with no code changes on just 2 A100s with a throughput of 15s per token as compared to 10 msecs on 8x80 A100s.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

15s per token as compared to 10 msecs on 8x80 A100s.

Do we know the batch size that was used for this ?
Batch_size will linearly increase this metric until you are compute bound.
So if the person ran 15s/token on BS=1 then we need to compare to throughput of 300ms/token on 16A100 (so maybe 150ms/token on 8 but I am not sure)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it was with batch size 1.




## DeepSpeed-ZeRO
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the results and the fact that it wasn't done to do inference, do you still think we should mention it ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure, perhaps we can ask the deepspeed team and see if they want it or not

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeffra, @tjruwase - what do you guys think? Should we not mentioned DS-ZeRO as an inference solution and only discuss DS-Inference? Thank you!

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stas00, I am fine with dropping DS-ZeRO because when there are enough GPUs to fit the model, DS-Inference is DS recommendation. We find that DS-ZeRO is more suitable when there is insufficient GPUs to fit the model and offloading to CPU or NVMe is necessary. Since the blog focuses on non-offload scenarios, I think including just one (the best) DS solution for such scenarios enhances clarity and avoids reader confusion. Just my two cents.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the feedback, @tjruwase - I will remove it then.

In general the offloading should work, but with 176B it's just too slow to be practical.

Perhaps leaving a mentioning of DS-ZeRO to say that its offloading features might be useful for smaller models? What do you think?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't dare to even try doing offload with a single GPU, after my experiment with ZeRO/NVME offload took 8h to process one batch. Surely, CPU Offload will be better, but how many people who have only one GPU have 512GB of CPU RAM - so this doesn't quite make sense.

@stas00, sorry you had such a horror experience with NVMe offload. I agree that if BLOOM token generation with zero-inference took 8 hours per token then it wouldn't make sense. However, this is neither the expected behavior nor my experience, so zero-inference was still quite broken when you tried this. The expectation for NVMe offload is that since most devices are capable of 3GB/sec reads, streaming ~360GB BLOOM weights from a single NVMe into GPU to generate a token should take about 120 seconds, and not 8 hours. Using more NVMe devices should improve performance up to the PCIe bandwidth (e.g., ~16GB/sec) of a single GPU. These are the kind of numbers that I get in my experiments. We can sync offline if needed.

Copy link
Contributor Author

@stas00 stas00 Aug 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or really, my experiment was done just recently - I think 2 months ago - has something changed since then? I'd be happy to try again. I think it'd be great if anybody with a small GPU could run generation even if it takes a longish time, but not hours.

It was 100 tokens, not one with bs=1. and 1x A100 and very fast very recent gen4 nvme and PCIe-4. the GPU was idle most of the time and NVME IO was very heavy the whole time.

And yes we can discuss this for sure if you think we can speed it up significantly.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think the code stabilized about 2 weeks ago. Hmm, generating 100 tokens for a bs=1 prompt will require reading the model 100 times which brings e2e latency to 100*120secs = ~3.3hours. Did I understand the scenario correctly? Perhaps this the hours latency you are concerned about. Of course, we could amortize by increasing batch size to around 64 as the extra compute time is negligible compared to PCIe transfer time. My experiments were focused on time per token generation, and not for all the tokens.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stas00, but I am interested in exploring a bit further, so please share your script with me. I can try on my side. Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the same script from the inference PR here:

https://github.com/bigscience-workshop/Megatron-DeepSpeed/pull/308/files#diff-b772a56ca7676205ef4bcf14ad5fd704246e81d7e35e6d41807775ebbb8b6bbf

The scenario is of a practical need - how long will it take a user who wants to do a single 100 token generation - and the desire is for the script to finish in a reasonable length of time. If we can do it in 3h with a single small GPU I think that would be quite amazing, given the size of the model.

@stas00
Copy link
Contributor Author

stas00 commented Aug 2, 2022

I think for the big picture we need an extra paragraph explaining ...

I was hoping you'd write a whole section on the server-style inference. I have even added you as an author ahead of time ;)

Perhaps this article could just discuss the server inference, summarizing your discoveries, but not bothering with the code and perhaps another article down the road with details of the code if you'd be inspired to do so?

Because, otherwise this article so far focuses on fast inference and the server solution is a related but a very different problem to solve, but the token generation approach is the same w/ or w/o the server, IMHO.

@Narsil Narsil force-pushed the bloom-inference-pytorch branch 2 times, most recently from 4f90534 to 6151d4c Compare August 4, 2022 10:10
stas00 and others added 10 commits August 4, 2022 12:15
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* Create 89_deep_rl_a2c folder

* Add A2C Introduction

* Rename deep-rl-a2c to deep-rl-a2c.md

* Add files via upload

* Add Part 1 A2C

* Add PII A A2C Article

* Add PII B A2C

* Misc updates A2C

* Add files via upload

* Add Conclusion A2C

* Finish A2C before review

* Add files via upload

* Update deep-rl-a2c.md

* Update _blog.yml
Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>
Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>
Added a small drawing in the first part that hopefully conveys a little
better what is happening.
@Narsil
Copy link
Contributor

Narsil commented Aug 4, 2022

@stas00 I added a few things, don't hesitate to comment or modify directly.

I felt that the first part and clarification about single forward latency vs throughput (token/ms) was needed otherwise readers would think we were actually getting 1 token in 0.7ms. (which isn't correct afaik).

Latency = total run time / 5 (number of cyles) / 100 (tokens per cycle) should be roughly the time T showed in my diagram.
(Past vs initial inference are different but ignoring for the sake of brevity and you already mention it in the blog)

Usually I prefer to do PR over yours to not modify your work directly, but you seemed ok with the idea of pushing on top.

I checked the images on the blog on light/dark theme and everything should be readable.
I aslo did a grammarly pass in a separate PR, hopefully I integrated correct fixes/improvements.

@stas00
Copy link
Contributor Author

stas00 commented Aug 5, 2022

Hmm, editing the PR directly is fine. but you force-pushed which is not the same as now I have no idea what you have changed other than manually doing a diff of the last version before your force-pushes. If you were to edit the files and commit them directly github would have shown the diff.

Usually please avoid force pushing in PRs that are a collaboration work. As besides the diff issue, it also makes it very difficult to push local changes if they were made before the force-push but weren't committed.

But what's done is done, I will just read it as a new text.

@stas00
Copy link
Contributor Author

stas00 commented Aug 5, 2022

I think the images you added weren't linked correctly, you can see in the preview that https://github.com/stas00/blog/blob/bloom-inference-pytorch/bloom-inference-pytorch.md that they are 404.

@Narsil
Copy link
Contributor

Narsil commented Aug 5, 2022

but you force-pushed which is not the same as now I have no idea what you have changed other than manually doing a diff of the last version before your force-pushes.

That was simply a rebase. But I understand. I never get why github forces us to force-push to rebase but there's no real way around it afaik. Here I shouldn't have, but did it by habit because of the merge conflict.

And the commit list is the same, so my changes are still readable as independant commits. 60d5fb8 (Ofc I could have swept the rug under you, but I didn't, at least to the best of my knowledge)

@Narsil
Copy link
Contributor

Narsil commented Aug 5, 2022

I think the images you added weren't linked correctly, you can see in the preview that https://github.com/stas00/blog/blob/bloom-inference-pytorch/bloom-inference-pytorch.md that they are 404.

Actually the links I correct. I verified on a local build of the blog. I seems your original PR had the same URL and same failure for you image: https://github.com/stas00/blog/blob/1a62600fa3ce36a9949b00908e4ed3a7681a3b68/bloom-inference-pytorch.md

The thing is that the github markdown is confused by the absolute path of the image (which isn't correct in github viewing but is correct once on the hub).

@stas00
Copy link
Contributor Author

stas00 commented Aug 5, 2022

You don't need to force push to rebase, here is a tool I wrote that I use everywhere to re-base
https://github.com/stas00/git-tools/tree/master/git-rebase

And now anybody with a previous clone has to figure out how to solve the wrong branch issue:

$ git pull
hint: You have divergent branches and need to specify how to reconcile them.
hint: You can do so by running one of the following commands sometime before
hint: your next pull:
hint:
hint:   git config pull.rebase false  # merge (the default strategy)
hint:   git config pull.rebase true   # rebase
hint:   git config pull.ff only       # fast-forward only
hint:
hint: You can replace "git config" with "git config --global" to set a default
hint: preference for all repositories. You can also pass --rebase, --no-rebase,
hint: or --ff-only on the command line to override the configured default per
hint: invocation.
fatal: Need to specify how to reconcile divergent branches.

@stas00
Copy link
Contributor Author

stas00 commented Aug 5, 2022

basically my local clone is completely broken right now after these force pushes. I'm just going to make a new clone.

@stas00
Copy link
Contributor Author

stas00 commented Aug 5, 2022

I'm not sure how to proceed - you wiped out all the editorial work that has been done so far, deleting all the fixes :(

edit: not all, but only my uncommitted changes got lost

I'm going to rewind the text to the last good version before your force pushes and then we will need to carefully re-add your changes.

Fixed the image paths as well, so it all gets rendered correctly here: https://github.com/stas00/blog/blob/bloom-inference-pytorch/bloom-inference-pytorch.md

Overall a great addition - and thanks for the visuals which help a lot, though I think one needs to put oneself into the shoes of a reader in some areas as I wasn't able to follow everywhere.


![initial_drawing](assets/bloom-inference-pytorch/initial_drawing.png)

In this image, **T** is the simple forward latency. This is determined by how efficiently the **forward** pass is coded. But as you can see, using `8xA100` gives plenty of computing power so we can increase the batch size, and that's (to a point) not going to change **T**. But you will be generating more tokens within **T** so your throughput is effectively going up. As you increase your batch size, your memory consumption will likely go up too, so you might hit OOM errors, **before** actually reaching your computational boundary.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your definition of T is very ambiguous - you first defined it as all of forwards together here and later in the images you used a single forward as T.

Moreover, not all forwards are created equal. some are faster some are slower. You probably need to state which of them you're referring here.


So now, what we're going to do is to have each request, handle the `generate` part on its own, and simply send back the required tensors only for the `forward` pass. A single thread will be responsible for batching those simple requests.

Since each request is handling entirely its own way of choosing next ids, there is no problem in handling an infinite amount of different parameters.
Copy link
Contributor Author

@stas00 stas00 Aug 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is "next ids"?

what is "different parameters"?

I think it's very difficult to follow as the text uses a lot of names without defining which is which. I lost you several times while reading this text.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants