Efficient BLOOM Inference on PyTorch #432

stas00 · 2022-07-20T19:34:07Z

WIP: Efficient BLOOM Inference on PyTorch

Covers Accelerate, DS-ZeRO, DS-Inference and hopefully a whole new implementation Nicolas created in Rust for servers.

Preview: https://github.com/stas00/blog/blob/bloom-inference-pytorch/bloom-inference-pytorch.md

cc: @sgugger, @Narsil

sgugger

Thanks for drafting this! Left some nits on the text.

bloom-inference-pytorch.md

Narsil

I left a few comments.

I think for the big picture we need an extra paragraph explaining that for a live server / a few queries we're bottlenecked by memory bandwidth, but that as batch_size increases it's less and less relevant because we're more and more compute bound.

This is a very important fact to keep in mind, especially comparing throughputs like we're doing since measuring throughput when increasing batch_size is essentially measuring how well GPU can distribute across new cores (so what I am trying to say is that it's not really a fair comparison to when we're starting to be compute bound)

Since memory bandwidth is the limiting factor it should be a little more present in the message we're trying to convey IMO (because that's how DS obtains better performance, not really by using TP)

bloom-inference-pytorch.md

Narsil · 2022-07-27T08:15:19Z

bloom-inference-pytorch.md

+
+For doing inference on a server the two important metrics are the overall latency and the throughput of tokens. The loading time is less important since once loaded, the model becomes persistent.
+
+The time to generate a token is straightforward - this is just how long it takes for a new token to be generated. As long as `use_cache=True` (the default argument to `generate`) all the previous logits are saved and therefore each token takes the same amount of time to generate. This of course comes at a memory cost - as the model is huge it can easily take GBs of memory to keep the cache. One can turn the caching off, and save memory, but it'd mean recalculating all the previous logits - which would make a long sequence generation very slow.


which would make a long sequence generation very slow

Very surprisingly, disabling past cache actually speeds up inference with accelerate + pipeline.
I don't have a real explanation but it was ~350ms/token with cache and ~320ms/token without. (on small sequences + 20 tokens) and no batching.

This is very suprising to me but really true. I can do the experiments again if you want.
I suspect the issue was even more tensor copies everywhere and since this model is memory bandwidth bound it only adds up to the overall latency.

yes please - perhaps it was a very short output - otherwise it's super strange!

It is a very short input/output, still surprising to me.

Narsil · 2022-07-27T08:18:03Z

bloom-inference-pytorch.md

+
+The time to generate a token is straightforward - this is just how long it takes for a new token to be generated. As long as `use_cache=True` (the default argument to `generate`) all the previous logits are saved and therefore each token takes the same amount of time to generate. This of course comes at a memory cost - as the model is huge it can easily take GBs of memory to keep the cache. One can turn the caching off, and save memory, but it'd mean recalculating all the previous logits - which would make a long sequence generation very slow.
+
+The latency is more difficult to define. If the server is idle and the model is ready to generate, and one or more prompts are sent from the same user at once, the latency would be just the time to generate a single token multiplied by the number of new tokens to generate. However, if you have multiple users sending prompts at different times, things become much more complicated. The first user gets the fastest response and thus fast latency, but subsequent users if they are unlucky to submit their request while the previous request is being processed may have to first wait till the first request has been generated and only their request will be attended. It's possible to design a very efficient system that uses a pipeline parallelism, where a new request can join the pipeline at any moment. But this of course will make the overall throughput lower. Then of course there is a small overhead of processing the input, sending data to the server and sending the results back to the user, but this is usually insignificant compared to the cost of generation on a huge model.


of course will make the overall throughput lower

I disagree with that statement. Why would it reduce throughput in any manner. There are some caveats for sure, but it's not obvious either way. Actually if done properly you should get MORE throughput (since you're batching more and batching is essentially free when in low batch sizes mode)

It'd be slower since

in the simple case you give all gpus to a single batch.

with the pipeline you share gpus to process multiple batches.

So the time to process a single batch is now slower. a throughput of a single batch is slower.

However the back-to-back latency of a user entering the busy queue is faster.

Are we on the same page now?

That paragraph was there to specifically disambiguate the definition of latency in this context. So clearly it needs some work if it was unclear.

It'd be slower since

in the simple case you give all gpus to a single batch.

with the pipeline you share gpus to process multiple batches.

I think I need to devise drawings since in both cases you send the batch to all GPUs (doesn't matter how you share the weights/computation across). The only thing that matters, is the overhead added by having another user on the same machine as you. And if a user blocks the machine for a single forward loop, or if it blocks for the full generate loop.

The logic is quite simple - the gpus will process a BS=1 and BS=64 at about the same speed. So if you can't fit inputs into large batches the processing time of sequential batches that will start after the first batch will lead to a slower overall througput no matter how you cut it.

bloom-inference-pytorch.md

Narsil · 2022-07-27T08:28:05Z

bloom-inference-pytorch.md

+
+In a situation where there are multiple GPUs with enough space to accomodate the whole model, it switches control from one GPU to the next until all layers have run. Only one GPU works at any given time, which sounds very inefficient but it does produce excellent throughput despite the idling of the GPUs.
+
+It is also very flexible since the same code can run on any given setup. Accelerate will use all available GPUs first, then offload on the CPU until the RAM is full, and finally on the disk. Offloading to CPU or disk will make things slower. As an example, users have reported running BLOOM with no code changes on just 2 A100s with a throughput of 15s per token as compared to 10 msecs on 8x80 A100s.


15s per token as compared to 10 msecs on 8x80 A100s.

Do we know the batch size that was used for this ?
Batch_size will linearly increase this metric until you are compute bound.
So if the person ran 15s/token on BS=1 then we need to compare to throughput of 300ms/token on 16A100 (so maybe 150ms/token on 8 but I am not sure)

Yes it was with batch size 1.

Narsil · 2022-07-27T08:29:30Z

bloom-inference-pytorch.md

+
+
+
+## DeepSpeed-ZeRO


Given the results and the fact that it wasn't done to do inference, do you still think we should mention it ?

Not sure, perhaps we can ask the deepspeed team and see if they want it or not

@jeffra, @tjruwase - what do you guys think? Should we not mentioned DS-ZeRO as an inference solution and only discuss DS-Inference? Thank you!

@stas00, I am fine with dropping DS-ZeRO because when there are enough GPUs to fit the model, DS-Inference is DS recommendation. We find that DS-ZeRO is more suitable when there is insufficient GPUs to fit the model and offloading to CPU or NVMe is necessary. Since the blog focuses on non-offload scenarios, I think including just one (the best) DS solution for such scenarios enhances clarity and avoids reader confusion. Just my two cents.

Thank you for the feedback, @tjruwase - I will remove it then.

In general the offloading should work, but with 176B it's just too slow to be practical.

Perhaps leaving a mentioning of DS-ZeRO to say that its offloading features might be useful for smaller models? What do you think?

I don't dare to even try doing offload with a single GPU, after my experiment with ZeRO/NVME offload took 8h to process one batch. Surely, CPU Offload will be better, but how many people who have only one GPU have 512GB of CPU RAM - so this doesn't quite make sense.

@stas00, sorry you had such a horror experience with NVMe offload. I agree that if BLOOM token generation with zero-inference took 8 hours per token then it wouldn't make sense. However, this is neither the expected behavior nor my experience, so zero-inference was still quite broken when you tried this. The expectation for NVMe offload is that since most devices are capable of 3GB/sec reads, streaming ~360GB BLOOM weights from a single NVMe into GPU to generate a token should take about 120 seconds, and not 8 hours. Using more NVMe devices should improve performance up to the PCIe bandwidth (e.g., ~16GB/sec) of a single GPU. These are the kind of numbers that I get in my experiments. We can sync offline if needed.

or really, my experiment was done just recently - I think 2 months ago - has something changed since then? I'd be happy to try again. I think it'd be great if anybody with a small GPU could run generation even if it takes a longish time, but not hours.

It was 100 tokens, not one with bs=1. and 1x A100 and very fast very recent gen4 nvme and PCIe-4. the GPU was idle most of the time and NVME IO was very heavy the whole time.

And yes we can discuss this for sure if you think we can speed it up significantly.

Yeah, I think the code stabilized about 2 weeks ago. Hmm, generating 100 tokens for a bs=1 prompt will require reading the model 100 times which brings e2e latency to 100*120secs = ~3.3hours. Did I understand the scenario correctly? Perhaps this the hours latency you are concerned about. Of course, we could amortize by increasing batch size to around 64 as the extra compute time is negligible compared to PCIe transfer time. My experiments were focused on time per token generation, and not for all the tokens.

@stas00, but I am interested in exploring a bit further, so please share your script with me. I can try on my side. Thanks!

It's the same script from the inference PR here:

https://github.com/bigscience-workshop/Megatron-DeepSpeed/pull/308/files#diff-b772a56ca7676205ef4bcf14ad5fd704246e81d7e35e6d41807775ebbb8b6bbf

The scenario is of a practical need - how long will it take a user who wants to do a single 100 token generation - and the desire is for the script to finish in a reasonable length of time. If we can do it in 3h with a single small GPU I think that would be quite amazing, given the size of the model.

stas00 · 2022-08-02T15:47:29Z

I think for the big picture we need an extra paragraph explaining ...

I was hoping you'd write a whole section on the server-style inference. I have even added you as an author ahead of time ;)

Perhaps this article could just discuss the server inference, summarizing your discoveries, but not bothering with the code and perhaps another article down the road with details of the code if you'd be inspired to do so?

Because, otherwise this article so far focuses on fast inference and the server solution is a related but a very different problem to solve, but the token generation approach is the same w/ or w/o the server, IMHO.

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Create 89_deep_rl_a2c folder * Add A2C Introduction * Rename deep-rl-a2c to deep-rl-a2c.md * Add files via upload * Add Part 1 A2C * Add PII A A2C Article * Add PII B A2C * Misc updates A2C * Add files via upload * Add Conclusion A2C * Finish A2C before review * Add files via upload * Update deep-rl-a2c.md * Update _blog.yml

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

Added a small drawing in the first part that hopefully conveys a little better what is happening.

Narsil · 2022-08-04T10:37:34Z

@stas00 I added a few things, don't hesitate to comment or modify directly.

I felt that the first part and clarification about single forward latency vs throughput (token/ms) was needed otherwise readers would think we were actually getting 1 token in 0.7ms. (which isn't correct afaik).

Latency = total run time / 5 (number of cyles) / 100 (tokens per cycle) should be roughly the time T showed in my diagram.
(Past vs initial inference are different but ignoring for the sake of brevity and you already mention it in the blog)

Usually I prefer to do PR over yours to not modify your work directly, but you seemed ok with the idea of pushing on top.

I checked the images on the blog on light/dark theme and everything should be readable.
I aslo did a grammarly pass in a separate PR, hopefully I integrated correct fixes/improvements.

stas00 · 2022-08-05T05:06:29Z

Hmm, editing the PR directly is fine. but you force-pushed which is not the same as now I have no idea what you have changed other than manually doing a diff of the last version before your force-pushes. If you were to edit the files and commit them directly github would have shown the diff.

Usually please avoid force pushing in PRs that are a collaboration work. As besides the diff issue, it also makes it very difficult to push local changes if they were made before the force-push but weren't committed.

But what's done is done, I will just read it as a new text.

stas00 · 2022-08-05T05:09:44Z

I think the images you added weren't linked correctly, you can see in the preview that https://github.com/stas00/blog/blob/bloom-inference-pytorch/bloom-inference-pytorch.md that they are 404.

Narsil · 2022-08-05T08:21:48Z

but you force-pushed which is not the same as now I have no idea what you have changed other than manually doing a diff of the last version before your force-pushes.

That was simply a rebase. But I understand. I never get why github forces us to force-push to rebase but there's no real way around it afaik. Here I shouldn't have, but did it by habit because of the merge conflict.

And the commit list is the same, so my changes are still readable as independant commits. 60d5fb8 (Ofc I could have swept the rug under you, but I didn't, at least to the best of my knowledge)

Narsil · 2022-08-05T13:07:29Z

I think the images you added weren't linked correctly, you can see in the preview that https://github.com/stas00/blog/blob/bloom-inference-pytorch/bloom-inference-pytorch.md that they are 404.

Actually the links I correct. I verified on a local build of the blog. I seems your original PR had the same URL and same failure for you image: https://github.com/stas00/blog/blob/1a62600fa3ce36a9949b00908e4ed3a7681a3b68/bloom-inference-pytorch.md

The thing is that the github markdown is confused by the absolute path of the image (which isn't correct in github viewing but is correct once on the hub).

stas00 · 2022-08-05T15:04:33Z

You don't need to force push to rebase, here is a tool I wrote that I use everywhere to re-base
https://github.com/stas00/git-tools/tree/master/git-rebase

And now anybody with a previous clone has to figure out how to solve the wrong branch issue:

$ git pull
hint: You have divergent branches and need to specify how to reconcile them.
hint: You can do so by running one of the following commands sometime before
hint: your next pull:
hint:
hint:   git config pull.rebase false  # merge (the default strategy)
hint:   git config pull.rebase true   # rebase
hint:   git config pull.ff only       # fast-forward only
hint:
hint: You can replace "git config" with "git config --global" to set a default
hint: preference for all repositories. You can also pass --rebase, --no-rebase,
hint: or --ff-only on the command line to override the configured default per
hint: invocation.
fatal: Need to specify how to reconcile divergent branches.

stas00 · 2022-08-05T15:52:01Z

basically my local clone is completely broken right now after these force pushes. I'm just going to make a new clone.

stas00 · 2022-08-05T15:56:48Z

I'm not sure how to proceed - you wiped out all the editorial work that has been done so far, deleting all the fixes :(

edit: not all, but only my uncommitted changes got lost

I'm going to rewind the text to the last good version before your force pushes and then we will need to carefully re-add your changes.

Fixed the image paths as well, so it all gets rendered correctly here: https://github.com/stas00/blog/blob/bloom-inference-pytorch/bloom-inference-pytorch.md

Overall a great addition - and thanks for the visuals which help a lot, though I think one needs to put oneself into the shoes of a reader in some areas as I wasn't able to follow everywhere.

stas00 · 2022-08-05T17:22:20Z

bloom-inference-pytorch.md

+
+![initial_drawing](assets/bloom-inference-pytorch/initial_drawing.png)
+
+In this image, **T** is the simple forward latency. This is determined by how efficiently the **forward** pass is coded. But as you can see, using `8xA100` gives plenty of computing power so we can increase the batch size, and that's (to a point) not going to change **T**. But you will be generating more tokens within **T** so your throughput is effectively going up. As you increase your batch size, your memory consumption will likely go up too, so you might hit OOM errors, **before** actually reaching your computational boundary.


Your definition of T is very ambiguous - you first defined it as all of forwards together here and later in the images you used a single forward as T.

Moreover, not all forwards are created equal. some are faster some are slower. You probably need to state which of them you're referring here.

stas00 · 2022-08-05T17:26:31Z

bloom-inference-pytorch.md

+
+So now, what we're going to do is to have each request, handle the `generate` part on its own, and simply send back the required tensors only for the `forward` pass. A single thread will be responsible for batching those simple requests.
+
+Since each request is handling entirely its own way of choosing next ids, there is no problem in handling an infinite amount of different parameters.


what is "next ids"?

what is "different parameters"?

I think it's very difficult to follow as the text uses a lot of names without defining which is which. I lost you several times while reading this text.

sgugger approved these changes Jul 21, 2022

View reviewed changes

stas00 commented Jul 22, 2022

View reviewed changes

bloom-inference-pytorch.md Outdated Show resolved Hide resolved

Narsil reviewed Jul 27, 2022

View reviewed changes

Narsil force-pushed the bloom-inference-pytorch branch 2 times, most recently from 4f90534 to 6151d4c Compare August 4, 2022 10:10

stas00 and others added 10 commits August 4, 2022 12:15

update

1a62600

add perf stats

ddea3f2

Performance metrics

bbd217a

Apply suggestions from code review

564c400

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Write a blob about Accelerate

e1707eb

Update bloom-inference-pytorch.md

e2f19a2

Co-authored-by: Stas Bekman <stas00@users.noreply.github.com>

typo

fe004e7

comments

5f2990a

Apply suggestions from code review

c54aff2

Co-authored-by: Nicolas Patry <patry.nicolas@protonmail.com>

Narsil force-pushed the bloom-inference-pytorch branch from 6151d4c to 9be9381 Compare August 4, 2022 10:24

Narsil added 2 commits August 4, 2022 12:35

Adding my part.

60d5fb8

Added a small drawing in the first part that hopefully conveys a little better what is happening.

Grammarly pass.

3f0f4da

Narsil force-pushed the bloom-inference-pytorch branch from 9be9381 to 3f0f4da Compare August 4, 2022 10:35

stas00 and others added 2 commits August 5, 2022 08:58

restore last version before force pushes

2899206

Merge branch 'main' into bloom-inference-pytorch

574fbb7

stas00 added 3 commits August 5, 2022 09:22

only Nicolas version

31b315d

fix image urls

5c116e7

edit the new text

cf190ed

stas00 commented Aug 5, 2022

View reviewed changes

edits

134285b

stas00 mentioned this pull request Sep 2, 2022

Followup PR for adding generation-server bigscience-workshop/Megatron-DeepSpeed#339

Merged

stas00 mentioned this pull request Sep 15, 2022

Incredibly Fast BLOOM Inference with DeepSpeed and Accelerate #521

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Efficient BLOOM Inference on PyTorch #432

Efficient BLOOM Inference on PyTorch #432

stas00 commented Jul 20, 2022 •

edited

Loading

sgugger left a comment

Narsil left a comment

Narsil Jul 27, 2022 •

edited

Loading

stas00 Aug 2, 2022

Narsil Aug 4, 2022

Narsil Jul 27, 2022

stas00 Aug 2, 2022

Narsil Aug 3, 2022

stas00 Aug 3, 2022 •

edited

Loading

Narsil Jul 27, 2022

sgugger Jul 27, 2022

Narsil Jul 27, 2022

stas00 Aug 2, 2022

stas00 Aug 2, 2022

tjruwase Aug 3, 2022

stas00 Aug 3, 2022

tjruwase Aug 3, 2022

stas00 Aug 4, 2022 •

edited

Loading

tjruwase Aug 4, 2022

tjruwase Aug 4, 2022

stas00 Aug 5, 2022

stas00 commented Aug 2, 2022

Narsil commented Aug 4, 2022 •

edited

Loading

stas00 commented Aug 5, 2022 •

edited

Loading

stas00 commented Aug 5, 2022 •

edited

Loading

Narsil commented Aug 5, 2022

Narsil commented Aug 5, 2022

stas00 commented Aug 5, 2022 •

edited

Loading

stas00 commented Aug 5, 2022

stas00 commented Aug 5, 2022 •

edited

Loading

stas00 Aug 5, 2022

stas00 Aug 5, 2022 •

edited

Loading


		For doing inference on a server the two important metrics are the overall latency and the throughput of tokens. The loading time is less important since once loaded, the model becomes persistent.

		The time to generate a token is straightforward - this is just how long it takes for a new token to be generated. As long as `use_cache=True` (the default argument to `generate`) all the previous logits are saved and therefore each token takes the same amount of time to generate. This of course comes at a memory cost - as the model is huge it can easily take GBs of memory to keep the cache. One can turn the caching off, and save memory, but it'd mean recalculating all the previous logits - which would make a long sequence generation very slow.


		The time to generate a token is straightforward - this is just how long it takes for a new token to be generated. As long as `use_cache=True` (the default argument to `generate`) all the previous logits are saved and therefore each token takes the same amount of time to generate. This of course comes at a memory cost - as the model is huge it can easily take GBs of memory to keep the cache. One can turn the caching off, and save memory, but it'd mean recalculating all the previous logits - which would make a long sequence generation very slow.

		The latency is more difficult to define. If the server is idle and the model is ready to generate, and one or more prompts are sent from the same user at once, the latency would be just the time to generate a single token multiplied by the number of new tokens to generate. However, if you have multiple users sending prompts at different times, things become much more complicated. The first user gets the fastest response and thus fast latency, but subsequent users if they are unlucky to submit their request while the previous request is being processed may have to first wait till the first request has been generated and only their request will be attended. It's possible to design a very efficient system that uses a pipeline parallelism, where a new request can join the pipeline at any moment. But this of course will make the overall throughput lower. Then of course there is a small overhead of processing the input, sending data to the server and sending the results back to the user, but this is usually insignificant compared to the cost of generation on a huge model.


		In a situation where there are multiple GPUs with enough space to accomodate the whole model, it switches control from one GPU to the next until all layers have run. Only one GPU works at any given time, which sounds very inefficient but it does produce excellent throughput despite the idling of the GPUs.

		It is also very flexible since the same code can run on any given setup. Accelerate will use all available GPUs first, then offload on the CPU until the RAM is full, and finally on the disk. Offloading to CPU or disk will make things slower. As an example, users have reported running BLOOM with no code changes on just 2 A100s with a throughput of 15s per token as compared to 10 msecs on 8x80 A100s.


		![initial_drawing](assets/bloom-inference-pytorch/initial_drawing.png)

		In this image, T is the simple forward latency. This is determined by how efficiently the forward pass is coded. But as you can see, using `8xA100` gives plenty of computing power so we can increase the batch size, and that's (to a point) not going to change T. But you will be generating more tokens within T so your throughput is effectively going up. As you increase your batch size, your memory consumption will likely go up too, so you might hit OOM errors, before actually reaching your computational boundary.


		So now, what we're going to do is to have each request, handle the `generate` part on its own, and simply send back the required tensors only for the `forward` pass. A single thread will be responsible for batching those simple requests.

		Since each request is handling entirely its own way of choosing next ids, there is no problem in handling an infinite amount of different parameters.

Efficient BLOOM Inference on PyTorch #432

Are you sure you want to change the base?

Efficient BLOOM Inference on PyTorch #432

Conversation

stas00 commented Jul 20, 2022 • edited Loading

sgugger left a comment

Choose a reason for hiding this comment

Narsil left a comment

Choose a reason for hiding this comment

Narsil Jul 27, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stas00 Aug 3, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stas00 Aug 4, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stas00 commented Aug 2, 2022

Narsil commented Aug 4, 2022 • edited Loading

stas00 commented Aug 5, 2022 • edited Loading

stas00 commented Aug 5, 2022 • edited Loading

Narsil commented Aug 5, 2022

Narsil commented Aug 5, 2022

stas00 commented Aug 5, 2022 • edited Loading

stas00 commented Aug 5, 2022

stas00 commented Aug 5, 2022 • edited Loading

Choose a reason for hiding this comment

stas00 Aug 5, 2022 • edited Loading

Choose a reason for hiding this comment

stas00 commented Jul 20, 2022 •

edited

Loading

Narsil Jul 27, 2022 •

edited

Loading

stas00 Aug 3, 2022 •

edited

Loading

stas00 Aug 4, 2022 •

edited

Loading

Narsil commented Aug 4, 2022 •

edited

Loading

stas00 commented Aug 5, 2022 •

edited

Loading

stas00 commented Aug 5, 2022 •

edited

Loading

stas00 commented Aug 5, 2022 •

edited

Loading

stas00 commented Aug 5, 2022 •

edited

Loading

stas00 Aug 5, 2022 •

edited

Loading