Skip to content

Commit

Permalink
Grammarly pass.
Browse files Browse the repository at this point in the history
  • Loading branch information
Narsil committed Aug 4, 2022
1 parent 726052e commit 4f90534
Showing 1 changed file with 21 additions and 21 deletions.
42 changes: 21 additions & 21 deletions bloom-inference-pytorch.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,13 +37,13 @@ thumbnail: /blog/assets/86_bloom_megatron_deepspeed/thumbnail.png
</a>
</div>

This article discusses various PyTorch-based solution to efficient 176B parameter [BLOOM model](https://huggingface.co/bigscience/bloom).
This article discusses various PyTorch-based solutions to run efficiently 176B parameter [BLOOM model](https://huggingface.co/bigscience/bloom).

As the model needs 352GB in bf16 (bfloat16) weights (`176*2`), the most efficient set-up is 8x80GB A100. The second best is 2x8x40GB A100s. The main reason for using A100s And, is that at the time of this writing they provide the largest GPU memory. But other GPUs can be used as well. It'd probably take 24x32GB V100 as another possibility.
As the model needs 352GB in bf16 (bfloat16) weights (`176*2`), the most efficient set-up is 8x80GB A100. The second best is 2x8x40GB A100s. The main reason for using A100s And, at the time of this writing, they provide the largest GPU memory. But other GPUs can be used as well. It'd probably take 24x32GB V100 as another possibility.

Using a single node is ideal since PCIe speed is typically much faster than inter-node network.

If you don't have that much hardware, it's still possible to run BLOOM inference on smaller GPUs, by using CPU or NVME offload, but of course the generation time will be much slower.
If you don't have that much hardware, it's still possible to run BLOOM inference on smaller GPUs, by using CPU or NVME offload, but of course, the generation time will be much slower.

For the sake of consistency, unless stated differently, the benchmarks in this article were all done on the same 8x80GB A100 node on [Jean Zay HPC](http://www.idris.fr/eng/jean-zay/index.html). The HPC users enjoy a very fast IO of about 3GB/s read speed (GPFS). This is important for loading speed time. A slow disc will result in slow loading time. Especially since we are concurrently doing IO in multiple processes.

Expand All @@ -55,13 +55,13 @@ For doing inference on a server the two important metrics are the overall latenc

The time to generate a token is straightforward - this is just how long it takes for a new token to be generated. As long as `use_cache=True` (the default argument to `generate`) all the previous logits are saved and therefore each token takes the same amount of time to generate. This of course comes at a memory cost - as the model is huge it can easily take GBs of memory to keep the cache. One can turn the caching off, and save memory, but it'd mean recalculating all the previous logits - which would make a long sequence generation very slow.

The latency is more difficult to define. If the server is idle and the model is ready to generate, and one or more prompts are sent from the same user at once, the latency would be just the time to generate a single token multiplied by the number of new tokens to generate. However, if you have multiple users sending prompts at different times, things become much more complicated. The first user gets the fastest response and thus fast latency, but subsequent users if they are unlucky to submit their request while the previous request is being processed may have to first wait till the first request has been generated and only their request will be attended. It's possible to design a very efficient system that uses a pipeline parallelism, where a new request can join the pipeline at any moment. But this has caveats. Then of course there is a small overhead of processing the input, sending data to the server and sending the results back to the user, but this is usually insignificant compared to the cost of generation on a huge model.
The latency is more difficult to define. If the server is idle and the model is ready to generate, and one or more prompts are sent from the same user at once, the latency would be just the time to generate a single token multiplied by the number of new tokens to generate. However, if you have multiple users sending prompts at different times, things become much more complicated. The first user gets the fastest response and thus fast latency, but subsequent users if they are unlucky to submit their request while the previous request is being processed may have to first wait till the first request has been generated and only their request will be attended. It's possible to design a very efficient system that uses a pipeline parallelism, where a new request can join the pipeline at any moment. But this has caveats. Then of course there is a small overhead of processing the input, sending data to the server, and sending the results back to the user, but this is usually insignificant compared to the cost of generation on a huge model.

Let's try and show the metrics in a drawing:

<img class="avatar avatar-user" src="/blog/assets/bloom-inference-pytorch/initial_drawing.png">

In this image, **T** is the simple forward latency. This is determined by how efficiently the **forward** pass is coded. But as you can see, using `8xA100` gives plenty of room for compute so we can increase the batch size, and that's (to a point) not going to change **T**. But you will be generating more tokens within **T** so your throughput is effectively going up.
In this image, **T** is the simple forward latency. This is determined by how efficiently the **forward** pass is coded. But as you can see, using `8xA100` gives plenty of computing power so we can increase the batch size, and that's (to a point) not going to change **T**. But you will be generating more tokens within **T** so your throughput is effectively going up.
As you increase your batch_size, your memory consumption will likely go up too, so you might hit OOM errors, **before** actually reaching your computational boundary.

This article primarily discusses simple solutions in PyTorch. More efficient solutions are being worked on as well.
Expand All @@ -80,7 +80,7 @@ Accelerate handles big models for inference in the following way:

It then ensures the model runs properly with hooks that transfer the inputs and outputs on the right device and that the model weights offloaded on the CPU (or even the disk) are loaded on a GPU just before the forward pass, before being offloaded again once the forward pass is finished.

In a situation where there are multiple GPUs with enough space to accomodate the whole model, it switches control from one GPU to the next until all layers have run. Only one GPU works at any given time, which sounds very inefficient but it does produce decent throughput despite the idling of the GPUs.
In a situation where there are multiple GPUs with enough space to accommodate the whole model, it switches control from one GPU to the next until all layers have run. Only one GPU works at any given time, which sounds very inefficient but it does produce decent throughput despite the idling of the GPUs.

It is also very flexible since the same code can run on any given setup. Accelerate will use all available GPUs first, then offload on the CPU until the RAM is full, and finally on the disk. Offloading to CPU or disk will make things slower. As an example, users have reported running BLOOM with no code changes on just 2 A100s with a throughput of 15s per token as compared to 10 msecs on 8x80 A100s.

Expand Down Expand Up @@ -115,7 +115,7 @@ The highest batch size we were able to run without OOM was 40 in this case.

## DeepSpeed-Inference

[DeepSpeed-Inference](https://www.deepspeed.ai/tutorials/inference-tutorial/) uses Tensor-Parallelism and efficient fused CUDA kernels to deliver an super-fast <1msec per token inference on a large batch size of 128.
[DeepSpeed-Inference](https://www.deepspeed.ai/tutorials/inference-tutorial/) uses Tensor-Parallelism and efficient fused CUDA kernels to deliver a super-fast <1msec per token inference on a large batch size of 128.



Expand Down Expand Up @@ -144,17 +144,17 @@ Start to finish: 601.323 secs
The highest batch size we were able to run without OOM was 128 in this case.
You can see two factors at play leading to better performance here.

1/ T was reduced by using Tensor Parallelism (TP) instead of Pipeline Parallelism (PP) of `accelerate`.
1/ T was reduced by using Tensor Parallelism (TP) instead of the Pipeline Parallelism (PP) of `accelerate`.
Because accelerate is meant to be very generic it is also unfortunately hard to maximize the GPU usage.
Basically all computations are done first on GPU 0, then on GPU 1, etc... until GPU 8, which means
All computations are done first on GPU 0, then on GPU 1, etc... until GPU 8, which means
7 GPU are idle all the time.
`DeepSpeed` on the other end, uses TP meaning it will send tensors to all GPUs, compute part of the computation
`DeepSpeed` on the other end uses TP meaning it will send tensors to all GPUs, compute part of the computation
then all GPUs communicate to each other the results, then move on to the next layer. That means all
GPUs are active at once but they need to communicate much more. Overall that decreases **T** by a theoretical
factor of `8x` (the theoretical limit would be `1.3s`).

2/ DeepSpeed also uses custom cuda kernels to avoid allocating too much memory and doing tensor copies
The effect of this is less memory allocation, less kernel starts which does reduce **T** but also
The effect of this is less memory allocation and fewer kernel starts which does reduce **T** but also
allows for bigger batch sizes (here **128** which also participates in raising the overall throughput.


Expand All @@ -163,25 +163,25 @@ You can see two factors at play leading to better performance here.

Another approach we took, was relooking at the overall behavior of the webserver in generation mode.

In practice when users send request to a `text-generation` model, we don't run a single `forward` loop but multiple ones.
In practice when users send requests to a `text-generation` model, we don't run a single `forward` loop but multiple ones.
They also don't send the same parameters, some want `greedy`, some `sampling` and some `beam_search` for instance, and probably
with never really the same values.

In theory, if we're using `.generate` directly, we are **not** able to do any batching on different parameter sets. So what happens
is that we're less likely to use batching in production, and that as mentionned above is a *big* source of throughput improvement.
is that we're less likely to use batching in production and that as mentioned above is a *big* source of throughput improvement.

Let's take a simple example where there's only 1 query running, and another query comes in while the first is being processed.

<img class="avatar avatar-user" src="/blog/assets/bloom-inference-pytorch/overall_latency.png">

As you can see here, the first request gets a latency of `3 x T` which is what we would expect. But for the **request 2** is has to wait
`4.5 x T`. which is bigger. Actually, if there was a **request 3** with yet another parameter set, then you would have to wait `7.5 x T`.
As you can see here, the first request gets a latency of `3 x T` which is what we would expect. But **request 2** has to wait
for `4.5 x T`. which is bigger. If there was a **request 3** with yet another parameter set, then you would have to wait for `7.5 x T`.
And it piles up. The simple solution is to force a single (or handful) parameter sets so that we can ensure we're capping the overall latency.

Another approach is to split entirely the generation loop. We're going to ignore the `use_cache=True` complexity for the sake of readability.

So now, what we're going to do is have each request, handle the `generate` part on it own, and simply send back the required tensors only for
the `forward` pass. A single thread will be responsible of batching those simple requests.
So now, what we're going to do is have each request, handle the `generate` part on its own, and simply send back the required tensors only for
the `forward` pass. A single thread will be responsible for batching those simple requests.

Since each request is handling entirely its own way of choosing next ids, there is no problem in handling an infinite amount of different parameters.

Expand All @@ -192,7 +192,7 @@ You're getting a compute model like this.
Here you can see that at most, **request 2** has to wait for 1 forward pass to join the batch. And so even if **request 1** is super long (as in many tokens need to be generated),
**request 2** can start to execute almost immediately, and might even finish before **request 1** is done, and we actually see that happening in practice regularly.

As with any sort of optimization, the trick is using all source of optimizations at once, for instance here since every request is doing it's own `.generate` then
As with any sort of optimization, the trick is using all sources of optimizations at once, for instance here since every request is doing its own `.generate` then
the tensors have to be copied a bunch of times (including the past key values). And any additional overhead adds up to increase **T** you need to make good care that doing this generation split is not offsetting all the other benefits.


Expand All @@ -201,7 +201,7 @@ It's in Rust, mostly because threading/async/multiprocessing have a lot of cavea
Rust doesn't provide any performance benefits here (it uses the same PyTorch under the hood)
This code is not meant to be portable and should run on `16xA100(40)` and it also uses PP (like `accelerate`)
and uses a technique to interleave GPU usage to that the throughput should be more comparable to deepspeed,
even if the latency (**T**) is exactly the same as `accelerate`.
even if the latency (**T**) is the same as `accelerate`.


There are other issues you need to take into account, like padding which might also hit your performance but this is
Expand All @@ -220,7 +220,7 @@ TORCH_CUDA_VERSION=cu116 cargo run --release

### Run

**This experiement does not have a 1-1 benchmark analysis since it's running a full fledged webserver and was never ran on the same machines as before**
**This experiment does not have a 1-1 benchmark analysis since it's running a full fledged webserver and was never ran on the same machines as before**

```
*** Performance stats (NUMBERS ARE ESTIMATES FROM QUERIES):
Expand All @@ -232,5 +232,5 @@ Start to finish: N/A
```

The faster load speed is due to `safetensors` (not a production ready library yet) is using `mmap` + there's 1 thread to load per GPU.
Overall the latency should be exactly the same as `accelerate`, throughput is ~10X better and **most importantly, it can support any amount of different parameters**
Overall the latency should be the same as `accelerate`, throughput is ~10X better and **most importantly, it can support any amount of different parameters**
without any latency cost.

0 comments on commit 4f90534

Please sign in to comment.