Incredibly Fast BLOOM Inference with DeepSpeed and Accelerate #521

stas00 · 2022-09-15T00:17:29Z

as the server solutions may take some time to complete we agreed to split off the client-side solutions into its own post, so this is an extraction of #432 with a major update on the benchmarks and scripts.

tjruwase · 2022-09-15T10:45:09Z

bloom-inference-pytorch-scripts.md

+
+As the model needs 352GB in bf16 (bfloat16) weights (`176*2`), the most efficient set-up is 8x80GB A100 GPUs. Also 2x8x40GB A100s or 2x8x48GB A6000 can be used. The main reason for using these GPUs is that at the time of this writing they provide the largest GPU memory, but other GPUs can be used as well. For example, 24x32GB V100s can be used.
+
+Using a single node will deliver the fastest througput since PCIe speed is typically much faster than inter-node network.


Do you mean NVLInk speed rather than PCIe speed?

that's if the user has NVLink.

but you're making a good point - I will mention both.

Also perhaps mention NVSwitch which makes the internode connectivity as fast as intranode? perhaps this is out of scope.

Is intranode using PCIe faster than inter-node using infiniband? Or is infiniband not considered for this?

The latest PCIe gens should be faster, but then there are so many different set-ups so it's super difficult to compare:

https://en.wikipedia.org/wiki/InfiniBand#Performance
https://en.wikipedia.org/wiki/PCI_Express#History_and_revisions

and if one uses NVSwitch shouldn't it make intra-node as fast as internode? I think it can connect up to 256 GPUs at the same speed as NVLInk - but that's Hopper.

Given that we are having this discussion I think it's the best to remove any specific suggestions and I'll just replace it with a very generic:

Using a single node will typically deliver a fastest throughput since most of the time intra-node GPU linking hardware is faster than inter-node one, but it's not always the case.

philschmid

Left some comments

bloom-inference-pytorch-scripts.md

Co-authored-by: Philipp Schmid <32632186+philschmid@users.noreply.github.com>

sgugger

Thanks for moving forward on this!

julien-c · 2022-09-19T09:10:25Z

_blog.yml

+    - nlp
+
+- local: bloom-inference-pytorch-scripts


i'm late to the party @stas00, but i would have favored a shorter URL, like bloom-inference-pytorch or even just bloom-inference

Let's keep that way now though! Just something to keep in mind

cc @osanseviero

@julien-c AFAIK, there is another blog post in the making focusing on API I/O, I guess that's why they went with *-pytorch-script

may I ask why super short urls are important? it's not like anybody types them manually.

should we add a little guidelines doc to the repo where you can outline the best practices?

e.g. one of the constant conflicts is trying to keep a sequential assets id, which when several concurrently worked on new blog posts constantly collide. So I am trying to remove this conflict by stopping using the id altogether, since each post's title is already unique.

#507 is adding some guidelines, and we'll iterate on it cc @simoninithomas

stas00 added 2 commits September 14, 2022 17:16

Incredibly Fast BLOOM Inference with DeepSpeed and Accelerate

6ce35df

fix

9b5749a

tjruwase reviewed Sep 15, 2022

View reviewed changes

philschmid reviewed Sep 15, 2022

View reviewed changes

Apply suggestions from code review

bd49c80

Co-authored-by: Philipp Schmid <32632186+philschmid@users.noreply.github.com>

sgugger approved these changes Sep 15, 2022

View reviewed changes

stas00 added 6 commits September 15, 2022 09:25

updates

482eb65

updates

9d2f16c

updates

b5eb787

updates

62a6d53

add thumbnail

0a27413

spellings and tweaks

56b49f0

stas00 merged commit 1928066 into main Sep 15, 2022

stas00 deleted the bloom-inference-scripts branch September 15, 2022 17:54

julien-c reviewed Sep 19, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incredibly Fast BLOOM Inference with DeepSpeed and Accelerate #521

Incredibly Fast BLOOM Inference with DeepSpeed and Accelerate #521

stas00 commented Sep 15, 2022 •

edited

Loading

tjruwase Sep 15, 2022

stas00 Sep 15, 2022

tjruwase Sep 15, 2022

stas00 Sep 15, 2022 •

edited

Loading

philschmid left a comment

sgugger left a comment

julien-c Sep 19, 2022

philschmid Sep 19, 2022

stas00 Sep 19, 2022 •

edited

Loading

osanseviero Sep 19, 2022


		As the model needs 352GB in bf16 (bfloat16) weights (`176*2`), the most efficient set-up is 8x80GB A100 GPUs. Also 2x8x40GB A100s or 2x8x48GB A6000 can be used. The main reason for using these GPUs is that at the time of this writing they provide the largest GPU memory, but other GPUs can be used as well. For example, 24x32GB V100s can be used.

		Using a single node will deliver the fastest througput since PCIe speed is typically much faster than inter-node network.

Incredibly Fast BLOOM Inference with DeepSpeed and Accelerate #521

Incredibly Fast BLOOM Inference with DeepSpeed and Accelerate #521

Conversation

stas00 commented Sep 15, 2022 • edited Loading

tjruwase Sep 15, 2022

Choose a reason for hiding this comment

stas00 Sep 15, 2022

Choose a reason for hiding this comment

tjruwase Sep 15, 2022

Choose a reason for hiding this comment

stas00 Sep 15, 2022 • edited Loading

Choose a reason for hiding this comment

philschmid left a comment

Choose a reason for hiding this comment

sgugger left a comment

Choose a reason for hiding this comment

julien-c Sep 19, 2022

Choose a reason for hiding this comment

philschmid Sep 19, 2022

Choose a reason for hiding this comment

stas00 Sep 19, 2022 • edited Loading

Choose a reason for hiding this comment

osanseviero Sep 19, 2022

Choose a reason for hiding this comment

stas00 commented Sep 15, 2022 •

edited

Loading

stas00 Sep 15, 2022 •

edited

Loading

stas00 Sep 19, 2022 •

edited

Loading