Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incredibly Fast BLOOM Inference with DeepSpeed and Accelerate #521

Merged
merged 9 commits into from
Sep 15, 2022

Conversation

stas00
Copy link
Contributor

@stas00 stas00 commented Sep 15, 2022

as the server solutions may take some time to complete we agreed to split off the client-side solutions into its own post, so this is an extraction of #432 with a major update on the benchmarks and scripts.


As the model needs 352GB in bf16 (bfloat16) weights (`176*2`), the most efficient set-up is 8x80GB A100 GPUs. Also 2x8x40GB A100s or 2x8x48GB A6000 can be used. The main reason for using these GPUs is that at the time of this writing they provide the largest GPU memory, but other GPUs can be used as well. For example, 24x32GB V100s can be used.

Using a single node will deliver the fastest througput since PCIe speed is typically much faster than inter-node network.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean NVLInk speed rather than PCIe speed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's if the user has NVLink.

but you're making a good point - I will mention both.

Also perhaps mention NVSwitch which makes the internode connectivity as fast as intranode? perhaps this is out of scope.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is intranode using PCIe faster than inter-node using infiniband? Or is infiniband not considered for this?

Copy link
Contributor Author

@stas00 stas00 Sep 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latest PCIe gens should be faster, but then there are so many different set-ups so it's super difficult to compare:

https://en.wikipedia.org/wiki/InfiniBand#Performance
https://en.wikipedia.org/wiki/PCI_Express#History_and_revisions

and if one uses NVSwitch shouldn't it make intra-node as fast as internode? I think it can connect up to 256 GPUs at the same speed as NVLInk - but that's Hopper.

Given that we are having this discussion I think it's the best to remove any specific suggestions and I'll just replace it with a very generic:

Using a single node will typically deliver a fastest throughput since most of the time intra-node GPU linking hardware is faster than inter-node one, but it's not always the case.

Copy link
Member

@philschmid philschmid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments

bloom-inference-pytorch-scripts.md Outdated Show resolved Hide resolved
bloom-inference-pytorch-scripts.md Outdated Show resolved Hide resolved
bloom-inference-pytorch-scripts.md Outdated Show resolved Hide resolved
bloom-inference-pytorch-scripts.md Outdated Show resolved Hide resolved
bloom-inference-pytorch-scripts.md Outdated Show resolved Hide resolved
bloom-inference-pytorch-scripts.md Show resolved Hide resolved
Co-authored-by: Philipp Schmid <32632186+philschmid@users.noreply.github.com>
Copy link
Contributor

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for moving forward on this!

@stas00 stas00 merged commit 1928066 into main Sep 15, 2022
@stas00 stas00 deleted the bloom-inference-scripts branch September 15, 2022 17:54
- nlp

- local: bloom-inference-pytorch-scripts
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm late to the party @stas00, but i would have favored a shorter URL, like bloom-inference-pytorch or even just bloom-inference

Let's keep that way now though! Just something to keep in mind

cc @osanseviero

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@julien-c AFAIK, there is another blog post in the making focusing on API I/O, I guess that's why they went with *-pytorch-script

Copy link
Contributor Author

@stas00 stas00 Sep 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. may I ask why super short urls are important? it's not like anybody types them manually.
  2. should we add a little guidelines doc to the repo where you can outline the best practices?

e.g. one of the constant conflicts is trying to keep a sequential assets id, which when several concurrently worked on new blog posts constantly collide. So I am trying to remove this conflict by stopping using the id altogether, since each post's title is already unique.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#507 is adding some guidelines, and we'll iterate on it cc @simoninithomas

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants