-
Notifications
You must be signed in to change notification settings - Fork 723
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incredibly Fast BLOOM Inference with DeepSpeed and Accelerate #521
Conversation
bloom-inference-pytorch-scripts.md
Outdated
|
||
As the model needs 352GB in bf16 (bfloat16) weights (`176*2`), the most efficient set-up is 8x80GB A100 GPUs. Also 2x8x40GB A100s or 2x8x48GB A6000 can be used. The main reason for using these GPUs is that at the time of this writing they provide the largest GPU memory, but other GPUs can be used as well. For example, 24x32GB V100s can be used. | ||
|
||
Using a single node will deliver the fastest througput since PCIe speed is typically much faster than inter-node network. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean NVLInk speed rather than PCIe speed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's if the user has NVLink.
but you're making a good point - I will mention both.
Also perhaps mention NVSwitch which makes the internode connectivity as fast as intranode? perhaps this is out of scope.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is intranode using PCIe faster than inter-node using infiniband? Or is infiniband not considered for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The latest PCIe gens should be faster, but then there are so many different set-ups so it's super difficult to compare:
https://en.wikipedia.org/wiki/InfiniBand#Performance
https://en.wikipedia.org/wiki/PCI_Express#History_and_revisions
and if one uses NVSwitch shouldn't it make intra-node as fast as internode? I think it can connect up to 256 GPUs at the same speed as NVLInk - but that's Hopper.
Given that we are having this discussion I think it's the best to remove any specific suggestions and I'll just replace it with a very generic:
Using a single node will typically deliver a fastest throughput since most of the time intra-node GPU linking hardware is faster than inter-node one, but it's not always the case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments
Co-authored-by: Philipp Schmid <32632186+philschmid@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for moving forward on this!
- nlp | ||
|
||
- local: bloom-inference-pytorch-scripts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm late to the party @stas00, but i would have favored a shorter URL, like bloom-inference-pytorch
or even just bloom-inference
Let's keep that way now though! Just something to keep in mind
cc @osanseviero
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@julien-c AFAIK, there is another blog post in the making focusing on API I/O, I guess that's why they went with *-pytorch-script
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- may I ask why super short urls are important? it's not like anybody types them manually.
- should we add a little guidelines doc to the repo where you can outline the best practices?
e.g. one of the constant conflicts is trying to keep a sequential assets id, which when several concurrently worked on new blog posts constantly collide. So I am trying to remove this conflict by stopping using the id altogether, since each post's title is already unique.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#507 is adding some guidelines, and we'll iterate on it cc @simoninithomas
as the server solutions may take some time to complete we agreed to split off the client-side solutions into its own post, so this is an extraction of #432 with a major update on the benchmarks and scripts.