[deepspeed pipe] expand the partitioning method to support weights #186

stas00 · 2021-11-09T05:23:11Z

we will need to hack https://github.com/microsoft/DeepSpeed/blob/master/deepspeed/runtime/pipe/module.py#L378-L384 to support partition_method type:embed:2|transformer:1 - or something like that - now the embed weights will get 2x partitioning weights and will get its own stage and all stages will be more balanced.

For context please see: #166 (comment)

It's actually not complicated at all. It's just a simple weighing scheme.

Let's look at partitioning weights to the code I quoted in the first para:

with 4 layers and 4 gpus

type:transformer [0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0] gets partitioned as [0, 0, 0, 1], [1], [1], [1, 0, 0, 0, 0]
type:embed|transformer [0, 1, 0, 1, 1, 1 1, 0, 0, 1, 0] gets partitioned as [0, 1, 0, 1], [1], [1], [1, 0, 0, 1, 0] (or something similar - I haven't validated),

but what we want is this:

the initial weights should be: [0, 2, 0, 1, 1, 1 1, 0, 0, 2, 0] which now should gets partitioned as [0, 2], [0, 1, 1], [1, 1], [0, 0, 2, 0]

(note: I'm not exactly sure where the 0's belong, it should be easy to see with print debug or debugger)

For context: 250k dict for mt5 has a huge embedding. it's 2x bigger than a single layer (n 104B), that's why we want them partitioned so that an embedding has its own stage and then each 2 layers use another stage.

this is so in the case of 60 layers and 2 embeddings and 32 pipe stages.

and once we are happy we can contribute this to deepspeed.

p.s. need to think about the best syntax to use, probably weighted_type:embed:2|transformer:1

The text was updated successfully, but these errors were encountered:

jaketae · 2021-11-20T21:59:47Z

Would this involve creating a PR on the upstream?

stas00 · 2021-11-21T05:01:54Z

This could be done with monkey patching first and then later added upstream.

I'm just not sure we should start working on it until this Issue is fixed microsoft/DeepSpeed#1522.

As I commented in #166 (comment) we could use BNB to compensate for ZeRO1. But BNB has issues as well at the moment.

Meanwhile it was proposed to use a 150k vocab instead of 250k. I am going to see how it scales in the next few days and we will know if this is required or not. So I will update this Issue once I have more information.

thank you.

stas00 added the Good First Issue Good for newcomers label Nov 9, 2021

stas00 mentioned this issue Nov 9, 2021

add support for adaptive softmax to reduce memory #166

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[deepspeed pipe] expand the partitioning method to support weights #186

[deepspeed pipe] expand the partitioning method to support weights #186

stas00 commented Nov 9, 2021

jaketae commented Nov 20, 2021

stas00 commented Nov 21, 2021

[deepspeed pipe] expand the partitioning method to support weights #186

[deepspeed pipe] expand the partitioning method to support weights #186

Comments

stas00 commented Nov 9, 2021

jaketae commented Nov 20, 2021

stas00 commented Nov 21, 2021