You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We have three main components which related to process group initialization:
Global parallel context
Device mesh
Process group manager
Global parallel context is compatible with all kinds of famous parallelism, but it has below drawbacks:
It's global, which means it's not flexible enough
It's deeply coupled with parallel method, which means it's not easy to extend
Some namings are confusing, e.g. local_rank
Device mesh it to decribe how a tensor is stored. It's great for tensor parallelism, but not for other parallelism.
Process group manager is too simple, which is just a dict of process groups, to handle complex ND-parallelism scenario.
In conclusion, we need a component which is:
Totally decoupled with parallel method
Not global
Easy to handle complex ND-parallism
Process group mesh
Process group mesh is to describe how to organize process groups. It's not coupled with parallel method. However, through it, it's easy to initialize process groups in ND-parallelism scenario.
It's a helper/utility class. It just initializes process groups and cache them. Exact parallel method will mange them.
We can use a ND-tuple to describe a process group mesh. E.g. ProcessGroupMesh(2, 2, 2) means a 3D cube process group mesh. We can further use a ND-coordinate to describe each process. E.g. (0, 1, 0) means the process whose rank is 2 in the above process group mesh. In classic 3D-parallelim scenario, each parallel method takes an axis. E.g. data parallelism takes axis-0, pipeline parallelism takes axis-1 and tensor parallelism takes axis-2. Process group mesh will provide a method to create group along axis, thus, it's easy to handle 3D-parallism.
The text was updated successfully, but these errors were encountered:
Motivation
We have three main components which related to process group initialization:
Global parallel context is compatible with all kinds of famous parallelism, but it has below drawbacks:
local_rank
Device mesh it to decribe how a tensor is stored. It's great for tensor parallelism, but not for other parallelism.
Process group manager is too simple, which is just a dict of process groups, to handle complex ND-parallelism scenario.
In conclusion, we need a component which is:
Process group mesh
Process group mesh is to describe how to organize process groups. It's not coupled with parallel method. However, through it, it's easy to initialize process groups in ND-parallelism scenario.
It's a helper/utility class. It just initializes process groups and cache them. Exact parallel method will mange them.
We can use a ND-tuple to describe a process group mesh. E.g.
ProcessGroupMesh(2, 2, 2)
means a 3D cube process group mesh. We can further use a ND-coordinate to describe each process. E.g.(0, 1, 0)
means the process whose rank is 2 in the above process group mesh. In classic 3D-parallelim scenario, each parallel method takes an axis. E.g. data parallelism takes axis-0, pipeline parallelism takes axis-1 and tensor parallelism takes axis-2. Process group mesh will provide a method to create group along axis, thus, it's easy to handle 3D-parallism.The text was updated successfully, but these errors were encountered: