Skip to content

Commit 929af79

Browse files
author
Baizhou Zhang
committed
update tp doc
1 parent 3536460 commit 929af79

10 files changed

+100
-711
lines changed

docs/source/en/features/1D_tensor_parallel.md

+3-77
Original file line numberDiff line numberDiff line change
@@ -2,14 +2,12 @@
22

33
Author: Zhengda Bian, Yongbin Li
44

5-
> ⚠️ The information on this page is outdated and will be deprecated. Please check [Shardformer](./shardformer.md) for more information.
6-
75
**Prerequisite**
86
- [Define Your Configuration](../basics/define_your_config.md)
97
- [Configure Parallelization](../basics/configure_parallelization.md)
108

119
**Example Code**
12-
- [ColossalAI-Examples 1D Tensor Parallelism](https://github.com/hpcaitech/ColossalAI-Examples/blob/main/features/tensor_parallel/README.md)
10+
- [Tensor Parallelism with Shardformer](https://github.com/hpcaitech/ColossalAI/tree/main/colossalai/shardformer/examples)
1311

1412
**Related Paper**
1513
- [Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM](https://deepakn94.github.io/assets/papers/megatron-sc21.pdf)
@@ -44,79 +42,7 @@ Given $P$ processors, we present the theoretical computation and memory cost, as
4442

4543
## Usage
4644

47-
To enable 1D tensor parallelism for our model, e.g. on 2 GPUs, we need to configure the parallelism setting as below.
48-
```python
49-
CONFIG = dict(parallel=dict(
50-
data=1,
51-
pipeline=1,
52-
tensor=dict(size=2, mode='1d'),
53-
))
54-
```
55-
Then Colossal-AI will automatically apply 1D parallelism to all the layers from `colossalai.nn`.
56-
57-
Let's define a model that consists of a two-layer multi-layer perceptron (MLP) as below.
58-
```python
59-
import colossalai
60-
import colossalai.nn as col_nn
61-
import torch
62-
from colossalai.utils import print_rank_0
63-
64-
class MLP(torch.nn.Module):
65-
def __init__(self, dim: int = 256):
66-
super().__init__()
67-
intermediate_dim = dim * 4
68-
self.dense_1 = col_nn.Linear(dim, intermediate_dim)
69-
print_rank_0(f'Weight of the first linear layer: {self.dense_1.weight.transpose(0, 1).shape}')
70-
self.activation = torch.nn.GELU()
71-
self.dense_2 = col_nn.Linear(intermediate_dim, dim)
72-
print_rank_0(f'Weight of the second linear layer: {self.dense_2.weight.transpose(0, 1).shape}')
73-
self.dropout = col_nn.Dropout(0.1)
74-
75-
def forward(self, x):
76-
x = self.dense_1(x)
77-
print_rank_0(f'Output of the first linear layer: {x.shape}')
78-
x = self.activation(x)
79-
x = self.dense_2(x)
80-
print_rank_0(f'Output of the second linear layer: {x.shape}')
81-
x = self.dropout(x)
82-
return x
83-
```
84-
85-
Launch Colossal-AI on 2 GPUs and build the model.
86-
87-
```python
88-
parser = colossalai.get_default_parser()
89-
colossalai.launch(config=CONFIG,
90-
rank=args.rank,
91-
world_size=args.world_size,
92-
local_rank=args.local_rank,
93-
host=args.host,
94-
port=args.port)
95-
96-
m = MLP()
97-
```
98-
We will see the shapes of partitioned parameters(e.g. weights) in the MLP model.
99-
```shell
100-
Weight of the first linear layer: torch.Size([256, 512])
101-
Weight of the second linear layer: torch.Size([512, 256])
102-
```
103-
The complete weight of the first linear layer is supposed to have the shape `[256, 1024]`. After the column-parallel partitioning, it becomes `[256, 512]`.
104-
Similarly, the second row-parallel layer partitions the weight `[1024, 256]` into `[512, 256]`.
105-
106-
We can run the model with some random inputs.
107-
```python
108-
from colossalai.utils import get_current_device
109-
110-
x = torch.randn((16, 256), device=get_current_device())
111-
torch.distributed.broadcast(x, src=0) # synchronize input
112-
113-
x = m(x)
114-
```
115-
Then we can see the shapes of activation results.
116-
```shell
117-
Output of the first linear layer: torch.Size([16, 512])
118-
Output of the second linear layer: torch.Size([16, 256])
119-
```
120-
The output of the first linear layer is split into 2 partitions (each has the shape `[16, 512]`), while the second layer has identical outputs across the GPUs.
45+
1D tensor parallelism is implemented by `Shardformer` feature in the newest version of ColossalAI.
46+
For more details about ideas and usages of `Shardformer`, please refer to [Shardformer Doc](./shardformer.md).
12147

12248
<!-- doc-test-command: echo -->

docs/source/en/features/2D_tensor_parallel.md

+6-80
Original file line numberDiff line numberDiff line change
@@ -60,83 +60,9 @@ Given $P=q\times q$ processors, we present the theoretical computation and memor
6060

6161
## Usage
6262

63-
To enable 2D tensor parallelism for our model, e.g. on 4 GPUs, we need to configure the parallelism setting as below.
64-
```python
65-
CONFIG = dict(parallel=dict(
66-
data=1,
67-
pipeline=1,
68-
tensor=dict(size=4, mode='2d'),
69-
))
70-
```
71-
Then Colossal-AI will automatically apply 2D parallelism to all the layers from `colossalai.nn`.
72-
73-
Let's define a model that consists of a two-layer multi-layer perceptron (MLP) as below.
74-
```python
75-
import colossalai
76-
import colossalai.nn as col_nn
77-
import torch
78-
from colossalai.utils import print_rank_0
79-
80-
class MLP(torch.nn.Module):
81-
def __init__(self, dim: int = 256):
82-
super().__init__()
83-
intermediate_dim = dim * 4
84-
self.dense_1 = col_nn.Linear(dim, intermediate_dim)
85-
print_rank_0(f'Weight of the first linear layer: {self.dense_1.weight.shape}')
86-
self.activation = torch.nn.GELU()
87-
self.dense_2 = col_nn.Linear(intermediate_dim, dim)
88-
print_rank_0(f'Weight of the second linear layer: {self.dense_2.weight.shape}')
89-
self.dropout = col_nn.Dropout(0.1)
90-
91-
def forward(self, x):
92-
x = self.dense_1(x)
93-
print_rank_0(f'Output of the first linear layer: {x.shape}')
94-
x = self.activation(x)
95-
x = self.dense_2(x)
96-
print_rank_0(f'Output of the second linear layer: {x.shape}')
97-
x = self.dropout(x)
98-
return x
99-
```
100-
Launch Colossal-AI on 4 GPUs and build the model
101-
```python
102-
parser = colossalai.get_default_parser()
103-
colossalai.launch(config=CONFIG,
104-
rank=args.rank,
105-
world_size=args.world_size,
106-
local_rank=args.local_rank,
107-
host=args.host,
108-
port=args.port)
109-
110-
m = MLP()
111-
```
112-
We will see the shapes of partitioned parameters(e.g. weights) in the MLP model.
113-
```shell
114-
Weight of the first linear layer: torch.Size([128, 512])
115-
Weight of the second linear layer: torch.Size([512, 128])
116-
```
117-
The complete weight of the first linear layer is supposed to have the shape `[256, 1024]`. After the partitioning of 2D parallelism, it becomes `[128, 512]` on each GPU.
118-
Similarly, the second layer partitions the weight `[1024, 256]` into `[512, 128]`.
119-
120-
We can run the model with some random inputs.
121-
```python
122-
from colossalai.context import ParallelMode
123-
from colossalai.core import global_context as gpc
124-
from colossalai.utils import get_current_device
125-
126-
x = torch.randn((16, 256), device=get_current_device())
127-
# partition input
128-
torch.distributed.broadcast(x, src=0)
129-
x = torch.chunk(x, 2, dim=0)[gpc.get_local_rank(ParallelMode.PARALLEL_2D_COL)]
130-
x = torch.chunk(x, 2, dim=-1)[gpc.get_local_rank(ParallelMode.PARALLEL_2D_ROW)]
131-
print_rank_0(f'Input: {x.shape}')
132-
133-
x = m(x)
134-
```
135-
Then we can see the shapes of activation results.
136-
```shell
137-
Input: torch.Size([8, 128])
138-
Output of the first linear layer: torch.Size([8, 512])
139-
Output of the second linear layer: torch.Size([8, 128])
140-
```
141-
The activation tensors in 2D parallelism are all split in both row and column.
142-
E.g. the output of the first linear layer has the shape `[8, 512]`, while the second layer has the output of `[8, 128]`.
63+
Currently the newest version of ColossalAI doesn't support 2D tensor parallelism, but this feature will be integrated into `Shardformer` in future releases.
64+
For more details about ideas and usages of `Shardformer`, please refer to [Shardformer Doc](./shardformer.md).
65+
66+
For users of older version of ColossalAI, please refer to [ColossalAI-Examples - 2D Tensor Parallelism](https://github.com/hpcaitech/ColossalAI-Examples/blob/main/features/tensor_parallel/README.md).
67+
68+
<!-- doc-test-command: echo -->

docs/source/en/features/2p5D_tensor_parallel.md

+6-83
Original file line numberDiff line numberDiff line change
@@ -58,86 +58,9 @@ Given $P=q \times q \times d$ processors, we present the theoretical computation
5858

5959
## Usage
6060

61-
To enable 2.5D tensor parallelism for our model, e.g. on 8 GPUs, we need to configure the parallelism setting as below.
62-
```python
63-
CONFIG = dict(parallel=dict(
64-
data=1,
65-
pipeline=1,
66-
tensor=dict(size=8, mode='2.5d', depth=2),
67-
))
68-
69-
```
70-
Then Colossal-AI will automatically apply 2.5D parallelism to all the layers from `colossalai.nn`.
71-
72-
Let's define a model that consists of a two-layer multi-layer perceptron (MLP) as below.
73-
```python
74-
import colossalai
75-
import colossalai.nn as col_nn
76-
import torch
77-
from colossalai.utils import print_rank_0
78-
79-
class MLP(torch.nn.Module):
80-
def __init__(self, dim: int = 256):
81-
super().__init__()
82-
intermediate_dim = dim * 4
83-
self.dense_1 = col_nn.Linear(dim, intermediate_dim)
84-
print_rank_0(f'Weight of the first linear layer: {self.dense_1.weight.shape}')
85-
self.activation = torch.nn.GELU()
86-
self.dense_2 = col_nn.Linear(intermediate_dim, dim)
87-
print_rank_0(f'Weight of the second linear layer: {self.dense_2.weight.shape}')
88-
self.dropout = col_nn.Dropout(0.1)
89-
90-
def forward(self, x):
91-
x = self.dense_1(x)
92-
print_rank_0(f'Output of the first linear layer: {x.shape}')
93-
x = self.activation(x)
94-
x = self.dense_2(x)
95-
print_rank_0(f'Output of the second linear layer: {x.shape}')
96-
x = self.dropout(x)
97-
return x
98-
```
99-
Launch Colossal-AI on 8 GPUs and build the model
100-
```python
101-
parser = colossalai.get_default_parser()
102-
colossalai.launch(config=CONFIG,
103-
rank=args.rank,
104-
world_size=args.world_size,
105-
local_rank=args.local_rank,
106-
host=args.host,
107-
port=args.port)
108-
109-
m = MLP()
110-
```
111-
We will see the shapes of partitioned parameters(e.g. weights) in the MLP model.
112-
```shell
113-
Weight of the first linear layer: torch.Size([128, 512])
114-
Weight of the second linear layer: torch.Size([512, 128])
115-
```
116-
The complete weight of the first linear layer is supposed to have the shape `[256, 1024]`. After the partitioning of 2.5D parallelism, it becomes `[128, 512]` on each GPU.
117-
Similarly, the second layer partitions the weight `[1024, 256]` into `[512, 128]`.
118-
119-
We can run the model with some random inputs.
120-
```python
121-
from colossalai.context import ParallelMode
122-
from colossalai.core import global_context as gpc
123-
from colossalai.utils import get_current_device
124-
125-
x = torch.randn((16, 256), device=get_current_device())
126-
# partition input
127-
torch.distributed.broadcast(x, src=0)
128-
x = torch.chunk(x, 2, dim=0)[gpc.get_local_rank(ParallelMode.PARALLEL_2P5D_DEP)]
129-
x = torch.chunk(x, 2, dim=0)[gpc.get_local_rank(ParallelMode.PARALLEL_2P5D_COL)]
130-
x = torch.chunk(x, 2, dim=-1)[gpc.get_local_rank(ParallelMode.PARALLEL_2P5D_ROW)]
131-
print_rank_0(f'Input: {x.shape}')
132-
133-
x = m(x)
134-
```
135-
Then we can see the shapes of activation results.
136-
```shell
137-
Input: torch.Size([4, 128])
138-
Output of the first linear layer: torch.Size([4, 512])
139-
Output of the second linear layer: torch.Size([4, 128])
140-
```
141-
The activation tensors in 2.5D parallelism are all split by $d \times q$ in the row and $q$ in the column.
142-
E.g. the output of the first linear layer has the shape `[4, 512]`), while the second layer has the output of `[4, 128]`.
143-
Note, 2.5D parallelism use the same partition method as 2D parallelism for weights, where the difference is the partition of input.
61+
Currently the newest version of ColossalAI doesn't support 2.5D tensor parallelism, but this feature will be integrated into `Shardformer` in future releases.
62+
For more details about ideas and usages of `Shardformer`, please refer to [Shardformer Doc](./shardformer.md).
63+
64+
For users of older version of ColossalAI, please refer to [ColossalAI-Examples - 2.5D Tensor Parallelism](https://github.com/hpcaitech/ColossalAI-Examples/blob/main/features/tensor_parallel/README.md).
65+
66+
<!-- doc-test-command: echo -->

docs/source/en/features/3D_tensor_parallel.md

+6-82
Original file line numberDiff line numberDiff line change
@@ -67,85 +67,9 @@ Given $P=q \times q \times q$ processors, we present the theoretical computation
6767

6868
## Usage
6969

70-
To enable 3D tensor parallelism for our model, e.g. on 8 GPUs, we need to configure the parallelism setting as below.
71-
```python
72-
CONFIG = dict(parallel=dict(
73-
data=1,
74-
pipeline=1,
75-
tensor=dict(size=8, mode='3d'),
76-
))
77-
```
78-
Then Colossal-AI will automatically apply 3D parallelism to all the layers from `colossalai.nn`.
79-
80-
Let's define a model that consists of a two-layer multi-layer perceptron (MLP) as below.
81-
```python
82-
import colossalai
83-
import colossalai.nn as col_nn
84-
import torch
85-
from colossalai.utils import print_rank_0
86-
87-
class MLP(torch.nn.Module):
88-
def __init__(self, dim: int = 256):
89-
super().__init__()
90-
intermediate_dim = dim * 4
91-
self.dense_1 = col_nn.Linear(dim, intermediate_dim)
92-
print_rank_0(f'Weight of the first linear layer: {self.dense_1.weight.shape}')
93-
self.activation = torch.nn.GELU()
94-
self.dense_2 = col_nn.Linear(intermediate_dim, dim)
95-
print_rank_0(f'Weight of the second linear layer: {self.dense_2.weight.shape}')
96-
self.dropout = col_nn.Dropout(0.1)
97-
98-
def forward(self, x):
99-
x = self.dense_1(x)
100-
print_rank_0(f'Output of the first linear layer: {x.shape}')
101-
x = self.activation(x)
102-
x = self.dense_2(x)
103-
print_rank_0(f'Output of the second linear layer: {x.shape}')
104-
x = self.dropout(x)
105-
return x
106-
```
107-
Launch Colossal-AI on 8 GPUs and build the model
108-
```python
109-
parser = colossalai.get_default_parser()
110-
colossalai.launch(config=CONFIG,
111-
rank=args.rank,
112-
world_size=args.world_size,
113-
local_rank=args.local_rank,
114-
host=args.host,
115-
port=args.port)
116-
117-
m = MLP()
118-
```
119-
We will see the shapes of partitioned parameters(e.g. weights) in the MLP model.
120-
```shell
121-
Weight of the first linear layer: torch.Size([128, 256])
122-
Weight of the second linear layer: torch.Size([512, 64])
123-
```
124-
The complete weight of the first linear layer is supposed to have the shape `[256, 1024]`. After the partitioning of 3D parallelism, it becomes `[128, 256]` on each GPU.
125-
Similarly, the second layer partitions the weight `[1024, 256]` into `[512, 64]`.
126-
127-
We can run the model with some random inputs.
128-
```python
129-
from colossalai.context import ParallelMode
130-
from colossalai.core import global_context as gpc
131-
from colossalai.utils import get_current_device
132-
133-
x = torch.randn((16, 256), device=get_current_device())
134-
# partition input
135-
torch.distributed.broadcast(x, src=0)
136-
x = torch.chunk(x, 2, dim=0)[gpc.get_local_rank(ParallelMode.PARALLEL_3D_WEIGHT)]
137-
x = torch.chunk(x, 2, dim=0)[gpc.get_local_rank(ParallelMode.PARALLEL_3D_INPUT)]
138-
x = torch.chunk(x, 2, dim=-1)[gpc.get_local_rank(ParallelMode.PARALLEL_3D_OUTPUT)]
139-
print_rank_0(f'Input: {x.shape}')
140-
141-
x = m(x)
142-
```
143-
Then we can see the shapes of activation results.
144-
```shell
145-
Input: torch.Size([4, 128])
146-
Output of the first linear layer: torch.Size([4, 512])
147-
Output of the second linear layer: torch.Size([4, 128])
148-
```
149-
The activation tensors in 3D parallelism are all split by $q^2$ in the row and $q$ in the column.
150-
E.g. the output of the first linear layer has the shape `[4, 512]`), while the second layer has the output of `[4, 128]`.
151-
Note, although the results of 3D parallelism have the same shape as that of 2.5D parallelism for weights here, the content of each partition is different.
70+
Currently the newest version of ColossalAI doesn't support 3D tensor parallelism, but this feature will be integrated into `Shardformer` in future releases.
71+
For more details about ideas and usages of `Shardformer`, please refer to [Shardformer Doc](./shardformer.md).
72+
73+
For users of older version of ColossalAI, please refer to [ColossalAI-Examples - 3D Tensor Parallelism](https://github.com/hpcaitech/ColossalAI-Examples/blob/main/features/tensor_parallel/README.md).
74+
75+
<!-- doc-test-command: echo -->

0 commit comments

Comments
 (0)