Sync layer norm #271

thomasw21 · 2022-03-24T22:38:46Z

Force sync layer norms

looks like pt-1.11 dropped `torch.testing.assert_equal`, so using `torch.testing.assert_equal` instead

stas00 · 2022-03-25T02:45:54Z

tests/test_training.py

+        # 2. test training from checkpoint: resume
+        # now do it again, this time resuming from the checkpoint
+        with CaptureStdout() as cs:
+            execute_subprocess_async(cmd, env=self.get_env())


so it crashes on resume:

Traceback (most recent call last): File "/home/stas/anaconda3/envs/py38-pt111/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 345, in wrapper return f(*args, **kwargs) File "/mnt/nvme0/code/huggingface/Megatron-DeepSpeed-master-4/pretrain_gpt.py", line 245, in main pretrain(train_valid_test_datasets_provider, model_provider, forward_step, File "/mnt/nvme0/code/huggingface/Megatron-DeepSpeed-master-4/megatron/training.py", line 188, in pretrain iteration = train(forward_step_func, File "/mnt/nvme0/code/huggingface/Megatron-DeepSpeed-master-4/megatron/training.py", line 857, in train train_step(forward_step_func, File "/mnt/nvme0/code/huggingface/Megatron-DeepSpeed-master-4/megatron/training.py", line 441, in train_step loss = model[0].train_batch(data_iter=data_iterator) File "/mnt/nvme0/code/github/00optimize/deepspeed/deepspeed/runtime/pipe/engine.py", line 346, in train_batch self._exec_schedule(sched) File "/mnt/nvme0/code/github/00optimize/deepspeed/deepspeed/runtime/pipe/engine.py", line 1363, in _exec_schedule self._exec_instr(**cmd.kwargs) File "/mnt/nvme0/code/github/00optimize/deepspeed/deepspeed/runtime/pipe/engine.py", line 1149, in _exec_optimizer_step self._take_model_step(lr_kwargs) File "/mnt/nvme0/code/github/00optimize/deepspeed/deepspeed/runtime/engine.py", line 1787, in _take_model_step self.optimizer.step() File "/home/stas/anaconda3/envs/py38-pt111/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/mnt/nvme0/code/github/00optimize/deepspeed/deepspeed/runtime/bf16_optimizer.py", line 239, in step assert all_groups_norm > 0. AssertionError

stas00 · 2022-03-25T02:50:35Z

megatron/model/fused_layer_norm.py

+    tp_world_size = mpu.get_tensor_model_parallel_world_size()
+    # TODO: hack in order to synchronize all layer norms despite them being unsynched
+    weight = mpu.reduce_from_tensor_model_parallel_region(self.weight) / tp_world_size
+    bias = mpu.reduce_from_tensor_model_parallel_region(self.bias) / tp_world_size

    return FusedLayerNormAffineFunction.apply(
-      input, self.weight, self.bias, self.normalized_shape,self.eps)
+      input, weight, bias, self.normalized_shape,self.eps)


@tjruwase, this is the main workaround that does the all_reduce mean on layer norm's weight+bias that we want to put in until we can fix the fp32 weights.

thomasw21 · 2022-03-25T09:55:37Z

megatron/model/fused_layer_norm.py

+    weight = torch.clone(self.weight)
+    bias = torch.clone(self.bias)
+    weight = mpu.reduce_from_tensor_model_parallel_region(weight) / tp_world_size
+    bias = mpu.reduce_from_tensor_model_parallel_region(bias) / tp_world_size


@stas00

Essentially the reduce is an in-place operator, which means at each forward pass, self.weight was updated with the sum of all the weights of all tp_ranks. We could try thinking of a better fix by doing a average reduce, but I'm scared back propagation doesn't play well with this in place logic.

New test fails with:

E raise StopIteration E StopIteration

This is more expected since the previous run should have consumed all the tokens. Going to update #272 and restart the training.

Should we extend:

Megatron-DeepSpeed/megatron/mpu/mappings.py

Lines 22 to 30 in 87a9dba

def _reduce(input_):

"""All-reduce the the input tensor across model parallel group."""

# Bypass the function if we are using only 1 GPU.

if get_tensor_model_parallel_world_size()==1:

return input_

# All-reduce.

torch.distributed.all_reduce(input_, group=get_tensor_model_parallel_group())

to support an optional ReduceOp.AVG

I think this is tricky. The reason why is this means that we need to implement custom backward function (since you compute the average, the gradient needs to be divided by the tp world size).

Also I don't think we save much compute by supporting that.

* Enable universal ckpting * Update run scripts * Address PR feedback * Remove line * Fix white lines * Remove redudant changes * Apply to gpt_model only * Code cleanup * Code cleanup * Update training.py Co-authored-by: Michael Wyatt <mrwyattii@gmail.com> * Update training.py Co-authored-by: Michael Wyatt <mrwyattii@gmail.com> * Log loss_scale only valid for fp16 * Add README and bf16 scripts * Visualization docsts * Support older DS * Handle uni_ckpt import error * Revert changes --------- Co-authored-by: Michael Wyatt <mrwyattii@gmail.com>

thomasw21 added 2 commits March 24, 2022 23:23

Better

07ccb3d

Force synchronize the layer norms parameters across all TP

391ed48

thomasw21 changed the base branch from main to thomas/test_different_layer_norm March 24, 2022 22:39

stas00 mentioned this pull request Mar 25, 2022

sync layer norms #272

Merged

stas00 and others added 7 commits March 24, 2022 18:42

import mpu

98d0e7c

use the bf16 branch for testing

279a77e

torch.testing.assert_equal didn't make it (#273)

87a9dba

looks like pt-1.11 dropped `torch.testing.assert_equal`, so using `torch.testing.assert_equal` instead

Merge remote-tracking branch 'origin/main' into thomas/fix_layer_norm

dbb5914

bf16 comms requite pt-1.11

70f91f8

already part of the function

835a3e5

reproduce the crashing on resume

37795a9

stas00 reviewed Mar 25, 2022

View reviewed changes

stas00 and others added 2 commits March 24, 2022 19:54

run just the test we want for now

3ec65f7

all_reduce is an in_place operation

8271d41

thomasw21 commented Mar 25, 2022

View reviewed changes

thomasw21 added 14 commits March 25, 2022 11:52

Make a test that TP reshaping works

b418b47

Woops

4b7207b

Woops

3bc5824

Woops

05c99db

Woops

55e10c6

Woops

2ab8a3a

Woops

d357839

Woops

5fb231c

Woops

cc7ff45

Woops

7cdb1be

Fix load issue

4574ec9

Woops

04e89d1

Fix checkpoint path

e943100

Test that force sync will allow TP changes

09cead3

thomasw21 added 13 commits March 25, 2022 18:19

Nit

77abee6

Now that we have a force sync mechanism, let's try to reproduce

64a62c8

Compare model_states_rank

0b7afcc

test

ce01733

Row column bias should be synchronized as well

89ab0b7

New list of matching embeddings

42997b2

Figure out why state differs

e0ef168

Test for final weight

1fc4fe8

Test that torch_rng_state

7ebbed1

Fix non matching torch_rng_state for tp_rank=0

2c49216

Update test

007ecb4

I'm surprised one can apply inplace operation here

c3844b5

Test out the loss from the fp32 weights and optimizer states

189f054

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sync layer norm #271

Sync layer norm #271

thomasw21 commented Mar 24, 2022

stas00 Mar 25, 2022

stas00 Mar 25, 2022 •

edited

Loading

thomasw21 Mar 25, 2022 •

edited

Loading

stas00 Mar 25, 2022

thomasw21 Mar 25, 2022

thomasw21 Mar 25, 2022

	def _reduce(input_):
	"""All-reduce the the input tensor across model parallel group."""

	# Bypass the function if we are using only 1 GPU.
	if get_tensor_model_parallel_world_size()==1:
	return input_

	# All-reduce.
	torch.distributed.all_reduce(input_, group=get_tensor_model_parallel_group())

Sync layer norm #271

Are you sure you want to change the base?

Sync layer norm #271

Conversation

thomasw21 commented Mar 24, 2022

stas00 Mar 25, 2022

Choose a reason for hiding this comment

stas00 Mar 25, 2022 • edited Loading

Choose a reason for hiding this comment

thomasw21 Mar 25, 2022 • edited Loading

Choose a reason for hiding this comment

stas00 Mar 25, 2022

Choose a reason for hiding this comment

thomasw21 Mar 25, 2022

Choose a reason for hiding this comment

thomasw21 Mar 25, 2022

Choose a reason for hiding this comment

stas00 Mar 25, 2022 •

edited

Loading

thomasw21 Mar 25, 2022 •

edited

Loading