Skip to content

Commit

Permalink
Universal ckpt: Visualize Z1 examples (NVIDIA#267)
Browse files Browse the repository at this point in the history
* Enable universal ckpting

* Update run scripts

* Address PR feedback

* Remove line

* Fix white lines

* Remove redudant changes

* Apply to gpt_model only

* Code cleanup

* Code cleanup

* Update training.py

Co-authored-by: Michael Wyatt <mrwyattii@gmail.com>

* Update training.py

Co-authored-by: Michael Wyatt <mrwyattii@gmail.com>

* Log loss_scale only valid for fp16

* Add README and bf16 scripts

* Visualization docsts

---------

Co-authored-by: Michael Wyatt <mrwyattii@gmail.com>
  • Loading branch information
tjruwase and mrwyattii authored Oct 25, 2023
1 parent 796866f commit 036ff77
Showing 1 changed file with 3 additions and 1 deletion.
4 changes: 3 additions & 1 deletion examples_deepspeed/universal_checkpointing/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,12 +47,14 @@ drwxr-xr-x 2 user group 4096 Oct 21 09:01 global_step200
-rwxr--r-- 1 user group 24177 Oct 21 09:50 zero_to_fp32.py
```

### Step3: Resume training with Universal checkpoint of iteration 100
### Step 3: Resume training with Universal checkpoint of iteration 100
```bash
bash examples_deepspeed/universal_checkpointing/run_universal_bf16.sh
```
This resumption script effects the loading of universal checkpoint rather than the ZeRO checkpoint in the folder by passing `--universal-checkpoint` command line flag to the main training script (i.e., `pretrain_gpt.py`).

Please see the corresponding [pull request](https://github.com/microsoft/Megatron-DeepSpeed/pull/265) for visualizations of matching loss values between original and universal checkpoint runs for bf16 and fp16 examples.

## ZeRO stage 2 training (**Coming soon**)

## ZeRO stage 3 training (**Coming soon**)

0 comments on commit 036ff77

Please sign in to comment.