From 036ff7721d59bb288888e6385ba1b89af148ae1e Mon Sep 17 00:00:00 2001 From: Olatunji Ruwase Date: Wed, 25 Oct 2023 14:45:26 -0400 Subject: [PATCH] Universal ckpt: Visualize Z1 examples (#267) * Enable universal ckpting * Update run scripts * Address PR feedback * Remove line * Fix white lines * Remove redudant changes * Apply to gpt_model only * Code cleanup * Code cleanup * Update training.py Co-authored-by: Michael Wyatt * Update training.py Co-authored-by: Michael Wyatt * Log loss_scale only valid for fp16 * Add README and bf16 scripts * Visualization docsts --------- Co-authored-by: Michael Wyatt --- examples_deepspeed/universal_checkpointing/README.md | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/examples_deepspeed/universal_checkpointing/README.md b/examples_deepspeed/universal_checkpointing/README.md index fc4babc535..f691561752 100644 --- a/examples_deepspeed/universal_checkpointing/README.md +++ b/examples_deepspeed/universal_checkpointing/README.md @@ -47,12 +47,14 @@ drwxr-xr-x 2 user group 4096 Oct 21 09:01 global_step200 -rwxr--r-- 1 user group 24177 Oct 21 09:50 zero_to_fp32.py ``` -### Step3: Resume training with Universal checkpoint of iteration 100 +### Step 3: Resume training with Universal checkpoint of iteration 100 ```bash bash examples_deepspeed/universal_checkpointing/run_universal_bf16.sh ``` This resumption script effects the loading of universal checkpoint rather than the ZeRO checkpoint in the folder by passing `--universal-checkpoint` command line flag to the main training script (i.e., `pretrain_gpt.py`). +Please see the corresponding [pull request](https://github.com/microsoft/Megatron-DeepSpeed/pull/265) for visualizations of matching loss values between original and universal checkpoint runs for bf16 and fp16 examples. + ## ZeRO stage 2 training (**Coming soon**) ## ZeRO stage 3 training (**Coming soon**) \ No newline at end of file