From 036ff7721d59bb288888e6385ba1b89af148ae1e Mon Sep 17 00:00:00 2001
From: Olatunji Ruwase <olruwase@microsoft.com>
Date: Wed, 25 Oct 2023 14:45:26 -0400
Subject: [PATCH] Universal ckpt: Visualize Z1 examples (#267)

* Enable universal ckpting

* Update run scripts

* Address PR feedback

* Remove line

* Fix white lines

* Remove redudant changes

* Apply to gpt_model only

* Code cleanup

* Code cleanup

* Update training.py

Co-authored-by: Michael Wyatt <mrwyattii@gmail.com>

* Update training.py

Co-authored-by: Michael Wyatt <mrwyattii@gmail.com>

* Log loss_scale only valid for fp16

* Add README and bf16 scripts

* Visualization docsts

---------

Co-authored-by: Michael Wyatt <mrwyattii@gmail.com>
---
 examples_deepspeed/universal_checkpointing/README.md | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/examples_deepspeed/universal_checkpointing/README.md b/examples_deepspeed/universal_checkpointing/README.md
index fc4babc535..f691561752 100644
--- a/examples_deepspeed/universal_checkpointing/README.md
+++ b/examples_deepspeed/universal_checkpointing/README.md
@@ -47,12 +47,14 @@ drwxr-xr-x 2 user group  4096 Oct 21 09:01 global_step200
 -rwxr--r-- 1 user group 24177 Oct 21 09:50 zero_to_fp32.py
 ```
 
-### Step3: Resume training with Universal checkpoint of iteration 100
+### Step 3: Resume training with Universal checkpoint of iteration 100
 ```bash 
 bash examples_deepspeed/universal_checkpointing/run_universal_bf16.sh
 ```
 This resumption script effects the loading of universal checkpoint rather than the ZeRO checkpoint in the folder by passing `--universal-checkpoint` command line flag to the main training script (i.e., `pretrain_gpt.py`). 
 
+Please see the corresponding [pull request](https://github.com/microsoft/Megatron-DeepSpeed/pull/265) for visualizations of matching loss values between original and universal checkpoint runs for bf16 and fp16 examples. 
+
 ## ZeRO stage 2 training (**Coming soon**)
 
 ## ZeRO stage 3 training (**Coming soon**)
\ No newline at end of file