Update on "Add the option to turn on async-TP"

Yifu Wang · Yifu Wang · commit 6fde13b44305 · 2024-06-26T16:39:24.000-07:00
This PR adds the option to turn on async-TP (`--experimental.enable_async_tensor_parallel`). The feature is currently implemented as compiler passes on relevant patterns, so the option is currently only effective when compile is enabled. Some trace samples from llama3_70b with tp degree=8: **all-gather -> qkv projection** Baseline: <img width="420" alt="image" src="https://github.com/pytorch/torchtitan/assets/4156752/df6980c3-4a2f-4455-bdd3-9079b538123f"> With async-TP: <img width="513" alt="image" src="https://github.com/pytorch/torchtitan/assets/4156752/635c3dee-660d-4452-809b-32620343080a"> **ffn -> reduce-scater** Baseline: <img width="537" alt="image" src="https://github.com/pytorch/torchtitan/assets/4156752/6b045c84-48df-4798-a786-4f57e3f4345a"> With async-TP: <img width="451" alt="image" src="https://github.com/pytorch/torchtitan/assets/4156752/63f13859-97f7-48ea-aef6-4e8861b207ac"> **all-gather -> ffn** Baseline: <img width="494" alt="image" src="https://github.com/pytorch/torchtitan/assets/4156752/b1636055-9b5b-43b1-b98e-b91f06af995e"> With async-TP: <img width="536" alt="image" src="https://github.com/pytorch/torchtitan/assets/4156752/3edaedf4-3780-423d-ba86-5aa1cc5e69df"> [ghstack-poisoned]
diff --git a/train.py b/train.py
@@ -252,11 +252,6 @@ def loss_fn(pred, labels):
         for m in model_parts
     ]
 
-    # for ease of testing TP in lieu of FSDP
-    if job_config.training.tensor_parallel_degree == world_size:
-        for model in model_parts:
-            model.to(torch.bfloat16)
-
     init_device = "cpu" if job_config.checkpoint.create_seed_checkpoint else "cuda"
     for model in model_parts:
         model.to_empty(device=init_device)