Update on "Reordered TP parallel plan to follow execution order"

- Llama uses pre-norm (norm before attention and before FFN), so we can move these up. - The root norm is before output, so we can swap this order too. [ghstack-poisoned]
pytorch · Jul 10, 2024 · 74304ba · 74304ba
1 parent 6165d3d
commit 74304ba
Showing 1 changed file with 1 addition and 0 deletions.
diff --git a/torchtitan/parallelisms/parallelize_llama.py b/torchtitan/parallelisms/parallelize_llama.py
@@ -332,6 +332,7 @@ def apply_tp(model, world_mesh, parallel_dims, job_config: JobConfig):
     """
     Apply tensor parallelism.
     """
+
     tp_mesh = world_mesh["tp"]
     (
         row_parallel_strategy,