tianyu-l
diff --git a/‎.ci/docker/requirements.txt
+1 b/‎.ci/docker/requirements.txt
+1
diff --git a/‎.github/workflows/integration_test_4gpu.yaml
+1-1 b/‎.github/workflows/integration_test_4gpu.yaml
+1-1
diff --git a/‎.github/workflows/integration_test_8gpu.yaml
-1 b/‎.github/workflows/integration_test_8gpu.yaml
-1
diff --git a/‎.github/workflows/unit_test_cpu.yaml
-1 b/‎.github/workflows/unit_test_cpu.yaml
-1
diff --git a/‎README.md
+16-7 b/‎README.md
+16-7
diff --git a/‎benchmark.py
+232 b/‎benchmark.py
+232
diff --git a/‎create_seed_checkpoint.sh
-2 b/‎create_seed_checkpoint.sh
-2
diff --git a/‎docs/composability.md
+23 b/‎docs/composability.md
+23
diff --git a/‎docs/float8.md
+18 b/‎docs/float8.md
+18
@@ -1,4 +1,5 @@
 torch >= 2.3.0
+torchdata >= 0.8.0
 datasets >= 2.19.0
 tomli >= 1.1.0 ; python_version < "3.11"
 tensorboard
 
@@ -38,6 +38,6 @@ jobs:
         pip config --user set global.progress_bar off
 
         python -m pip install --force-reinstall --pre torch --index-url https://download.pytorch.org/whl/nightly/cu121
-        python -m pip install --pre torchdata --index-url https://download.pytorch.org/whl/nightly/
+        USE_CPP=0 python -m pip install git+https://github.com/pytorch/ao.git
         mkdir artifacts-to-be-uploaded
         python ./test_runner.py artifacts-to-be-uploaded --ngpu 4
@@ -37,6 +37,5 @@ jobs:
         pip config --user set global.progress_bar off
 
         python -m pip install --force-reinstall --pre torch --index-url https://download.pytorch.org/whl/nightly/cu121
-        python -m pip install --pre torchdata --index-url https://download.pytorch.org/whl/nightly/
         mkdir artifacts-to-be-uploaded
         python ./test_runner.py artifacts-to-be-uploaded --ngpu 8
@@ -25,5 +25,4 @@ jobs:
         pip config --user set global.progress_bar off
 
         pip install --force-reinstall --pre torch --index-url https://download.pytorch.org/whl/nightly/cpu
-        pip install --pre torchdata --index-url https://download.pytorch.org/whl/nightly
         pytest test --cov=. --cov-report=xml --durations=20 -vv
@@ -18,6 +18,16 @@ Our guiding principles when building `torchtitan`:
 
 [![Welcome to torchtitan!](assets/images/titan_play_video.png)](https://youtu.be/ee5DOEqD35I?si=_B94PbVv0V5ZnNKE "Welcome to torchtitan!")
 
+### Dive into the code
+
+You may want to see how the model is defined or how parallelism techniques are applied. For a guided tour, see these files first:
+* [train.py](https://github.com/pytorch/torchtitan/blob/main/train.py) - the main training loop and high-level setup code
+* [torchtitan/parallelisms/parallelize_llama.py](https://github.com/pytorch/torchtitan/blob/main/torchtitan/parallelisms/parallelize_llama.py) - helpers for applying Data Parallel, Tensor Parallel, activation checkpointing, and `torch.compile` to the model
+* [torchtitan/parallelisms/pipeline_llama.py](https://github.com/pytorch/torchtitan/blob/main/torchtitan/parallelisms/pipeline_llama.py) - helpers for applying Pipeline Parallel to the model
+* [torchtitan/checkpoint.py](https://github.com/pytorch/torchtitan/blob/main/torchtitan/checkpoint.py) - utils for saving/loading distributed checkpoints
+* [torchtitan/float8.py](https://github.com/pytorch/torchtitan/blob/main/torchtitan/float8.py) - utils for applying Float8 techniques
+* [torchtitan/models/llama/model.py](https://github.com/pytorch/torchtitan/blob/main/torchtitan/models/llama/model.py) - the Llama model definition (shared for Llama2 and Llama3 variants)
+
 ## Pre-Release Updates:
 #### (4/25/2024): `torchtitan` is now public but in a pre-release state and under development.
 Currently we showcase pre-training **Llama 3 and Llama 2** LLMs of various sizes from scratch. `torchtitan` is tested and verified with the PyTorch nightly version `torch-2.4.0.dev20240412`. (We recommend latest PyTorch nightly).
@@ -33,18 +43,18 @@ Currently we showcase pre-training **Llama 3 and Llama 2** LLMs of various sizes
 6. Learning rate scheduler, meta init, Optional Fused RMSNorm
 7. All options easily configured via [toml files](train_configs/)
 8. [Interoperable checkpoints](docs/checkpoint.md) which can be loaded directly into [`torchtune`](https://github.com/pytorch/torchtune) for fine tuning
+9. [Float8 support](docs/float8.md)
 
 We report our [Performance](docs/performance.md) verified on 64 A100 GPUs
 
 
 ### Coming soon
 
 1. Async checkpointing
-2. FP8 support
-3. Context Parallel
-4. 3D Pipeline Parallel
-5. `torch.compile` support
-6. Scalable data loading solution
+2. Context Parallel
+3. 3D Pipeline Parallel
+4. `torch.compile` support
+5. Scalable data loading solution
 
 
 ## Installation
@@ -54,7 +64,6 @@ git clone https://github.com/pytorch/torchtitan
 cd torchtitan
 pip install -r requirements.txt
 pip3 install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu121 # or cu118
-pip3 install --pre torchdata --index-url https://download.pytorch.org/whl/nightly
 ```
 
 ### Downloading a tokenizer
@@ -66,7 +75,7 @@ Once you have confirmed access, you can run the following command to download th
 ```bash
 # Get your HF token from https://huggingface.co/settings/tokens
 
-# llama3 tokenizer.model
+# llama3 or 3.1 tokenizer.model
 python torchtitan/datasets/download_tokenizer.py --repo_id meta-llama/Meta-Llama-3-8B --tokenizer_path "original" --hf_token=...
 
 # llama2 tokenizer.model
 
@@ -0,0 +1,232 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+
+import os
+import time
+from datetime import timedelta
+
+import torch
+from torch.distributed.elastic.multiprocessing.errors import record
+
+from torchbenchmark.util.experiment.instantiator import (
+    load_model,
+    TorchBenchModelConfig,
+)
+from torchbenchmark.util.experiment.metrics import get_model_flops
+from torchbenchmark.util.input import input_cast
+
+from torchtitan import utils
+from torchtitan.checkpoint import TrainState
+from torchtitan.config_manager import JobConfig, TORCH_DTYPE_MAP
+from torchtitan.logging import init_logger, logger
+from torchtitan.metrics import build_gpu_memory_monitor
+from torchtitan.parallelisms import ParallelDims
+from torchtitan.parallelisms.parallelize_llama import torch_spmd_parallelize
+from torchtitan.profiling import maybe_enable_memory_snapshot, maybe_enable_profiling
+
+
+# Enable debug tracing on failure: https://pytorch.org/docs/stable/elastic/errors.html
+@record
+def main(job_config: JobConfig):
+    init_logger()
+    logger.info(f"Starting job: {job_config.job.description}")
+
+    # used for colorful printing
+    color = utils.Color if job_config.metrics.enable_color_printing else utils.NoColor
+
+    # take control of garbage collection to avoid stragglers
+    gc_handler = utils.GarbageCollection(gc_freq=job_config.training.gc_freq)
+
+    # init distributed
+    world_size = int(os.environ["WORLD_SIZE"])
+    parallel_dims = ParallelDims(
+        dp=job_config.training.data_parallel_degree,
+        tp=job_config.training.tensor_parallel_degree,
+        pp=job_config.experimental.pipeline_parallel_degree,
+        world_size=world_size,
+        enable_loss_parallel=job_config.training.enable_loss_parallel,
+        dp_type=job_config.training.data_parallel_type,
+    )
+    device = torch.device(f"cuda:{int(os.environ['LOCAL_RANK'])}")
+    torch.cuda.set_device(device)
+    utils.init_distributed(job_config)
+    # initialize GPU memory monitor and get peak flops for MFU calculation
+    gpu_memory_monitor = build_gpu_memory_monitor()
+    gpu_peak_flops = utils.get_peak_flops(gpu_memory_monitor.device_name)
+
+    # build meshes
+    world_mesh = parallel_dims.build_mesh(device_type="cuda")
+    if parallel_dims.dp_enabled:
+        dp_mesh = world_mesh["dp"]
+        dp_degree, dp_rank = dp_mesh.size(), dp_mesh.get_local_rank()
+    else:
+        dp_degree, dp_rank = 1, 0
+
+    if parallel_dims.pp_enabled:
+        pp_mesh = world_mesh["pp"]
+
+    model_name = job_config.model.name
+
+    # initiate model from torchbench
+    config = TorchBenchModelConfig(
+        name=model_name,
+        test="train",
+        device="cuda",
+        batch_size=job_config.training.batch_size,
+        extra_args=[],
+    )
+    model_flops = get_model_flops(config)
+    benchmark_model = load_model(config)
+    model, _ = benchmark_model.get_module()
+
+    # TODO: there seems to be a bug with dtype conversion (e.g. use resnet50)
+    # cast input dtype if needed
+    param_dtype = TORCH_DTYPE_MAP[job_config.training.mixed_precision_param]
+    input_cond = lambda x: x.dtype == torch.float32
+    input_action = lambda x: x.to(param_dtype)
+    if hasattr(benchmark_model, "example_inputs"):
+        benchmark_model.example_inputs = input_cast(
+            input_cond, input_action, benchmark_model.example_inputs
+        )
+    else:
+        logger.warning(
+            f"{model_name} example inputs haven't been cast to {action} yet!"
+        )
+
+    # log model size
+    model_param_count = utils.get_num_params(model)
+    logger.info(
+        f"{color.blue}Model {model_name} "
+        f"{color.red}size: {model_param_count:,} total parameters{color.reset}"
+    )
+
+    # apply PT-D Tensor Parallel, activation checkpointing, torch.compile, Data Parallel
+    model = torch_spmd_parallelize(model, world_mesh, parallel_dims, job_config)
+
+    # update model and optimizer after applying parallelisms
+    benchmark_model.set_module(model)
+    optimizer = benchmark_model.get_optimizer()
+    optimizer.add_param_group({"params": model.parameters()})
+
+    model.train()
+
+    gpu_mem_stats = gpu_memory_monitor.get_peak_stats()
+    logger.info(
+        f"GPU memory usage for model: "
+        f"{gpu_mem_stats.max_reserved_gib:.2f}GiB"
+        f"({gpu_mem_stats.max_reserved_pct:.2f}%)"
+    )
+
+    train_state = TrainState()
+
+    # variables used to keep info for metrics logging
+    losses_since_last_log = []
+    gpu_memory_monitor.reset_peak_stats()
+
+    # train loop
+    logger.info(
+        f"Training starts at step {train_state.step + 1}, "
+        f"with local batch size {job_config.training.batch_size}, "
+        f"global batch size {job_config.training.batch_size * dp_degree}, "
+        f"total steps {job_config.training.steps}"
+    )
+    with maybe_enable_profiling(
+        job_config, global_step=train_state.step
+    ) as torch_profiler, maybe_enable_memory_snapshot(
+        job_config, global_step=train_state.step
+    ) as memory_profiler:
+        while train_state.step < job_config.training.steps:
+            train_state.step += 1
+            gc_handler.run(train_state.step)
+
+            torch.cuda.synchronize()
+            start_event = torch.cuda.Event(enable_timing=True)
+            end_event = torch.cuda.Event(enable_timing=True)
+
+            # Collect time_ns() instead of time() which does not provide better precision than 1
+            # second according to https://docs.python.org/3/library/time.html#time.time.
+            t0 = time.time_ns()
+            start_event.record()
+
+            is_staged = (
+                hasattr(benchmark_model, "forward")
+                and hasattr(benchmark_model, "backward")
+                and hasattr(benchmark_model, "optimizer_step")
+            )
+            if is_staged and (getattr(benchmark_model, "train", None) is None):
+                if optimizer is not None:
+                    optimizer.zero_grad()
+                loss = benchmark_model.forward()
+                benchmark_model.backward(loss)
+                if optimizer is not None:
+                    benchmark_model.optimizer_step()
+            else:
+                loss = benchmark_model.train()
+
+            end_event.record()
+            torch.cuda.synchronize()
+            t1 = time.time_ns()
+            time_delta = start_event.elapsed_time(end_event), (t1 - t0) / 1_000_000
+
+            # log metrics
+            losses_since_last_log.append(loss)
+            if (
+                train_state.step == 1
+                or train_state.step % job_config.metrics.log_freq == 0
+            ):
+                losses = [
+                    loss.item() if isinstance(loss, torch.Tensor) else loss
+                    for loss in losses_since_last_log
+                ]
+                avg_loss, max_loss = sum(losses) / len(losses), max(losses)
+                if parallel_dims.dp_enabled:
+                    global_avg_loss, global_max_loss = (
+                        utils.dist_mean(avg_loss, dp_mesh),
+                        utils.dist_max(max_loss, dp_mesh),
+                    )
+                else:
+                    global_avg_loss, global_max_loss = avg_loss, max_loss
+
+                gpu_mem_stats = gpu_memory_monitor.get_peak_stats()
+
+                logger.info(
+                    f"{color.cyan}step: {train_state.step:2}  "
+                    f"{color.green}loss: {global_avg_loss:7.4f}  "
+                    f"{color.yellow}memory: {gpu_mem_stats.max_reserved_gib:5.2f}GiB"
+                    f"({gpu_mem_stats.max_reserved_pct:.2f}%)  "
+                    f"{color.blue}GPU time: {time_delta[0]:.3f}ms  "
+                    f"CPU wall time: {time_delta[1]:.3f}ms{color.reset}"
+                )
+
+                losses_since_last_log.clear()
+                gpu_memory_monitor.reset_peak_stats()
+
+            # signal the profiler that the next profiling step has started
+            if torch_profiler:
+                torch_profiler.step()
+            if memory_profiler:
+                memory_profiler.step()
+
+            # reduce timeout after first train step for faster signal
+            # (assuming lazy init and compilation are finished)
+            if train_state.step == 1:
+                utils.set_pg_timeouts(
+                    timeout=timedelta(seconds=job_config.comm.train_timeout_seconds),
+                    world_mesh=world_mesh,
+                )
+
+    if torch.distributed.get_rank() == 0:
+        logger.info("Sleeping 2 seconds for other ranks to complete")
+        time.sleep(2)
+
+    logger.info("Training completed")
+
+
+if __name__ == "__main__":
+    config = JobConfig()
+    config.parse_args()
+    main(config)
+    torch.distributed.destroy_process_group()
@@ -18,8 +18,6 @@
 
 set -ex
 
-export USE_LIBUV=1
-TRAINER_DIR=${1:-/home/$USER/local/torchtitan}
 NGPU=1
 LOG_RANK=0
 CONFIG_FILE=${CONFIG_FILE:-"./train_configs/debug_model.toml"}
 
@@ -0,0 +1,23 @@
+# Building a Clean, Readable Distributed LLM
+One of the main goals for TorchTitan was to provide a version of distributed LLM that was not only high performance, but utilized native pytorch techniques and readable code.  The challenge is how to compose together so many individual library components (FSDP, TP, PP, FP8, Compile, DCP, ...) just to name a few, and avoid having to make too many changes to the model guts in the process.  A lot of the work is behind the scenes, designing individual components to make fewer assumptions, use common abstractions (e.g. DTensor) and generally 'get along'.  But we found a few tweaks to the model code invaluable as well, and wanted to share those changes and the rationale for them.
+
+
+
+# Making the model "pipeline friendly"
+When applying Pipeline Parallelism, you will have to construct nn.Module objects representing the portion of the model that runs on a given pipeline stage. Whether you plan to manually edit your model code, or use techniques like tracing to extract model chunks, a few changes to the original model code can go a long way to making this process easier.
+
+### Simplifying the top-level model forward
+Most likely, you can write your model in such a way that the top-level nn.Module owns a sequence of child modules that it calls during forward, delegating most of the complexity to the child module forwards.  If you can reduce your top level forward to mostly a for-loop over child module calls, then you'll simplify the pipeline-partitioning task to choosing the set of submodules to keep per stage.  If you have non-trivial logic in the top-level forward, you'll have to find a way to patch that logic back onto the resulting pipeline stage model, which can be annoying.
+
+example ([PR #321](https://github.com/pytorch/torchtitan/pull/321)):
+we used to slice the `freqs_cis` buffer by `seq_len` in the top level forward, pass that into child modules, and expect that inside the child modules the `seq_len` would match up with the size of other local tensors.  But we don't know about whether TP was applied or not when we consider PP splitting and could create a mismatch.  Its just as easy to perform the `freqs_cis` slicing inside the child submodule, using the runtime-accurate local `seq_len`, and this sidesteps the issue at PP slicing time.  
+
+example ([PR #322])https://github.com/pytorch/torchtitan/pull/322)): We decided to actually reuse the top-level model object on every PP stage, just delete the layers we don't want, and make sure that the top-level forward would do the right thing.  This means we don't have to make a separate runtime pp_forward that glues together child modules per stage.  The first change was using a moduledict instead of modulelist to store layers. This preserves layer Fully Qualified Names (FQNs) even when deleting some layers - e.g. layers.1 stays layers.1 even if you remove layers.0, which isn't true for a list- this matters for checkpoint save/load.  Preserving FQNs is a requirement for using Distributed Checkpointing (DCP) since it uses FQNs as globally unique IDs for sharding metadata. The second change was making the input and output layers optional- if the layer exists, we run it, otherwise we feed the input through to bypass it.  With these two changes, we can just (meta)-initialize the whole model, delete the unused parts per stage, then materialize the remaining part on GPU before loading a checkpoint.
+
+# Using a seed checkpoint for init
+Initializing the pipeline-parallel model is challenging becuase we assume the model could be so large as to not fit on local GPU (or possibly, even on CPU), and we also want to use the (bitwise) same initialization as we use for 1D or 2D parallel models, to ease debugging or comparisons between runs. It's not that easy to rewrite the original model's `init_weights` function to be tolerant of initializing only some layers, and also serializing initialization operations globally for consistent RNG order.
+
+For now, we sidestep all these problems with a simple but brutal solution: Initialize the whole model on some CPU instance, save a checkpoint file, and then lean on Distributed Checkpointing's "load" functionality to initialize the FQNs that are present on a given PP stage after stage creation.  For future work, we consider adding a more elaborate initialization scheme to `torch.pipelining`.
+
+One issue with seed checkpoints is that we rely on initializing _every_ model state from the checkpoint, which means the model can't have any non-persistent buffers, or else we have to specially initialize those in `train.py` after pipeline splitting.  `freqs_cis` was originally a non-persistent buffer, and we changed this to persistent in order to load it from the seed checkpoint.
+
@@ -0,0 +1,18 @@
+## Enable Float8 Training on H100s
+
+Please install latest [TorchAO](https://github.com/pytorch/ao/tree/main/torchao/float8) to support float8 dtype
+```
+USE_CPP=0 python -m pip install git+https://github.com/pytorch/ao.git
+```
+
+Launch training job with the following command (or alternatively set configs in toml files)
+```
+CONFIG_FILE="./train_configs/llama3_8b.toml" ./run_llama_train.sh --float8.enable_float8_linear --float8.enable_fsdp_float8_all_gather --float8.precompute_float8_dynamic_scale_for_fsdp
+```
+* `--float8.enable_float8_linear`: swap `nn.Linear` with `Float8Linear` to perform float8 matmul.
+* `--float8.enable_fsdp_float8_all_gather`: cast `Float8Linear.weight` from high precision to float8 before FSDP all-gather so we can communicate in float8 to save bandwidth.
+* `--float8.precompute_float8_dynamic_scale_for_fsdp` (optional): communicate AMAX/scales efficiently in a single all-reduce for all parameters instead of doing many small all-reduce for each parameter.
+
+For parallelisms, we support float8 all-gather for FSDP (optional) and for TP (by default for `Float8Linear`).
+
+For scaling strategy, we currently support tensor-wise scaling with dynamic scales, and are actively working on tensor-wise scaling with delayed scales. Row-wise scaling is under exploration.
Original file line number	Diff line number	Diff line change
`@@ -1,4 +1,5 @@`
`1`	`1`	`torch >= 2.3.0`
	`2`	`+torchdata >= 0.8.0`
`2`	`3`	`datasets >= 2.19.0`
`3`	`4`	`tomli >= 1.1.0 ; python_version < "3.11"`
`4`	`5`	`tensorboard`