merge upstream changes (#15) · tianyu-l/torchtitan_intern24@169d1c7

Commit

merge upstream changes (#15)

* Load missing keys default from argparse (#111)

```
+ export USE_LIBUV=1
+ USE_LIBUV=1
+ TRAINER_DIR=/home/gnadathur/local/torchtrain
+ NGPU=4
+ LOG_RANK=0,1
+ CONFIG_FILE=./train_configs/debug_model.toml
+ torchrun --nproc_per_node=4 --rdzv_endpoint=localhost:5972 --local-ranks-filter 0,1 --role rank --tee 3 train.py --job.config_file ./train_configs/debug_model.toml
W0304 17:01:26.766000 140549371597824 torch/distributed/run.py:717] 
W0304 17:01:26.766000 140549371597824 torch/distributed/run.py:717] *****************************************
W0304 17:01:26.766000 140549371597824 torch/distributed/run.py:717] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0304 17:01:26.766000 140549371597824 torch/distributed/run.py:717] *****************************************
[rank0]:2024-03-04 17:01:28,834 - torchtrain.parallelisms - INFO - Building 1-D device mesh with ('dp',), [4]
[rank1]:2024-03-04 17:01:28,857 - torchtrain.parallelisms - INFO - Building 1-D device mesh with ('dp',), [4]
[rank0]:2024-03-04 17:01:29,712 - root - INFO - Starting job: debug training
[rank0]:2024-03-04 17:01:29,712 - root - INFO - Building llama
[rank0]:2024-03-04 17:01:29,719 - root - INFO - Reloaded SentencePiece model from ./torchtrain/datasets/tokenizer/tokenizer.model
[rank0]:2024-03-04 17:01:29,719 - root - INFO - #words: 32000 - BOS ID: 1 - EOS ID: 2
[rank1]:2024-03-04 17:01:31,187 - root - INFO - Reloaded SentencePiece model from ./torchtrain/datasets/tokenizer/tokenizer.model
[rank1]:2024-03-04 17:01:31,188 - root - INFO - #words: 32000 - BOS ID: 1 - EOS ID: 2
[rank0]:2024-03-04 17:01:31,346 - root - INFO - Model fully initialized via reset_params
[rank0]:2024-03-04 17:01:31,346 - root - INFO - Model built with: ModelArgs(dim=256, n_layers=2, n_heads=16, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=32768, depth_init=True)
[rank0]:2024-03-04 17:01:31,347 - root - INFO - �[34mModel llama debugmodel �[31msize: 18,089,216 total parameters�[39m
[rank0]:2024-03-04 17:01:31,347 - root - INFO - GPU memory usage: NVIDIA H100 (0): 95.0396 GiB capacity, 0.0 GiB in-use, 0.0% in-use
[rank0]:2024-03-04 17:01:32,502 - root - INFO - Applied FSDP to the model...
[rank0]:2024-03-04 17:01:32,503 - root - INFO - Gradient scaling not enabled.
[rank0]:2024-03-04 17:01:32,504 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs/tb/20240304-1701.
[rank0]:2024-03-04 17:01:32,901 - root - INFO - Profiling active.  Traces will be saved at ./outputs/profiling/traces
[rank0]:2024-03-04 17:01:34,806 - root - INFO - �[36mstep:  1  �[32mloss: 10.8424  �[39miter: �[34m 1.8688�[39m  data: �[34m0.0316  �[39mlr: �[33m0.00026667�[39m
[rank0]:2024-03-04 17:01:34,891 - root - INFO - �[36mstep:  2  �[32mloss: 10.7581  �[39miter: �[34m 0.0476�[39m  data: �[34m0.0357  �[39mlr: �[33m0.00053333�[39m
[rank0]:2024-03-04 17:01:34,970 - root - INFO - �[36mstep:  3  �[32mloss: 10.6239  �[39miter: �[34m  0.045�[39m  data: �[34m0.0333  �[39mlr: �[33m0.0008�[39m
[rank0]:2024-03-04 17:01:35,048 - root - INFO - �[36mstep:  4  �[32mloss: 10.4163  �[39miter: �[34m 0.0455�[39m  data: �[34m0.0323  �[39mlr: �[33m0.0007�[39m
[rank0]:2024-03-04 17:01:35,127 - root - INFO - �[36mstep:  5  �[32mloss: 10.1529  �[39miter: �[34m 0.0459�[39m  data: �[34m0.032  �[39mlr: �[33m0.0006�[39m
[rank0]:2024-03-04 17:01:35,206 - root - INFO - �[36mstep:  6  �[32mloss:  9.8899  �[39miter: �[34m 0.0468�[39m  data: �[34m0.0311  �[39mlr: �[33m0.0005�[39m
[rank0]:2024-03-04 17:01:35,284 - root - INFO - �[36mstep:  7  �[32mloss:  9.7204  �[39miter: �[34m 0.0461�[39m  data: �[34m0.0312  �[39mlr: �[33m0.0004�[39m
[rank0]:2024-03-04 17:01:35,425 - root - INFO - �[36mstep:  8  �[32mloss:  9.3757  �[39miter: �[34m 0.0457�[39m  data: �[34m0.0319  �[39mlr: �[33m0.0003�[39m
[rank0]:STAGE:2024-03-04 17:01:35 3850444:3850444 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
[rank0]:2024-03-04 17:01:35,537 - root - INFO - �[36mstep:  9  �[32mloss:  9.1883  �[39miter: �[34m 0.0762�[39m  data: �[34m0.0318  �[39mlr: �[33m0.0002�[39m
[rank0]:[rank0]:[W CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event
[rank1]:STAGE:2024-03-04 17:01:35 3850445:3850445 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
[rank1]:[rank1]:[W CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event
[rank0]:STAGE:2024-03-04 17:01:35 3850444:3850444 ActivityProfilerController.cpp:320] Completed Stage: Collection
[rank0]:STAGE:2024-03-04 17:01:35 3850444:3850444 ActivityProfilerController.cpp:324] Completed Stage: Post Processing
[rank1]:STAGE:2024-03-04 17:01:35 3850445:3850445 ActivityProfilerController.cpp:320] Completed Stage: Collection
[rank1]:STAGE:2024-03-04 17:01:35 3850445:3850445 ActivityProfilerController.cpp:324] Completed Stage: Post Processing
[rank0]:2024-03-04 17:01:35,958 - root - INFO - exporting profile traces to ./outputs/profiling/traces/iteration_10
[rank0]:2024-03-04 17:01:35,971 - root - INFO - �[36mstep: 10  �[32mloss:  9.1212  �[39miter: �[34m 0.0808�[39m  data: �[34m0.0319  �[39mlr: �[33m0.0001�[39m
[rank0]:2024-03-04 17:01:35,972 - root - INFO - Average iter time: 0.0553 seconds
[rank0]:2024-03-04 17:01:35,972 - root - INFO - Average data load time: 0.0317 seconds
[rank0]:2024-03-04 17:01:35,972 - root - INFO - Current Memory: NVIDIA H100 (0): Reserved: 9.6465%, Alloc 2.1969%, Active: 2.2%
[rank0]:Peak Memory: Reserved 9.65%, Alloc 8.43%, Active: 8.44%
[rank0]:num retries: 0, num ooms: 0
[rank0]:NCCL version 2.19.3+cuda12.0
```

Co-authored-by: gnadathur <gnadathur@devvm4378.nao0.facebook.com>

* Add meta_init, enable it as default init process (#84)

This PR enables meta_init functionality to avoid OOM'ing on cpu for
larger models.
The core functionality is in meta_init.py, and a few changes in
parallelization and train.py.
Key items:
1 - this is largely the same as the earlier PR I had for meta_init, but
I did a new one b/c faster than reworking it with all the interim
changes.
2 - to address feedback in previous PR:
a - why do we need meta_init.py, can't we just do:
~~~
with torch.device("meta"):
    model = Model.from_args(...)
~~~
Unfortunately this does not work b/c the rope embeddings are treated
differently (buffer) and thus the simple lambda call from param_init_fn
in FSDP (lambda module: module.to_device('cuda') ) will not invoke or
move the rope embeddings and the model will fail on first forward.
This issue relates to the nn.embeddings not being moved, and that the
device is referenced in the forward pass for the current rope class.
Have opened https://github.com/pytorch/torchtrain/issues/110 to track
this and investigate while not holding up meta init that is working from
landing.

b - per earlier feedback - meta init is now 'not optional' but simply
the default. This should ensure all models leverage it and ensure we
aren't missing things for future meta_init aspects.

3 - misc change - I switched the model_params to just do the normal all
params count instead of 'unique params' b/c it does not mesh with what
people perceive model size as.

Testing:
tested both debugmodel and 26B model with and without meta init to
confirm same loss curves.
Note for future reference - if you get a bad init (meta init failure)
you will simply not train (loss is same every iter).
If you fail to call reset params after FSDP, then you will train (b/c we
default to torch.randn_like) but your starting loss will be 5x+ higher
(telling you that you have not properly init'ed the model).

* Fix feedback from PR 111 (#113)

Co-authored-by: gnadathur <gnadathur@devvm4378.nao0.facebook.com>

* fix SP minor issues

ghstack-source-id: 5133a8d97ad209b569e0fc528e58daafdd31d80d
Pull Request resolved: https://github.com/pytorch/torchtrain/pull/114

* enable loss parallel in SP

ghstack-source-id: a0c8b4454f75ad1cd9824ac89a1df0182f6a7d8c
Pull Request resolved: https://github.com/pytorch/torchtrain/pull/112

* Float8_experimental option for training (#102)

* add miniPile dataset for pretraining, 1M entries (solves the 'out of data' at 40 iters issue) (#88)

This PR add's minipile (1M, 6GB) dataset as an option for pretraining
with torchtrain.
It resolves the issue where we run out of data after 40 iterations with
the default alpaca dataset.
Per @tianyu-l's excellent suggestion, have refactored to have a single
hf_datasets.py file that supports both minipile and alpaca since it
turned out no need for any different tokenizer, etc.
Also cleaned up the datasets package so that create_tokenizer is exposed
directly, and thus all public apis can be used directly from
torchtrain.datasets.
Lastly - added warning if/when a dataset is being re-looped so users
don't get burned by overfitting:
<img width="1294" alt="Screenshot 2024-03-06 at 5 11 09 AM"
src="https://github.com/pytorch/torchtrain/assets/46302957/82480b6f-c677-4794-80c5-5c10b037732a">


Adds a color highlight to showcase what dataloader was built:
<img width="1360" alt="Screenshot 2024-03-05 at 9 19 10 PM"
src="https://github.com/pytorch/torchtrain/assets/46302957/4717ec6a-14bb-4283-a3ae-fa40c27deee0">
and
<img width="1360" alt="Screenshot 2024-03-05 at 9 22 01 PM"
src="https://github.com/pytorch/torchtrain/assets/46302957/dbf32d51-2dd4-4526-8855-9b33b627559e">


Usage:
just add "minipile" or "alpaca" as the dataset in the training config
toml file.
<img width="439" alt="Screenshot 2024-02-25 at 12 35 26 PM"
src="https://github.com/pytorch/torchtrain/assets/46302957/1afbaed1-07f8-4e37-b8cc-80190db7fb27">

Testing:
verified training loss is improving and ran for 100 iters to verify no
issue with out of data any longer with minipile.
reran with alpaca and saw the expected out of data at 40 iters without
infinite loop option, runs to 100 with infinite.

Notes:
I did not make this a default dataset since for debugmodel, mostly
running 10 iters is fine and there's 6GB to pull down.
<img width="869" alt="Screenshot 2024-02-25 at 12 30 29 PM"
src="https://github.com/pytorch/torchtrain/assets/46302957/1070a80a-ad20-4f0f-a860-e13caa3120a0">

* add data loading option to load from local file system

ghstack-source-id: 3c930054d3b04faf3866048740a2ef887d066dd6
Pull Request resolved: https://github.com/pytorch/torchtrain/pull/117

* add llama 13B configs

ghstack-source-id: 733bf85716cda3a5b9af780eba79c9b5dd66abad
Pull Request resolved: https://github.com/pytorch/torchtrain/pull/121

* add llama 70B toml

ghstack-source-id: d7cd26d84aa2442ac45223992e1766397e52c8d8
Pull Request resolved: https://github.com/pytorch/torchtrain/pull/122

* set betas and weight decay for optimizers

according to suggestions in https://github.com/pytorch/torchtrain/issues/118#issuecomment-1986470746

ghstack-source-id: 357f0872cd1c9bad2c4c256d47adbd3f716a7651
Pull Request resolved: https://github.com/pytorch/torchtrain/pull/123

* Add c4 dataset (177M, streaming), update multi-node support for latest job configs (#124)

This PR:
1 - adds the english language portion of c4 dataset, which has 177M
entries. (https://huggingface.co/datasets/allenai/c4)

Due to the size, streaming is enabled as the default.  
This is the allen-ai/c4, as apparently the original c4 is being
deprecated and HF advises to use allen-ai now.

For comparison per @tianyu-l request - 40 iterations avg time:
alpaca cached loading: Average data load time: 0.0279 seconds
c4 streaming loading: Average data load time: 0.0290 seconds

There is a longer initial delay during the 'preparing c4' vs alpaca
(i.e. 45 seconds vs 10 seconds), but after that speed is similar.

Dataset sample (not displayed in training, just an excerpt I pulled to
double check the data flow):
<img width="1233" alt="Screenshot 2024-03-08 at 5 31 06 PM"
src="https://github.com/pytorch/torchtrain/assets/46302957/94915f80-da70-48d1-8c43-43f874fef121">

2 - I also updated the multi-node slurm file to account for the new job
config.

Test:
verified no looping with 100 iterations, 
sampled data streamed to verify.

* Add openwebtext dataset for larger scale training without shuffling (#130)

This PR adds the openwebtext 1M dataset. 
This is a homogenous dataset, so we are able to train successfully while
not having any shuffle in our dataset loader.

1 - adds the dateset to hf_datasets
2 - makes the default dataset for 13b and 70b as openwebtext since that
is the preferred choice for larger scale training.

Testing - ran 5K iters (9 nodes) to verify no spiking issues:

<img width="787" alt="Screenshot 2024-03-12 at 9 50 57 AM"
src="https://github.com/pytorch/torchtrain/assets/46302957/420fa1fc-50f8-47bc-9b07-02c8fa132e7c">

* [TorchTrain][Checkpoint] Fix TrainState state_dict to unblock loading (#131)

This fix would temporarily unblock loading. So we won't run into the
issue of:

```
[rank0]:[rank0]:     train_state.losses.append(train_state.current_loss)
[rank0]:[rank0]: AttributeError: 'float' object has no attribute 'append'
```

However, current_loss and losses are still not correct, since by current
setup, losses and current_losses would be different across different
ranks. Also, we don't know the size of losses because this is based on
the # of steps. So loading still work but the value of current_loss and
losses are not being loaded correctly.

I will follow up with further fixes.

* improve logging

ghstack-source-id: de61ec093b43a2ccbf1156c76ba81ecd698a6a8a
Pull Request resolved: https://github.com/pytorch/torchtrain/pull/132

* use SequenceParallel style in tp/sp (#133)

simplify things given we already have SequenceParallel style landed in
main

* support TP-only parallelism

ghstack-source-id: c13ebb8de8e8e9203624b5dd710a046d17311b0f
Pull Request resolved: https://github.com/pytorch/torchtrain/pull/137

* disable verbose print from profiling

ghstack-source-id: ca6eb8f42bf3c2a59d8e6389e7fe94ed85103099
Pull Request resolved: https://github.com/pytorch/torchtrain/pull/136

* add Selective layer  activation checkpointing, single control for turning AC on or off. (#125)

This PR:
1 - adds selective layer checkpointing - this lets the user select every
x layer to checkpoint:
i.e. 2 = every other layer is checkpointed.

spec for config was updated by Wanchao - so we now have this layout for
AC which is hopefully self-explanatory (covers None, full, Selective Op
or Selective Layer and layer filtering policy:
<img width="941" alt="Screenshot 2024-03-13 at 6 09 52 PM"
src="https://github.com/pytorch/torchtrain/assets/46302957/4b992286-1fbd-4a14-957a-4325f81a9ab4">


Thus, it lets user toggle between traditional 'all layers' to more and
more fine grained checkpointing.
Note that I implemented this for IBM last summer and in their llama
testing, every 2nd layer was the best bang/buck so I have made that the
default.

2 - the config file has been updated to make a new
[activation_checkpointing] section and make it easier to modify vs being
dumped into the training section.

Testing and results:
I tested all the AC options to ensure all options are working, and that
we fail if both types are set to true in config:
<img width="608" alt="Screenshot 2024-03-09 at 3 43 52 PM"
src="https://github.com/pytorch/torchtrain/assets/46302957/e3c20fbf-73e2-492d-9fb9-f32e772e239e">

* remove per iter syncronize

ghstack-source-id: 581c9115e89d3de57e558175b527c12c06a6808c
Pull Request resolved: https://github.com/pytorch/torchtrain/pull/134

* Shorten nccl comm  timeout and enable flight recorder dumping (#103)

Timeout
-------

It's convenient whether during iterative debugging or long running
training to find out asap about a failure. The default timeout is way
too long and leads to wasted cluster time or developer frustration.
  
Timeout can be adjusted via cmdline or in .toml if it needs to be larger
for a particular model.

Another useful pattern can be to set a large timeout for initialization
and then tighten it after iteration 1. We can add this later if desired.

Ideally we could pass the timeout to the device mesh ctor, but it's not
ready yet. Also, we can change timeouts of the existing PGs after
creating them, but that's more LOC and not necessary unless we want to
change the timeouts at runtime.

Dumps
-----

Dumping on timeout should be a safe default for everyone. It has the
side-effect of requiring a dump path which defaults to ~/pgnccl_dump but
can be overridden via DUMP_PATH env.

The raw content of the dump is a pickle that is intended to be consumed
through scripts/tools which are under development, so it may not be easy
to know how to use these for now. As the tooling matures, we should
provide reference docs and probably print out pointers in the logs when
we perform the dump.


Test plan:
tested locally by adding a rank0 sleep for 10sec inside the training
loop, validating all 8 ranks dumped a trace.

* fix up gpu memory monitoring and logging

ghstack-source-id: 2f79d081c7724dbc34f357913671e8aefdf437b1
Pull Request resolved: https://github.com/pytorch/torchtrain/pull/147

* Separate timeout during init and training (#149)

Allow a tighter timeout during training than during init.

Init includes the first train step, as well as any loading and setup. It
can be slower and less predictable due to various factors including lazy
initialization or jit compilation.

After the first train step, we expect more predictable runtime and
benefit from a tighter timeout to give quick feedback on a hang.

Tested by pasting this code in 2 places
```
if dp_mesh.get_local_rank() == 0 and train_state.step == 1:
   import time
   time.sleep(10)
```

(a) before calling set_pg_timeout, which did not cause a timeout (b)
after calling set_pg_timeout, which timed out

* Update activation check with updates to config manager (#152)

* Refactor to clean up parallelisms/__init__.py

(second attempt, didn't land correctly)

ghstack-source-id: 3dfec3ed134105cc5a951f8db160c8c2a9b3349b
Pull Request resolved: https://github.com/pytorch/torchtrain/pull/154

* enable gc control scheduling to help avoid stragglers (#148)

This PR adds control over Python garbage collection to help avoid
stragglers during large scale training.
updates - this feature is now exposed as a controllable option
gc_schedule, with a default of 50.
0 = not enabled.
int = schedules gc at every int iters during training loop. 
<img width="1078" alt="Screenshot 2024-03-15 at 12 39 26 PM"
src="https://github.com/pytorch/torchtrain/assets/46302957/1ee387c5-f0a6-4366-936c-a1e281dad88f">

Effectively we disable the gc, run one collection to ensure a good
starting point, and then at the start of each gc_schedule iter, we call
gc to free up things.

By enforcing a fixed schedule for collection, it helps all ranks stay
more in synch.
Point of reference - on 512 GPU FSDP, adding this (gc_schedule=1) gave a
perf boost of ~1.5% per iter just by virtue of better synch.

(this was originally developed during dist compiler to resolve
stragglers, I believe @fegin came up with this solution).

* Add float8 specific parallel strategies (#153)

* add MFU to metrics

ghstack-source-id: 995efd6f460f3fe83ecf8d72c2178554f325485b
Pull Request resolved: https://github.com/pytorch/torchtrain/pull/151

* disable buffer reuse for compile for now (#156)

disable buffer reuse for compile to have close numerics to eager mode,
as suggested by @Chillee

This is probably only a temp change until buff reuse fix in inductor

* refactor config manager and support cmd overrides (#157)

This PR supports explicit cmd overrides, to allow infra layers to
override certain options (the most important one is dump_folder)

* Add support for generating debug traces on failure

* rename sequence_parallel to tensor_parallel (#162)

This PR renames sequence_parallel to tensor_parallel, as sequence
parallel is only applied to rmsnorm layers, a broader name should be
tensor_parallel, maybe with sequence_parallel enabled

ghstack broken :( so using direct branch push instead

* add basic AC configs for 13B and 70B (#169)

as titled, currently 13B use selective op, and 70B use selective layer,
we can do some more experiments and adjust the configs later

* [TorchTrain][Checkpoint] Update train state to include global_avg_losses and global_max_losses (#167)

Based on discussion with @tianyu-l, we decided to only checkpoint
`global_avg_losses` and `global_max_losses` per log frequency iteration
to avoid all_reduce and device sync every iteration.
`TrainState.current_loss` and `TrainState.losses` are removed from
TrainState `state_dict()` and `load_state_dict()` call.


Tested with saving/loading with 30 steps with log_frequency = 10 and
loading with 40 steps to resume training. The numerics in
global_avg_losses and global_max_losses in the list aligns with
expected.

```
Step 30 save:
[rank0]:before save: 
self.states['train_state']=TrainState(step=30, global_avg_losses=[10.8023841381073, 9.556332552433012, 6.9460043668746945], global_max_losses=[10.813274383544922, 10.74332332611084, 7.8649702072143555], log_steps=[1, 11, 21])


Step 30 load:
[rank0]:after load:
self.states['train_state']=TrainState(step=30, global_avg_losses=[10.8023841381073, 9.556332552433012, 6.9460043668746945], global_max_losses=[10.813274383544922, 10.74332332611084, 7.8649702072143555], log_steps=[1, 11, 21])


Step 40 load and resume training:
[rank0]:before save: 
self.states['train_state']=TrainState(step=40, global_avg_losses=[10.8023841381073, 9.556332552433012, 6.9460043668746945, 5.596909999847412], global_max_losses=[10.813274383544922, 10.74332332611084, 7.8649702072143555, 5.6796345710754395], log_steps=[1, 11, 21, 31])
```

* Basic integration test infra (#170)

Summary:
PR adds an option `use_for_integration_test`. when set to `True`, this
adds the config to the integration test suite. A test runner picks all
the configs marked for integration test and run them.

Test Plan:
```
=====Integration test: CONFIG_FILE=./train_configs/debug_model.toml NGPU=4 ./run_llama_train.sh=====
+ export USE_LIBUV=1
+ USE_LIBUV=1
+ TRAINER_DIR=/home/gnadathur/local/torchtrain
+ NGPU=4
+ LOG_RANK=0
+ CONFIG_FILE=./train_configs/debug_model.toml
+ torchrun --nproc_per_node=4 --rdzv_endpoint=localhost:5972 --local-ranks-filter 0 --role rank --tee 3 train.py --job.config_file ./train_configs/debug_model.toml
W0327 09:46:30.440000 140712921650176 torch/distributed/run.py:757]
W0327 09:46:30.440000 140712921650176 torch/distributed/run.py:757] *****************************************
W0327 09:46:30.440000 140712921650176 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0327 09:46:30.440000 140712921650176 torch/distributed/run.py:757] *****************************************
[rank0]:2024-03-27 09:46:32,214 - root - INFO - Starting job: LLaMA debug training
[rank0]:2024-03-27 09:46:32,372 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config
[rank0]:2024-03-27 09:46:32,375 - root - INFO - Building 1-D device mesh with ['dp'], [4]
[rank0]:2024-03-27 09:46:32,377 - root - INFO - Building sentencepiece tokenizer locally from ./torchtrain/datasets/tokenizer/tokenizer.model
[rank0]:2024-03-27 09:46:32,384 - root - INFO - SentencePieceTokenizer built: #words 32000, BOS ID 1, EOS ID 2
[rank0]:2024-03-27 09:46:32,384 - root - INFO - Preparing alpaca dataset from HuggingFace
[rank0]:2024-03-27 09:46:34,015 - root - INFO - Building llama debugmodel with ModelArgs(dim=256, n_layers=2, n_heads=16, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=32768, depth_init=True)
[rank0]:2024-03-27 09:46:34,024 - root - INFO - �[34mModel llama debugmodel �[31msize: 18,089,216 total parameters�[39m
[rank0]:2024-03-27 09:46:34,025 - root - INFO - GPU capacity: NVIDIA H100 (0) with 95.04GiB memory
[rank0]:2024-03-27 09:46:34,147 - root - INFO - Applied selective activation checkpointing to the model
[rank0]:2024-03-27 09:46:34,147 - root - INFO - Applied FSDP to the model
[rank0]:2024-03-27 09:46:34,171 - root - INFO - Model fully initialized via reset_parameters
[rank0]:2024-03-27 09:46:34,171 - root - INFO - Gradient scaling not enabled
[rank0]:2024-03-27 09:46:34,171 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs/tb/20240327-0946
[rank0]:2024-03-27 09:46:34,809 - root - INFO - Profiling active. Traces will be saved at ./outputs/profiling/traces
[rank0]:/data/users/gnadathur/a/pytorch/torch/utils/checkpoint.py:144: UserWarning: Tensor arguments, excluding CPU tensors, are detected on at least two types of devices. Device state will only be saved for devices of a single device type, and the remaining devices will be ignored. Consequently, if any checkpointed functions involve randomness, this may result in incorrect gradients. (Note that if CUDA devices are among the devices detected, it will be prioritized; otherwise, the first device encountered will be selected.)
[rank0]:  warnings.warn(
[rank0]:2024-03-27 09:46:35,627 - root - INFO - �[36mstep:  1  �[32mloss: 10.9486  �[33mmemory:  9.42GiB(9.91%)  �[34mwps: 20,066  �[35mmfu: 0.25%�[39m
[rank0]:2024-03-27 09:46:35,627 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:00:05
[rank0]:2024-03-27 09:46:35,705 - root - INFO - �[36mstep:  2  �[32mloss: 10.8786  �[33mmemory: 11.38GiB(11.97%)  �[34mwps: 212,046  �[35mmfu: 2.60%�[39m
[rank0]:2024-03-27 09:46:35,786 - root - INFO - �[36mstep:  3  �[32mloss: 10.7362  �[33mmemory: 11.38GiB(11.97%)  �[34mwps: 204,441  �[35mmfu: 2.50%�[39m
[rank0]:2024-03-27 09:46:35,863 - root - INFO - �[36mstep:  4  �[32mloss: 10.5094  �[33mmemory: 11.38GiB(11.97%)  �[34mwps: 216,800  �[35mmfu: 2.66%�[39m
[rank0]:2024-03-27 09:46:35,939 - root - INFO - �[36mstep:  5  �[32mloss: 10.2755  �[33mmemory: 11.38GiB(11.97%)  �[34mwps: 216,527  �[35mmfu: 2.65%�[39m
[rank0]:2024-03-27 09:46:36,016 - root - INFO - �[36mstep:  6  �[32mloss: 10.0318  �[33mmemory: 11.38GiB(11.97%)  �[34mwps: 214,117  �[35mmfu: 2.62%�[39m
[rank0]:2024-03-27 09:46:36,093 - root - INFO - �[36mstep:  7  �[32mloss:  9.7929  �[33mmemory: 11.38GiB(11.97%)  �[34mwps: 216,509  �[35mmfu: 2.65%�[39m
[rank0]:2024-03-27 09:46:36,192 - root - INFO - �[36mstep:  8  �[32mloss:  9.5539  �[33mmemory: 11.38GiB(11.97%)  �[34mwps: 166,639  �[35mmfu: 2.04%�[39m
[rank0]:2024-03-27 09:46:36,329 - root - INFO - �[36mstep:  9  �[32mloss:  9.3909  �[33mmemory: 11.38GiB(11.97%)  �[34mwps: 120,381  �[35mmfu: 1.47%�[39m
[rank0]:[rank0]:[W327 09:46:36.744143018 CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event
[rank0]:2024-03-27 09:46:36,409 - root - INFO - �[36mstep: 10  �[32mloss:  9.2749  �[33mmemory: 11.38GiB(11.97%)  �[34mwps: 207,613  �[35mmfu: 2.54%�[39m
[rank0]:NCCL version 2.20.5+cuda12.0

```

Reviewers:

Subscribers:

Tasks:

Tags:

---------

Co-authored-by: gnadathur <gnadathur@devvm4378.nao0.facebook.com>

* Add 2D integration test (FSDP + TP) (#171)

Summary:
Add a 2D test to integration test suite

Test Plan:

```

=====Integration test: CONFIG_FILE=./train_configs/debug_model.toml NGPU=4 ./run_llama_train.sh=====
+ export USE_LIBUV=1
+ USE_LIBUV=1
+ TRAINER_DIR=/home/gnadathur/local/torchtrain
+ NGPU=4
+ LOG_RANK=0
+ CONFIG_FILE=./train_configs/debug_model.toml
+ torchrun --nproc_per_node=4 --rdzv_endpoint=localhost:5972 --local-ranks-filter 0 --role rank --tee 3 train.py --job.config_file ./train_configs/debug_model.toml
W0327 14:29:47.734000 140642626999296 torch/distributed/run.py:757]
W0327 14:29:47.734000 140642626999296 torch/distributed/run.py:757] *****************************************
W0327 14:29:47.734000 140642626999296 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0327 14:29:47.734000 140642626999296 torch/distributed/run.py:757] *****************************************
[rank0]:2024-03-27 14:29:49,466 - root - INFO - Starting job: LLaMA debug training
[rank0]:2024-03-27 14:29:49,615 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config
[rank0]:2024-03-27 14:29:49,621 - root - INFO - Building 1-D device mesh with ['dp'], [4]
[rank0]:2024-03-27 14:29:49,623 - root - INFO - Building sentencepiece tokenizer locally from ./torchtrain/datasets/tokenizer/tokenizer.model
[rank0]:2024-03-27 14:29:49,630 - root - INFO - SentencePieceTokenizer built: #words 32000, BOS ID 1, EOS ID 2
[rank0]:2024-03-27 14:29:49,630 - root - INFO - Preparing alpaca dataset from HuggingFace
[rank0]:2024-03-27 14:29:51,114 - root - INFO - Building llama debugmodel with ModelArgs(dim=256, n_layers=2, n_heads=16, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=32768, depth_init=True)
[rank0]:2024-03-27 14:29:51,124 - root - INFO - �[34mModel llama debugmodel �[31msize: 18,089,216 total parameters�[39m
[rank0]:2024-03-27 14:29:51,124 - root - INFO - GPU capacity: NVIDIA H100 (0) with 95.04GiB memory
[rank0]:2024-03-27 14:29:51,259 - root - INFO - Applied selective activation checkpointing to the model
[rank0]:2024-03-27 14:29:51,259 - root - INFO - Applied FSDP to the model
[rank0]:2024-03-27 14:29:51,284 - root - INFO - Model fully initialized via reset_parameters
[rank0]:2024-03-27 14:29:51,284 - root - INFO - Gradient scaling not enabled
[rank0]:2024-03-27 14:29:51,285 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs/tb/20240327-1429
[rank0]:2024-03-27 14:29:52,056 - root - INFO - Profiling active. Traces will be saved at ./outputs/profiling/traces
[rank0]:/data/users/gnadathur/a/pytorch/torch/utils/checkpoint.py:144: UserWarning: Tensor arguments, excluding CPU tensors, are detected on at least two types of devices. Device state will only be saved for devices of a single device type, and the remaining devices will be ignored. Consequently, if any checkpointed functions involve randomness, this may result in incorrect gradients. (Note that if CUDA devices are among the devices detected, it will be prioritized; otherwise, the first device encountered will be selected.)
[rank0]:  warnings.warn(
[rank0]:2024-03-27 14:29:52,825 - root - INFO - �[36mstep:  1  �[32mloss: 10.7425  �[33mmemory:  9.42GiB(9.91%)  �[34mwps: 21,337  �[35mmfu: 0.26%�[39m
[rank0]:2024-03-27 14:29:52,825 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:00:05
[rank0]:2024-03-27 14:29:52,905 - root - INFO - �[36mstep:  2  �[32mloss: 10.6722  �[33mmemory: 11.38GiB(11.97%)  �[34mwps: 208,060  �[35mmfu: 2.55%�[39m
[rank0]:2024-03-27 14:29:52,982 - root - INFO - �[36mstep:  3  �[32mloss: 10.5435  �[33mmemory: 11.38GiB(11.97%)  �[34mwps: 213,622  �[35mmfu: 2.62%�[39m
[rank0]:2024-03-27 14:29:53,060 - root - INFO - �[36mstep:  4  �[32mloss: 10.3359  �[33mmemory: 11.38GiB(11.97%)  �[34mwps: 212,856  �[35mmfu: 2.61%�[39m
[rank0]:2024-03-27 14:29:53,139 - root - INFO - �[36mstep:  5  �[32mloss: 10.0965  �[33mmemory: 11.38GiB(11.97%)  �[34mwps: 209,326  �[35mmfu: 2.56%�[39m
[rank0]:2024-03-27 14:29:53,215 - root - INFO - �[36mstep:  6  �[32mloss:  9.8806  �[33mmemory: 11.38GiB(11.97%)  �[34mwps: 216,808  �[35mmfu: 2.66%�[39m
[rank0]:2024-03-27 14:29:53,292 - root - INFO - �[36mstep:  7  �[32mloss:  9.6442  �[33mmemory: 11.38GiB(11.97%)  �[34mwps: 214,874  �[35mmfu: 2.63%�[39m
[rank0]:2024-03-27 14:29:53,367 - root - INFO - �[36mstep:  8  �[32mloss:  9.4349  �[33mmemory: 11.38GiB(11.97%)  �[34mwps: 220,877  �[35mmfu: 2.70%�[39m
[rank0]:2024-03-27 14:29:53,500 - root - INFO - �[36mstep:  9  �[32mloss:  9.2674  �[33mmemory: 11.38GiB(11.97%)  �[34mwps: 123,924  �[35mmfu: 1.52%�[39m
[rank0]:[rank0]:[W327 14:29:53.248291822 CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event
[rank0]:2024-03-27 14:29:53,577 - root - INFO - �[36mstep: 10  �[32mloss:  9.1404  �[33mmemory: 11.38GiB(11.97%)  �[34mwps: 214,910  �[35mmfu: 2.63%�[39m
[rank0]:NCCL version 2.20.5+cuda12.0

=====Integration test: CONFIG_FILE=./train_configs/debug_model_2d.toml NGPU=4 ./run_llama_train.sh=====
+ export USE_LIBUV=1
+ USE_LIBUV=1
+ TRAINER_DIR=/home/gnadathur/local/torchtrain
+ NGPU=4
+ LOG_RANK=0
+ CONFIG_FILE=./train_configs/debug_model_2d.toml
+ torchrun --nproc_per_node=4 --rdzv_endpoint=localhost:5972 --local-ranks-filter 0 --role rank --tee 3 train.py --job.config_file ./train_configs/debug_model_2d.toml
W0327 14:29:58.902000 140021143774208 torch/distributed/run.py:757]
W0327 14:29:58.902000 140021143774208 torch/distributed/run.py:757] *****************************************
W0327 14:29:58.902000 140021143774208 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0327 14:29:58.902000 140021143774208 torch/distributed/run.py:757] *****************************************
[rank0]:2024-03-27 14:30:00,872 - root - INFO - Starting job: LLaMA debug training
[rank0]:2024-03-27 14:30:01,177 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config
[rank0]:2024-03-27 14:30:01,182 - root - INFO - Building 2-D device mesh with ['dp', 'tp'], [2, 2]
[rank0]:2024-03-27 14:30:01,185 - root - INFO - Building sentencepiece tokenizer locally from ./torchtrain/datasets/tokenizer/tokenizer.model
[rank0]:2024-03-27 14:30:01,194 - root - INFO - SentencePieceTokenizer built: #words 32000, BOS ID 1, EOS ID 2
[rank0]:2024-03-27 14:30:01,195 - root - INFO - Preparing alpaca dataset from HuggingFace
[rank0]:2024-03-27 14:30:02,807 - root - INFO - Building llama debugmodel with ModelArgs(dim=256, n_layers=2, n_heads=16, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=32768, depth_init=True)
[rank0]:2024-03-27 14:30:02,818 - root - INFO - �[34mModel llama debugmodel �[31msize: 18,089,216 total parameters�[39m
[rank0]:2024-03-27 14:30:02,819 - root - INFO - GPU capacity: NVIDIA H100 (0) with 95.04GiB memory
[rank0]:2024-03-27 14:30:02,830 - root - INFO - Applied Sequence Parallelism to the model
[rank0]:2024-03-27 14:30:02,975 - root - INFO - Applied selective activation checkpointing to the model
[rank0]:2024-03-27 14:30:02,975 - root - INFO - Applied FSDP to the model
[rank0]:2024-03-27 14:30:03,004 - root - INFO - Model fully initialized via reset_parameters
[rank0]:2024-03-27 14:30:03,004 - root - INFO - Gradient scaling not enabled
[rank0]:2024-03-27 14:30:03,005 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs/tb/20240327-1430
[rank0]:2024-03-27 14:30:03,642 - root - INFO - Profiling active. Traces will be saved at ./outputs/profiling/traces
[rank0]:/data/users/gnadathur/a/pytorch/torch/utils/checkpoint.py:144: UserWarning: Tensor arguments, excluding CPU tensors, are detected on at least two types of devices. Device state will only be saved for devices of a single device type, and the remaining devices will be ignored. Consequently, if any checkpointed functions involve randomness, this may result in incorrect gradients. (Note that if CUDA devices are among the devices detected, it will be prioritized; otherwise, the first device encountered will be selected.)
[rank0]:  warnings.warn(
[rank0]:2024-03-27 14:30:04,528 - root - INFO - �[36mstep:  1  �[32mloss: 10.8502  �[33mmemory:  5.71GiB(6.01%)  �[34mwps: 9,259  �[35mmfu: 0.11%�[39m
[rank0]:2024-03-27 14:30:04,528 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:00:05
[rank0]:2024-03-27 14:30:04,679 - root - INFO - �[36mstep:  2  �[32mloss: 10.7671  �[33mmemory:  6.69GiB(7.04%)  �[34mwps: 54,430  �[35mmfu: 0.67%�[39m
[rank0]:2024-03-27 14:30:04,773 - root - INFO - �[36mstep:  3  �[32mloss: 10.6390  �[33mmemory:  6.69GiB(7.04%)  �[34mwps: 88,457  �[35mmfu: 1.08%�[39m
[rank0]:2024-03-27 14:30:04,864 - root - INFO - �[36mstep:  4  �[32mloss: 10.4210  �[33mmemory:  6.69GiB(7.04%)  �[34mwps: 90,384  �[35mmfu: 1.11%�[39m
[rank0]:2024-03-27 14:30:04,954 - root - INFO - �[36mstep:  5  �[32mloss: 10.1648  �[33mmemory:  6.69GiB(7.04%)  �[34mwps: 93,058  �[35mmfu: 1.14%�[39m
[rank0]:2024-03-27 14:30:05,067 - root - INFO - �[36mstep:  6  �[32mloss:  9.9451  �[33mmemory:  6.69GiB(7.04%)  �[34mwps: 72,642  �[35mmfu: 0.89%�[39m
[rank0]:2024-03-27 14:30:05,165 - root - INFO - �[36mstep:  7  �[32mloss:  9.7004  �[33mmemory:  6.69GiB(7.04%)  �[34mwps: 85,096  �[35mmfu: 1.04%�[39m
[rank0]:2024-03-27 14:30:05,251 - root - INFO - �[36mstep:  8  �[32mloss:  9.4422  �[33mmemory:  6.69GiB(7.04%)  �[34mwps: 95,860  �[35mmfu: 1.17%�[39m
[rank0]:2024-03-27 14:30:05,399 - root - INFO - �[36mstep:  9  �[32mloss:  9.2144  �[33mmemory:  6.69GiB(7.04%)  �[34mwps: 55,837  �[35mmfu: 0.68%�[39m
[rank0]:[rank0]:[W327 14:30:05.148473462 CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event
[rank0]:2024-03-27 14:30:05,496 - root - INFO - �[36mstep: 10  �[32mloss:  9.1710  �[33mmemory:  6.69GiB(7.04%)  �[34mwps: 86,136  �[35mmfu: 1.05%�[39m
[rank0]:NCCL version 2.20.5+cuda12.0
```

Reviewers:

Subscribers:

Tasks:

Tags:

Co-authored-by: gnadathur <gnadathur@devvm4378.nao0.facebook.com>

* Used per-parameter FSDP (#165)

**Numeric Parity**
1D FSDP
- Eager: 1k steps of minipile on 8 H100 GPUs, local batch size 8,
sequence length 2048, AC/SAC, bf16 mixed precision, fp32 reduce-scatter
- FSDP1 (AC): 24.81% peak active, 33.82% peak reserved, 6100-6200 WPS
- FSDP1 (SAC): 52.98% peak active, 67.23% peak reserved, 6500-6700 WPS
- FSDP2 (AC): 23.92% peak active, 32.64% peak reserved, 6100-6300 WPS
- FSDP2 (SAC): 52.13% peak active, 62.51% peak reserved, 6600-6800 WPS
    - Loss curves match between FSDP1 and FSDP2
- Memory numbers reported as percentage since that is how they are
logged; can convert against 95.0396 GiB GPU memory
- Compile: same setup as eager
- FSDP2 (AC), buffer reuse disabled: 28.72 GiB (30.22%) peak reserved,
7200-7500 WPS, 33% MFU
- FSDP2 (AC), buffer reuse enabled: 28.90 GiB (30.40%) peak reserved,
7200-7500 WPS, 33% MFU
- FSDP2 (SAC), buffer reuse enabled: 53.83 GiB (56.64%) peak reserved,
8100-8400 WPS, 36% MFU
    - Loss curves slightly better than eager
    - For fun -- how much can we push MFU?
- If we use FSDP2 (SAC) with 16 local batch size (doubled), we get 88.23
GiB (92.84%) peak reserved, 8600 WPS, 38% MFU.
- If we use FSDP2 (no AC) with 8 local batch size, we get 90.28 GiB
(94.99%) peak reserved, 9100-9300 WPS, 40% MFU.
- Why is FSDP2 faster? (1) fp32 reduce-scatter only uses one div kernel
instead of two and (2), `reshard_after_forward=False` for the last
transformer block

2D FSDP
- Eager (2-way SP, 4-way FSDP): 1k steps of minipile on 8 H100 GPUs,
local batch size 16 (to preserve global batch size), sequence length
2048, bf16 mixed precision, fp32 reduce-scatter
- FSDP2 (AC): 50.12% peak active, 60.97% peak reserved, 5800-5900 WPS
- FSDP2 (SAC): 76.49% peak active, 90.14% peak reserved, 6100-6300 WPS
- Loss curves match 8-way FSDP
- FSDP1 + SP has incorrect numerics due to the `FSDP.clip_grad_norm_`
not all-reducing over TP mesh dimension

<details>
<summary> Loss curves </summary>

<img width="732" alt="Screenshot 2024-03-26 at 3 31 19 PM"
src="https://github.com/pytorch/torchtrain/assets/31054793/59ec71cc-ad0a-4dd1-b5c6-a8cbf9ab5e85">

</details>


**Meta-Device Initialization**
- The PyTorch Core guideline is for `module.reset_parameters()` to only
initialize parameters/buffers immediately owned by `module` (i.e.
`module.parameters(recurse=False)` and `module.buffers(recurse=False)`).
- This makes it challenging to specify custom initializations for core
modules like `nn.Linear` and `nn.Embedding`. For example, in
@lessw2020's depth-wise truncated normal initialization, the
`trunc_normal_` standard deviation depends on the layer ID, which is a
property of the `TransformerBlock` but affects the child `nn.Linear`s.
- To disambiguate, I suggest avoiding the name `reset_parameters()` in
the case that we violate the PyTorch Core guideline and instead use a
different name (e.g. `init_weights`).

**DCP & Save/Load**
- Tested 1D and 2D by specifying `checkpoint_folder =
"/tmp/checkpoint_andgu` in the `.toml`, training until saving a
checkpoint, terminating the run, and restarting the training to load the
checkpoint -- the loss after loading looks reasonable

* plot losses in loaded TrainState to TensorBoard

ghstack-source-id: f13612ce1f739219c31aa2b9222259f9f586126b
Pull Request resolved: https://github.com/pytorch/torchtrain/pull/173

* Removed setting global flag for `swap_tensors` since not needed anymore

ghstack-source-id: 484237b30ba8bf8bb9e7a9cf2c97180d9fb21295
Pull Request resolved: https://github.com/pytorch/torchtrain/pull/178

* Add integration test with compile enabled (#183)

Summary:
same as title

Test Plan:
```

+ export USE_LIBUV=1
+ USE_LIBUV=1
+ TRAINER_DIR=/home/gnadathur/local/torchtrain
+ NGPU=4
+ LOG_RANK=0,1
+ CONFIG_FILE=./train_configs/debug_model_compile.toml
+ torchrun --nproc_per_node=4 --rdzv_endpoint=localhost:5972 --local-ranks-filter 0,1 --role rank --tee 3 train.py --job.config_file ./train_configs/debug_model_compile.toml
W0401 17:54:33.567000 139955931223040 torch/distributed/run.py:757]
W0401 17:54:33.567000 139955931223040 torch/distributed/run.py:757] *****************************************
W0401 17:54:33.567000 139955931223040 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0401 17:54:33.567000 139955931223040 torch/distributed/run.py:757] *****************************************
[rank0]:2024-04-01 17:54:35,779 - root - INFO - Starting job: LLaMA debug training
[rank1]:2024-04-01 17:54:35,797 - root - INFO - Starting job: LLaMA debug training
[rank0]:2024-04-01 17:54:36,063 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config
[rank0]:2024-04-01 17:54:36,069 - root - INFO - Building 1-D device mesh with ['dp'], [4]
[rank0]:2024-04-01 17:54:36,071 - root - INFO - Building sentencepiece tokenizer locally from ./torchtrain/datasets/tokenizer/tokenizer.model
[rank0]:2024-04-01 17:54:36,078 - root - INFO - SentencePieceTokenizer built: #words 32000, BOS ID 1, EOS ID 2
[rank0]:2024-04-01 17:54:36,078 - root - INFO - Preparing alpaca dataset from HuggingFace
[rank1]:2024-04-01 17:54:36,449 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config
[rank1]:2024-04-01 17:54:36,454 - root - INFO - Building 1-D device mesh with ['dp'], [4]
[rank1]:2024-04-01 17:54:36,456 - root - INFO - Building sentencepiece tokenizer locally from ./torchtrain/datasets/tokenizer/tokenizer.model
[rank1]:2024-04-01 17:54:36,463 - root - INFO - SentencePieceTokenizer built: #words 32000, BOS ID 1, EOS ID 2
[rank1]:2024-04-01 17:54:36,463 - root - INFO - Preparing alpaca dataset from HuggingFace
[rank0]:2024-04-01 17:54:37,631 - root - INFO - Building llama debugmodel with ModelArgs(dim=256, n_layers=2, n_heads=16, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=32768, depth_init=True)
[rank0]:2024-04-01 17:54:37,643 - root - INFO - �[34mModel llama debugmodel �[31msize: 18,089,216 total parameters�[39m
[rank0]:2024-04-01 17:54:37,644 - root - INFO - GPU capacity: NVIDIA H100 (0) with 95.04GiB memory
[rank0]:2024-04-01 17:54:37,653 - root - INFO - Applied selective activation checkpointing to the model
[rank0]:2024-04-01 17:54:37,653 - root - INFO - Applied FSDP to the model
[rank1]:2024-04-01 17:54:38,310 - root - INFO - Building llama debugmodel with ModelArgs(dim=256, n_layers=2, n_heads=16, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=32768, depth_init=True)
[rank1]:2024-04-01 17:54:38,324 - root - INFO - �[34mModel llama debugmodel �[31msize: 18,089,216 total parameters�[39m
[rank1]:2024-04-01 17:54:38,325 - root - INFO - GPU capacity: NVIDIA H100 (1) with 95.04GiB memory
[rank1]:2024-04-01 17:54:38,335 - root - INFO - Applied selective activation checkpointing to the model
[rank1]:2024-04-01 17:54:38,335 - root - INFO - Applied FSDP to the model
[rank1]:2024-04-01 17:54:38,699 - root - INFO - Gradient scaling not enabled
[rank1]:2024-04-01 17:54:38,699 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs/tb/20240401-1754
[rank1]:2024-04-01 17:54:38,701 - root - INFO - Compiling model with torch.compile
[rank0]:2024-04-01 17:54:38,692 - root - INFO - Gradient scaling not enabled
[rank0]:2024-04-01 17:54:38,693 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs/tb/20240401-1754
[rank0]:2024-04-01 17:54:38,694 - root - INFO - Compiling model with torch.compile
[rank0]:2024-04-01 17:54:39,390 - root - INFO - Profiling active. Traces will be saved at ./outputs/profiling/traces
[rank1]:2024-04-01 17:54:39,390 - root - INFO - Profiling active. Traces will be saved at ./outputs/profiling/traces
[rank1]:/data/users/gnadathur/a/pytorch/torch/_inductor/lowering.py:1789: UserWarning: Torchinductor does not support code generation for complex operators. Performance may be worse than eager.
[rank1]:  warnings.warn(
[rank0]:/data/users/gnadathur/a/pytorch/torch/_inductor/lowering.py:1789: UserWarning: Torchinductor does not support code generation for complex operators. Performance may be worse than eager.
[rank0]:  warnings.warn(
[rank1]:2024-04-01 17:54:40,498 - root - INFO - running build_ext
[rank0]:2024-04-01 17:54:40,493 - root - INFO - running build_ext
[rank1]:2024-04-01 17:54:41,992 - root - INFO - running build_ext
[rank0]:2024-04-01 17:54:41,985 - root - INFO - running build_ext
[rank1]:2024-04-01 17:54:42,180 - root - INFO - running build_ext
[rank0]:2024-04-01 17:54:42,187 - root - INFO - running build_ext
[rank1]:2024-04-01 17:54:43,947 - root - INFO - running build_ext
[rank1]:2024-04-01 17:54:43,963 - root - INFO - running build_ext
[rank1]:2024-04-01 17:54:43,971 - root - INFO - running build_ext
[rank0]:2024-04-01 17:54:43,920 - root - INFO - running build_ext
[rank0]:2024-04-01 17:54:43,951 - root - INFO - running build_ext
[rank0]:2024-04-01 17:54:43,974 - root - INFO - running build_ext
[rank1]:2024-04-01 17:54:44,029 - root - INFO - running build_ext
[rank0]:2024-04-01 17:54:44,033 - root - INFO - running build_ext
[rank1]:2024-04-01 17:54:45,907 - root - INFO - running build_ext
[rank0]:2024-04-01 17:54:45,933 - root - INFO - running build_ext
[rank1]:2024-04-01 17:54:47,561 - root - INFO - running build_ext
[rank1]:2024-04-01 17:54:47,667 - root - INFO - running build_ext
[rank0]:2024-04-01 17:54:47,649 - root - INFO - running build_ext
[rank0]:2024-04-01 17:54:47,706 - root - INFO - running build_ext
[rank1]:2024-04-01 17:54:49,084 - root - INFO - running build_ext
[rank1]:2024-04-01 17:54:49,108 - root - INFO - running build_ext
[rank1]:2024-04-01 17:54:49,110 - root - INFO - running build_ext
[rank0]:2024-04-01 17:54:49,086 - root - INFO - running build_ext
[rank0]:2024-04-01 17:54:49,114 - root - INFO - running build_ext
[rank0]:2024-04-01 17:54:49,131 - root - INFO - running build_ext
[rank0]:2024-04-01 17:54:50,546 - root - INFO - running build_ext
[rank1]:2024-04-01 17:54:50,638 - root - INFO - running build_ext
[rank0]:2024-04-01 17:54:51,901 - root - INFO - running build_ext
[rank1]:2024-04-01 17:54:52,025 - root - INFO - running build_ext
[rank1]:2024-04-01 17:54:52,734 - root - INFO - �[36mstep:  1  �[32mloss: 10.9746  �[33mmemory:  9.53GiB(10.03%)  �[34mwps: 1,228  �[35mmfu: 0.02%�[39m
[rank1]:2024-04-01 17:54:52,734 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:00:05
[rank1]:2024-04-01 17:54:52,813 - root - INFO - �[36mstep:  2  �[32mloss: 10.9091  �[33mmemory:  9.54GiB(10.03%)  �[34mwps: 208,739  �[35mmfu: 2.56%�[39m
[rank0]:2024-04-01 17:54:52,734 - root - INFO - �[36mstep:  1  �[32mloss: 10.9746  �[33mmemory:  9.53GiB(10.03%)  �[34mwps: 1,228  �[35mmfu: 0.02%�[39m
[rank0]:2024-04-01 17:54:52,734 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:00:05
[rank0]:2024-04-01 17:54:52,813 - root - INFO - �[36mstep:  2  �[32mloss: 10.9091  �[33mmemory:  9.54GiB(10.03%)  �[34mwps: 208,501  �[35mmfu: 2.55%�[39m
[rank1]:2024-04-01 17:54:52,889 - root - INFO - �[36mstep:  3  �[32mloss: 10.7722  �[33mmemory:  9.54GiB(10.03%)  �[34mwps: 219,416  �[35mmfu: 2.69%�[39m
[rank0]:2024-04-01 17:54:52,889 - root - INFO - �[36mstep:  3  �[32mloss: 10.7722  �[33mmemory:  9.54GiB(10.03%)  �[34mwps: 219,182  �[35mmfu: 2.68%�[39m
[rank1]:2024-04-01 17:54:52,965 - root - INFO - �[36mstep:  4  �[32mloss: 10.5428  �[33mmemory:  9.54GiB(10.03%)  �[34mwps: 218,226  �[35mmfu: 2.67%�[39m
[rank0]:2024-04-01 17:54:52,965 - root - INFO - �[36mstep:  4  �[32mloss: 10.5428  �[33mmemory:  9.54GiB(10.03%)  �[34mwps: 218,015  �[35mmfu: 2.67%�[39m
[rank1]:2024-04-01 17:54:53,045 - root - INFO - �[36mstep:  5  �[32mloss: 10.3063  �[33mmemory:  9.54GiB(10.03%)  �[34mwps: 207,094  �[35mmfu: 2.54%�[39m
[rank0]:2024-04-01 17:54:53,045 - root - INFO - �[36mstep:  5  �[32mloss: 10.3063  �[33mmemory:  9.54GiB(10.03%)  �[34mwps: 207,220  �[35mmfu: 2.54%�[39m
[rank1]:2024-04-01 17:54:53,123 - root - INFO - �[36mstep:  6  �[32mloss: 10.0707  �[33mmemory:  9.54GiB(10.03%)  �[34mwps: 210,814  �[35mmfu: 2.58%�[39m
[rank1]:2024-04-01 17:54:53,202 - root - INFO - �[36mstep:  7  �[32mloss:  9.8302  �[33mmemory:  9.54GiB(10.03%)  �[34mwps: 209,649  �[35mmfu: 2.57%�[39m
[rank0]:2024-04-01 17:54:53,123 - root - INFO - �[36mstep:  6  �[32mloss: 10.0707  �[33mmemory:  9.54GiB(10.03%)  �[34mwps: 210,849  �[35mmfu: 2.58%�[39m
[rank0]:2024-04-01 17:54:53,202 - root - INFO - �[36mstep:  7  �[32mloss:  9.8302  �[33mmemory:  9.54GiB(10.03%)  �[34mwps: 209,542  �[35mmfu: 2.57%�[39m
[rank0]:2024-04-01 17:54:53,281 - root - INFO - �[36mstep:  8  �[32mloss:  9.5918  �[33mmemory:  9.54GiB(10.03%)  �[34mwps: 211,690  �[35mmfu: 2.59%�[39m
[rank1]:2024-04-01 17:54:53,281 - root - INFO - �[36mstep:  8  �[32mloss:  9.5918  �[33mmemory:  9.54GiB(10.03%)  �[34mwps: 211,786  �[35mmfu: 2.59%�[39m
[rank1]:2024-04-01 17:54:53,412 - root - INFO - �[36mstep:  9  �[32mloss:  9.4299  �[33mmemory:  9.54GiB(10.03%)  �[34mwps: 125,833  �[35mmfu: 1.54%�[39m
[rank1]:[rank1]:[W401 17:54:53.242673953 CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event
[rank0]:2024-04-01 17:54:53,412 - root - INFO - �[36mstep:  9  �[32mloss:  9.4299  �[33mmemory:  9.54GiB(10.03%)  �[34mwps: 125,765  �[35mmfu: 1.54%�[39m
[rank0]:[rank0]:[W401 17:54:53.240925776 CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event
[rank1]:2024-04-01 17:54:53,492 - root - INFO - �[36mstep: 10  �[32mloss:  9.2955  �[33mmemory:  9.54GiB(10.03%)  �[34mwps: 207,661  �[35mmfu: 2.54%�[39m
[rank0]:2024-04-01 17:54:53,492 - root - INFO - �[36mstep: 10  �[32mloss:  9.2955  �[33mmemory:  9.54GiB(10.03%)  �[34mwps: 207,426  �[35mmfu: 2.54%�[39m
[rank0]:NCCL version 2.20.5+cuda12.0
```

Reviewers:

Subscribers:

Tasks:

Tags:

---------

Co-authored-by: gnadathur <gnadathur@devvm4378.nao0.facebook.com>

* remove folding and unfolding of sequence dim in model.py

ghstack-source-id: 5d299adcd766baad6a36e63be4acc01fb2fd36db
Pull Request resolved: https://github.com/pytorch/torchtrain/pull/190

* bump comm.train_timeout_seconds (#189)

this PR bumps this default config to a larger value, as profiling is
pretty heavy step, a default 5 seconds would likely trigger watchdog
unintentionally

* fix checkpoint parser

ghstack-source-id: 47ee7b5e2228705e5215195ac9ff13e1b168f93e
Pull Request resolved: https://github.com/pytorch/torchtrain/pull/197

* support sequence of tests and add checkpoint test

address comments

ghstack-source-id: 7d6c51a5ef68dea06ba7d64741a554165c79f1d3
Pull Request resolved: https://github.com/pytorch/torchtrain/pull/198

* Make freqs_cis a persistent buffer for pp init

currently, planning to use a 'seed checkpoint' to initialize the
pipeline parallel model chunks after moving them from meta device to
cuda/empty.

non-persistent buffers are incompatible with this approach, as they are
missing from the checkpoint and thus require manual init.

an alternative is to manually run the initializer for just the
non-persistent buffers after loading a seed-checkpoint, but this
approach is nearly equivalent with less code changes.

ghstack-source-id: b48228488d4c3924fffef4237f4106383c14a934
Pull Request resolved: https://github.com/pytorch/torchtrain/pull/201

* Delete grad scaler, which is unsupported/unused

grad scaler currently doesn't work with FSDP2, and isn't enabled anyway
becuase bf16 training is the norm and doens't require it.

remove it for simplicity.  it will be easier to enable pipeline
parallelism with a simplier loss function setup, but if desired, its
still possible to support pipeline parallelism with the scaler added
back in.

ghstack-source-id: 82b0e4324eac88ee62723a6d832182d4e6c76e0f
Pull Request resolved: https://github.com/pytorch/torchtrain/pull/202

* Factor out loss_fn to share code with pipeline par

PP requires feeding a loss_fn into the schedule's step so that loss can
be computed per microbatch as part of the forward/backward scheduling.

As such, it is nice to define loss once and use it both in the non-pp
code that manually calls f/loss/b and also use it in the pp step().

ghstack-source-id: 9bedd5103e23627d5e268c287d49f0759442ba12
Pull Request resolved: https://github.com/pytorch/torchtrain/pull/203

* [TorchTrain] Minor fix for #197 (#204)

The changes made in github editor didn't go in when doing ghstack land.

* Add FusedRMSNorm (Triton kernel, +15% eager), Add NPLayerNorm, Enable config selectable Norm Type (#181)

This PR has multiple aspects:
1 - Adds a new Triton based Fused RMSNorm I wrote. I've verified it's
numerical accuracy on both forward and backward with a unit test.
It improves MFU by +15% with FSDP2 7B, and compiled slightly by +1.2%:
<img width="545" alt="Screenshot 2024-03-29 at 5 18 14 PM"
src="https://github.com/pytorch/torchtrain/assets/46302957/8f16fae9-947b-4720-a370-b954779c33a7">

2 - Adds norms.py to house all 4 norm types, and standardizes to
[layernorm / np_layernorm / rmsnorm / fused_rmsnorm]. Norms.py has a
create_norms function that then creates the appropriate norm.

3 - Adds np_layernorm, which is layernorm with no affine transformation.

4 - Updates model.py to now support plug and play of any supported norm.

Thus instead of this type of if/then logic in the model class:
<img width="928" alt="Screenshot 2024-03-30 at 1 52 07 PM"
src="https://github.com/pytorch/torchtrain/assets/46302957/ba7cb976-580f-4471-a79b-a584f7d20693">

We simply have this:
<img width="1129" alt="Screenshot 2024-03-30 at 1 52 23 PM"
src="https://github.com/pytorch/torchtrain/assets/46302957/aba48b4d-1620-4059-840d-e620468f00f2">

This then allows for easy plug and play of any norm type with no
fiddling around in the model code.

5 - updates run_llama_train.sh to randomly select a port vs previous
fixed port number. (thanks @yifuwang for this tip!)


6 - Now users can quickly select the norm of their choice via the config
file:
<img width="774" alt="Screenshot 2024-03-30 at 3 01 43 PM"
src="https://github.com/pytorch/torchtrain/assets/46302957/3238b375-dc21-4ee2-a5fa-f6571da79edb">

7 - adds a NotImpl error if users try to run TP + fused_rnsmorm to avoid
any confusion (per @tianyu-l feedback):
~~~
NotImplementedError: fused_rmsnorm not yet compatible with TP. Please
use rmsnorm.
~~~

* remove .item() per iter

ghstack-source-id: ab29c214604fd76cefdfe70149ecf07a2e03103e
Pull Request resolved: https://github.com/pytorch/torchtrain/pull/206

* Removed cache_k and cache_v comments

ghstack-source-id: 8bc66c683a801189b152b0ef4301579ec1ec17e7
Pull Request resolved: https://github.com/pytorch/torchtrain/pull/213

* Some more cleanups

ghstack-source-id: a53cbbecc35eac2a62d8ebc241462ac418666336
Pull Request resolved: https://github.com/pytorch/torchtrain/pull/212

* avoid record streams and make color printing a config

ghstack-source-id: 1c7cb2710330ec3fb2384793b5ad77c65b107cbc
Pull Request resolved: https://github.com/pytorch/torchtrain/pull/195

* fix SAC to use the correct reduce_scatter op (#215)

as titled, we migrated to the native functional collective so the SAC
should capture this instead of the old one

* Test runner  raises exception on failures (#216)

Summary: Test runner  should raise exception on failures.

Test Plan: 

```
=====Integration test, flavor : , command : CONFIG_FILE=./train_configs/debug_model.toml NGPU=4 ./run_llama_train.sh  =====
+ export USE_LIBUV=1
+ USE_LIBUV=1
+ TRAINER_DIR=/home/gnadathur/local/torchtrain
+ NGPU=4
+ LOG_RANK=0
+ CONFIG_FILE=./train_configs/debug_model.toml
+ overrides=
+ '[' 0 -ne 0 ']'

=====Integration test, flavor : 1D compile, command : CONFIG_FILE=./train_configs/debug_model.toml NGPU=4 ./run_llama_train.sh --training.compile=====
+ export USE_LIBUV=1
+ USE_LIBUV=1
+ TRAINER_DIR=--training.compile
+ NGPU=4
+ LOG_RANK=0
+ CONFIG_FILE=./train_configs/debug_model.toml
+ overrides=
+ '[' 1 -ne 0 ']'
+ overrides=--training.compile
+ torchrun --nproc_per_node=4 --rdzv_backend c10d --rdzv_endpoint=localhost:0 --local-ranks-filter 0 --role rank --tee 3 train.py --job.config_file ./train_configs/debug_model.toml --training.compile W0410 13:32:42.926000 139839630783488 torch/distributed/run.py:757] W0410 13:32:42.926000 139839630783488 torch/distributed/run.py:757] ***************************************** W0410 13:32:42.926000 139839630783488 torch/distributed/run.py:757] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0410 13:32:42.926000 139839630783488 torch/distributed/run.py:757] ***************************************** [rank0]:2024-04-10 13:32:45,243 - root - INFO - Starting job: LLaMA debug training [rank0]:2024-04-10 13:32:45,676 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config [rank0]:2024-04-10 13:32:46,028 - root - INFO - Building 1-D device mesh with ['dp'], [4] [rank0]:2024-04-10 13:32:46,030 - root - INFO - Building sentencepiece tokenizer locally from ./torchtrain/datasets/tokenizer/tokenizer.model [rank0]:2024-04-10 13:32:46,038 - root - INFO - SentencePieceTokenizer built: #words 32000, BOS ID 1, EOS ID 2 [rank0]:2024-04-10 13:32:46,038 - root - INFO - Preparing alpaca dataset from HuggingFace [rank0]:2024-04-10 13:32:47,813 - root - INFO - Building llama debugmodel with ModelArgs(dim=256, n_layers=2, n_heads=16, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=32768, depth_init=True, norm_type='fused_rmsnorm') [rank0]:2024-04-10 13:32:47,826 - root - INFO - �[34mModel llama debugmodel �[31msize: 18,089,216 total parameters�[39m [rank0]:2024-04-10 13:32:47,826 - root - INFO - GPU capacity: NVIDIA H100 (0) with 95.04GiB memory [rank0]:2024-04-10 13:32:47,836 - root - INFO - Applied selective activation checkpointing to the model [rank0]:2024-04-10 13:32:47,836 - root - INFO - Applied FSDP to the model [rank0]:2024-04-10 13:32:48,582 - root - INFO - GPU memory usage for model: 0.04GiB(0.05%) [rank0]:2024-04-10 13:32:48,582 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs/tb/20240410-1332 [rank0]:2024-04-10 13:32:48,584 - root - INFO - Compiling model with torch.compile [rank0]:2024-04-10 13:32:49,384 - root - INFO - Training starts at step 1 [rank0]:2024-04-10 13:32:49,385 - root - INFO - Profiling active. Traces will be saved at ./outputs/profiling/traces [rank0]:[rank0]:W0410 13:32:49.487000 139672077292544 torch/_logging/_internal.py:1016] [0/0] Profiler function <class 'torch.autograd.profiler.record_function'> will be ignored [rank0]:[rank0]: Traceback (most recent call last):
[rank0]:[rank0]:   File "/data/users/gnadathur/a/torchtitan/train.py", line 394, in <module>
[rank0]:[rank0]:     main(config)
[rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
[rank0]:[rank0]:     return f(*args, **kwargs)
[rank0]:[rank0]:   File "/data/users/gnadathur/a/torchtitan/train.py", line 287, in main
[rank0]:[rank0]:     pred = model(input_ids)
[rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/nn/modules/module.py", line 1541, in _call_impl
[rank0]:[rank0]:     return forward_call(*args, **kwargs)
[rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/eval_frame.py", line 410, in _fn
[rank0]:[rank0]:     return fn(*args, **kwargs)
[rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
[rank0]:[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/nn/modules/module.py", line 1582, in _call_impl
[rank0]:[rank0]:     result = forward_call(*args, **kwargs)
[rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 966, in catch_errors
[rank0]:[rank0]:     return callback(frame, cache_entry, hooks, frame_state, skip=1)
[rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 809, in _convert_frame
[rank0]:[rank0]:     result = inner_convert(
[rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 404, in _convert_frame_assert
[rank0]:[rank0]:     return _compile(
[rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_utils_internal.py", line 70, in wrapper_function
[rank0]:[rank0]:     return function(*args, **kwargs)
[rank0]:[rank0]:   File "/home/gnadathur/local/a/pytorch-env/lib/python3.10/contextlib.py", line 79, in inner
[rank0]:[rank0]:     return func(*args, **kwds)
[rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 691, in _compile
[rank0]:[rank0]:     guarded_code = compile_inner(code, one_graph, hooks, transform)
[rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/utils.py", line 265, in time_wrapper
[rank0]:[rank0]:     r = func(*args, **kwargs)
[rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 546, in compile_inner
[rank0]:[rank0]:     out_code = transform_code_object(code, transform)
[rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/bytecode_transformation.py", line 1103, in transform_code_object
[rank0]:[rank0]:     transformations(instructions, code_options)
[rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 168, in _fn
[rank0]:[rank0]:     return fn(*args, **kwargs)
[rank0]:[rank0]:   File "/data/users/gnadathur/a/pytorch/torch/_dynamo/convert_frame.py", line 508, in transform
[rank0]:[rank0]:     tracer.run()
[rank0]:[rank0]:   File "/data/u…

Loading branch information

32 people authored Sep 8, 2024

1 parent 7aa3e80 commit 169d1c7

.ci/docker/requirements.txt

-Original file line number
+Diff line change
@@ -1,6 +1,5 @@
-    torch >= 2.3.0
     torchdata >= 0.8.0
-    datasets >= 2.19.0
+    datasets >= 2.21.0
     tomli >= 1.1.0 ; python_version < "3.11"
     tensorboard
     sentencepiece
@@ Expand Down @@

.github/workflows/integration_test_4gpu.yaml

-Original file line number
+Diff line change
@@ Expand Up / @@ -38,6 +38,5 @@ jobs: @@
             pip config --user set global.progress_bar off
             python -m pip install --force-reinstall --pre torch --index-url https://download.pytorch.org/whl/nightly/cu121
-            USE_CPP=0 python -m pip install git+https://github.com/pytorch/ao.git
             mkdir artifacts-to-be-uploaded
             python ./test_runner.py artifacts-to-be-uploaded --ngpu 4

CONTRIBUTING.md

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -1,13 +1,13 @@
  
    # Contributing to torchtitan

    We want to make contributing to this project as easy and transparent as

    possible.

    possible. Contributions should follow the [Contributing Guidelines](#contributing-guidelines) below.

    ## Setup

    ### Setup

    ```

    pip install -r dev-requirements.txt

    ```

    ## Pull Requests

    ### Pull Requests

    We actively welcome your pull requests.

    1. Fork the repo and create your branch from `main`.

    @@ -17,20 +17,74 @@ We actively welcome your pull requests.
  
    5. Make sure your code lints (`pre-commit run --all-files`).

    6. If you haven't already, complete the Contributor License Agreement ("CLA").

    ## Contributor License Agreement ("CLA")

    ### Contributor License Agreement ("CLA")

    In order to accept your pull request, we need you to submit a CLA. You only need

    to do this once to work on any of Meta's open source projects.

    Complete your CLA here: <https://code.facebook.com/cla>

    ## Issues

    ### Issues

    We use GitHub issues to track public bugs. Please ensure your description is

    clear and has sufficient instructions to be able to reproduce the issue.

    Meta has a [bounty program](https://www.facebook.com/whitehat/) for the safe

    disclosure of security bugs. In those cases, please go through the process

    outlined on that page and do not file a public issue.

    ## License

    ### License

    By contributing to `torchtitan`, you agree that your contributions will be licensed

    under the LICENSE file in the root directory of this source tree.

    ---

    ## Contributing Guidelines

    ### Principles of contribution

    - Apply PyTorch-native training techniques.

      - The technique should be of general interests for distributed training.

      - A technique with moderate to large complexity should be sitting in the proper repo (e.g. pytorch/pytorch for a new parallelism, or pytorch/data for a new data loader) instead of `torchtitan`.

      - The main branch of `torchtitan` should have minimal dependency on non-PyTorch libraries. Interesting models/techniques that depend on external libraries can be demonstrated in forks of `torchtitan`.

    - Aim for minimum (if not zero) code change to the model. For the Llama model in `torchtitan`, if one has to make (justifiable) model change:

      - After the model change, it should still load the original checkpoint correctly.

      - Document the reasons for the code change, similar to [composability.md](docs/composability.md).

    - Keep code modularized, especially for [train.py](train.py), so that it remains easy to copy-paste into a minimal code example. If necessary:

      - Introduce new config options/category in [config_manager.py](torchtitan/config_manager.py).

      - Create separate functions/files.

    ### Proof of Value

    It is the contributor’s responsibility to justify the change. The requirements include, but are not limited to

    #### Loss

    - If a change does not impact computation results, one should see identical loss before vs. after, with fixed random seeds. An example is activation checkpointing.

    ```

    seed = 0

    torch.manual_seed(seed)

    torch.cuda.manual_seed(seed)

    torch.backends.cudnn.deterministic = True

    torch.backends.cudnn.benchmark = False

    ```

    - If a change is expected to impact computation results, loss convergence should be verified via end-to-end training on representable datasets (e.g. Llama3 models on the C4 dataset). One can refer to the example jobs reported in [performance.md](docs/performance.md) on what configs and how many steps to run.

    - The resulted loss curve should be compared with a verified baseline.

      - 1D FSDP – Preferred, whose effectiveness can be proven by comparing with 1D DDP and single-GPU training.

      - 2D FSDP + TP can be used as the baseline when 1D FSDP does not suffice to make comparisons due to limited scalability. For example, this should be the baseline when experimenting with 3D parallelisms on the Llama 3.1 405B model.

    #### Performance

    - Memory and WPS / MFU, which are available from logging, should meet expectations.

    - It is worth noting that performance expectations vary from case to case. For example, there are cases when a technique targeting memory reduction may cause throughput regression but still be acceptable (e.g. activation checkpointing). Again, it is the contributor's job to justify the feature, whether by achieving hypothetical performance, or by comparing with existing well-known implementations, etc.

    - If necessary, verify the numbers on jobs spanning multiple nodes (e.g. on 64 GPUs). Please reach out to the `torchtitan` team for help if you are resource-constrained.

    - When appropriate, one should show profile traces and/or memory snapshots to prove the effectiveness.

    ### Best practices

    When appropriate, one should consider

    - Adding CPU/GPU unit/integration tests.

      - To add a unit test, put it in the [test](test/) folder and follow the existing test files.

      - To add a GPU integration test, create a new `OverrideDefinitions` in [test_runner.py](test_runner.py). It will override the default config to run on the [debug model](train_configs/debug_model.toml).

    - Updating [README](README.md) and writing a new note in the [docs](docs/) folder on installation and usage, similar to [float8.md](docs/float8.md).

    - Updating [performance.md](docs/performance.md) with new performance results.

    - Creating GitHub issues for things that cannot be addressed at the moment.

    - Writing a post on [PyTorch Dev Discussions](https://dev-discuss.pytorch.org/c/distributed/6) forum and linking to it.

README.md

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -21,12 +21,12 @@ Our guiding principles when building `torchtitan`:
  
    ### Dive into the code

    You may want to see how the model is defined or how parallelism techniques are applied. For a guided tour, see these files first:

    * [train.py](https://github.com/pytorch/torchtitan/blob/main/train.py) - the main training loop and high-level setup code

    * [torchtitan/parallelisms/parallelize_llama.py](https://github.com/pytorch/torchtitan/blob/main/torchtitan/parallelisms/parallelize_llama.py) - helpers for applying Data Parallel, Tensor Parallel, activation checkpointing, and `torch.compile` to the model

    * [torchtitan/parallelisms/pipeline_llama.py](https://github.com/pytorch/torchtitan/blob/main/torchtitan/parallelisms/pipeline_llama.py) - helpers for applying Pipeline Parallel to the model

    * [torchtitan/checkpoint.py](https://github.com/pytorch/torchtitan/blob/main/torchtitan/checkpoint.py) - utils for saving/loading distributed checkpoints

    * [torchtitan/float8.py](https://github.com/pytorch/torchtitan/blob/main/torchtitan/float8.py) - utils for applying Float8 techniques

    * [torchtitan/models/llama/model.py](https://github.com/pytorch/torchtitan/blob/main/torchtitan/models/llama/model.py) - the Llama model definition (shared for Llama2 and Llama3 variants)

    * [train.py](train.py) - the main training loop and high-level setup code

    * [torchtitan/parallelisms/parallelize_llama.py](torchtitan/parallelisms/parallelize_llama.py) - helpers for applying Data Parallel, Tensor Parallel, activation checkpointing, and `torch.compile` to the model

    * [torchtitan/parallelisms/pipeline_llama.py](torchtitan/parallelisms/pipeline_llama.py) - helpers for applying Pipeline Parallel to the model

    * [torchtitan/checkpoint.py](torchtitan/checkpoint.py) - utils for saving/loading distributed checkpoints

    * [torchtitan/float8.py](torchtitan/float8.py) - utils for applying Float8 techniques

    * [torchtitan/models/llama/model.py](torchtitan/models/llama/model.py) - the Llama model definition (shared for Llama2 and Llama3 variants)

    ## Pre-Release Updates:

    #### (4/25/2024): `torchtitan` is now public but in a pre-release state and under development.

    @@ -35,26 +35,25 @@ Currently we showcase pre-training **Llama 3 and Llama 2** LLMs of various sizes
  
    ### Key features available

    1. [FSDP2 with per param sharding](docs/fsdp.md)

    2. [Tensor Parallel](https://pytorch.org/docs/stable/distributed.tensor.parallel.html)

    2. [Tensor Parallel](https://pytorch.org/docs/stable/distributed.tensor.parallel.html) (including async TP)

    3. Selective layer and operator activation checkpointing

    4. Distributed checkpointing

    5. 2 datasets pre-configured (45K - 144M)

    6. GPU usage, MFU, tokens per second and more displayed via TensorBoard

    6. Learning rate scheduler, meta init, Optional Fused RMSNorm

    7. All options easily configured via [toml files](train_configs/)

    8. [Interoperable checkpoints](docs/checkpoint.md) which can be loaded directly into [`torchtune`](https://github.com/pytorch/torchtune) for fine tuning

    9. [Float8 support](docs/float8.md)

    4. Distributed checkpointing (including async checkpointing)

    5. Checkpointable data-loading, with the C4 dataset pre-configured (144M entries)

    6. Loss, GPU memory, tokens-per-second, and MFU displayed and logged via TensorBoard

    7. Learning rate scheduler, meta-init, optional Fused RMSNorm into [`torchtune`](https://github.com/pytorch/torchtune) for fine tuning

    8. [Float8 support](docs/float8.md)

    9. `torch.compile` support

    10. All options easily configured via [toml files](train_configs/)

    11. [Interoperable checkpoints](docs/checkpoint.md) which can be loaded directly

    We report our [Performance](docs/performance.md) verified on 64 A100 GPUs

    ### Coming soon

    1. Async checkpointing

    2. Context Parallel

    3. 3D Pipeline Parallel

    4. `torch.compile` support

    5. Scalable data loading solution

    1. Context Parallel

    2. Pipeline Parallel (and 3D parallellism)

    3. HSDP

    ## Installation

assets/images/llama3_1_405B_loss_curves.png

Sorry, something went wrong. Reload?

Sorry, we cannot display this file.

Sorry, this file is invalid so it cannot be displayed.

docs/composability.md

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -1,23 +1,21 @@
  
    # Building a Clean, Readable Distributed LLM

    One of the main goals for TorchTitan was to provide a version of distributed LLM that was not only high performance, but utilized native pytorch techniques and readable code.  The challenge is how to compose together so many individual library components (FSDP, TP, PP, FP8, Compile, DCP, ...) just to name a few, and avoid having to make too many changes to the model guts in the process.  A lot of the work is behind the scenes, designing individual components to make fewer assumptions, use common abstractions (e.g. DTensor) and generally 'get along'.  But we found a few tweaks to the model code invaluable as well, and wanted to share those changes and the rationale for them.

    One of the main goals for torchtitan was to provide a version of distributed LLM that was not only high performance, but utilized native PyTorch techniques and readable code. The challenge is how to compose together so many individual library components (FSDP, TP, PP, Float8, Compile, DCP, ..., just to name a few), and avoid having to make too many changes to the model guts in the process. A lot of the work is behind the scenes, designing individual components to make fewer assumptions, use common abstractions (e.g. DTensor) and generally "get along". But we found a few tweaks to the model code invaluable as well, and wanted to share those changes and the rationale for them.

    # Making the model "pipeline friendly"

    ## Making the model "pipeline friendly"

    When applying Pipeline Parallelism, you will have to construct nn.Module objects representing the portion of the model that runs on a given pipeline stage. Whether you plan to manually edit your model code, or use techniques like tracing to extract model chunks, a few changes to the original model code can go a long way to making this process easier.

    ### Simplifying the top-level model forward

    Most likely, you can write your model in such a way that the top-level nn.Module owns a sequence of child modules that it calls during forward, delegating most of the complexity to the child module forwards.  If you can reduce your top level forward to mostly a for-loop over child module calls, then you'll simplify the pipeline-partitioning task to choosing the set of submodules to keep per stage.  If you have non-trivial logic in the top-level forward, you'll have to find a way to patch that logic back onto the resulting pipeline stage model, which can be annoying.

    example ([PR #321](https://github.com/pytorch/torchtitan/pull/321)):

    we used to slice the `freqs_cis` buffer by `seq_len` in the top level forward, pass that into child modules, and expect that inside the child modules the `seq_len` would match up with the size of other local tensors.  But we don't know about whether TP was applied or not when we consider PP splitting and could create a mismatch.  Its just as easy to perform the `freqs_cis` slicing inside the child submodule, using the runtime-accurate local `seq_len`, and this sidesteps the issue at PP slicing time.  

    Example ([PR #321](https://github.com/pytorch/torchtitan/pull/321)):

    We used to slice the `freqs_cis` buffer by `seq_len` in the top level forward, pass that into child modules, and expect that inside the child modules the `seq_len` would match up with the size of other local tensors.  But we don't know about whether TP was applied or not when we consider PP splitting and could create a mismatch.  Its just as easy to perform the `freqs_cis` slicing inside the child submodule, using the runtime-accurate local `seq_len`, and this sidesteps the issue at PP slicing time.

    example ([PR #322])https://github.com/pytorch/torchtitan/pull/322)): We decided to actually reuse the top-level model object on every PP stage, just delete the layers we don't want, and make sure that the top-level forward would do the right thing.  This means we don't have to make a separate runtime pp_forward that glues together child modules per stage.  The first change was using a moduledict instead of modulelist to store layers. This preserves layer Fully Qualified Names (FQNs) even when deleting some layers - e.g. layers.1 stays layers.1 even if you remove layers.0, which isn't true for a list- this matters for checkpoint save/load.  Preserving FQNs is a requirement for using Distributed Checkpointing (DCP) since it uses FQNs as globally unique IDs for sharding metadata. The second change was making the input and output layers optional- if the layer exists, we run it, otherwise we feed the input through to bypass it.  With these two changes, we can just (meta)-initialize the whole model, delete the unused parts per stage, then materialize the remaining part on GPU before loading a checkpoint.

    Example ([PR #322](https://github.com/pytorch/torchtitan/pull/322)):

    We decided to actually reuse the top-level model object on every PP stage, just delete the layers we don't want, and make sure that the top-level forward would do the right thing.  This means we don't have to make a separate runtime pp_forward that glues together child modules per stage.  The first change was using a moduledict instead of modulelist to store layers. This preserves layer Fully Qualified Names (FQNs) even when deleting some layers - e.g. layers.1 stays layers.1 even if you remove layers.0, which isn't true for a list- this matters for checkpoint save/load.  Preserving FQNs is a requirement for using Distributed Checkpointing (DCP) since it uses FQNs as globally unique IDs for sharding metadata. The second change was making the input and output layers optional- if the layer exists, we run it, otherwise we feed the input through to bypass it.  With these two changes, we can just (meta)-initialize the whole model, delete the unused parts per stage, then materialize the remaining part on GPU before loading a checkpoint.

    # Using a seed checkpoint for init

    ## Using a seed checkpoint for init

    Initializing the pipeline-parallel model is challenging becuase we assume the model could be so large as to not fit on local GPU (or possibly, even on CPU), and we also want to use the (bitwise) same initialization as we use for 1D or 2D parallel models, to ease debugging or comparisons between runs. It's not that easy to rewrite the original model's `init_weights` function to be tolerant of initializing only some layers, and also serializing initialization operations globally for consistent RNG order.

    For now, we sidestep all these problems with a simple but brutal solution: Initialize the whole model on some CPU instance, save a checkpoint file, and then lean on Distributed Checkpointing's "load" functionality to initialize the FQNs that are present on a given PP stage after stage creation.  For future work, we consider adding a more elaborate initialization scheme to `torch.pipelining`.

    One issue with seed checkpoints is that we rely on initializing _every_ model state from the checkpoint, which means the model can't have any non-persistent buffers, or else we have to specially initialize those in `train.py` after pipeline splitting.  `freqs_cis` was originally a non-persistent buffer, and we changed this to persistent in order to load it from the seed checkpoint.

    One issue with seed checkpoints is that we rely on initializing _every_ model state from the checkpoint, which means the model can't have any non-persistent buffers, or else we have to specially initialize those in [train.py](../train.py) after pipeline splitting.  `freqs_cis` was originally a non-persistent buffer, and we changed this to persistent in order to load it from the seed checkpoint.

docs/fsdp.md

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -49,7 +49,7 @@ def fully_shard(
  
    - `fully_shard(module)` is similar to `FullyShardedDataParallel(module)`, constructing one communication bucket from `module.parameters()` except those already assigned to a nested `fully_shard`/`FullyShardedDataParallel` call.

        - `fully_shard(module)` adds an `FSDPState` object on `module`, accessible via `fully_shard.state(module)`, instead of being an `nn.Module` wrapper. This is done via the `@contract` decorator.

        - Calling `model.named_parameters()` for a `model` with FSDP2 applied returns unchanged parameter names and `DTensor` sharded parameters. This means that the optimizer and gradient norm clipping see `DTensor`s.

        - `fully_shard(module)` performs a dynamic class swap on `module`. E.g., if `type(module) is Transformer`, then FSDP2 constructs a new class `FSDPTransformer` that inherits from a class `FSDP` and `Transformer` and sets `module.__class__` to be `FSDPTransformer`. This allows us to add new methods and override methods via `FSDP` without constructing an `nn.Module` wrapper.

        - `fully_shard(module)` performs a dynamic class swap on `module`. E.g., if `type(module) is Transformer`, then FSDP2 constructs a new class `FSDPTransformer` that inherits from a class `FSDPModule` and `Transformer` and sets `module.__class__` to be `FSDPTransformer`. This allows us to add new methods and override methods via `FSDPModule` without constructing an `nn.Module` wrapper.

    - FSDP1's `sharding_strategy` and `process_group`/`device_mesh` maps to FSDP2's `mesh` and `reshard_after_forward`.

      - `mesh` should be 1D for FSDP and 2D for HSDP. For HSDP, we assume replication on the 0th mesh dim and sharding on the 1st mesh dim. If `mesh is None`, then FSDP2 initializes a 1D global mesh over the default process group.

      - `reshard_after_forward=True` or `False` determines whether parameters are resharded (freed) after forward. If `True`, then they are re-all-gathered in backward. This trades off saving memory at the cost of extra communication.

    @@ -106,7 +106,7 @@ fully_shard(model)
  
    for tensor in itertools.chain(model.parameters(), model.buffers()):

        assert tensor.device == torch.device("meta")

    # Allocate buffers and sharded parameters on GPU

    model.to_empty("cuda")

    model.to_empty(device="cuda")

    # Run user-defined initializers

    model.init_weights() # or `model.apply(init_weights)`

    ```

0 comments on commit `169d1c7`

Please sign in to comment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit

There are no files selected for viewing

0 comments on commit `169d1c7`

Commit

There are no files selected for viewing

0 comments on commit 169d1c7

0 comments on commit `169d1c7`