IBM experimental dataloaders #376

daviswer · 2024-05-31T07:27:31Z

This PR introduces an experimental PyTorch-native dataloader from IBM that is distributed, stateful, checkpointable, composable and rescalable. It is intended for use in large-scale model pretraining, particularly in research settings where rapid iteration between datasets may be required. It automatically and invisibly handles data sharding, shuffling, subdataset weighting, checkpoint saving and loading, and more, with minimal overhead and high throughput.

Add experimental dataset source file
Add experimental dataloader builder, hooked into torchtitan cfg
Update torchtitan cfg with additional dataset arg fields
Update train script to build experimental dataloader instead of hf depending on cfg flags
Replace the existing C4-mini example dataset with one that matches the expected formatting for the experimental dataloader
TODO: port over unit tests as well
TODO: preprocessing script(s) for the new dataset format
TODO: further cleanup/iteration

it's a small thing and can be download from OSS, we can just check in

This PR adds the following: 1 - via reset parameters, a full layerwise init for the llama models under /llama. This uses the total model depth as part of the init via: self.weight_init_std = 0.02 / (2 * self.num_layers) ** 0.5 2 - The final output ffn (head) is init with sqrt of the dim of the model itself and a slightly wider cutoff factor of 3. 3 - tangential change - updates run_llama_train.sh with updated MODEL and MODEL_CONF params to allow for direct model control via the sh script. (there was a MODEL already but it was incorrectly using that in place of MODEL_CONF...though we should update this as it's not intuitive). 4 - made the debugmodel default to 2 layers as an improved debug check. 5 - added a 1B and 40B for additional testing configs. I can't currently run 70B on my H100 due to OOM, but can run 40B. Testing: Verified proper init and training with 7B, 13B and ~40B: <img width="1085" alt="Screenshot 2024-02-11 at 10 39 12 PM" src="https://github.com/pytorch-labs/torchtrain/assets/46302957/049037ed-63a4-4ab0-bebc-f297857aab72">

This PR is the start of adding perf related metrics. 1 - This PR adds function for logging the total num of unique model params, with option for only counting trainable params as well. (for future peft/qlora type work). 2 - logs it with comma formatted logging and model name ala: <img width="716" alt="Screenshot 2024-02-12 at 4 12 22 PM" src="https://github.com/pytorch-labs/torchtrain/assets/46302957/8eb48870-ab1e-4b70-9159-92864ff6c0e5"> this helps de-mistify for example the size of our debug model as well: <img width="716" alt="Screenshot 2024-02-12 at 4 10 17 PM" src="https://github.com/pytorch-labs/torchtrain/assets/46302957/77475306-54bc-48a6-bf28-9c9a542577fd"> **additional updates** - added in gpu mem tracking. We want to show the user peak memory stats, as well as monitor and alert for any cudacachealloc retries which are a perf hindrance. Thus, added class GPUMemoryMonitor: usage: 1 - instantiate <img width="1329" alt="Screenshot 2024-02-13 at 9 32 11 AM" src="https://github.com/pytorch-labs/torchtrain/assets/46302957/95610386-6fde-47bb-bbdc-bb7c399c5895"> 2 - start of training = start_monitoring() 3 - end of training = stop_monitoring() 4 - show results = get_peak_stats_str() and rank0_log it. <img width="1074" alt="Screenshot 2024-02-13 at 9 12 45 AM" src="https://github.com/pytorch-labs/torchtrain/assets/46302957/b6c7c854-7d83-436a-bea9-a67109422381">

ghstack-source-id: d0828f16c06747a5af2586630e5205bf786de1c4 Pull Request resolved: pytorch#57

ghstack-source-id: da7e02b1c2f21a7471ce1dda8bd4d0ee888ad9ac Pull Request resolved: pytorch#60

ghstack-source-id: e23d5e0b70abc427a13bc8bf195c876c007f4939 Pull Request resolved: pytorch#65

…ix (pytorch#63) This PR 1 - adds multi-node training support via a multinode_trainer.slurm file. Verified llama 7b on 20 nodes / 160 A100s. 2 - It also corrects a race condition that can occur on larger scale training in profiling, where the check for the trace dir existence fails for process 1, but in the interim another process 2 makes the directory, and then when process 1 tries to make the dir it errors out as the dir exists. This is a simple fix of adding exist_ok=True to both of the makedir command (dump folder, trace folder). <img width="1047" alt="Screenshot 2024-02-15 at 10 53 18 PM" src="https://github.com/pytorch-labs/torchtrain/assets/46302957/20378637-4adb-425b-91d8-7fd36289d3b5"> <img width="545" alt="Screenshot 2024-02-15 at 10 55 02 PM" src="https://github.com/pytorch-labs/torchtrain/assets/46302957/28658614-cff6-42b5-ab57-bac578393d5c">

…orch#64) Small PR: 1 - add configurable init style in model_args - 'use_unique_init' will use the layer_id in the init stddev denom, otherwise uses the original init style of total layer count. (verified both work on 7B llama...not clear yet if one is better vs other). 2 - clean up lr and loss display formatting - lr display was spanning out to 12+ digits which isn't that informative, and was wrapped in list format. This PR rounds it to max of 8 digits precision and removes the []'s that were around the lr rate display. (note this is purely UI...the full float precision is still used in actual lr calcs). 3 - clean up loss display - rounds the loss display to 4 digits precision to make it more readable and informative. previously: <img width="1198" alt="Screenshot 2024-02-16 at 2 33 34 PM" src="https://github.com/pytorch-labs/torchtrain/assets/46302957/77733af0-42db-4fab-a047-fccc7d404278"> Now: <img width="1063" alt="Screenshot 2024-02-16 at 2 51 53 PM" src="https://github.com/pytorch-labs/torchtrain/assets/46302957/4eb75b98-67f4-41ec-83d8-dd84a0e8b29e">

Summary: PR implements an unfied config manager. - Command line args and toml file args are now unified. - Defaults can be loaded from either. options like `training.batchsize` will be available as `config.training.batchsize` where `config` is a config manager object. Test Plan: Test Plan: ============================= test session starts ============================== platform linux -- Python 3.10.13, pytest-8.0.1, pluggy-1.4.0 -- /home/gnadathur/local/a/pytorch-env/bin/python cachedir: .pytest_cache rootdir: /data/users/gnadathur/a/torchtrain configfile: pyproject.toml plugins: cov-4.1.0 collecting ... collected 5 items test/test_job_config.py::TestJobConfig::test_command_line_args PASSED [ 20%] test/test_job_config.py::TestJobConfig::test_command_line_args_with_override PASSED [ 40%] test/test_job_config.py::TestJobConfig::test_job_config_file PASSED [ 60%] test/test_job_config.py::TestJobConfig::test_job_config_file_with_override PASSED [ 80%] test/test_job_config.py::TestJobConfig::test_job_file_does_not_exist PASSED [100%] ---------- coverage: platform linux, python 3.10.13-final-0 ---------- Coverage XML written to file coverage.xml ============================= slowest 20 durations ============================= 0.01s call test/test_job_config.py::TestJobConfig::test_job_config_file_with_override 0.00s call test/test_job_config.py::TestJobConfig::test_job_config_file 0.00s call test/test_job_config.py::TestJobConfig::test_command_line_args 0.00s call test/test_job_config.py::TestJobConfig::test_command_line_args_with_override 0.00s call test/test_job_config.py::TestJobConfig::test_job_file_does_not_exist 0.00s setup test/test_job_config.py::TestJobConfig::test_command_line_args 0.00s teardown test/test_job_config.py::TestJobConfig::test_command_line_args 0.00s setup test/test_job_config.py::TestJobConfig::test_job_file_does_not_exist 0.00s setup test/test_job_config.py::TestJobConfig::test_command_line_args_with_override 0.00s teardown test/test_job_config.py::TestJobConfig::test_command_line_args_with_override 0.00s setup test/test_job_config.py::TestJobConfig::test_job_config_file_with_override 0.00s setup test/test_job_config.py::TestJobConfig::test_job_config_file 0.00s teardown test/test_job_config.py::TestJobConfig::test_job_file_does_not_exist 0.00s teardown test/test_job_config.py::TestJobConfig::test_job_config_file 0.00s teardown test/test_job_config.py::TestJobConfig::test_job_config_file_with_override ============================== 5 passed in 0.10s =============================== Reviewers: Subscribers: Tasks: Tags: Co-authored-by: gnadathur <gnadathur@devgpu051.cln3.facebook.com>

Add the linter back using a different changed-files plugin which doesn't have permission issues on pytorch/ org. Also change the linter job to use py 3.10 to match our unit test runner.

For now this literally just runs `NGPU=4 ./run_llama_train.sh` but I verified at least it catches problems. As a follow up, we should integrate mgpu test infra from pytorch and set up actual unit tests to run in this job. We should probably also keep testing the run_llama_train.sh script, and add other combinations of 2D parallelism to ensure they all keep working. <img width="2120" alt="image" src="https://github.com/pytorch/torchtrain/assets/4984825/2c235e9a-04ed-4f2d-9915-67de39d78e1c">

mostly testing if new repo works or not

as titled, move the config files to the root folder, where it decouples with the torchtrain package build, and allow easier navigations

@tianyu-l

…olumnar display to show both, show avg iter & data loading times at end of training (pytorch#87) This PR adds basic perf timing and display for 'per iter' and 'final iter average' display. (in part based on Andrew's comment about having to open the trace to compare iter timing). 1. tracking list is housed in TrainState, but I do not save it as part of the state dict as I view this as useful but not saveable info. 2. iter times are tracked after dataloading is done each iter and after optimizer step. The idea is to make this timing expressly the model training iter (not data loading or post iter other metrics calcs). 3. 'time' is now displayed at each iter along with the usual loss and lr. 4. at the end of training, assuming more than 3 iters run, then the average iter time is calculated by igoring the first three iters (consider these as warmup esp as cudaCacheAllocator gets warmed up) and displayed. 5. based on @tianyu-l feedback: I have added data loading times as well. I used the same timeit.default_timer() from timeit to be consistent. (cpu side so no synch's needed :) 6 - after fiddling with printf width formatting options, added beautiful aligned columnar display for the per iter updates: Now: <img width="1282" alt="Screenshot 2024-02-26 at 9 39 25 AM" src="https://github.com/pytorch/torchtrain/assets/46302957/9ee2ea7b-5c28-4d41-ba91-d4176c64fc66"> before: <img width="1282" alt="Screenshot 2024-02-26 at 8 39 46 AM" src="https://github.com/pytorch/torchtrain/assets/46302957/37cbfa20-7f1d-4d94-be94-3505ef4498c0">

Summary: Summary: Follow up on config unification, options not available in config file are picked from command line defaults. Test Plan: ============================= test session starts ============================== platform linux -- Python 3.10.13, pytest-8.0.1, pluggy-1.4.0 -- /home/gnadathur/local/a/pytorch-env/bin/python cachedir: .pytest_cache rootdir: /data/users/gnadathur/a/torchtrain configfile: pyproject.toml plugins: cov-4.1.0 collecting ... collected 3 items test/test_job_config.py::TestJobConfig::test_command_line_args PASSED [ 33%] test/test_job_config.py::TestJobConfig::test_job_config_file PASSED [ 66%] test/test_job_config.py::TestJobConfig::test_job_file_does_not_exist PASSED [100%] ---------- coverage: platform linux, python 3.10.13-final-0 ---------- Coverage XML written to file coverage.xml ============================= slowest 20 durations ============================= 0.00s call test/test_job_config.py::TestJobConfig::test_job_config_file 0.00s call test/test_job_config.py::TestJobConfig::test_command_line_args 0.00s call test/test_job_config.py::TestJobConfig::test_job_file_does_not_exist 0.00s setup test/test_job_config.py::TestJobConfig::test_command_line_args 0.00s teardown test/test_job_config.py::TestJobConfig::test_command_line_args 0.00s setup test/test_job_config.py::TestJobConfig::test_job_config_file 0.00s setup test/test_job_config.py::TestJobConfig::test_job_file_does_not_exist 0.00s teardown test/test_job_config.py::TestJobConfig::test_job_config_file 0.00s teardown test/test_job_config.py::TestJobConfig::test_job_file_does_not_exist ============================== 3 passed in 0.06s =============================== Test Plan: Reviewers: Subscribers: Tasks: Tags: --------- Co-authored-by: gnadathur <gnadathur@devvm4378.nao0.facebook.com>

ghstack-source-id: 38cbc277e2a177bc0baf35450a661835b97a7f22 Pull Request resolved: pytorch#92

…g on slurm (pytorch#93) This PR adds the ability to do colored console outputs in order to highlight the training data outputs. It also adds a check to not use this color formatting on slurm, where it will add 33= instead of the color if not avoided. Note that I've just added some color to highlight the main training data. Users that fork/clone can use it to enhance their outputs as desired. <img width="1372" alt="Screenshot 2024-02-26 at 10 20 15 PM" src="https://github.com/pytorch/torchtrain/assets/46302957/44849821-1677-40bf-896c-39344cd661d6"> Note that on slurm it remains plain: <img width="847" alt="Screenshot 2024-02-26 at 10 46 24 PM" src="https://github.com/pytorch/torchtrain/assets/46302957/172eaa58-4f5c-48f5-8ec1-bc349e3e82f2"> if you dont' check this, then it would otherwise look like this (this does not happen with this PR, just showing if we didn't check and credit to Yifu for noting this would be an issue): <img width="847" alt="Screenshot 2024-02-26 at 10 39 23 PM" src="https://github.com/pytorch/torchtrain/assets/46302957/4a87fb9a-dd3a-417c-a29e-286ded069358">

@awgu

this PR updates the GPU metrics to labelling as GiB - we were calculating GiB but calling it GB. (credit to @awgu for flagging this - issue pytorch#94) function names and member vars in metrics.py have been updated to _gib instead of _gb for clarity, and the logging output now labels as GiB: <img width="851" alt="Screenshot 2024-02-27 at 11 28 23 AM" src="https://github.com/pytorch/torchtrain/assets/46302957/85eb260a-77e9-4c49-be8a-b1aaa10dc3e2">

ghstack-source-id: 7dc4a80cf9c32f4dca3d00bcef019d256bdf58f7 Pull Request resolved: pytorch#96

Enable libUV for torchtrain. Test: ``` + export USE_LIBUV=1 + USE_LIBUV=1 + TRAINER_DIR=/home/gnadathur/local/torchtrain + NGPU=4 + LOG_RANK=0,1 + CONFIG_FILE=./train_configs/debug_model.toml + torchrun --nproc_per_node=4 --rdzv_endpoint=localhost:5972 --local-ranks-filter 0,1 --role rank --tee 3 train.py --job.config_file ./train_configs/debug_model.toml W0228 09:12:02.564000 140353616004096 torch/distributed/run.py:717] W0228 09:12:02.564000 140353616004096 torch/distributed/run.py:717] ***************************************** W0228 09:12:02.564000 140353616004096 torch/distributed/run.py:717] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0228 09:12:02.564000 140353616004096 torch/distributed/run.py:717] ***************************************** [rank0]:2024-02-28 09:12:04,581 - torchtrain.parallelisms - INFO - Building 1-D device mesh with ('dp',), [4] [rank1]:2024-02-28 09:12:04,708 - torchtrain.parallelisms - INFO - Building 1-D device mesh with ('dp',), [4] [rank0]:2024-02-28 09:12:05,647 - root - INFO - Building llama [rank0]:2024-02-28 09:12:05,655 - root - INFO - Reloaded SentencePiece model from ./torchtrain/datasets/tokenizer/tokenizer.model [rank0]:2024-02-28 09:12:05,655 - root - INFO - #words: 32000 - BOS ID: 1 - EOS ID: 2 [rank1]:2024-02-28 09:12:07,299 - root - INFO - Reloaded SentencePiece model from ./torchtrain/datasets/tokenizer/tokenizer.model [rank1]:2024-02-28 09:12:07,299 - root - INFO - #words: 32000 - BOS ID: 1 - EOS ID: 2 [rank0]:2024-02-28 09:12:07,565 - root - INFO - Model fully initialized via reset_params [rank0]:2024-02-28 09:12:07,566 - root - INFO - Model built with: ModelArgs(dim=256, n_layers=2, n_heads=16, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=32768, depth_init=True) [rank0]:2024-02-28 09:12:07,566 - root - INFO - �[34mModel llama debugmodel �[31msize: 18,089,216 total parameters�[39m [rank0]:2024-02-28 09:12:07,567 - root - INFO - GPU memory usage: NVIDIA H100 (0): 95.0396 GiB capacity, 0.0 GiB in-use, 0.0% in-use [rank0]:2024-02-28 09:12:08,769 - root - INFO - Applied FSDP to the model... [rank0]:2024-02-28 09:12:08,770 - root - INFO - Gradient scaling not enabled. [rank0]:2024-02-28 09:12:08,770 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs/tb/20240228-0912. [rank0]:2024-02-28 09:12:08,977 - root - INFO - Profiling active. Traces will be saved at ./outputs/profiling/traces [rank0]:2024-02-28 09:12:10,956 - root - INFO - �[36mstep: 1 �[32mloss: 10.9229 �[39miter: �[34m 1.9386�[39m data: �[34m0.0368 �[39mlr: �[33m0.00026667�[39m [rank0]:2024-02-28 09:12:11,045 - root - INFO - �[36mstep: 2 �[32mloss: 10.8673 �[39miter: �[34m 0.0562�[39m data: �[34m0.0316 �[39mlr: �[33m0.00053333�[39m [rank0]:2024-02-28 09:12:11,130 - root - INFO - �[36mstep: 3 �[32mloss: 10.7145 �[39miter: �[34m 0.0523�[39m data: �[34m0.0322 �[39mlr: �[33m0.0008�[39m [rank0]:2024-02-28 09:12:11,219 - root - INFO - �[36mstep: 4 �[32mloss: 10.5038 �[39miter: �[34m 0.0559�[39m data: �[34m0.0319 �[39mlr: �[33m0.0007�[39m [rank0]:2024-02-28 09:12:11,304 - root - INFO - �[36mstep: 5 �[32mloss: 10.2228 �[39miter: �[34m 0.0537�[39m data: �[34m0.031 �[39mlr: �[33m0.0006�[39m [rank0]:2024-02-28 09:12:11,391 - root - INFO - �[36mstep: 6 �[32mloss: 9.9677 �[39miter: �[34m 0.0562�[39m data: �[34m0.0302 �[39mlr: �[33m0.0005�[39m [rank0]:2024-02-28 09:12:11,478 - root - INFO - �[36mstep: 7 �[32mloss: 9.7762 �[39miter: �[34m 0.0544�[39m data: �[34m0.0317 �[39mlr: �[33m0.0004�[39m [rank0]:2024-02-28 09:12:11,676 - root - INFO - �[36mstep: 8 �[32mloss: 9.4359 �[39miter: �[34m 0.0509�[39m data: �[34m0.0322 �[39mlr: �[33m0.0003�[39m [rank1]:STAGE:2024-02-28 09:12:11 3161834:3161834 ActivityProfilerController.cpp:314] Completed Stage: Warm Up [rank1]:[rank1]:[W CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event [rank0]:STAGE:2024-02-28 09:12:11 3161833:3161833 ActivityProfilerController.cpp:314] Completed Stage: Warm Up [rank0]:2024-02-28 09:12:11,813 - root - INFO - �[36mstep: 9 �[32mloss: 9.2326 �[39miter: �[34m 0.1007�[39m data: �[34m0.0321 �[39mlr: �[33m0.0002�[39m [rank0]:[rank0]:[W CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event [rank1]:STAGE:2024-02-28 09:12:11 3161834:3161834 ActivityProfilerController.cpp:320] Completed Stage: Collection [rank1]:STAGE:2024-02-28 09:12:11 3161834:3161834 ActivityProfilerController.cpp:324] Completed Stage: Post Processing [rank0]:STAGE:2024-02-28 09:12:11 3161833:3161833 ActivityProfilerController.cpp:320] Completed Stage: Collection [rank0]:STAGE:2024-02-28 09:12:11 3161833:3161833 ActivityProfilerController.cpp:324] Completed Stage: Post Processing [rank0]:2024-02-28 09:12:12,195 - root - INFO - exporting profile traces to ./outputs/profiling/traces/iteration_10 [rank0]:2024-02-28 09:12:12,207 - root - INFO - �[36mstep: 10 �[32mloss: 9.1641 �[39miter: �[34m 0.0971�[39m data: �[34m0.031 �[39mlr: �[33m0.0001�[39m [rank0]:2024-02-28 09:12:12,207 - root - INFO - Average iter time: 0.0670 seconds [rank0]:2024-02-28 09:12:12,207 - root - INFO - Average data load time: 0.0314 seconds [rank0]:2024-02-28 09:12:12,208 - root - INFO - Current Memory: NVIDIA H100 (0): Reserved: 9.6465%, Alloc 2.1969%, Active: 2.2% [rank0]:Peak Memory: Reserved 9.65%, Alloc 8.43%, Active: 8.44% [rank0]:num retries: 0, num ooms: 0 [rank0]:NCCL version 2.19.3+cuda12.0 ``` --------- Co-authored-by: gnadathur <gnadathur@devvm4378.nao0.facebook.com>

as titled, we don't want to allow steps == -1 case as it would blow up the lr scheduler

Add 7b config and adjust options to be more realistic didn't add this to the train scripts as default as it's expensive to init, whoever use it can adjust it accordingly

ghstack-source-id: f7ee3c867bfcdcae5dbb490982920606191e6f40 Pull Request resolved: pytorch#97

Summary: Adding a description field, useful for integration tests to describe the test. Test Plan: ``` + export USE_LIBUV=1 + USE_LIBUV=1 + TRAINER_DIR=/home/gnadathur/local/torchtrain + NGPU=4 + LOG_RANK=0,1 + CONFIG_FILE=./train_configs/debug_model.toml + torchrun --nproc_per_node=4 --rdzv_endpoint=localhost:5972 --local-ranks-filter 0,1 --role rank --tee 3 train.py --job.config_file ./train_configs/debug_model.toml W0229 17:05:02.466000 140187679912960 torch/distributed/run.py:717] W0229 17:05:02.466000 140187679912960 torch/distributed/run.py:717] ***************************************** W0229 17:05:02.466000 140187679912960 torch/distributed/run.py:717] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0229 17:05:02.466000 140187679912960 torch/distributed/run.py:717] ***************************************** [rank1]:2024-02-29 17:05:04,269 - torchtrain.parallelisms - INFO - Building 1-D device mesh with ('dp',), [4] [rank0]:2024-02-29 17:05:04,510 - torchtrain.parallelisms - INFO - Building 1-D device mesh with ('dp',), [4] [rank0]:2024-02-29 17:05:05,327 - root - INFO - Starting job: debug training [rank0]:2024-02-29 17:05:05,327 - root - INFO - Building llama [rank0]:2024-02-29 17:05:05,335 - root - INFO - Reloaded SentencePiece model from ./torchtrain/datasets/tokenizer/tokenizer.model [rank0]:2024-02-29 17:05:05,335 - root - INFO - #words: 32000 - BOS ID: 1 - EOS ID: 2 [rank1]:2024-02-29 17:05:06,782 - root - INFO - Reloaded SentencePiece model from ./torchtrain/datasets/tokenizer/tokenizer.model [rank1]:2024-02-29 17:05:06,782 - root - INFO - #words: 32000 - BOS ID: 1 - EOS ID: 2 [rank0]:2024-02-29 17:05:07,347 - root - INFO - Model fully initialized via reset_params [rank0]:2024-02-29 17:05:07,349 - root - INFO - Model built with: ModelArgs(dim=256, n_layers=2, n_heads=16, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=32768, depth_init=True) [rank0]:2024-02-29 17:05:07,349 - root - INFO - �[34mModel llama debugmodel �[31msize: 18,089,216 total parameters�[39m [rank0]:2024-02-29 17:05:07,349 - root - INFO - GPU memory usage: NVIDIA H100 (0): 95.0396 GiB capacity, 0.0 GiB in-use, 0.0% in-use [rank0]:2024-02-29 17:05:08,375 - root - INFO - Applied FSDP to the model... [rank0]:2024-02-29 17:05:08,376 - root - INFO - Gradient scaling not enabled. [rank0]:2024-02-29 17:05:08,376 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs/tb/20240229-1705. [rank0]:2024-02-29 17:05:08,610 - root - INFO - Profiling active. Traces will be saved at ./outputs/profiling/traces [rank0]:2024-02-29 17:05:10,570 - root - INFO - �[36mstep: 1 �[32mloss: 10.9183 �[39miter: �[34m 1.9258�[39m data: �[34m0.0303 �[39mlr: �[33m0.00026667�[39m [rank0]:2024-02-29 17:05:10,653 - root - INFO - �[36mstep: 2 �[32mloss: 10.8347 �[39miter: �[34m 0.0487�[39m data: �[34m0.0336 �[39mlr: �[33m0.00053333�[39m [rank0]:2024-02-29 17:05:10,733 - root - INFO - �[36mstep: 3 �[32mloss: 10.6861 �[39miter: �[34m 0.045�[39m data: �[34m0.0334 �[39mlr: �[33m0.0008�[39m [rank0]:2024-02-29 17:05:10,812 - root - INFO - �[36mstep: 4 �[32mloss: 10.4672 �[39miter: �[34m 0.0453�[39m data: �[34m0.0336 �[39mlr: �[33m0.0007�[39m [rank0]:2024-02-29 17:05:10,893 - root - INFO - �[36mstep: 5 �[32mloss: 10.2154 �[39miter: �[34m 0.0466�[39m data: �[34m0.033 �[39mlr: �[33m0.0006�[39m [rank0]:2024-02-29 17:05:10,975 - root - INFO - �[36mstep: 6 �[32mloss: 9.9573 �[39miter: �[34m 0.0496�[39m data: �[34m0.0314 �[39mlr: �[33m0.0005�[39m [rank0]:2024-02-29 17:05:11,056 - root - INFO - �[36mstep: 7 �[32mloss: 9.7627 �[39miter: �[34m 0.0486�[39m data: �[34m0.0321 �[39mlr: �[33m0.0004�[39m [rank0]:2024-02-29 17:05:11,201 - root - INFO - �[36mstep: 8 �[32mloss: 9.437 �[39miter: �[34m 0.0457�[39m data: �[34m0.0333 �[39mlr: �[33m0.0003�[39m [rank1]:STAGE:2024-02-29 17:05:11 3368103:3368103 ActivityProfilerController.cpp:314] Completed Stage: Warm Up [rank1]:[rank1]:[W CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event [rank0]:STAGE:2024-02-29 17:05:11 3368102:3368102 ActivityProfilerController.cpp:314] Completed Stage: Warm Up [rank0]:2024-02-29 17:05:11,317 - root - INFO - �[36mstep: 9 �[32mloss: 9.2446 �[39miter: �[34m 0.0794�[39m data: �[34m0.0324 �[39mlr: �[33m0.0002�[39m [rank0]:[rank0]:[W CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event [rank1]:STAGE:2024-02-29 17:05:11 3368103:3368103 ActivityProfilerController.cpp:320] Completed Stage: Collection [rank1]:STAGE:2024-02-29 17:05:11 3368103:3368103 ActivityProfilerController.cpp:324] Completed Stage: Post Processing [rank0]:STAGE:2024-02-29 17:05:11 3368102:3368102 ActivityProfilerController.cpp:320] Completed Stage: Collection [rank0]:STAGE:2024-02-29 17:05:11 3368102:3368102 ActivityProfilerController.cpp:324] Completed Stage: Post Processing [rank0]:2024-02-29 17:05:11,748 - root - INFO - exporting profile traces to ./outputs/profiling/traces/iteration_10 [rank0]:2024-02-29 17:05:11,762 - root - INFO - �[36mstep: 10 �[32mloss: 9.1772 �[39miter: �[34m 0.0893�[39m data: �[34m0.0324 �[39mlr: �[33m0.0001�[39m [rank0]:2024-02-29 17:05:11,763 - root - INFO - Average iter time: 0.0578 seconds [rank0]:2024-02-29 17:05:11,763 - root - INFO - Average data load time: 0.0326 seconds [rank0]:2024-02-29 17:05:11,763 - root - INFO - Current Memory: NVIDIA H100 (0): Reserved: 9.6465%, Alloc 2.1969%, Active: 2.2% [rank0]:Peak Memory: Reserved 9.65%, Alloc 8.43%, Active: 8.44% [rank0]:num retries: 0, num ooms: 0 [rank0]:NCCL version 2.19.3+cuda12.0 ``` Reviewers: Subscribers: Tasks: Tags: Co-authored-by: gnadathur <gnadathur@devvm4378.nao0.facebook.com>

ghstack-source-id: 1c5bf790d7473f6a24124051fcfa1fd2585a56f9 Pull Request resolved: pytorch#105

``` + export USE_LIBUV=1 + USE_LIBUV=1 + TRAINER_DIR=/home/gnadathur/local/torchtrain + NGPU=4 + LOG_RANK=0,1 + CONFIG_FILE=./train_configs/debug_model.toml + torchrun --nproc_per_node=4 --rdzv_endpoint=localhost:5972 --local-ranks-filter 0,1 --role rank --tee 3 train.py --job.config_file ./train_configs/debug_model.toml W0304 17:01:26.766000 140549371597824 torch/distributed/run.py:717] W0304 17:01:26.766000 140549371597824 torch/distributed/run.py:717] ***************************************** W0304 17:01:26.766000 140549371597824 torch/distributed/run.py:717] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. W0304 17:01:26.766000 140549371597824 torch/distributed/run.py:717] ***************************************** [rank0]:2024-03-04 17:01:28,834 - torchtrain.parallelisms - INFO - Building 1-D device mesh with ('dp',), [4] [rank1]:2024-03-04 17:01:28,857 - torchtrain.parallelisms - INFO - Building 1-D device mesh with ('dp',), [4] [rank0]:2024-03-04 17:01:29,712 - root - INFO - Starting job: debug training [rank0]:2024-03-04 17:01:29,712 - root - INFO - Building llama [rank0]:2024-03-04 17:01:29,719 - root - INFO - Reloaded SentencePiece model from ./torchtrain/datasets/tokenizer/tokenizer.model [rank0]:2024-03-04 17:01:29,719 - root - INFO - #words: 32000 - BOS ID: 1 - EOS ID: 2 [rank1]:2024-03-04 17:01:31,187 - root - INFO - Reloaded SentencePiece model from ./torchtrain/datasets/tokenizer/tokenizer.model [rank1]:2024-03-04 17:01:31,188 - root - INFO - #words: 32000 - BOS ID: 1 - EOS ID: 2 [rank0]:2024-03-04 17:01:31,346 - root - INFO - Model fully initialized via reset_params [rank0]:2024-03-04 17:01:31,346 - root - INFO - Model built with: ModelArgs(dim=256, n_layers=2, n_heads=16, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=32768, depth_init=True) [rank0]:2024-03-04 17:01:31,347 - root - INFO - �[34mModel llama debugmodel �[31msize: 18,089,216 total parameters�[39m [rank0]:2024-03-04 17:01:31,347 - root - INFO - GPU memory usage: NVIDIA H100 (0): 95.0396 GiB capacity, 0.0 GiB in-use, 0.0% in-use [rank0]:2024-03-04 17:01:32,502 - root - INFO - Applied FSDP to the model... [rank0]:2024-03-04 17:01:32,503 - root - INFO - Gradient scaling not enabled. [rank0]:2024-03-04 17:01:32,504 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs/tb/20240304-1701. [rank0]:2024-03-04 17:01:32,901 - root - INFO - Profiling active. Traces will be saved at ./outputs/profiling/traces [rank0]:2024-03-04 17:01:34,806 - root - INFO - �[36mstep: 1 �[32mloss: 10.8424 �[39miter: �[34m 1.8688�[39m data: �[34m0.0316 �[39mlr: �[33m0.00026667�[39m [rank0]:2024-03-04 17:01:34,891 - root - INFO - �[36mstep: 2 �[32mloss: 10.7581 �[39miter: �[34m 0.0476�[39m data: �[34m0.0357 �[39mlr: �[33m0.00053333�[39m [rank0]:2024-03-04 17:01:34,970 - root - INFO - �[36mstep: 3 �[32mloss: 10.6239 �[39miter: �[34m 0.045�[39m data: �[34m0.0333 �[39mlr: �[33m0.0008�[39m [rank0]:2024-03-04 17:01:35,048 - root - INFO - �[36mstep: 4 �[32mloss: 10.4163 �[39miter: �[34m 0.0455�[39m data: �[34m0.0323 �[39mlr: �[33m0.0007�[39m [rank0]:2024-03-04 17:01:35,127 - root - INFO - �[36mstep: 5 �[32mloss: 10.1529 �[39miter: �[34m 0.0459�[39m data: �[34m0.032 �[39mlr: �[33m0.0006�[39m [rank0]:2024-03-04 17:01:35,206 - root - INFO - �[36mstep: 6 �[32mloss: 9.8899 �[39miter: �[34m 0.0468�[39m data: �[34m0.0311 �[39mlr: �[33m0.0005�[39m [rank0]:2024-03-04 17:01:35,284 - root - INFO - �[36mstep: 7 �[32mloss: 9.7204 �[39miter: �[34m 0.0461�[39m data: �[34m0.0312 �[39mlr: �[33m0.0004�[39m [rank0]:2024-03-04 17:01:35,425 - root - INFO - �[36mstep: 8 �[32mloss: 9.3757 �[39miter: �[34m 0.0457�[39m data: �[34m0.0319 �[39mlr: �[33m0.0003�[39m [rank0]:STAGE:2024-03-04 17:01:35 3850444:3850444 ActivityProfilerController.cpp:314] Completed Stage: Warm Up [rank0]:2024-03-04 17:01:35,537 - root - INFO - �[36mstep: 9 �[32mloss: 9.1883 �[39miter: �[34m 0.0762�[39m data: �[34m0.0318 �[39mlr: �[33m0.0002�[39m [rank0]:[rank0]:[W CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event [rank1]:STAGE:2024-03-04 17:01:35 3850445:3850445 ActivityProfilerController.cpp:314] Completed Stage: Warm Up [rank1]:[rank1]:[W CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event [rank0]:STAGE:2024-03-04 17:01:35 3850444:3850444 ActivityProfilerController.cpp:320] Completed Stage: Collection [rank0]:STAGE:2024-03-04 17:01:35 3850444:3850444 ActivityProfilerController.cpp:324] Completed Stage: Post Processing [rank1]:STAGE:2024-03-04 17:01:35 3850445:3850445 ActivityProfilerController.cpp:320] Completed Stage: Collection [rank1]:STAGE:2024-03-04 17:01:35 3850445:3850445 ActivityProfilerController.cpp:324] Completed Stage: Post Processing [rank0]:2024-03-04 17:01:35,958 - root - INFO - exporting profile traces to ./outputs/profiling/traces/iteration_10 [rank0]:2024-03-04 17:01:35,971 - root - INFO - �[36mstep: 10 �[32mloss: 9.1212 �[39miter: �[34m 0.0808�[39m data: �[34m0.0319 �[39mlr: �[33m0.0001�[39m [rank0]:2024-03-04 17:01:35,972 - root - INFO - Average iter time: 0.0553 seconds [rank0]:2024-03-04 17:01:35,972 - root - INFO - Average data load time: 0.0317 seconds [rank0]:2024-03-04 17:01:35,972 - root - INFO - Current Memory: NVIDIA H100 (0): Reserved: 9.6465%, Alloc 2.1969%, Active: 2.2% [rank0]:Peak Memory: Reserved 9.65%, Alloc 8.43%, Active: 8.44% [rank0]:num retries: 0, num ooms: 0 [rank0]:NCCL version 2.19.3+cuda12.0 ``` Co-authored-by: gnadathur <gnadathur@devvm4378.nao0.facebook.com>

This PR enables meta_init functionality to avoid OOM'ing on cpu for larger models. The core functionality is in meta_init.py, and a few changes in parallelization and train.py. Key items: 1 - this is largely the same as the earlier PR I had for meta_init, but I did a new one b/c faster than reworking it with all the interim changes. 2 - to address feedback in previous PR: a - why do we need meta_init.py, can't we just do: ~~~ with torch.device("meta"): model = Model.from_args(...) ~~~ Unfortunately this does not work b/c the rope embeddings are treated differently (buffer) and thus the simple lambda call from param_init_fn in FSDP (lambda module: module.to_device('cuda') ) will not invoke or move the rope embeddings and the model will fail on first forward. This issue relates to the nn.embeddings not being moved, and that the device is referenced in the forward pass for the current rope class. Have opened pytorch#110 to track this and investigate while not holding up meta init that is working from landing. b - per earlier feedback - meta init is now 'not optional' but simply the default. This should ensure all models leverage it and ensure we aren't missing things for future meta_init aspects. 3 - misc change - I switched the model_params to just do the normal all params count instead of 'unique params' b/c it does not mesh with what people perceive model size as. Testing: tested both debugmodel and 26B model with and without meta init to confirm same loss curves. Note for future reference - if you get a bad init (meta init failure) you will simply not train (loss is same every iter). If you fail to call reset params after FSDP, then you will train (b/c we default to torch.randn_like) but your starting loss will be 5x+ higher (telling you that you have not properly init'ed the model).

Co-authored-by: gnadathur <gnadathur@devvm4378.nao0.facebook.com>

…ch#268) This way we could temporarily enable 2-D parallel compile, and it might make sense to do transformer block compile in the future with PP (which we'll see). We should figure out: 1. dynamic shape issue when turning on 2D parallel 2. full model compile issue for 2D parallel compile 3. cache reusing currently does not work, enable it later

previous change to use logging from torchtitan caused stdout not to show up. ghstack-source-id: 30a77c59ba68043ffa844be0443d5351d9584fab Pull Request resolved: pytorch#352

@wanchaol

…orch#350) cc @wanchaol @lessw2020 @wconstab

mostly harmless bug, since output shape of last layer is not used for send/recv purpose, the runtime value overrides it no matter what value you configured it with. However, since adding in/out shape validation to pipeline lib in torch, this raises an error and has to be fixed. ghstack-source-id: 950e41529b7b506085ab280d8a492e345eaefd24 Pull Request resolved: pytorch#354

APIs conform to the pytorch rules. This PR should be able to land safely after tonight's nightly pytorch build which includes the above PR. ghstack-source-id: c575bc7835472128c09798544caa38bf1908e5ca Pull Request resolved: pytorch#356

After updating today, I found a whole slew of various new temp files clogging up my source tab. This PR screens these out so that they don't accidentally get added in a PR and keeps your source tab change count correct. Example of issue without this PR: <img width="780" alt="Screenshot 2024-05-23 at 9 21 55 PM" src="https://github.com/pytorch/torchtitan/assets/46302957/41b7061a-41a0-4a95-938b-3fd9292a2f38"> vs with this PR: <img width="661" alt="Screenshot 2024-05-23 at 10 07 16 PM" src="https://github.com/pytorch/torchtitan/assets/46302957/cccf8c5f-368d-40a8-b10f-f11ca37df2bc">

- switch to using public PipelineStage API - clean up some asserts in tracer codepath ghstack-source-id: 2d069b7d45c4f3c788dec8fc85d8a7e83e463fcd Pull Request resolved: pytorch#357

ghstack-source-id: 4255cc792b9a221bc5a012e91db92533dcfa2243 Pull Request resolved: pytorch#339

ghstack-source-id: 8cbd62b97816ae8185b8a7e1aa9a7505f2780525 Pull Request resolved: pytorch#372

Usage: `--test <test_id>` Acceptable values: `test_id` in `build_test_list` (default: all) Example: ``` rm -rf outputs && python test_runner.py outputs --test pp_gpipe ```

ghstack-source-id: 775591945ff5427cb7e5e9fc7592952b4c746341 Pull Request resolved: pytorch#373

Co-authored-by: Linsong Chu <lchu@us.ibm.com>

facebook-github-bot · 2024-05-31T07:27:40Z

Hi @daviswer!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

fegin · 2024-05-31T07:30:14Z

torchtitan/checkpoint.py

@@ -120,7 +120,6 @@ def __init__(
                "model": ModelWrapper(model),
                "optimizer": OptimizerWrapper(model, optimizer),
                "lr_scheduler": lr_scheduler,
-                "dataloader": dataloader,


This will break the usage of the existing dataloader checkpoint.

facebook-github-bot · 2024-05-31T07:38:34Z

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

gokulavasan · 2024-06-03T18:19:24Z

Thanks for the PR! Reviewing it now

gokulavasan

Finished one pass through the PR. Left comments to understand the code better

gokulavasan · 2024-06-05T08:14:47Z

torchtitan/config_manager.py

+        self.parser.add_argument(
+            "--dataset.datasets",
+            type=str,
+            default="c4_mini",


For multiple (sub) datasets, would this be specified as a list of str?

For multiple subdatasets, we have it set up to take a string of comma-separated names. There's an example comment in train_configs/llama3_8b.toml:

datasets = "CommonCrawl,C4,Github,Wikipedia,Books,ArXiv,StackExchange" dataset_weights = "67,15,4.5,4.5,4.5,2.5,2"

Of course this can be changed if another format is more convenient

gokulavasan · 2024-06-05T08:27:38Z

torchtitan/datasets/experimental/llama3/meta/c4_llama3_counts.csv

@@ -0,0 +1,2 @@
+dataset/filename,documents,tokens
+/c4_mini/c4_mini.arrow,45000,20505558


i) Are all the tokens for a single doc present in one row+col?
ii) What is the need to pretokenize the data instead of storing raw text alone?

The expected format is for each document to occupy one row in the pyarrow file. Because docs are different lengths, there's not really any consistent notion of columns anymore (or viewed another way, the doc rows all occupy the single "tokens" column of the shard file)

Right now we pretokenize to enable chunking - currently, we can yield portions of a document at a time, without ever having to load the whole doc into memory at once, which is helpful in the case of extremely long docs. This gets more difficult for raw text since we have to split on tokens rather than on characters, but we don't know the token boundaries until after tokenization. Support for dynamic tokenization is on the todo list though!

gokulavasan · 2024-06-05T09:01:11Z

torchtitan/datasets/experimental_datasets.py

+        """
+        os.makedirs(path, exist_ok=True)
+        state = self.state_dict()
+        torch.save(state, os.path.join(path, f"loader_state_{self.rank}.pth"))


From the description, it looks like only num_workers=1 is supported per rank. Thus self.rank would be sufficient in this case. Just to confirm, for num_workers > 1, do you foresee each worker storing their state from the worker process directly to local disk?

Yes that's the eventual plan for supporting num_workers > 1

gokulavasan · 2024-06-05T09:15:53Z

torchtitan/datasets/experimental_datasets.py

+
+    def _reshard(self, sharded_list):
+        """
+        Sharded_list is a list of lists, where each "shard" sublist must have the same length.


Can you elaborate on why "shard" sublist must have the same length? For example, when the input dataset folder is created with arrow files inside - does each file correspond to a shard? In that case, is each file expected to have the same length?

I suppose it's not strictly necessary that shards have the same length, we just take that shortcut because this function is only ever used to reassign per-worker data structures from load_worldsize to worldsize. So far we've been able to assume that workers all hold the same fundamental data structures so this hasn't been an issue. This doesn't get called on any actual data shard files - apologies for unclear naming/commenting here

gokulavasan · 2024-06-05T09:20:41Z

torchtitan/datasets/experimental_datasets.py

+            int(n_items * (self.rank + 1) / self.worldsize) - item_offset,
+        )
+        # Pull out owned items
+        return [sharded_list[i // shard_len][i % shard_len] for i in my_items]


Does the sharded list is a basically a list of list of indices? And does this return a list of indices that is owned by this rank? If that is the case, could this be a really large list for huge datasets?

Update: Looking at the code further down, this seems to be for in-memory buffers instead of sample/doc indices?

Yes your update is correct. We never actually materialize the full list of doc indices owned by each worker, because as you say, this gets very large.

gokulavasan · 2024-06-05T09:27:48Z

torchtitan/datasets/experimental_datasets.py

+                for flag in self.state_params + self.reshard_params
+            ]
+        else:
+            for flag in self.reshard_params:


For all parameters that are set as reshard params, this logic figures out which subset of indices in that list of ranges that it owns? Apart from indices assigned to a worker, what are the other states tend to lend well with re-sharding?

And looks like state params are discarded during reshard? Would RNG state be a typical state param?

Yes RNG state is a typical state param. Reshard params are generally buffers or sets of stateful subworkers

gokulavasan · 2024-06-05T09:34:48Z

torchtitan/datasets/experimental_datasets.py

+            for flag in self.state_params + self.reshard_params
+        }
+
+    def _reshard(self, sharded_list):


So this function doesn't place any restriction on what world_size could be in relation to load_world_size - is that correct?

That's correct

gokulavasan · 2024-06-05T09:50:16Z

torchtitan/datasets/experimental_datasets.py

+            # Swap out randomly sampled value from buffer
+            i = torch.randint(self.buffer_size, (1,), generator=self.generator).item()
+            out = self.buffer[i]
+            self.buffer[i] = next(dataset)


once the dataset runs out of data, how is the remaining items in the buffer drained?

The expectation is a sublayer that never runs out - currently, we just loop over epochs forever

gokulavasan · 2024-06-05T10:25:36Z

torchtitan/datasets/experimental_datasets.py

+                newpath = os.path.join(self.data, shardid)
+                path, reader = self._get_reader(path, newpath, reader)
+                # Map id in range of owned docs to new (consistently) shuffled id
+                doclcg = self._random_map_docid(docrange)


Just to confirm, once a shard is chosen, all the docs within that shard is visited in random order but that shard is exhausted before moving to the next shard?

Technically, once a shard is chosen, all the docs within the fraction of the shard owned by this worker are visited. But yes that's broadly correct.

gokulavasan · 2024-06-05T10:27:20Z

torchtitan/datasets/experimental_datasets.py

+        self.percent_seen = 0
+        self.lcg_state = seed + rank
+
+        self.state_params = [


I see no reshard_params, so this dataset doesn't support resharding?

Correct, this one particular dataset needs to sit under a Scalable_Shard_Dataset to support resharding

gokulavasan

Finished one pass through the PR. Left comments to understand the code better

gokulavasan

Finished one pass through the PR. Left comments to understand the code better

daviswer · 2024-06-05T22:18:06Z

Thanks for taking a look! Left a bunch of responses, hopefully this brings clarity

andrewkho · 2024-06-24T18:40:51Z

torchtitan/datasets/experimental_datasets.py

+    return itemlist[start:end]
+
+
+class _Stateful_Dataset(data.IterableDataset):


nit: can we stick with Camel case and remove underscores between words?

Re-sync fork to main, and add the latest features from fms-fsdp: - Remove dependence on the separate metadata "count file" in the dataset directory - Support n_workers > 1. This is accomplished by shunting all path/rank-dependent setup out of initialization and into a new setup() method, which runs after init but before any other op - Support HF-style parquet raw text datasets, with tokenization on the fly (for reasonably sized documents/shardfiles) - Support non-flat data directories: all legal files under the specified location will be included, regardless of depth or location. Enables simple weight-free dataset mixing via a single `StreamingDocDataset` on the parent directory - `SamplingDataset` and `ScalableShardDataset`are now implemented as proper `_WrapperDatasets`, reflecting the intended modular usage - Fix Weird_Separated_Camel_Case naming convention in favor of ProperClassNaming - Allow `PreloadBufferDataset` to shrink back down to the desired size after rescaling to a smaller number of workers

daviswer · 2024-08-13T21:41:32Z

Updating the dataloader to match the latest version in our public repo. Re-syncs to latest main, and incorporates many previously discussed features:

Remove dependence on the separate metadata "count file" in the dataset directory
Support n_workers > 1. This is accomplished by shunting all path/rank-dependent setup out of initialization and into a new setup() method, which runs after init but before any other op
Support HF-style parquet raw text datasets, with tokenization on the fly (for reasonably sized documents/shardfiles)
Support non-flat data directories: all legal files under the specified location will be included, regardless of depth or location. Enables simple weight-free dataset mixing via a single StreamingDocDataset on the parent directory
SamplingDataset and ScalableShardDatasetare now implemented as proper _WrapperDatasets, reflecting the intended modular usage
Fix Weird_Separated_Camel_Case naming convention in favor of ProperClassNaming
Allow PreloadBufferDataset to shrink back down to the desired size after rescaling to a smaller number of workers

wanchaol and others added 30 commits February 13, 2024 14:10

check in tokenizer.model for ease of dev setup (pytorch#59)

054f088

it's a small thing and can be download from OSS, we can just check in

add TensorBoard logging with loss and wps

ad69e62

ghstack-source-id: d0828f16c06747a5af2586630e5205bf786de1c4 Pull Request resolved: pytorch#57

add memory metrics to TensorBoard

a4663b1

ghstack-source-id: da7e02b1c2f21a7471ce1dda8bd4d0ee888ad9ac Pull Request resolved: pytorch#60

modify data split to use HF api

2daf53f

ghstack-source-id: e23d5e0b70abc427a13bc8bf195c876c007f4939 Pull Request resolved: pytorch#65

add bunch of cleanups and design principle section (pytorch#71)

8097c26

delete the linter to see if re-adding it helps (pytorch#80)

28f431f

Whc/add linter (pytorch#81)

bccad90

Add the linter back using a different changed-files plugin which doesn't have permission issues on pytorch/ org. Also change the linter job to use py 3.10 to match our unit test runner.

update readme (pytorch#74)

468ce8f

mostly testing if new repo works or not

move config folder to root and adjust options (pytorch#83)

3fce6bb

as titled, move the config files to the root folder, where it decouples with the torchtrain package build, and allow easier navigations

support infinite loop over alpaca dataset

325951f

ghstack-source-id: 38cbc277e2a177bc0baf35450a661835b97a7f22 Pull Request resolved: pytorch#92

improve TensorBoard instructions in README

4c03475

ghstack-source-id: 7dc4a80cf9c32f4dca3d00bcef019d256bdf58f7 Pull Request resolved: pytorch#96

use warmup steps for lr scheduler, ban steps == -1 (pytorch#99)

e60c573

as titled, we don't want to allow steps == -1 case as it would blow up the lr scheduler

Add llama 7B config (pytorch#100)

900b215

Add 7b config and adjust options to be more realistic didn't add this to the train scripts as default as it's expensive to init, whoever use it can adjust it accordingly

add selective activation checkpointing

6e87471

ghstack-source-id: f7ee3c867bfcdcae5dbb490982920606191e6f40 Pull Request resolved: pytorch#97

fix 2D parallel crash caused by all-reduce on 2D world_mesh

42f8907

ghstack-source-id: 1c5bf790d7473f6a24124051fcfa1fd2585a56f9 Pull Request resolved: pytorch#105

Fix feedback from PR 111 (pytorch#113)

5f0eaea

Co-authored-by: gnadathur <gnadathur@devvm4378.nao0.facebook.com>

wanchaol and others added 13 commits May 21, 2024 22:10

Make test_runner use separate logger with default INFO

910662c

previous change to use logging from torchtitan caused stdout not to show up. ghstack-source-id: 30a77c59ba68043ffa844be0443d5351d9584fab Pull Request resolved: pytorch#352

Fix llama_13b.toml -> llama2_13b.toml in multinode_trainer.slurm (pyt…

6807909

…orch#350) cc @wanchaol @lessw2020 @wconstab

Update pipelining import after change on pytorch

c87e8bc

APIs conform to the pytorch rules. This PR should be able to land safely after tonight's nightly pytorch build which includes the above PR. ghstack-source-id: c575bc7835472128c09798544caa38bf1908e5ca Pull Request resolved: pytorch#356

Add test for PP tracer frontend

1ceaa4e

- switch to using public PipelineStage API - clean up some asserts in tracer codepath ghstack-source-id: 2d069b7d45c4f3c788dec8fc85d8a7e83e463fcd Pull Request resolved: pytorch#357

only produce tensorboard logs on rank 0 by default

6a8455e

ghstack-source-id: 4255cc792b9a221bc5a012e91db92533dcfa2243 Pull Request resolved: pytorch#339

replace old torch dependency in requirements.txt

5831e81

ghstack-source-id: 8cbd62b97816ae8185b8a7e1aa9a7505f2780525 Pull Request resolved: pytorch#372

Add --test option to specify test to run (pytorch#368)

3343d1d

Usage: `--test <test_id>` Acceptable values: `test_id` in `build_test_list` (default: all) Example: ``` rm -rf outputs && python test_runner.py outputs --test pp_gpipe ```

use integration test as the badge shown on the homepage

07fa503

ghstack-source-id: 775591945ff5427cb7e5e9fc7592952b4c746341 Pull Request resolved: pytorch#373

Add FMS datasets (#1)

2e733d4

Co-authored-by: Linsong Chu <lchu@us.ibm.com>

Merge branch 'pytorch:main' into main

27de38c

fegin reviewed May 31, 2024

View reviewed changes

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 31, 2024

gnadathur requested a review from gokulavasan May 31, 2024 17:26

gokulavasan reviewed Jun 5, 2024

View reviewed changes

andrewkho reviewed Jun 24, 2024

View reviewed changes

daviswer mentioned this pull request Aug 12, 2024

Update IBM experimental dataloader #515

Closed

tianyu-l force-pushed the main branch from 88ace96 to 81c555f Compare August 16, 2024 21:00

Fix ckp load path handling

c8dc004

daviswer mentioned this pull request Nov 23, 2024

Rescalability via IBM dataset layers pytorch/data#1372

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IBM experimental dataloaders #376

IBM experimental dataloaders #376

daviswer commented May 31, 2024 •

edited

Loading

facebook-github-bot commented May 31, 2024

fegin May 31, 2024

facebook-github-bot commented May 31, 2024

gokulavasan commented Jun 3, 2024

gokulavasan left a comment

gokulavasan Jun 5, 2024

daviswer Jun 5, 2024

gokulavasan Jun 5, 2024

daviswer Jun 5, 2024

gokulavasan Jun 5, 2024

daviswer Jun 5, 2024 •

edited

Loading

gokulavasan Jun 5, 2024

daviswer Jun 5, 2024

gokulavasan Jun 5, 2024

daviswer Jun 5, 2024

gokulavasan Jun 5, 2024

daviswer Jun 5, 2024

gokulavasan Jun 5, 2024

daviswer Jun 5, 2024

gokulavasan Jun 5, 2024

daviswer Jun 5, 2024 •

edited

Loading

gokulavasan Jun 5, 2024

daviswer Jun 5, 2024

gokulavasan Jun 5, 2024

daviswer Jun 5, 2024

gokulavasan left a comment

gokulavasan left a comment

daviswer commented Jun 5, 2024

andrewkho Jun 24, 2024

daviswer commented Aug 13, 2024

		@@ -0,0 +1,2 @@
		dataset/filename,documents,tokens
		/c4_mini/c4_mini.arrow,45000,20505558

		return itemlist[start:end]


		class _Stateful_Dataset(data.IterableDataset):

IBM experimental dataloaders #376

Are you sure you want to change the base?

IBM experimental dataloaders #376

Conversation

daviswer commented May 31, 2024 • edited Loading

facebook-github-bot commented May 31, 2024

Action Required

Process

Choose a reason for hiding this comment

facebook-github-bot commented May 31, 2024

gokulavasan commented Jun 3, 2024

gokulavasan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

daviswer Jun 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

daviswer Jun 5, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gokulavasan left a comment

Choose a reason for hiding this comment

gokulavasan left a comment

Choose a reason for hiding this comment

daviswer commented Jun 5, 2024

Choose a reason for hiding this comment

daviswer commented Aug 13, 2024

daviswer commented May 31, 2024 •

edited

Loading

daviswer Jun 5, 2024 •

edited

Loading

daviswer Jun 5, 2024 •

edited

Loading