Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IBM experimental dataloaders #376

Open
wants to merge 175 commits into
base: main
Choose a base branch
from
Open

IBM experimental dataloaders #376

wants to merge 175 commits into from

Conversation

daviswer
Copy link

@daviswer daviswer commented May 31, 2024

This PR introduces an experimental PyTorch-native dataloader from IBM that is distributed, stateful, checkpointable, composable and rescalable. It is intended for use in large-scale model pretraining, particularly in research settings where rapid iteration between datasets may be required. It automatically and invisibly handles data sharding, shuffling, subdataset weighting, checkpoint saving and loading, and more, with minimal overhead and high throughput.

  • Add experimental dataset source file
  • Add experimental dataloader builder, hooked into torchtitan cfg
  • Update torchtitan cfg with additional dataset arg fields
  • Update train script to build experimental dataloader instead of hf depending on cfg flags
  • Replace the existing C4-mini example dataset with one that matches the expected formatting for the experimental dataloader
  • TODO: port over unit tests as well
  • TODO: preprocessing script(s) for the new dataset format
  • TODO: further cleanup/iteration

wanchaol and others added 30 commits February 13, 2024 14:10
it's a small thing and can be download from OSS, we can just check in
This PR adds the following:
1 - via reset parameters, a full layerwise init for the llama models
under /llama. This uses the total model depth as part of the init via:
self.weight_init_std = 0.02 / (2 * self.num_layers) ** 0.5

2 - The final output ffn (head) is init with sqrt of the dim of the
model itself and a slightly wider cutoff factor of 3.

3 - tangential change - updates run_llama_train.sh with updated MODEL
and MODEL_CONF params to allow for direct model control via the sh
script. (there was a MODEL already but it was incorrectly using that in
place of MODEL_CONF...though we should update this as it's not
intuitive).

4 - made the debugmodel default to 2 layers as an improved debug check.

5 - added a 1B and 40B for additional testing configs. I can't currently
run 70B on my H100 due to OOM, but can run 40B.

Testing:
Verified proper init and training with 7B, 13B and ~40B:

<img width="1085" alt="Screenshot 2024-02-11 at 10 39 12 PM"
src="https://github.com/pytorch-labs/torchtrain/assets/46302957/049037ed-63a4-4ab0-bebc-f297857aab72">
This PR is the start of adding perf related metrics. 
1 - This PR adds function for logging the total num of unique model
params, with option for only counting trainable params as well. (for
future peft/qlora type work).
2 - logs it with comma formatted logging and model name ala:
<img width="716" alt="Screenshot 2024-02-12 at 4 12 22 PM"
src="https://github.com/pytorch-labs/torchtrain/assets/46302957/8eb48870-ab1e-4b70-9159-92864ff6c0e5">

this helps de-mistify for example the size of our debug model as well:
<img width="716" alt="Screenshot 2024-02-12 at 4 10 17 PM"
src="https://github.com/pytorch-labs/torchtrain/assets/46302957/77475306-54bc-48a6-bf28-9c9a542577fd">

**additional updates** - added in gpu mem tracking. We want to show the
user peak memory stats, as well as monitor and alert for any
cudacachealloc retries which are a perf hindrance.

Thus, added class GPUMemoryMonitor:
usage:
1 - instantiate
<img width="1329" alt="Screenshot 2024-02-13 at 9 32 11 AM"
src="https://github.com/pytorch-labs/torchtrain/assets/46302957/95610386-6fde-47bb-bbdc-bb7c399c5895">

2 - start of training = start_monitoring()
3 - end of training = stop_monitoring()
4 - show results = get_peak_stats_str() and rank0_log it.
<img width="1074" alt="Screenshot 2024-02-13 at 9 12 45 AM"
src="https://github.com/pytorch-labs/torchtrain/assets/46302957/b6c7c854-7d83-436a-bea9-a67109422381">
ghstack-source-id: d0828f16c06747a5af2586630e5205bf786de1c4
Pull Request resolved: pytorch#57
ghstack-source-id: da7e02b1c2f21a7471ce1dda8bd4d0ee888ad9ac
Pull Request resolved: pytorch#60
ghstack-source-id: e23d5e0b70abc427a13bc8bf195c876c007f4939
Pull Request resolved: pytorch#65
…ix (pytorch#63)

This PR 
1 - adds multi-node training support via a multinode_trainer.slurm file.
Verified llama 7b on 20 nodes / 160 A100s.
2 - It also corrects a race condition that can occur on larger scale
training in profiling, where the check for the trace dir existence fails
for process 1, but in the interim another process 2 makes the directory,
and then when process 1 tries to make the dir it errors out as the dir
exists.
This is a simple fix of adding exist_ok=True to both of the makedir
command (dump folder, trace folder).

<img width="1047" alt="Screenshot 2024-02-15 at 10 53 18 PM"
src="https://github.com/pytorch-labs/torchtrain/assets/46302957/20378637-4adb-425b-91d8-7fd36289d3b5">
<img width="545" alt="Screenshot 2024-02-15 at 10 55 02 PM"
src="https://github.com/pytorch-labs/torchtrain/assets/46302957/28658614-cff6-42b5-ab57-bac578393d5c">
…orch#64)

Small PR:
1 - add configurable init style in model_args - 'use_unique_init' will
use the layer_id in the init stddev denom, otherwise uses the original
init style of total layer count. (verified both work on 7B llama...not
clear yet if one is better vs other).

2 - clean up lr and loss display formatting - lr display was spanning
out to 12+ digits which isn't that informative, and was wrapped in list
format. This PR rounds it to max of 8 digits precision and removes the
[]'s that were around the lr rate display.
(note this is purely UI...the full float precision is still used in
actual lr calcs).

3 - clean up loss display - rounds the loss display to 4 digits
precision to make it more readable and informative.
previously:
<img width="1198" alt="Screenshot 2024-02-16 at 2 33 34 PM"
src="https://github.com/pytorch-labs/torchtrain/assets/46302957/77733af0-42db-4fab-a047-fccc7d404278">

Now:
<img width="1063" alt="Screenshot 2024-02-16 at 2 51 53 PM"
src="https://github.com/pytorch-labs/torchtrain/assets/46302957/4eb75b98-67f4-41ec-83d8-dd84a0e8b29e">
Summary:

PR implements an unfied config manager.

- Command line args and toml file args are now unified.
- Defaults can be loaded from either.

options like `training.batchsize` will be available as
`config.training.batchsize` where `config` is a config manager object.

Test Plan:

Test Plan:
============================= test session starts
============================== platform linux -- Python 3.10.13,
pytest-8.0.1, pluggy-1.4.0 --
/home/gnadathur/local/a/pytorch-env/bin/python cachedir: .pytest_cache
rootdir: /data/users/gnadathur/a/torchtrain
configfile: pyproject.toml
plugins: cov-4.1.0
collecting ... collected 5 items

test/test_job_config.py::TestJobConfig::test_command_line_args PASSED [
20%]
test/test_job_config.py::TestJobConfig::test_command_line_args_with_override
PASSED [ 40%]
test/test_job_config.py::TestJobConfig::test_job_config_file PASSED [
60%]
test/test_job_config.py::TestJobConfig::test_job_config_file_with_override
PASSED [ 80%]
test/test_job_config.py::TestJobConfig::test_job_file_does_not_exist
PASSED [100%]

---------- coverage: platform linux, python 3.10.13-final-0 ----------
Coverage XML written to file coverage.xml

============================= slowest 20 durations
============================= 0.01s call
test/test_job_config.py::TestJobConfig::test_job_config_file_with_override
0.00s call test/test_job_config.py::TestJobConfig::test_job_config_file
0.00s call
test/test_job_config.py::TestJobConfig::test_command_line_args 0.00s
call
test/test_job_config.py::TestJobConfig::test_command_line_args_with_override
0.00s call
test/test_job_config.py::TestJobConfig::test_job_file_does_not_exist
0.00s setup
test/test_job_config.py::TestJobConfig::test_command_line_args 0.00s
teardown test/test_job_config.py::TestJobConfig::test_command_line_args
0.00s setup
test/test_job_config.py::TestJobConfig::test_job_file_does_not_exist
0.00s setup
test/test_job_config.py::TestJobConfig::test_command_line_args_with_override
0.00s teardown
test/test_job_config.py::TestJobConfig::test_command_line_args_with_override
0.00s setup
test/test_job_config.py::TestJobConfig::test_job_config_file_with_override
0.00s setup test/test_job_config.py::TestJobConfig::test_job_config_file
0.00s teardown
test/test_job_config.py::TestJobConfig::test_job_file_does_not_exist
0.00s teardown
test/test_job_config.py::TestJobConfig::test_job_config_file 0.00s
teardown
test/test_job_config.py::TestJobConfig::test_job_config_file_with_override
============================== 5 passed in 0.10s
===============================

Reviewers:

Subscribers:

Tasks:

Tags:

Co-authored-by: gnadathur <gnadathur@devgpu051.cln3.facebook.com>
Add the linter back using a different changed-files plugin which doesn't have permission issues on pytorch/ org.

Also change the linter job to use py 3.10 to match our unit test runner.
For now this literally just runs `NGPU=4 ./run_llama_train.sh` but I
verified at least it catches problems.

As a follow up, we should integrate mgpu test infra from pytorch and set
up actual unit tests to run in this job.

We should probably also keep testing the run_llama_train.sh script, and
add other combinations of 2D parallelism to ensure they all keep
working.

<img width="2120" alt="image"
src="https://github.com/pytorch/torchtrain/assets/4984825/2c235e9a-04ed-4f2d-9915-67de39d78e1c">
mostly testing if new repo works or not
as titled, move the config files to the root folder, where it decouples
with the torchtrain package build, and allow easier navigations
…olumnar display to show both, show avg iter & data loading times at end of training (pytorch#87)

This PR adds basic perf timing and display for 'per iter' and 'final
iter average' display. (in part based on Andrew's comment about having
to open the trace to compare iter timing).

1. tracking list is housed in TrainState, but I do not save it as part
of the state dict as I view this as useful but not saveable info.
2. iter times are tracked after dataloading is done each iter and after
optimizer step. The idea is to make this timing expressly the model
training iter (not data loading or post iter other metrics calcs).

3. 'time' is now displayed at each iter along with the usual loss and
lr.

4. at the end of training, assuming more than 3 iters run, then the
average iter time is calculated by igoring the first three iters
(consider these as warmup esp as cudaCacheAllocator gets warmed up) and
displayed.
5. based on @tianyu-l feedback: I have added data loading times as well.
I used the same timeit.default_timer() from timeit to be consistent.
(cpu side so no synch's needed :)

6 - after fiddling with printf width formatting options, added beautiful
aligned columnar display for the per iter updates:
Now: 
<img width="1282" alt="Screenshot 2024-02-26 at 9 39 25 AM"
src="https://github.com/pytorch/torchtrain/assets/46302957/9ee2ea7b-5c28-4d41-ba91-d4176c64fc66">

before: 
<img width="1282" alt="Screenshot 2024-02-26 at 8 39 46 AM"
src="https://github.com/pytorch/torchtrain/assets/46302957/37cbfa20-7f1d-4d94-be94-3505ef4498c0">
Summary:

Summary:
Follow up on config unification, options not available in config file
are picked from command line defaults.

Test Plan:
============================= test session starts
============================== platform linux -- Python 3.10.13,
pytest-8.0.1, pluggy-1.4.0 --
/home/gnadathur/local/a/pytorch-env/bin/python cachedir: .pytest_cache
rootdir: /data/users/gnadathur/a/torchtrain
configfile: pyproject.toml
plugins: cov-4.1.0
collecting ... collected 3 items

test/test_job_config.py::TestJobConfig::test_command_line_args PASSED [
33%] test/test_job_config.py::TestJobConfig::test_job_config_file PASSED
[ 66%]
test/test_job_config.py::TestJobConfig::test_job_file_does_not_exist
PASSED [100%]

---------- coverage: platform linux, python 3.10.13-final-0 ----------
Coverage XML written to file coverage.xml

============================= slowest 20 durations
============================= 0.00s call
test/test_job_config.py::TestJobConfig::test_job_config_file 0.00s call
test/test_job_config.py::TestJobConfig::test_command_line_args 0.00s
call
test/test_job_config.py::TestJobConfig::test_job_file_does_not_exist
0.00s setup
test/test_job_config.py::TestJobConfig::test_command_line_args 0.00s
teardown test/test_job_config.py::TestJobConfig::test_command_line_args
0.00s setup test/test_job_config.py::TestJobConfig::test_job_config_file
0.00s setup
test/test_job_config.py::TestJobConfig::test_job_file_does_not_exist
0.00s teardown
test/test_job_config.py::TestJobConfig::test_job_config_file 0.00s
teardown
test/test_job_config.py::TestJobConfig::test_job_file_does_not_exist
============================== 3 passed in 0.06s
===============================

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

---------

Co-authored-by: gnadathur <gnadathur@devvm4378.nao0.facebook.com>
ghstack-source-id: 38cbc277e2a177bc0baf35450a661835b97a7f22
Pull Request resolved: pytorch#92
…g on slurm (pytorch#93)

This PR adds the ability to do colored console outputs in order to
highlight the training data outputs.
It also adds a check to not use this color formatting on slurm, where it
will add 33= instead of the color if not avoided.

Note that I've just added some color to highlight the main training
data. Users that fork/clone can use it to enhance their outputs as
desired.

<img width="1372" alt="Screenshot 2024-02-26 at 10 20 15 PM"
src="https://github.com/pytorch/torchtrain/assets/46302957/44849821-1677-40bf-896c-39344cd661d6">


Note that on slurm it remains plain:
<img width="847" alt="Screenshot 2024-02-26 at 10 46 24 PM"
src="https://github.com/pytorch/torchtrain/assets/46302957/172eaa58-4f5c-48f5-8ec1-bc349e3e82f2">

if you dont' check this, then it would otherwise look like this (this
does not happen with this PR, just showing if we didn't check and credit
to Yifu for noting this would be an issue):
<img width="847" alt="Screenshot 2024-02-26 at 10 39 23 PM"
src="https://github.com/pytorch/torchtrain/assets/46302957/4a87fb9a-dd3a-417c-a29e-286ded069358">
this PR updates the GPU metrics to labelling as GiB - we were
calculating GiB but calling it GB.
(credit to @awgu for flagging this - issue
pytorch#94)

function names and member vars in metrics.py have been updated to _gib
instead of _gb for clarity, and the logging output now labels as GiB:
<img width="851" alt="Screenshot 2024-02-27 at 11 28 23 AM"
src="https://github.com/pytorch/torchtrain/assets/46302957/85eb260a-77e9-4c49-be8a-b1aaa10dc3e2">
ghstack-source-id: 7dc4a80cf9c32f4dca3d00bcef019d256bdf58f7
Pull Request resolved: pytorch#96
Enable libUV for torchtrain.

Test:
```
+ export USE_LIBUV=1
+ USE_LIBUV=1
+ TRAINER_DIR=/home/gnadathur/local/torchtrain
+ NGPU=4
+ LOG_RANK=0,1
+ CONFIG_FILE=./train_configs/debug_model.toml
+ torchrun --nproc_per_node=4 --rdzv_endpoint=localhost:5972 --local-ranks-filter 0,1 --role rank --tee 3 train.py --job.config_file ./train_configs/debug_model.toml
W0228 09:12:02.564000 140353616004096 torch/distributed/run.py:717] 
W0228 09:12:02.564000 140353616004096 torch/distributed/run.py:717] *****************************************
W0228 09:12:02.564000 140353616004096 torch/distributed/run.py:717] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0228 09:12:02.564000 140353616004096 torch/distributed/run.py:717] *****************************************
[rank0]:2024-02-28 09:12:04,581 - torchtrain.parallelisms - INFO - Building 1-D device mesh with ('dp',), [4]
[rank1]:2024-02-28 09:12:04,708 - torchtrain.parallelisms - INFO - Building 1-D device mesh with ('dp',), [4]
[rank0]:2024-02-28 09:12:05,647 - root - INFO - Building llama
[rank0]:2024-02-28 09:12:05,655 - root - INFO - Reloaded SentencePiece model from ./torchtrain/datasets/tokenizer/tokenizer.model
[rank0]:2024-02-28 09:12:05,655 - root - INFO - #words: 32000 - BOS ID: 1 - EOS ID: 2
[rank1]:2024-02-28 09:12:07,299 - root - INFO - Reloaded SentencePiece model from ./torchtrain/datasets/tokenizer/tokenizer.model
[rank1]:2024-02-28 09:12:07,299 - root - INFO - #words: 32000 - BOS ID: 1 - EOS ID: 2
[rank0]:2024-02-28 09:12:07,565 - root - INFO - Model fully initialized via reset_params
[rank0]:2024-02-28 09:12:07,566 - root - INFO - Model built with: ModelArgs(dim=256, n_layers=2, n_heads=16, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=32768, depth_init=True)
[rank0]:2024-02-28 09:12:07,566 - root - INFO - �[34mModel llama debugmodel �[31msize: 18,089,216 total parameters�[39m
[rank0]:2024-02-28 09:12:07,567 - root - INFO - GPU memory usage: NVIDIA H100 (0): 95.0396 GiB capacity, 0.0 GiB in-use, 0.0% in-use
[rank0]:2024-02-28 09:12:08,769 - root - INFO - Applied FSDP to the model...
[rank0]:2024-02-28 09:12:08,770 - root - INFO - Gradient scaling not enabled.
[rank0]:2024-02-28 09:12:08,770 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs/tb/20240228-0912.
[rank0]:2024-02-28 09:12:08,977 - root - INFO - Profiling active.  Traces will be saved at ./outputs/profiling/traces
[rank0]:2024-02-28 09:12:10,956 - root - INFO - �[36mstep:  1  �[32mloss: 10.9229  �[39miter: �[34m 1.9386�[39m  data: �[34m0.0368  �[39mlr: �[33m0.00026667�[39m
[rank0]:2024-02-28 09:12:11,045 - root - INFO - �[36mstep:  2  �[32mloss: 10.8673  �[39miter: �[34m 0.0562�[39m  data: �[34m0.0316  �[39mlr: �[33m0.00053333�[39m
[rank0]:2024-02-28 09:12:11,130 - root - INFO - �[36mstep:  3  �[32mloss: 10.7145  �[39miter: �[34m 0.0523�[39m  data: �[34m0.0322  �[39mlr: �[33m0.0008�[39m
[rank0]:2024-02-28 09:12:11,219 - root - INFO - �[36mstep:  4  �[32mloss: 10.5038  �[39miter: �[34m 0.0559�[39m  data: �[34m0.0319  �[39mlr: �[33m0.0007�[39m
[rank0]:2024-02-28 09:12:11,304 - root - INFO - �[36mstep:  5  �[32mloss: 10.2228  �[39miter: �[34m 0.0537�[39m  data: �[34m0.031  �[39mlr: �[33m0.0006�[39m
[rank0]:2024-02-28 09:12:11,391 - root - INFO - �[36mstep:  6  �[32mloss:  9.9677  �[39miter: �[34m 0.0562�[39m  data: �[34m0.0302  �[39mlr: �[33m0.0005�[39m
[rank0]:2024-02-28 09:12:11,478 - root - INFO - �[36mstep:  7  �[32mloss:  9.7762  �[39miter: �[34m 0.0544�[39m  data: �[34m0.0317  �[39mlr: �[33m0.0004�[39m
[rank0]:2024-02-28 09:12:11,676 - root - INFO - �[36mstep:  8  �[32mloss:  9.4359  �[39miter: �[34m 0.0509�[39m  data: �[34m0.0322  �[39mlr: �[33m0.0003�[39m
[rank1]:STAGE:2024-02-28 09:12:11 3161834:3161834 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
[rank1]:[rank1]:[W CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event
[rank0]:STAGE:2024-02-28 09:12:11 3161833:3161833 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
[rank0]:2024-02-28 09:12:11,813 - root - INFO - �[36mstep:  9  �[32mloss:  9.2326  �[39miter: �[34m 0.1007�[39m  data: �[34m0.0321  �[39mlr: �[33m0.0002�[39m
[rank0]:[rank0]:[W CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event
[rank1]:STAGE:2024-02-28 09:12:11 3161834:3161834 ActivityProfilerController.cpp:320] Completed Stage: Collection
[rank1]:STAGE:2024-02-28 09:12:11 3161834:3161834 ActivityProfilerController.cpp:324] Completed Stage: Post Processing
[rank0]:STAGE:2024-02-28 09:12:11 3161833:3161833 ActivityProfilerController.cpp:320] Completed Stage: Collection
[rank0]:STAGE:2024-02-28 09:12:11 3161833:3161833 ActivityProfilerController.cpp:324] Completed Stage: Post Processing
[rank0]:2024-02-28 09:12:12,195 - root - INFO - exporting profile traces to ./outputs/profiling/traces/iteration_10
[rank0]:2024-02-28 09:12:12,207 - root - INFO - �[36mstep: 10  �[32mloss:  9.1641  �[39miter: �[34m 0.0971�[39m  data: �[34m0.031  �[39mlr: �[33m0.0001�[39m
[rank0]:2024-02-28 09:12:12,207 - root - INFO - Average iter time: 0.0670 seconds
[rank0]:2024-02-28 09:12:12,207 - root - INFO - Average data load time: 0.0314 seconds
[rank0]:2024-02-28 09:12:12,208 - root - INFO - Current Memory: NVIDIA H100 (0): Reserved: 9.6465%, Alloc 2.1969%, Active: 2.2%
[rank0]:Peak Memory: Reserved 9.65%, Alloc 8.43%, Active: 8.44%
[rank0]:num retries: 0, num ooms: 0
[rank0]:NCCL version 2.19.3+cuda12.0
```

---------

Co-authored-by: gnadathur <gnadathur@devvm4378.nao0.facebook.com>
as titled, we don't want to allow steps == -1 case as it would blow up
the lr scheduler
Add 7b config and adjust options to be more realistic

didn't add this to the train scripts as default as it's expensive to
init, whoever use it can adjust it accordingly
ghstack-source-id: f7ee3c867bfcdcae5dbb490982920606191e6f40
Pull Request resolved: pytorch#97
Summary:
Adding a description field, useful for integration tests to describe the
test.

Test Plan:
```
+ export USE_LIBUV=1
+ USE_LIBUV=1
+ TRAINER_DIR=/home/gnadathur/local/torchtrain
+ NGPU=4
+ LOG_RANK=0,1
+ CONFIG_FILE=./train_configs/debug_model.toml
+ torchrun --nproc_per_node=4 --rdzv_endpoint=localhost:5972 --local-ranks-filter 0,1 --role rank --tee 3 train.py --job.config_file ./train_configs/debug_model.toml
W0229 17:05:02.466000 140187679912960 torch/distributed/run.py:717] 
W0229 17:05:02.466000 140187679912960 torch/distributed/run.py:717] *****************************************
W0229 17:05:02.466000 140187679912960 torch/distributed/run.py:717] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0229 17:05:02.466000 140187679912960 torch/distributed/run.py:717] *****************************************
[rank1]:2024-02-29 17:05:04,269 - torchtrain.parallelisms - INFO - Building 1-D device mesh with ('dp',), [4]
[rank0]:2024-02-29 17:05:04,510 - torchtrain.parallelisms - INFO - Building 1-D device mesh with ('dp',), [4]
[rank0]:2024-02-29 17:05:05,327 - root - INFO - Starting job: debug training
[rank0]:2024-02-29 17:05:05,327 - root - INFO - Building llama
[rank0]:2024-02-29 17:05:05,335 - root - INFO - Reloaded SentencePiece model from ./torchtrain/datasets/tokenizer/tokenizer.model
[rank0]:2024-02-29 17:05:05,335 - root - INFO - #words: 32000 - BOS ID: 1 - EOS ID: 2
[rank1]:2024-02-29 17:05:06,782 - root - INFO - Reloaded SentencePiece model from ./torchtrain/datasets/tokenizer/tokenizer.model
[rank1]:2024-02-29 17:05:06,782 - root - INFO - #words: 32000 - BOS ID: 1 - EOS ID: 2
[rank0]:2024-02-29 17:05:07,347 - root - INFO - Model fully initialized via reset_params
[rank0]:2024-02-29 17:05:07,349 - root - INFO - Model built with: ModelArgs(dim=256, n_layers=2, n_heads=16, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=32768, depth_init=True)
[rank0]:2024-02-29 17:05:07,349 - root - INFO - �[34mModel llama debugmodel �[31msize: 18,089,216 total parameters�[39m
[rank0]:2024-02-29 17:05:07,349 - root - INFO - GPU memory usage: NVIDIA H100 (0): 95.0396 GiB capacity, 0.0 GiB in-use, 0.0% in-use
[rank0]:2024-02-29 17:05:08,375 - root - INFO - Applied FSDP to the model...
[rank0]:2024-02-29 17:05:08,376 - root - INFO - Gradient scaling not enabled.
[rank0]:2024-02-29 17:05:08,376 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs/tb/20240229-1705.
[rank0]:2024-02-29 17:05:08,610 - root - INFO - Profiling active.  Traces will be saved at ./outputs/profiling/traces
[rank0]:2024-02-29 17:05:10,570 - root - INFO - �[36mstep:  1  �[32mloss: 10.9183  �[39miter: �[34m 1.9258�[39m  data: �[34m0.0303  �[39mlr: �[33m0.00026667�[39m
[rank0]:2024-02-29 17:05:10,653 - root - INFO - �[36mstep:  2  �[32mloss: 10.8347  �[39miter: �[34m 0.0487�[39m  data: �[34m0.0336  �[39mlr: �[33m0.00053333�[39m
[rank0]:2024-02-29 17:05:10,733 - root - INFO - �[36mstep:  3  �[32mloss: 10.6861  �[39miter: �[34m  0.045�[39m  data: �[34m0.0334  �[39mlr: �[33m0.0008�[39m
[rank0]:2024-02-29 17:05:10,812 - root - INFO - �[36mstep:  4  �[32mloss: 10.4672  �[39miter: �[34m 0.0453�[39m  data: �[34m0.0336  �[39mlr: �[33m0.0007�[39m
[rank0]:2024-02-29 17:05:10,893 - root - INFO - �[36mstep:  5  �[32mloss: 10.2154  �[39miter: �[34m 0.0466�[39m  data: �[34m0.033  �[39mlr: �[33m0.0006�[39m
[rank0]:2024-02-29 17:05:10,975 - root - INFO - �[36mstep:  6  �[32mloss:  9.9573  �[39miter: �[34m 0.0496�[39m  data: �[34m0.0314  �[39mlr: �[33m0.0005�[39m
[rank0]:2024-02-29 17:05:11,056 - root - INFO - �[36mstep:  7  �[32mloss:  9.7627  �[39miter: �[34m 0.0486�[39m  data: �[34m0.0321  �[39mlr: �[33m0.0004�[39m
[rank0]:2024-02-29 17:05:11,201 - root - INFO - �[36mstep:  8  �[32mloss:   9.437  �[39miter: �[34m 0.0457�[39m  data: �[34m0.0333  �[39mlr: �[33m0.0003�[39m
[rank1]:STAGE:2024-02-29 17:05:11 3368103:3368103 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
[rank1]:[rank1]:[W CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event
[rank0]:STAGE:2024-02-29 17:05:11 3368102:3368102 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
[rank0]:2024-02-29 17:05:11,317 - root - INFO - �[36mstep:  9  �[32mloss:  9.2446  �[39miter: �[34m 0.0794�[39m  data: �[34m0.0324  �[39mlr: �[33m0.0002�[39m
[rank0]:[rank0]:[W CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event
[rank1]:STAGE:2024-02-29 17:05:11 3368103:3368103 ActivityProfilerController.cpp:320] Completed Stage: Collection
[rank1]:STAGE:2024-02-29 17:05:11 3368103:3368103 ActivityProfilerController.cpp:324] Completed Stage: Post Processing
[rank0]:STAGE:2024-02-29 17:05:11 3368102:3368102 ActivityProfilerController.cpp:320] Completed Stage: Collection
[rank0]:STAGE:2024-02-29 17:05:11 3368102:3368102 ActivityProfilerController.cpp:324] Completed Stage: Post Processing
[rank0]:2024-02-29 17:05:11,748 - root - INFO - exporting profile traces to ./outputs/profiling/traces/iteration_10
[rank0]:2024-02-29 17:05:11,762 - root - INFO - �[36mstep: 10  �[32mloss:  9.1772  �[39miter: �[34m 0.0893�[39m  data: �[34m0.0324  �[39mlr: �[33m0.0001�[39m
[rank0]:2024-02-29 17:05:11,763 - root - INFO - Average iter time: 0.0578 seconds
[rank0]:2024-02-29 17:05:11,763 - root - INFO - Average data load time: 0.0326 seconds
[rank0]:2024-02-29 17:05:11,763 - root - INFO - Current Memory: NVIDIA H100 (0): Reserved: 9.6465%, Alloc 2.1969%, Active: 2.2%
[rank0]:Peak Memory: Reserved 9.65%, Alloc 8.43%, Active: 8.44%
[rank0]:num retries: 0, num ooms: 0
[rank0]:NCCL version 2.19.3+cuda12.0
```

Reviewers:

Subscribers:

Tasks:

Tags:

Co-authored-by: gnadathur <gnadathur@devvm4378.nao0.facebook.com>
ghstack-source-id: 1c5bf790d7473f6a24124051fcfa1fd2585a56f9
Pull Request resolved: pytorch#105
```
+ export USE_LIBUV=1
+ USE_LIBUV=1
+ TRAINER_DIR=/home/gnadathur/local/torchtrain
+ NGPU=4
+ LOG_RANK=0,1
+ CONFIG_FILE=./train_configs/debug_model.toml
+ torchrun --nproc_per_node=4 --rdzv_endpoint=localhost:5972 --local-ranks-filter 0,1 --role rank --tee 3 train.py --job.config_file ./train_configs/debug_model.toml
W0304 17:01:26.766000 140549371597824 torch/distributed/run.py:717] 
W0304 17:01:26.766000 140549371597824 torch/distributed/run.py:717] *****************************************
W0304 17:01:26.766000 140549371597824 torch/distributed/run.py:717] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0304 17:01:26.766000 140549371597824 torch/distributed/run.py:717] *****************************************
[rank0]:2024-03-04 17:01:28,834 - torchtrain.parallelisms - INFO - Building 1-D device mesh with ('dp',), [4]
[rank1]:2024-03-04 17:01:28,857 - torchtrain.parallelisms - INFO - Building 1-D device mesh with ('dp',), [4]
[rank0]:2024-03-04 17:01:29,712 - root - INFO - Starting job: debug training
[rank0]:2024-03-04 17:01:29,712 - root - INFO - Building llama
[rank0]:2024-03-04 17:01:29,719 - root - INFO - Reloaded SentencePiece model from ./torchtrain/datasets/tokenizer/tokenizer.model
[rank0]:2024-03-04 17:01:29,719 - root - INFO - #words: 32000 - BOS ID: 1 - EOS ID: 2
[rank1]:2024-03-04 17:01:31,187 - root - INFO - Reloaded SentencePiece model from ./torchtrain/datasets/tokenizer/tokenizer.model
[rank1]:2024-03-04 17:01:31,188 - root - INFO - #words: 32000 - BOS ID: 1 - EOS ID: 2
[rank0]:2024-03-04 17:01:31,346 - root - INFO - Model fully initialized via reset_params
[rank0]:2024-03-04 17:01:31,346 - root - INFO - Model built with: ModelArgs(dim=256, n_layers=2, n_heads=16, n_kv_heads=None, vocab_size=32000, multiple_of=256, ffn_dim_multiplier=None, norm_eps=1e-05, max_batch_size=32, max_seq_len=32768, depth_init=True)
[rank0]:2024-03-04 17:01:31,347 - root - INFO - �[34mModel llama debugmodel �[31msize: 18,089,216 total parameters�[39m
[rank0]:2024-03-04 17:01:31,347 - root - INFO - GPU memory usage: NVIDIA H100 (0): 95.0396 GiB capacity, 0.0 GiB in-use, 0.0% in-use
[rank0]:2024-03-04 17:01:32,502 - root - INFO - Applied FSDP to the model...
[rank0]:2024-03-04 17:01:32,503 - root - INFO - Gradient scaling not enabled.
[rank0]:2024-03-04 17:01:32,504 - root - INFO - Metrics logging active. Tensorboard logs will be saved at ./outputs/tb/20240304-1701.
[rank0]:2024-03-04 17:01:32,901 - root - INFO - Profiling active.  Traces will be saved at ./outputs/profiling/traces
[rank0]:2024-03-04 17:01:34,806 - root - INFO - �[36mstep:  1  �[32mloss: 10.8424  �[39miter: �[34m 1.8688�[39m  data: �[34m0.0316  �[39mlr: �[33m0.00026667�[39m
[rank0]:2024-03-04 17:01:34,891 - root - INFO - �[36mstep:  2  �[32mloss: 10.7581  �[39miter: �[34m 0.0476�[39m  data: �[34m0.0357  �[39mlr: �[33m0.00053333�[39m
[rank0]:2024-03-04 17:01:34,970 - root - INFO - �[36mstep:  3  �[32mloss: 10.6239  �[39miter: �[34m  0.045�[39m  data: �[34m0.0333  �[39mlr: �[33m0.0008�[39m
[rank0]:2024-03-04 17:01:35,048 - root - INFO - �[36mstep:  4  �[32mloss: 10.4163  �[39miter: �[34m 0.0455�[39m  data: �[34m0.0323  �[39mlr: �[33m0.0007�[39m
[rank0]:2024-03-04 17:01:35,127 - root - INFO - �[36mstep:  5  �[32mloss: 10.1529  �[39miter: �[34m 0.0459�[39m  data: �[34m0.032  �[39mlr: �[33m0.0006�[39m
[rank0]:2024-03-04 17:01:35,206 - root - INFO - �[36mstep:  6  �[32mloss:  9.8899  �[39miter: �[34m 0.0468�[39m  data: �[34m0.0311  �[39mlr: �[33m0.0005�[39m
[rank0]:2024-03-04 17:01:35,284 - root - INFO - �[36mstep:  7  �[32mloss:  9.7204  �[39miter: �[34m 0.0461�[39m  data: �[34m0.0312  �[39mlr: �[33m0.0004�[39m
[rank0]:2024-03-04 17:01:35,425 - root - INFO - �[36mstep:  8  �[32mloss:  9.3757  �[39miter: �[34m 0.0457�[39m  data: �[34m0.0319  �[39mlr: �[33m0.0003�[39m
[rank0]:STAGE:2024-03-04 17:01:35 3850444:3850444 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
[rank0]:2024-03-04 17:01:35,537 - root - INFO - �[36mstep:  9  �[32mloss:  9.1883  �[39miter: �[34m 0.0762�[39m  data: �[34m0.0318  �[39mlr: �[33m0.0002�[39m
[rank0]:[rank0]:[W CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event
[rank1]:STAGE:2024-03-04 17:01:35 3850445:3850445 ActivityProfilerController.cpp:314] Completed Stage: Warm Up
[rank1]:[rank1]:[W CPUAllocator.cpp:249] Memory block of unknown size was allocated before the profiling started, profiler results will not include the deallocation event
[rank0]:STAGE:2024-03-04 17:01:35 3850444:3850444 ActivityProfilerController.cpp:320] Completed Stage: Collection
[rank0]:STAGE:2024-03-04 17:01:35 3850444:3850444 ActivityProfilerController.cpp:324] Completed Stage: Post Processing
[rank1]:STAGE:2024-03-04 17:01:35 3850445:3850445 ActivityProfilerController.cpp:320] Completed Stage: Collection
[rank1]:STAGE:2024-03-04 17:01:35 3850445:3850445 ActivityProfilerController.cpp:324] Completed Stage: Post Processing
[rank0]:2024-03-04 17:01:35,958 - root - INFO - exporting profile traces to ./outputs/profiling/traces/iteration_10
[rank0]:2024-03-04 17:01:35,971 - root - INFO - �[36mstep: 10  �[32mloss:  9.1212  �[39miter: �[34m 0.0808�[39m  data: �[34m0.0319  �[39mlr: �[33m0.0001�[39m
[rank0]:2024-03-04 17:01:35,972 - root - INFO - Average iter time: 0.0553 seconds
[rank0]:2024-03-04 17:01:35,972 - root - INFO - Average data load time: 0.0317 seconds
[rank0]:2024-03-04 17:01:35,972 - root - INFO - Current Memory: NVIDIA H100 (0): Reserved: 9.6465%, Alloc 2.1969%, Active: 2.2%
[rank0]:Peak Memory: Reserved 9.65%, Alloc 8.43%, Active: 8.44%
[rank0]:num retries: 0, num ooms: 0
[rank0]:NCCL version 2.19.3+cuda12.0
```

Co-authored-by: gnadathur <gnadathur@devvm4378.nao0.facebook.com>
This PR enables meta_init functionality to avoid OOM'ing on cpu for
larger models.
The core functionality is in meta_init.py, and a few changes in
parallelization and train.py.
Key items:
1 - this is largely the same as the earlier PR I had for meta_init, but
I did a new one b/c faster than reworking it with all the interim
changes.
2 - to address feedback in previous PR:
a - why do we need meta_init.py, can't we just do:
~~~
with torch.device("meta"):
    model = Model.from_args(...)
~~~
Unfortunately this does not work b/c the rope embeddings are treated
differently (buffer) and thus the simple lambda call from param_init_fn
in FSDP (lambda module: module.to_device('cuda') ) will not invoke or
move the rope embeddings and the model will fail on first forward.
This issue relates to the nn.embeddings not being moved, and that the
device is referenced in the forward pass for the current rope class.
Have opened pytorch#110 to track
this and investigate while not holding up meta init that is working from
landing.

b - per earlier feedback - meta init is now 'not optional' but simply
the default. This should ensure all models leverage it and ensure we
aren't missing things for future meta_init aspects.

3 - misc change - I switched the model_params to just do the normal all
params count instead of 'unique params' b/c it does not mesh with what
people perceive model size as.

Testing:
tested both debugmodel and 26B model with and without meta init to
confirm same loss curves.
Note for future reference - if you get a bad init (meta init failure)
you will simply not train (loss is same every iter).
If you fail to call reset params after FSDP, then you will train (b/c we
default to torch.randn_like) but your starting loss will be 5x+ higher
(telling you that you have not properly init'ed the model).
Co-authored-by: gnadathur <gnadathur@devvm4378.nao0.facebook.com>
wanchaol and others added 13 commits May 21, 2024 22:10
…ch#268)

This way we could temporarily enable 2-D parallel compile, and it might
make sense to do transformer block compile in the future with PP (which
we'll see).

We should figure out:
1. dynamic shape issue when turning on 2D parallel
2. full model compile issue for 2D parallel compile
3. cache reusing currently does not work, enable it later
previous change to use logging from torchtitan caused stdout not
to show up.

ghstack-source-id: 30a77c59ba68043ffa844be0443d5351d9584fab
Pull Request resolved: pytorch#352
mostly harmless bug, since output shape of last layer is not used for
send/recv purpose, the runtime value overrides it no matter what value
you configured it with.

However, since adding in/out shape validation to pipeline lib in torch,
this raises an error and has to be fixed.

ghstack-source-id: 950e41529b7b506085ab280d8a492e345eaefd24
Pull Request resolved: pytorch#354
APIs conform to the pytorch rules.  This PR should be able to land
safely after tonight's nightly pytorch build which includes the above
PR.

ghstack-source-id: c575bc7835472128c09798544caa38bf1908e5ca
Pull Request resolved: pytorch#356
After updating today, I found a whole slew of various new temp files
clogging up my source tab.
This PR screens these out so that they don't accidentally get added in a
PR and keeps your source tab change count correct.

Example of issue without this PR:
<img width="780" alt="Screenshot 2024-05-23 at 9 21 55 PM"
src="https://github.com/pytorch/torchtitan/assets/46302957/41b7061a-41a0-4a95-938b-3fd9292a2f38">

vs with this PR:
<img width="661" alt="Screenshot 2024-05-23 at 10 07 16 PM"
src="https://github.com/pytorch/torchtitan/assets/46302957/cccf8c5f-368d-40a8-b10f-f11ca37df2bc">
- switch to using public PipelineStage API
- clean up some asserts in tracer codepath

ghstack-source-id: 2d069b7d45c4f3c788dec8fc85d8a7e83e463fcd
Pull Request resolved: pytorch#357
ghstack-source-id: 4255cc792b9a221bc5a012e91db92533dcfa2243
Pull Request resolved: pytorch#339
ghstack-source-id: 8cbd62b97816ae8185b8a7e1aa9a7505f2780525
Pull Request resolved: pytorch#372
Usage:
`--test <test_id>`

Acceptable values: `test_id` in `build_test_list` (default: all)

Example:
```
rm -rf outputs && python test_runner.py outputs --test pp_gpipe
```
ghstack-source-id: 775591945ff5427cb7e5e9fc7592952b4c746341
Pull Request resolved: pytorch#373
Co-authored-by: Linsong Chu <lchu@us.ibm.com>
@facebook-github-bot
Copy link

Hi @daviswer!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

@@ -120,7 +120,6 @@ def __init__(
"model": ModelWrapper(model),
"optimizer": OptimizerWrapper(model, optimizer),
"lr_scheduler": lr_scheduler,
"dataloader": dataloader,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will break the usage of the existing dataloader checkpoint.

@facebook-github-bot
Copy link

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 31, 2024
@gnadathur gnadathur requested a review from gokulavasan May 31, 2024 17:26
@gokulavasan
Copy link
Contributor

Thanks for the PR! Reviewing it now

Copy link
Contributor

@gokulavasan gokulavasan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finished one pass through the PR. Left comments to understand the code better

self.parser.add_argument(
"--dataset.datasets",
type=str,
default="c4_mini",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For multiple (sub) datasets, would this be specified as a list of str?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For multiple subdatasets, we have it set up to take a string of comma-separated names. There's an example comment in train_configs/llama3_8b.toml:

datasets = "CommonCrawl,C4,Github,Wikipedia,Books,ArXiv,StackExchange"
dataset_weights = "67,15,4.5,4.5,4.5,2.5,2"

Of course this can be changed if another format is more convenient

@@ -0,0 +1,2 @@
dataset/filename,documents,tokens
/c4_mini/c4_mini.arrow,45000,20505558
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i) Are all the tokens for a single doc present in one row+col?
ii) What is the need to pretokenize the data instead of storing raw text alone?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. The expected format is for each document to occupy one row in the pyarrow file. Because docs are different lengths, there's not really any consistent notion of columns anymore (or viewed another way, the doc rows all occupy the single "tokens" column of the shard file)
  2. Right now we pretokenize to enable chunking - currently, we can yield portions of a document at a time, without ever having to load the whole doc into memory at once, which is helpful in the case of extremely long docs. This gets more difficult for raw text since we have to split on tokens rather than on characters, but we don't know the token boundaries until after tokenization. Support for dynamic tokenization is on the todo list though!

"""
os.makedirs(path, exist_ok=True)
state = self.state_dict()
torch.save(state, os.path.join(path, f"loader_state_{self.rank}.pth"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the description, it looks like only num_workers=1 is supported per rank. Thus self.rank would be sufficient in this case. Just to confirm, for num_workers > 1, do you foresee each worker storing their state from the worker process directly to local disk?

Copy link
Author

@daviswer daviswer Jun 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's the eventual plan for supporting num_workers > 1


def _reshard(self, sharded_list):
"""
Sharded_list is a list of lists, where each "shard" sublist must have the same length.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate on why "shard" sublist must have the same length? For example, when the input dataset folder is created with arrow files inside - does each file correspond to a shard? In that case, is each file expected to have the same length?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose it's not strictly necessary that shards have the same length, we just take that shortcut because this function is only ever used to reassign per-worker data structures from load_worldsize to worldsize. So far we've been able to assume that workers all hold the same fundamental data structures so this hasn't been an issue. This doesn't get called on any actual data shard files - apologies for unclear naming/commenting here

int(n_items * (self.rank + 1) / self.worldsize) - item_offset,
)
# Pull out owned items
return [sharded_list[i // shard_len][i % shard_len] for i in my_items]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the sharded list is a basically a list of list of indices? And does this return a list of indices that is owned by this rank? If that is the case, could this be a really large list for huge datasets?

Update: Looking at the code further down, this seems to be for in-memory buffers instead of sample/doc indices?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes your update is correct. We never actually materialize the full list of doc indices owned by each worker, because as you say, this gets very large.

for flag in self.state_params + self.reshard_params
]
else:
for flag in self.reshard_params:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For all parameters that are set as reshard params, this logic figures out which subset of indices in that list of ranges that it owns? Apart from indices assigned to a worker, what are the other states tend to lend well with re-sharding?

And looks like state params are discarded during reshard? Would RNG state be a typical state param?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes RNG state is a typical state param. Reshard params are generally buffers or sets of stateful subworkers

for flag in self.state_params + self.reshard_params
}

def _reshard(self, sharded_list):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this function doesn't place any restriction on what world_size could be in relation to load_world_size - is that correct?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct

# Swap out randomly sampled value from buffer
i = torch.randint(self.buffer_size, (1,), generator=self.generator).item()
out = self.buffer[i]
self.buffer[i] = next(dataset)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

once the dataset runs out of data, how is the remaining items in the buffer drained?

Copy link
Author

@daviswer daviswer Jun 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The expectation is a sublayer that never runs out - currently, we just loop over epochs forever

newpath = os.path.join(self.data, shardid)
path, reader = self._get_reader(path, newpath, reader)
# Map id in range of owned docs to new (consistently) shuffled id
doclcg = self._random_map_docid(docrange)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to confirm, once a shard is chosen, all the docs within that shard is visited in random order but that shard is exhausted before moving to the next shard?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, once a shard is chosen, all the docs within the fraction of the shard owned by this worker are visited. But yes that's broadly correct.

self.percent_seen = 0
self.lcg_state = seed + rank

self.state_params = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see no reshard_params, so this dataset doesn't support resharding?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, this one particular dataset needs to sit under a Scalable_Shard_Dataset to support resharding

Copy link
Contributor

@gokulavasan gokulavasan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finished one pass through the PR. Left comments to understand the code better

Copy link
Contributor

@gokulavasan gokulavasan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Finished one pass through the PR. Left comments to understand the code better

@daviswer
Copy link
Author

daviswer commented Jun 5, 2024

Thanks for taking a look! Left a bunch of responses, hopefully this brings clarity

return itemlist[start:end]


class _Stateful_Dataset(data.IterableDataset):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we stick with Camel case and remove underscores between words?

Re-sync fork to main, and add the latest features from fms-fsdp:

- Remove dependence on the separate metadata "count file" in the dataset directory
- Support n_workers > 1. This is accomplished by shunting all path/rank-dependent setup out of initialization and into a new setup() method, which runs after init but before any other op
- Support HF-style parquet raw text datasets, with tokenization on the fly (for reasonably sized documents/shardfiles)
- Support non-flat data directories: all legal files under the specified location will be included, regardless of depth or location. Enables simple weight-free dataset mixing via a single `StreamingDocDataset` on the parent directory
- `SamplingDataset` and `ScalableShardDataset`are now implemented as proper `_WrapperDatasets`, reflecting the intended modular usage
- Fix Weird_Separated_Camel_Case naming convention in favor of ProperClassNaming
- Allow `PreloadBufferDataset` to shrink back down to the desired size after rescaling to a smaller number of workers
@daviswer
Copy link
Author

Updating the dataloader to match the latest version in our public repo. Re-syncs to latest main, and incorporates many previously discussed features:

  • Remove dependence on the separate metadata "count file" in the dataset directory
  • Support n_workers > 1. This is accomplished by shunting all path/rank-dependent setup out of initialization and into a new setup() method, which runs after init but before any other op
  • Support HF-style parquet raw text datasets, with tokenization on the fly (for reasonably sized documents/shardfiles)
  • Support non-flat data directories: all legal files under the specified location will be included, regardless of depth or location. Enables simple weight-free dataset mixing via a single StreamingDocDataset on the parent directory
  • SamplingDataset and ScalableShardDatasetare now implemented as proper _WrapperDatasets, reflecting the intended modular usage
  • Fix Weird_Separated_Camel_Case naming convention in favor of ProperClassNaming
  • Allow PreloadBufferDataset to shrink back down to the desired size after rescaling to a smaller number of workers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.