FSDP checkpoints don't load when run is restarted with greater world size #811

darkmirage · 2025-01-28T21:38:09Z

A checkpoint is saved from an 8-GPU run with dp_shard set to 8 and all other parallelisms set to 1. My understanding is that this is configured as an FSDP run.

The checkpoint is resumed from 16 GPUs with dp_shard now set to 16. When loading the checkpoint, we get this error:

[rank0]: Traceback (most recent call last): (RANK 15)                                                                                            [rank0]:   File "/app/.venv/lib/python3.10/site-packages/torch/distributed/checkpoint/utils.py", line 164, in reduce_scatter                     [rank0]:     local_data = map_fun()                                                                                                              [rank0]:   File "/app/.venv/lib/python3.10/site-packages/torch/distributed/checkpoint/logger.py", line 83, in wrapper                            
[rank0]:     result = func(*args, **kwargs)                                                                                                      
[rank0]:   File "/app/.venv/lib/python3.10/site-packages/torch/distributed/checkpoint/state_dict_loader.py", line 211, in local_step             
[rank0]:     local_plan = planner.create_local_plan()                                                                                            
[rank0]:   File "/app/.venv/lib/python3.10/site-packages/torch/distributed/checkpoint/default_planner.py", line 233, in create_local_plan        
[rank0]:     return create_default_local_load_plan(                                                                                              
[rank0]:   File "/app/.venv/lib/python3.10/site-packages/torch/distributed/checkpoint/default_planner.py", line 354, in create_default_local_load
[rank0]:     raise RuntimeError(f"Missing key in checkpoint state_dict: {fqn}.")                                                                 
[rank0]: RuntimeError: Missing key in checkpoint state_dict: dataloader.dp_rank_15.

My understanding is that torch distributed checkpoints are supposed to support dynamic resharding at load time. Does this not work with torchtitan?

I was able to successfully resume a checkpoint going down from 32 GPUs to 16.

The text was updated successfully, but these errors were encountered:

fegin · 2025-01-30T07:27:25Z

ye, the dataloader and learning rate scheduler do not support resharding but model and optimizer states do support resharding, we may be able to add selective resharding support.

tianyu-l · 2025-01-30T22:52:44Z

we may be able to add selective resharding support.

before supporting this, we should error out in data loader if world size becomes smaller, otherwise it's silent error.

mori360 · 2025-02-04T00:51:18Z

We add error messages at #816 when loading under low dp_degree at checkpoint saved with high dp_degree.

[rank0]: RuntimeError: Missing key in checkpoint state_dict: dataloader.dp_rank_15.

For the error message at summary here, which loads under high dp_degree at checkpoint saved with low dp_degree, we would support optional checkpoint loading in the next step.

…m checkpoint (#816) Solve the issue here #811 to avoid users to run with data loader resharding. DataLoader resharding is not supported yet. For checkpoint loading before this PR, Case 1: save (dp:4) -> load (dp:4) Checkpoint works successfully as expected. Case 2: save (dp:4) -> load (dp:2) Run successfully but `dataloader.dp_rank_2` and `dataloader.dp_rank_3` are missing Case 3: save (dp:2) -> load (dp:4) Raise error that dataloader.dp_rank_2 and dataloader.dp_rank_3 not found in checkpoint state_dict The PR here aims to raise error at Case 2 as dataloader info are missing. In this PR, we store `dp_degree`(or say as `dp_world_size`), at dataloader state_dict. After loading from checkpoint, we compare `dp_degree` with the current. Test with Case 2 that load from checkpoint at step 3. ``` [rank0]:2025-02-03 13:39:06,055 - root - INFO - Starting job: Llama 3 8B training [rank0]:2025-02-03 13:39:06,866 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config [rank0]:2025-02-03 13:39:06,868 - root - INFO - CUDA capacity: NVIDIA H100 with 95.00GiB memory [rank0]:2025-02-03 13:39:06,920 - root - INFO - Peak FLOPS used for computing MFU: 9.890e+14 [rank0]:2025-02-03 13:39:06,920 - root - INFO - Building 2-D device mesh with ['dp_shard', 'tp'], [2, 4] [rank0]:2025-02-03 13:39:08,099 - root - INFO - Building tiktoken tokenizer locally from ./torchtitan/datasets/tokenizer/original/tokenizer.model [rank0]:2025-02-03 13:39:08,283 - root - INFO - TikTokenizer built: #words 128256, BOS ID 128000, EOS ID 128001 [rank0]:2025-02-03 13:39:08,284 - root - INFO - Preparing c4 dataset from allenai/c4 [rank0]:2025-02-03 13:39:13,047 - root - INFO - Building llama3 8B with ModelArgs(dim=4096, n_layers=32, n_heads=32, n_kv_heads=8, vocab_size=128256, multiple_of=1024, ffn_dim_multiplier=1.3, norm_eps=1e-05, rope_theta=500000, max_seq_len=8192, depth_init=True, norm_type='rmsnorm') [rank0]:2025-02-03 13:39:13,182 - root - INFO - Model llama3 8B size: 8,030,261,248 total parameters [rank0]:2025-02-03 13:39:13,252 - root - INFO - Applied Tensor Parallelism to the model [rank0]:2025-02-03 13:39:13,253 - root - INFO - Applied selective activation checkpointing to the model [rank0]:2025-02-03 13:39:13,296 - root - INFO - Compiling each TransformerBlock with torch.compile [rank0]:2025-02-03 13:39:13,386 - root - INFO - Applied FSDP to the model [rank0]:NCCL version 2.21.5+cuda12.0 [rank0]:2025-02-03 13:39:13,606 - root - INFO - CUDA memory usage for model: 3.77GiB(3.97%) [rank0]:2025-02-03 13:39:13,607 - root - INFO - Checkpointing active. Checkpoints will be loaded from and saved to ./outputs/checkpoint [rank0]:2025-02-03 13:39:13,607 - root - INFO - Loading the checkpoint at step 2. [rank0]:[rank0]: Traceback (most recent call last): [rank0]:[rank0]: File "/data/users/.../torchtitan/train.py", line 433, in <module> [rank0]:[rank0]: main(config) [rank0]:[rank0]: File "/data/users/.../pytorch/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper [rank0]:[rank0]: return f(*args, **kwargs) [rank0]:[rank0]: File "/data/users/.../torchtitan/train.py", line 214, in main [rank0]:[rank0]: checkpoint.load(step=job_config.checkpoint.load_step) [rank0]:[rank0]: File "/data/users/.../torchtitan/torchtitan/checkpoint.py", line 441, in load [rank0]:[rank0]: dcp.load( [rank0]:[rank0]: File "/data/users/.../pytorch/torch/distributed/checkpoint/logger.py", line 83, in wrapper [rank0]:[rank0]: result = func(*args, **kwargs) [rank0]:[rank0]: File "/data/users/.../pytorch/torch/distributed/checkpoint/utils.py", line 438, in inner_func [rank0]:[rank0]: return func(*args, **kwargs) [rank0]:[rank0]: File "/data/users/.../pytorch/torch/distributed/checkpoint/state_dict_loader.py", line 188, in load [rank0]:[rank0]: elem.load_state_dict(statetful_sd[key]) [rank0]:[rank0]: File "/data/users/.../torchtitan/torchtitan/datasets/hf_datasets.py", line 178, in load_state_dict [rank0]:[rank0]: self._world_size == state_dict["world_size"] [rank0]:[rank0]: AssertionError: dp_degree is inconsistent before and after checkpoint, DataLoader resharding is not supported yet. ```

fegin · 2025-02-05T01:52:10Z

We can close this issue after #819 is landed.

tianyu-l added documentation Improvements or additions to documentation bug Something isn't working labels Jan 30, 2025

tianyu-l added this to the torchtitan v1.0.0 release milestone Jan 30, 2025

tianyu-l assigned tianyu-l and mori360 and unassigned tianyu-l Jan 30, 2025

tianyu-l added the enhancement New feature or request label Jan 31, 2025

tianyu-l mentioned this issue Feb 1, 2025

Loss metrics dramatically change after resuming from checkpoint #809

Closed

mori360 mentioned this issue Feb 3, 2025

[BE] Raise error for dp_degree dismatch during dataloader loading from checkpoint #816

Merged

tianyu-l linked a pull request Feb 5, 2025 that will close this issue

Enable optional checkpoint at loading #819

Merged

gnadathur added the module: fsdp label Feb 6, 2025

mori360 closed this as completed in #819 Feb 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FSDP checkpoints don't load when run is restarted with greater world size #811

FSDP checkpoints don't load when run is restarted with greater world size #811

darkmirage commented Jan 28, 2025

fegin commented Jan 30, 2025

tianyu-l commented Jan 30, 2025 •

edited

Loading

mori360 commented Feb 4, 2025

fegin commented Feb 5, 2025

FSDP checkpoints don't load when run is restarted with greater world size #811

FSDP checkpoints don't load when run is restarted with greater world size #811

Comments

darkmirage commented Jan 28, 2025

fegin commented Jan 30, 2025

tianyu-l commented Jan 30, 2025 • edited Loading

mori360 commented Feb 4, 2025

fegin commented Feb 5, 2025

tianyu-l commented Jan 30, 2025 •

edited

Loading