Run big models with DDP/FSDP instead of torch.nn.DataParallel
#683
Labels
discussion
enhancement
New feature or request
help wanted
Extra attention is needed
new feature
Proposing to add a new feature
1. Feature description
Make PyPOTS run models on multi-GPU with DDP (Distributed Data Parallel, https://pytorch.org/tutorials/intermediate/ddp_tutorial.html) or FSDP (Fully Sharded Data Parallel, https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html).
2. Motivation
Current multi-gpu training implemented with
torch.nn.DataParallel
in PyPOTS framework is not enough for training big models like Time-LLM (e.g. #675 Time-LLM easy OOM on short-len TS samples), we need more advanced feature like DDP or FSDP3. Your contribution
Would like to lead or arrange the development task. Please leave comments below to start discussions if you're interested. More comments will help prioritize this feature.
The text was updated successfully, but these errors were encountered: