multi-node distributed training with spark #935

jmoralez · 2024-03-19T21:01:01Z

Adds the functionality to perform distributed data parallel training with spark. The logic is as follows:

The user provides a spark dataframe and sets a configuration with how many nodes and how many GPUs each node has.
We'll then have one task for each GPU in the cluster and thus partition the dataframe accordingly.
We save the partitioned dataframe and get the names of the generated parquet files.
Each task will compute its global rank, load its corresponding file and use that to train.

There were a couple of challenges:

The final model is serialized and sent back to the driver, so we should make sure that it doesn't contain any exotic things (to avoid pickling errors), thus we remove the _trainer attribute (and thus the trainer property) from the model.
The save method of the models used the trainer's save_checkpoint method. Since we won't have the trainer anymore this implements very simple methods to save and load models, which use only the init params and weights (which will also make the files smaller). The premise here is that we don't actually need all the stuff that the checkpoint has in order to load the model for inference. This tries to maintain backward compatibility by using the same names as pytorch lightning does (hyper_parameters and state_dict).

Also makes the following change, which isn't strictly necessary and could be made in a separate PR:

Ensures that the original aliases are preserved when saving and loading models. Right now when loading a saved model it'll use the default alias, so if an AutoNHITS was trained with the alias 'my_model' after loading it and making predictions with it the column will be named 'NHITS'.

review-notebook-app · 2024-03-19T21:01:07Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

cchallu

Amazing work! We should still wait for Azul's review before merging.

AzulGarza

awesome @jmoralez!🎉

nbs/common.base_model.ipynb

jmoralez added 7 commits March 5, 2024 23:05

multi node

0f05a02

save/load and predict

08eee02

support predicting stored files dataset and keep aliases

0f91137

support static and exog

e46a0e9

provide fs to dataset

1e0108f

prepend protocol instead of using fs

db38362

add example

e222596

jmoralez added 7 commits March 19, 2024 15:02

merge main

1c72440

fixes

bf423df

export

5bdb419

more fixes

753bcad

sync_dist and raise error for multivariate

dea9f71

remove load from auto

daa5be6

add extras

1b626f4

jmoralez marked this pull request as ready for review March 20, 2024 17:34

jmoralez requested review from cchallu, elephaint and AzulGarza March 20, 2024 17:34

jmoralez added the feature label Mar 20, 2024

cchallu approved these changes Mar 20, 2024

View reviewed changes

AzulGarza approved these changes Apr 1, 2024

View reviewed changes

jmoralez added 2 commits April 9, 2024 14:16

merge main

ef5f306

update docstrings and add kwargs to model.load

db16b40

cchallu reviewed Apr 9, 2024

View reviewed changes

nbs/common.base_model.ipynb Show resolved Hide resolved

comment init disable

b5243ff

AzulGarza self-requested a review April 10, 2024 05:02

AzulGarza approved these changes Apr 10, 2024

View reviewed changes

jmoralez merged commit 8121bfc into main Apr 10, 2024
17 checks passed

jmoralez deleted the multi-node2 branch April 10, 2024 16:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multi-node distributed training with spark #935

multi-node distributed training with spark #935

jmoralez commented Mar 19, 2024 •

edited

Loading

review-notebook-app bot commented Mar 19, 2024

cchallu left a comment

AzulGarza left a comment

multi-node distributed training with spark #935

multi-node distributed training with spark #935

Conversation

jmoralez commented Mar 19, 2024 • edited Loading

review-notebook-app bot commented Mar 19, 2024

cchallu left a comment

Choose a reason for hiding this comment

AzulGarza left a comment

Choose a reason for hiding this comment

jmoralez commented Mar 19, 2024 •

edited

Loading