Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specify map location when loading model #272

Merged
merged 10 commits into from
May 12, 2023
Merged

Specify map location when loading model #272

merged 10 commits into from
May 12, 2023

Conversation

ejm714
Copy link
Collaborator

@ejm714 ejm714 commented May 5, 2023

Fixes #270
Fixes #271

@netlify
Copy link

netlify bot commented May 5, 2023

Deploy Preview for silly-keller-664934 ready!

Name Link
🔨 Latest commit 754afa2
🔍 Latest deploy log https://app.netlify.com/sites/silly-keller-664934/deploys/645ac38c8b0a7800081422b6
😎 Deploy Preview https://deploy-preview-272--silly-keller-664934.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

@github-actions
Copy link
Contributor

github-actions bot commented May 5, 2023

@codecov
Copy link

codecov bot commented May 5, 2023

Codecov Report

Merging #272 (754afa2) into master (333f874) will not change coverage.
The diff coverage is 100.0%.

Additional details and impacted files
@@          Coverage Diff           @@
##           master    #272   +/-   ##
======================================
  Coverage    87.6%   87.6%           
======================================
  Files          26      26           
  Lines        2178    2178           
======================================
  Hits         1908    1908           
  Misses        270     270           
Impacted Files Coverage Δ
zamba/models/efficientnet_models.py 100.0% <100.0%> (ø)
zamba/models/model_manager.py 84.1% <100.0%> (ø)
zamba/models/slowfast_models.py 88.8% <100.0%> (ø)
zamba/pytorch_lightning/utils.py 97.7% <100.0%> (ø)

@ejm714 ejm714 requested a review from pjbull May 5, 2023 18:53
@pjbull
Copy link
Member

pjbull commented May 8, 2023

Tests fail for me on a GPU machine with these logs:

===================================================================== FAILURES ======================================================================
_________________________________________________________ test_not_use_default_model_labels _________________________________________________________

dummy_trained_model_checkpoint = PosixPath('/tmp/pytest-of-bull/pytest-0/dummy-model-dir3/my_model/version_0/dummy_model.ckpt')

    def test_not_use_default_model_labels(dummy_trained_model_checkpoint):
        """Tests that training a model using labels that are a subset of the model species but
        with use_default_model_labels=False replaces the model head."""
        original_model = DummyZambaVideoClassificationLightningModule.from_disk(
            dummy_trained_model_checkpoint
        )

        model = instantiate_model(
            checkpoint=dummy_trained_model_checkpoint,
            scheduler_config="default",
            labels=pd.DataFrame([{"filepath": "gorilla.mp4", "species_gorilla": 1}]),
            use_default_model_labels=False,
        )

>       assert (model.head.weight != original_model.head.weight).all()
E       RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

Not sure, but maybe the from_disk path needs to also check the device and load to the appropriate one?

Actually, I checked with PDB and here are the device each is on. Looks like maybe instantiate_model did not do the right thing in this case since that is on cpu when a gpu is available.

(Pdb) model.device
device(type='cpu')
(Pdb) original_model.device
device(type='cuda', index=0)

@@ -110,10 +110,8 @@ def instantiate_model(
return resume_training(
scheduler_config=scheduler_config,
hparams=hparams,
species=species,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[cleanup] removed unused params

@ejm714
Copy link
Collaborator Author

ejm714 commented May 9, 2023

@pjbull I learned a couple useful things in debugging this.

1)

A LightningModule already keep track of whether a gpu is available. We need to specify map_location to satisfy torch but can use the self.device property when we are finetuning rather than figuring out if there's a gpu separately. This simplifies the code changes needed by a little bit.

2)

instantiate_model has inconsistent behavior on whether the model that is returned is on the CPU or GPU.

  • If we are replacing the head or training from scratch (i.e. not just doing a plain load_from_checkpoint), the model is returned on CPU
  • If we are just loading from a checkpoint to resume training, the model will be returned on a GPU if a GPU is available
  • None of this functionally matters because putting the model onto the desired device is handled by pl.Trainer in train_model and predict_model. In addition, the purpose of the affected tests is to check the weights are correct (i.e. that the head has been correctly replaced or not), not the location of the model. Therefore, I just added a .to("cpu") for the single case where this discrepancy occurs.

Our options are to:

  • enforce that the model is always loaded onto CPU for instantiate_model
  • leave the discrepancy as is since devices will be sorted out by pl.Trainer

@ejm714
Copy link
Collaborator Author

ejm714 commented May 9, 2023

I've gone ahead and made some simplifications.

  • Now, anytime we're loading from a checkpoint in the code, we use the .from_disk method which all models have. This always loads onto CPU.
  • Moving to GPU is handled by the devices parameter in pl.Trainer.
    accelerator, devices = configure_accelerator_and_devices_from_gpus(train_config.gpus)
    trainer = pl.Trainer(
    accelerator=accelerator,
    devices=devices,
    max_epochs=train_config.max_epochs,
    logger=tensorboard_logger,
    callbacks=callbacks,
    fast_dev_run=train_config.dry_run,
    strategy=DDPStrategy(find_unused_parameters=False)
    if (data_module.multiprocessing_context is not None) and (train_config.gpus > 1)
    else "auto",
    )
  • The utility function configure_accelerator_and_devices_from_gpus is where we ensure the user specified number of gpus gets passed to the Trainer (and the accelerator matches). In the pydantic validation earlier, we've ensured that you can't specify devices that you don't have.

Tests are passing locally in a fresh python 3.9 environment on a GPU and non-GPU machine.

Copy link
Member

@pjbull pjbull left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It still seems a little weird to force cpu in all the from_disk calls, and seems maybe preferable to avoid that if possible. If not, the rest of the change looks great to me!

return cls.load_from_checkpoint(path)
def from_disk(cls, path: os.PathLike, **kwargs):
# note: we always load models onto CPU; moving to GPU is handled by `devices` in pl.Trainer
return cls.load_from_checkpoint(path, map_location="cpu", **kwargs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this is fine, but forcing cpu here still feels a little weird to me. What if someone isn't using the trainer? Then they have to know and move devices themselves?

Just wondering if letting things happen automatically is it a problem?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering if letting things happen automatically is it a problem?

With the recent torch changes, we need to specify a map location when loading from checkpoint (map_location cannot be None; that is the source of the failing tests). Trying to figure out if the user has a GPU and wants to use it at this point in the code just adds redundancy and complexity in a way that is not helpful (it's hard to summarize succinctly here but that is the route we initially tried and it was not a good option).

What if someone isn't using the trainer?

Whether they're using our train_model code or writing their own PTL code, they'll still be using the trainer. If for some reason they're using our code to load the model but then not using PTL, then yes, they'll have to know and move devices.

@ejm714 ejm714 merged commit ee693e7 into master May 12, 2023
@ejm714 ejm714 deleted the device-update branch May 12, 2023 17:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Failed build on master branch (tests #579) Failed build on master branch (tests #578)
2 participants