Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

change Checkpoint callback's save_best_only to save_top_k #128

Merged
merged 50 commits into from
Nov 19, 2019

Conversation

Ir1d
Copy link
Contributor

@Ir1d Ir1d commented Aug 17, 2019

this pr should close #70
bringing save_function outside of pytorch_lightning/callbacks/pt_callbacks.py is really confusing, since in that case pt_callbacks.py cannot be run on its own 🌚
not sure where to add the tests, so I appended them at the end of the file.

@Ir1d Ir1d changed the title change Checkpoint callback's save_best_only to `save_top_k change Checkpoint callback's save_best_only to save_top_k Aug 17, 2019
@williamFalcon
Copy link
Contributor

add the test to test_models.py

@williamFalcon
Copy link
Contributor

to test, set the save_function manually. do a few cases of k (k=0, k=1, k=2) and inspect the folder contents as expected for every case (those are the assertions)

@williamFalcon
Copy link
Contributor

@Ir1d i'll take a look at the PR once the tests are added. awesome PR!

pytorch_lightning/callbacks/pt_callbacks.py Show resolved Hide resolved
pytorch_lightning/callbacks/pt_callbacks.py Outdated Show resolved Hide resolved
pytorch_lightning/callbacks/pt_callbacks.py Outdated Show resolved Hide resolved
pytorch_lightning/callbacks/pt_callbacks.py Outdated Show resolved Hide resolved
pytorch_lightning/callbacks/pt_callbacks.py Outdated Show resolved Hide resolved
pytorch_lightning/callbacks/pt_callbacks.py Outdated Show resolved Hide resolved
pytorch_lightning/callbacks/pt_callbacks.py Outdated Show resolved Hide resolved
pytorch_lightning/callbacks/pt_callbacks.py Outdated Show resolved Hide resolved
shutil.rmtree(path_to_delete)
except OSError:
os.remove(path_to_delete)
try:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should only remove files this callback saved. For instance this would remove other checkpoints the user drags in manually or the ones saved by slurm

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what del_model and save_model is intended to be. Through my experiments I noticed that the original implementation simply delete all the models in the corresponding folder. I modified the functions so that they delete only the filepath model. AFAIK, these two functions are called with the exact model filepath.

@Ir1d
Copy link
Contributor Author

Ir1d commented Nov 5, 2019

@williamFalcon ping

docs/Trainer/Checkpointing.md Outdated Show resolved Hide resolved
docs/Trainer/Checkpointing.md Show resolved Hide resolved
@williamFalcon
Copy link
Contributor

@Ir1d looks like GPU tests fail

__________________________________________________________________________________ test_amp_gpu_dp __________________________________________________________________________________

    def test_amp_gpu_dp():
        """
        Make sure DP + AMP work
        :return:
        """
        testing_utils.reset_seed()

        if not testing_utils.can_run_gpu_test():
            return

        model, hparams = testing_utils.get_model()
        trainer_options = dict(
            max_nb_epochs=1,
            gpus='0, 1',  # test init with gpu string
            distributed_backend='dp',
            use_amp=True
        )
        with pytest.raises(MisconfigurationException):
>           testing_utils.run_gpu_model_test(trainer_options, model, hparams)

tests/test_z_amp.py:204:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
tests/testing_utils.py:62: in run_gpu_model_test
    checkpoint = init_checkpoint_callback(logger)
tests/testing_utils.py:238: in init_checkpoint_callback
    checkpoint = ModelCheckpoint(ckpt_dir)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <pytorch_lightning.callbacks.pt_callbacks.ModelCheckpoint object at 0x7efdb03242e8>
filepath = '/private/home/falc/Developer/pytorch-lightning/tests/save_dir/lightning_logs/version_22/checkpoints', monitor = 'val_loss', verbose = 0, save_top_k = 1
save_weights_only = False, mode = 'auto', period = 1, prefix = ''

    def __init__(self, filepath, monitor='val_loss', verbose=0,
                 save_top_k=1, save_weights_only=False,
                 mode='auto', period=1, prefix=''):
        super(ModelCheckpoint, self).__init__()
        if (
>           save_best_only and
            os.path.isdir(filepath) and
            len(os.listdir(filepath)) > 0
        ):
E       NameError: name 'save_best_only' is not defined

pytorch_lightning/callbacks/pt_callbacks.py:189: NameError
================================================================================= warnings summary ==================================================================================
/private/home/falc/.conda/envs/lightning/lib/python3.7/site-packages/av/container/__init__.py:1
  /private/home/falc/.conda/envs/lightning/lib/python3.7/site-packages/av/container/__init__.py:1: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
    from .core import Container, open

tests/test_a_restore_models.py::test_running_test_pretrained_model_ddp
  /private/home/falc/.conda/envs/lightning/lib/python3.7/importlib/_bootstrap.py:219: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 216, got 192
    return f(*args, **kwds)

tests/test_a_restore_models.py::test_running_test_pretrained_model_ddp
tests/test_a_restore_models.py::test_running_test_pretrained_model_ddp
tests/test_a_restore_models.py::test_running_test_pretrained_model_ddp
  /private/home/falc/.conda/envs/lightning/lib/python3.7/importlib/_bootstrap.py:219: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject
    return f(*args, **kwds)

-- Docs: https://docs.pytest.org/en/latest/warnings.html
================================================================= 38 failed, 48 passed, 5 warnings in 27.44 seconds ====

@Ir1d
Copy link
Contributor Author

Ir1d commented Nov 5, 2019

its because master changed.. a fix will be ready soon

@Ir1d
Copy link
Contributor Author

Ir1d commented Nov 5, 2019

image

Preview of the edits on docs

@awaelchli
Copy link
Contributor

@Ir1d how about just writing k >= 1 instead of k (k >= 1)? It's cleaner.

@williamFalcon
Copy link
Contributor

@Ir1d i don't think this can go in this release. let's land it soon and test for a bit on master before releasing.

@Ir1d
Copy link
Contributor Author

Ir1d commented Nov 6, 2019

Fine. I just don't feel like merging master branch into this PR again.

@williamFalcon
Copy link
Contributor

williamFalcon commented Nov 6, 2019

totally. but i imagine the longer we delay that the harder it’ll become to get this PR added.

checkpointing is very important for research, i don’t want to rush this and mess up a bunch of projects


Also, if `save_top_k` >= 2 and the callback is called multiple
times inside an epoch, the name of the saved file will be
appended with a version count starting with `v0`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand this PR has been out for a while so it's okay if this change isn't implemented, but at latent space we've implemented a checkpointing system within an epoch (we have a usecase of epochless datasets) and we rely heavily on having the iteration in the name for some of our checkpoint analysis. If we don't think we're saving multiple times in the same iteration, we could consider using the iteration number

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on considering iteration numbers.
But I think it's too much for this PR, perhaps considering a seperate issue?
You see we expect so many things from this single PR, and it is blocked like forever. And I have to merge master from time to time.

@williamFalcon
Copy link
Contributor

@jeffling feel free to bring this PR up to speed. Add your use case here as well as other production teams will find it helpful.

Ir1d added 2 commits November 17, 2019 01:09
accidentally pressed wrong button when solving conflicts
@Ir1d
Copy link
Contributor Author

Ir1d commented Nov 17, 2019

@williamFalcon ping

@Borda Borda mentioned this pull request Nov 19, 2019
@williamFalcon williamFalcon merged commit 7324dd9 into Lightning-AI:master Nov 19, 2019
@jeffling
Copy link
Contributor

@jeffling feel free to bring this PR up to speed. Add your use case here as well as other production teams will find it helpful.

Will be adding in follow-up PR

@5n7-sk 5n7-sk mentioned this pull request Mar 24, 2020
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

change Checkpoint callback's save_best_only to save_top_k
6 participants