-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
change Checkpoint callback's save_best_only
to save_top_k
#128
Conversation
save_best_only
to `save_top_ksave_best_only
to save_top_k
add the test to test_models.py |
to test, set the save_function manually. do a few cases of k (k=0, k=1, k=2) and inspect the folder contents as expected for every case (those are the assertions) |
@Ir1d i'll take a look at the PR once the tests are added. awesome PR! |
shutil.rmtree(path_to_delete) | ||
except OSError: | ||
os.remove(path_to_delete) | ||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should only remove files this callback saved. For instance this would remove other checkpoints the user drags in manually or the ones saved by slurm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what del_model and save_model is intended to be. Through my experiments I noticed that the original implementation simply delete all the models in the corresponding folder. I modified the functions so that they delete only the filepath
model. AFAIK, these two functions are called with the exact model filepath.
@williamFalcon ping |
@Ir1d looks like GPU tests fail __________________________________________________________________________________ test_amp_gpu_dp __________________________________________________________________________________
def test_amp_gpu_dp():
"""
Make sure DP + AMP work
:return:
"""
testing_utils.reset_seed()
if not testing_utils.can_run_gpu_test():
return
model, hparams = testing_utils.get_model()
trainer_options = dict(
max_nb_epochs=1,
gpus='0, 1', # test init with gpu string
distributed_backend='dp',
use_amp=True
)
with pytest.raises(MisconfigurationException):
> testing_utils.run_gpu_model_test(trainer_options, model, hparams)
tests/test_z_amp.py:204:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
tests/testing_utils.py:62: in run_gpu_model_test
checkpoint = init_checkpoint_callback(logger)
tests/testing_utils.py:238: in init_checkpoint_callback
checkpoint = ModelCheckpoint(ckpt_dir)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <pytorch_lightning.callbacks.pt_callbacks.ModelCheckpoint object at 0x7efdb03242e8>
filepath = '/private/home/falc/Developer/pytorch-lightning/tests/save_dir/lightning_logs/version_22/checkpoints', monitor = 'val_loss', verbose = 0, save_top_k = 1
save_weights_only = False, mode = 'auto', period = 1, prefix = ''
def __init__(self, filepath, monitor='val_loss', verbose=0,
save_top_k=1, save_weights_only=False,
mode='auto', period=1, prefix=''):
super(ModelCheckpoint, self).__init__()
if (
> save_best_only and
os.path.isdir(filepath) and
len(os.listdir(filepath)) > 0
):
E NameError: name 'save_best_only' is not defined
pytorch_lightning/callbacks/pt_callbacks.py:189: NameError
================================================================================= warnings summary ==================================================================================
/private/home/falc/.conda/envs/lightning/lib/python3.7/site-packages/av/container/__init__.py:1
/private/home/falc/.conda/envs/lightning/lib/python3.7/site-packages/av/container/__init__.py:1: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated, and in 3.8 it will stop working
from .core import Container, open
tests/test_a_restore_models.py::test_running_test_pretrained_model_ddp
/private/home/falc/.conda/envs/lightning/lib/python3.7/importlib/_bootstrap.py:219: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 216, got 192
return f(*args, **kwds)
tests/test_a_restore_models.py::test_running_test_pretrained_model_ddp
tests/test_a_restore_models.py::test_running_test_pretrained_model_ddp
tests/test_a_restore_models.py::test_running_test_pretrained_model_ddp
/private/home/falc/.conda/envs/lightning/lib/python3.7/importlib/_bootstrap.py:219: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject
return f(*args, **kwds)
-- Docs: https://docs.pytest.org/en/latest/warnings.html
================================================================= 38 failed, 48 passed, 5 warnings in 27.44 seconds ==== |
its because master changed.. a fix will be ready soon |
@Ir1d how about just writing k >= 1 instead of k (k >= 1)? It's cleaner. |
@Ir1d i don't think this can go in this release. let's land it soon and test for a bit on master before releasing. |
Fine. I just don't feel like merging master branch into this PR again. |
totally. but i imagine the longer we delay that the harder it’ll become to get this PR added. checkpointing is very important for research, i don’t want to rush this and mess up a bunch of projects |
|
||
Also, if `save_top_k` >= 2 and the callback is called multiple | ||
times inside an epoch, the name of the saved file will be | ||
appended with a version count starting with `v0`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand this PR has been out for a while so it's okay if this change isn't implemented, but at latent space we've implemented a checkpointing system within an epoch (we have a usecase of epochless datasets) and we rely heavily on having the iteration in the name for some of our checkpoint analysis. If we don't think we're saving multiple times in the same iteration, we could consider using the iteration number
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on considering iteration numbers.
But I think it's too much for this PR, perhaps considering a seperate issue?
You see we expect so many things from this single PR, and it is blocked like forever. And I have to merge master from time to time.
@jeffling feel free to bring this PR up to speed. Add your use case here as well as other production teams will find it helpful. |
accidentally pressed wrong button when solving conflicts
@williamFalcon ping |
Will be adding in follow-up PR |
this pr should close #70
bringing
save_function
outside ofpytorch_lightning/callbacks/pt_callbacks.py
is really confusing, since in that casept_callbacks.py
cannot be run on its own 🌚not sure where to add the tests, so I appended them at the end of the file.