dvc: deprecate persistent outputs #2340

ghost · 2019-07-30T01:29:41Z

I don't see a clear benefit on having them, this concern was already expressed here #1884 (comment) and here #1214 (comment) .

Supporting them is not a hassle, but also we can go with the argument that "less code is better code", and if it doesn't offer a clear benefit, it is making the interface more complex.

Current types of outputs:

cached
not-cached
persisted
persisted-not-cached
external
external not-cached
external persisted / external persisted-not-cached ? (I've never seen this one in the wild 😅 )

shcheklein · 2019-07-30T02:07:22Z

@guysmoilov guys did end up using these outputs?

guysmoilov · 2019-07-30T11:25:58Z

Yes: https://dagshub.com/Guy/fairseq/src/dvc/dvc-example/train.dvc

I don't see any other sane way to allow for checkpoints and resuming training.

ghost · 2019-07-30T14:56:30Z

@guysmoilov , what about not tracking the checkpoints with dvc ? Why do you need to resume training?
I'm more used to the queue-worker architecture, and I'm not familiar with "training infrastructures".

Still I'm not sure if it's a hack or something that we should put more attention and attach it to an specific use case with proper documentation.

guysmoilov · 2019-07-31T07:06:58Z

@MrOutis It's a very common pattern in deep learning.

Pick a model & hyperparameters
Train for a few epochs to get intermediate results,
Try a few different configurations, get intermediate results for them.
Pick the most promising configuration, give it more budget for training, resuming from the checkpoint where you paused.
Rinse & repeat until you can claim SOTA on Arxiv 🥇

Hyperparameter optimization algorithms such as Hyperband even formalize and automate this process.
Existing frameworks such as TF have pretty well established norms for working with checkpoints - usually, you'll pick a checkpoint directory, and save the model to a checkpoint file every N epochs (with a filename like checkpoints/checkpoint_N).
Special names will be given to the following checkpoints:

checkpoint_best (according to a defined metric, probably validation loss)
checkpoint_latest (so it's easier for the resuming code to know which checkpoint to resume from)

And sometimes you'll do fancier things, like only keep the latest 5 checkpoints, or keep the best 3 checkpoints at any given time, so you can later average out their weights or turn them into an ensemble.

Obviously, you need something like DVC to be able to resume from checkpoints, switch contexts to a different configuration, and to keep track of the runs.

Hope this explains things better.

guysmoilov · 2019-07-31T07:12:44Z

https://stackoverflow.com/questions/42666046/loading-a-trained-keras-model-and-continue-training

ghost · 2019-08-01T17:51:47Z

@guysmoilov , thanks a lot for your comments! Frist time hearing about early-stopping, and didn't know that you can train a model that was already trained (it makes a lot of sense, tho)!

I'll close this issue, then :)

@shcheklein, what do you think about adding this info to the docs?

shcheklein · 2019-08-01T21:25:48Z

@MrOutis So, I'm still hesitant on merging the docs for this. I totally understand the problem, but I'm not sure adding a flag and saving outputs this way is the right way of solving it. We need to brainstorm a little bit and it can be part of a broader discussion around experiments management in general.

ghost added question I have a question? research labels Jul 30, 2019

ghost closed this as completed Aug 1, 2019

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dvc: deprecate persistent outputs #2340

dvc: deprecate persistent outputs #2340

ghost commented Jul 30, 2019 •

edited by ghost

Loading

shcheklein commented Jul 30, 2019

guysmoilov commented Jul 30, 2019 •

edited

Loading

ghost commented Jul 30, 2019

guysmoilov commented Jul 31, 2019

guysmoilov commented Jul 31, 2019

ghost commented Aug 1, 2019

shcheklein commented Aug 1, 2019

dvc: deprecate persistent outputs #2340

dvc: deprecate persistent outputs #2340

Comments

ghost commented Jul 30, 2019 • edited by ghost Loading

shcheklein commented Jul 30, 2019

guysmoilov commented Jul 30, 2019 • edited Loading

ghost commented Jul 30, 2019

guysmoilov commented Jul 31, 2019

guysmoilov commented Jul 31, 2019

ghost commented Aug 1, 2019

shcheklein commented Aug 1, 2019

ghost commented Jul 30, 2019 •

edited by ghost

Loading

guysmoilov commented Jul 30, 2019 •

edited

Loading