Skip to content

dvc: deprecate persistent outputs #2340

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ghost opened this issue Jul 30, 2019 · 7 comments
Closed

dvc: deprecate persistent outputs #2340

ghost opened this issue Jul 30, 2019 · 7 comments
Labels
question I have a question? research

Comments

@ghost
Copy link

ghost commented Jul 30, 2019

I don't see a clear benefit on having them, this concern was already expressed here #1884 (comment) and here #1214 (comment) .

Supporting them is not a hassle, but also we can go with the argument that "less code is better code", and if it doesn't offer a clear benefit, it is making the interface more complex.

Current types of outputs:

  • cached
  • not-cached
  • persisted
  • persisted-not-cached
  • external
  • external not-cached
  • external persisted / external persisted-not-cached ? (I've never seen this one in the wild 😅 )
@ghost ghost added question I have a question? research labels Jul 30, 2019
@shcheklein
Copy link
Member

@guysmoilov guys did end up using these outputs?

@guysmoilov
Copy link
Contributor

guysmoilov commented Jul 30, 2019

Yes: https://dagshub.com/Guy/fairseq/src/dvc/dvc-example/train.dvc

I don't see any other sane way to allow for checkpoints and resuming training.

@ghost
Copy link
Author

ghost commented Jul 30, 2019

@guysmoilov , what about not tracking the checkpoints with dvc ? Why do you need to resume training?
I'm more used to the queue-worker architecture, and I'm not familiar with "training infrastructures".

Still I'm not sure if it's a hack or something that we should put more attention and attach it to an specific use case with proper documentation.

@guysmoilov
Copy link
Contributor

@MrOutis It's a very common pattern in deep learning.

  1. Pick a model & hyperparameters
  2. Train for a few epochs to get intermediate results,
  3. Try a few different configurations, get intermediate results for them.
  4. Pick the most promising configuration, give it more budget for training, resuming from the checkpoint where you paused.
  5. Rinse & repeat until you can claim SOTA on Arxiv 🥇

Hyperparameter optimization algorithms such as Hyperband even formalize and automate this process.
Existing frameworks such as TF have pretty well established norms for working with checkpoints - usually, you'll pick a checkpoint directory, and save the model to a checkpoint file every N epochs (with a filename like checkpoints/checkpoint_N).
Special names will be given to the following checkpoints:

  • checkpoint_best (according to a defined metric, probably validation loss)
  • checkpoint_latest (so it's easier for the resuming code to know which checkpoint to resume from)

And sometimes you'll do fancier things, like only keep the latest 5 checkpoints, or keep the best 3 checkpoints at any given time, so you can later average out their weights or turn them into an ensemble.

Obviously, you need something like DVC to be able to resume from checkpoints, switch contexts to a different configuration, and to keep track of the runs.

Hope this explains things better.

@guysmoilov
Copy link
Contributor

@ghost
Copy link
Author

ghost commented Aug 1, 2019

@guysmoilov , thanks a lot for your comments! Frist time hearing about early-stopping, and didn't know that you can train a model that was already trained (it makes a lot of sense, tho)!

I'll close this issue, then :)

@shcheklein, what do you think about adding this info to the docs?

@ghost ghost closed this as completed Aug 1, 2019
@shcheklein
Copy link
Member

@MrOutis So, I'm still hesitant on merging the docs for this. I totally understand the problem, but I'm not sure adding a flag and saving outputs this way is the right way of solving it. We need to brainstorm a little bit and it can be part of a broader discussion around experiments management in general.

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question I have a question? research
Projects
None yet
Development

No branches or pull requests

2 participants