Skip to content

dvc stage foreach ordered execution #5644

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
gabrieljdcoelho opened this issue Mar 18, 2021 · 5 comments
Closed

dvc stage foreach ordered execution #5644

gabrieljdcoelho opened this issue Mar 18, 2021 · 5 comments
Labels
A: pipelines Related to the pipelines feature discussion requires active participation to reach a conclusion

Comments

@gabrieljdcoelho
Copy link

When using foreach loops on dvc.yaml, I noticed that executions do not follow the expected order.
However, in my particular case, this is a great issue.

I use foreach loops for two reasons:

  1. iterate over several datasets (let’s say 5) and apply data preparation tasks to all (does not need to follow the order defined on a params.yaml list);
  2. apply a rolling window process (https://www.mathworks.com/help/econ/rolling-window-estimation-of-state-space-models.html) and order must be followed because there is dependencies between iterations.

Considering that I have 10 rolling window iterations which simulates an online learning environment. In the first one, a model is trained for X iterations and store it. In the second one, I load the generated model and re-train it on new data for X\2 iterations, since the model is already trained and I need only to adapt it to new data, I do not need to train it entirely.
Thus, execution order is mandatory. If DVC starts the foreach loop by the 3rd iteration (for example), it will not fulfill the mentioned requirements.
I store all models, predictions, results, datasets, ... in all iterations and I use DVC pipelines since it makes it easier to both manage all data and run only parts of my pipeline.
It should be noticed that I already have to duplicate my rolling window stage code to all my datasets, since it is not possible to have nested foreach loops in DVC, so I have 5 stages with the same code, only changing the dataset.
If DVC does not follow the foreach order, I'll need to have 50 stages with duplicated code, assuming that I only have 1 model, otherwise, it would be impracticable.

@gabrieljdcoelho
Copy link
Author

@skshetry
Copy link
Member

hmm, could the second usecase of rolling window be solved by checkpoints?

@dberenbaum
Copy link
Contributor

@gabrieljdcoelho Thanks for the detailed explanation of an interesting use case. As you note, these are two different scenarios, so I'm not sure that foreach is the right way to go about simulating a rolling window/online learning process. Did you actually try to write this in your dvc.yaml? I'd be curious how you handled the dependencies between iterations, but regardless it seems like it's not going to work for this use case.

It should be noticed that I already have to duplicate my rolling window stage code to all my datasets, since it is not possible to have nested foreach loops in DVC, so I have 5 stages with the same code, only changing the dataset.

Would it work to have 10 rolling window stages and use foreach to apply those to the 5 different datasets? This isn't ideal, but it would be better than 50 stage definitions. I'm unclear on whether the lack of nesting means that you can't use foreach at all?

Thinking more holistically about the problem, could you have a metascript to generate your dvc.yaml? My understanding is that the goal of features like foreach was to give some flexibility while maintaining yaml structure. If that's insufficient, it might be easier to use something like jinja to write a template that can generate a more complex dvc.yaml without needing to hard code everything.

@dberenbaum
Copy link
Contributor

@skshetry Checkpoints are an interesting idea. It raises a few questions:

  1. Not only the model output but also the data dependency gets updated each iteration. I'm not sure if checkpoints could work with a changing data dependency. If it's all one dataset and the script is specifying what date range to select at each iteration, it might be possible to have a checkpoint output that tracks the iteration number and uses it to read in a specific date range (for example, iter_start_dt = start_dt + 30 * iter_num).
  2. The 10 iterations would need to be defined in the parameters or the script itself since there is no way to define the number of checkpoints in dvc.yaml.
  3. dvc repro does not support checkpoints, so the pipeline would need to be executed with dvc exp run. These do mostly the same thing, but the workflow might differ a little depending on whether you need to keep all the intermediate iterations or just the final result.

@skshetry
Copy link
Member

@dberenbaum, we had some discussions regarding this a few months back on #5181.

@shcheklein shcheklein added the discussion requires active participation to reach a conclusion label Mar 19, 2021
@daavoo daavoo added the A: pipelines Related to the pipelines feature label Oct 20, 2021
@mattseddon mattseddon closed this as not planned Won't fix, can't repro, duplicate, stale Mar 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: pipelines Related to the pipelines feature discussion requires active participation to reach a conclusion
Projects
None yet
Development

No branches or pull requests

6 participants