dvc stage foreach ordered execution #5644

gabrieljdcoelho · 2021-03-18T12:02:23Z

When using foreach loops on dvc.yaml, I noticed that executions do not follow the expected order.
However, in my particular case, this is a great issue.

I use foreach loops for two reasons:

iterate over several datasets (let’s say 5) and apply data preparation tasks to all (does not need to follow the order defined on a params.yaml list);
apply a rolling window process (https://www.mathworks.com/help/econ/rolling-window-estimation-of-state-space-models.html) and order must be followed because there is dependencies between iterations.

Considering that I have 10 rolling window iterations which simulates an online learning environment. In the first one, a model is trained for X iterations and store it. In the second one, I load the generated model and re-train it on new data for X\2 iterations, since the model is already trained and I need only to adapt it to new data, I do not need to train it entirely.
Thus, execution order is mandatory. If DVC starts the foreach loop by the 3rd iteration (for example), it will not fulfill the mentioned requirements.
I store all models, predictions, results, datasets, ... in all iterations and I use DVC pipelines since it makes it easier to both manage all data and run only parts of my pipeline.
It should be noticed that I already have to duplicate my rolling window stage code to all my datasets, since it is not possible to have nested foreach loops in DVC, so I have 5 stages with the same code, only changing the dataset.
If DVC does not follow the foreach order, I'll need to have 50 stages with duplicated code, assuming that I only have 1 model, otherwise, it would be impracticable.

gabrieljdcoelho · 2021-03-18T12:03:03Z

Context: https://discord.com/channels/485586884165107732/563406153334128681/821809018808172564

skshetry · 2021-03-18T12:06:39Z

hmm, could the second usecase of rolling window be solved by checkpoints?

dberenbaum · 2021-03-18T13:37:26Z

@gabrieljdcoelho Thanks for the detailed explanation of an interesting use case. As you note, these are two different scenarios, so I'm not sure that foreach is the right way to go about simulating a rolling window/online learning process. Did you actually try to write this in your dvc.yaml? I'd be curious how you handled the dependencies between iterations, but regardless it seems like it's not going to work for this use case.

It should be noticed that I already have to duplicate my rolling window stage code to all my datasets, since it is not possible to have nested foreach loops in DVC, so I have 5 stages with the same code, only changing the dataset.

Would it work to have 10 rolling window stages and use foreach to apply those to the 5 different datasets? This isn't ideal, but it would be better than 50 stage definitions. I'm unclear on whether the lack of nesting means that you can't use foreach at all?

Thinking more holistically about the problem, could you have a metascript to generate your dvc.yaml? My understanding is that the goal of features like foreach was to give some flexibility while maintaining yaml structure. If that's insufficient, it might be easier to use something like jinja to write a template that can generate a more complex dvc.yaml without needing to hard code everything.

dberenbaum · 2021-03-18T13:57:23Z

@skshetry Checkpoints are an interesting idea. It raises a few questions:

Not only the model output but also the data dependency gets updated each iteration. I'm not sure if checkpoints could work with a changing data dependency. If it's all one dataset and the script is specifying what date range to select at each iteration, it might be possible to have a checkpoint output that tracks the iteration number and uses it to read in a specific date range (for example, iter_start_dt = start_dt + 30 * iter_num).
The 10 iterations would need to be defined in the parameters or the script itself since there is no way to define the number of checkpoints in dvc.yaml.
dvc repro does not support checkpoints, so the pipeline would need to be executed with dvc exp run. These do mostly the same thing, but the workflow might differ a little depending on whether you need to keep all the intermediate iterations or just the final result.

skshetry · 2021-03-18T14:20:38Z

@dberenbaum, we had some discussions regarding this a few months back on #5181.

shcheklein added the discussion requires active participation to reach a conclusion label Mar 19, 2021

This was referenced Mar 22, 2021

dvc.yaml: future of foreach stages #5440

Closed

ref: general updates to Experiments iterative/dvc.org#2300

Merged

daavoo added the A: pipelines Related to the pipelines feature label Oct 20, 2021

mattseddon closed this as not planned Won't fix, can't repro, duplicate, stale Mar 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dvc stage foreach ordered execution #5644

dvc stage foreach ordered execution #5644

gabrieljdcoelho commented Mar 18, 2021

gabrieljdcoelho commented Mar 18, 2021

skshetry commented Mar 18, 2021

dberenbaum commented Mar 18, 2021

dberenbaum commented Mar 18, 2021

skshetry commented Mar 18, 2021

dvc stage foreach ordered execution #5644

dvc stage foreach ordered execution #5644

Comments

gabrieljdcoelho commented Mar 18, 2021

gabrieljdcoelho commented Mar 18, 2021

skshetry commented Mar 18, 2021

dberenbaum commented Mar 18, 2021

dberenbaum commented Mar 18, 2021

skshetry commented Mar 18, 2021