Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

option --always-changed of dvc run #2531

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dashohoxha opened this issue Sep 24, 2019 · 11 comments
Closed

option --always-changed of dvc run #2531

dashohoxha opened this issue Sep 24, 2019 · 11 comments
Labels
discussion requires active participation to reach a conclusion

Comments

@dashohoxha
Copy link
Contributor

There is a discussion that illustrates the usage of --always-changed with this example:

dvc run -d ... -o s3://my-bucket/folder.parquet ./create_dir.sh
dvc run -d s3://my-bucket/folder.parquet -o marker --always-changed ./generate_effective_md5.sh
dvc run -d marker -o ... ./do_something_with_dir.sh

I was thinking if we could solve the same problem with a callback stage, like this:

dvc run -d ... -o s3://my-bucket/folder.parquet ./create_dir.sh
dvc run -o marker ./generate_effective_md5.sh
dvc run -d marker -o ... ./do_something_with_dir.sh

or like this:

dvc run -o marker './create_dir.sh && ./generate_effective_md5.sh'
dvc run -d marker -o ... ./do_something_with_dir.sh

It seems like these are equivalent to the first example.

Then I realized that maybe the option --always-changed is not needed at all, since in all the cases we can convert it to a callback stage that is equivalent.
Indeed, if --always-changed ignores the dependencies, then just don't specify any dependencies to the stage, making it a callback stage, and it will always be executed.

I wonder if there are any cases when an --always-changed stage cannot be converted to an equivalent callback stage.

@efiop
Copy link
Contributor

efiop commented Sep 24, 2019

@dashohoxha In both cases the connection between create_dir and do_something_with_dir would be lost, as generate_effective_md5 does't have dependencies listed, so dvc won't be able to build a DAG involving all of these three stages. Plus --always-changed is a much better explicit way of setting callback stages than implicit no-dep logic that we have. Please see the discussion about it at #2378 .

Closing as it seems to be resolved. Please feel free to reopen if I've missed anything.

@efiop efiop closed this as completed Sep 24, 2019
@dashohoxha
Copy link
Contributor Author

In both cases the connection between create_dir and do_something_with_dir would be lost, as generate_effective_md5 does't have dependencies listed, so dvc won't be able to build a DAG involving all of these three stages.

What about this:

dvc run -d ... -o s3://my-bucket/folder.parquet -o marker './create_dir.sh && ./generate_effective_md5.sh'
dvc run -d s3://my-bucket/folder.parquet -d marker -o ... './do_something_with_dir.sh'

I think this is equivalent to the original example (and still doesn't need --always-changed):

dvc run -d ... -o s3://my-bucket/folder.parquet ./create_dir.sh
dvc run -d s3://my-bucket/folder.parquet -o marker --always-changed ./generate_effective_md5.sh
dvc run -d marker -o ... ./do_something_with_dir.sh

Plus --always-changed is a much better explicit way of setting callback stages than implicit no-dep logic that we have.

I don't see why this is the case. It seems more like a personal preference.

Please see the discussion about it at #2378 .

This is a mind-boggling discussion for me, I am not sure I can follow it. But it seems that there is no general agreement about that issue about what to do. So, I don't know why that issue is still open, and this one was closed immediately, without any proper discussions.

Please feel free to reopen if I've missed anything.

Maybe I would have done it, but I don't have permissions to reopen issues.
Please feel free to leave it closed if you feel like this is not an important issue.

@efiop
Copy link
Contributor

efiop commented Sep 24, 2019

What about this:

It would always run do_something_with_dir if s3://my-bucket/folder.parquet has changed (hash computed by dvc has changed), which defeats the purpose.

I don't see why this is the case. It seems more like a personal preference.

Because the definition of "considered always changed if there are no deps" is implicit as it doesn't contain clear marker, and "considered always changed if always_changed == True" is an explicit one. Not sure how this is a personal preference.

This is a mind-boggling discussion for me, I am not sure I can follow it. But it seems that there is no general agreement about that issue about what to do. So, I don't know why that issue is still open, and this one was closed immediately, without any proper discussions.
Maybe I would have done it, but I don't have permissions to reopen issues.
Please feel free to leave it closed if you feel like this is not an important issue.

I guess I was wrong about this one being resolved. Reopening.

@efiop efiop reopened this Sep 24, 2019
@efiop efiop added the question I have a question? label Sep 24, 2019
@dashohoxha
Copy link
Contributor Author

It would always run do_something_with_dir if s3://my-bucket/folder.parquet has changed (hash computed by dvc has changed), which defeats the purpose.

Maybe this one is better:

dvc run -d ... -o s3://my-bucket/folder.parquet -o marker './create_dir.sh && ./generate_effective_md5.sh'
dvc run -d marker -o ... './do_something_with_dir.sh'

So, marker decides whether should be done something with the dir, and marker is computed as soon as the dir is created.

The point is that with callback stages it is possible to solve all the problems that can be solved with --always-execute stages.
On the other hand I don't see how callback stages can be deprecated. If the user creates a stage without dependencies, are we going to complain that this is an error? It doesn't seem reasonable to me.

Because the definition of "considered always changed if there are no deps" is implicit as it doesn't contain clear marker, and "considered always changed if always_changed == True" is an explicit one.

Maybe we have a terminology problem. If we call a stage that has no dependencies "unconditioned" (or "unrestricted", or "unconstrained", or "independent", etc.) it would feel natural to execute its command all the times.

@efiop
Copy link
Contributor

efiop commented Sep 26, 2019

@dashohoxha Good point, that one would work, but at the expense of mashing stages together :)

The point is that with callback stages it is possible to solve all the problems that can be solved with --always-execute stages.

With callback stages and mashing commands together, yes :)

On the other hand I don't see how callback stages can be deprecated. If the user creates a stage without dependencies, are we going to complain that this is an error? It doesn't seem reasonable to me.

In 1.0 we'll consider dropping that behavior in favor of explicit --always-changed.

Maybe we have a terminology problem. If we call a stage that has no dependencies "unconditioned" (or "unrestricted", or "unconstrained", or "independent", etc.) it would feel natural to execute its command all the times.

That is a nice way to put it! But that is definitely not something very obvious, unlike dead simple --always-changed.

@dashohoxha
Copy link
Contributor Author

Good point, that one would work, but at the expense of mashing stages together :)

Also at the benefit of using what is available :)

In 1.0 we'll consider dropping that behavior in favor of explicit --always-changed.

Do what you think is best. But I am in favor of not changing callback stages, except for their name (for example calling them "unconditioned" changes).

@ghost
Copy link

ghost commented Sep 30, 2019

@dashohoxha concerns are valid.

Not sure how this is a personal preference.

@efiop , I think it is, since we don't have a design philosophy for DVC (at least, not a written one).

There are people that prefer a minimal design:

  • The less flags I need to remember, the better
  • The less edge cases to handle, the better

Others are on the opposite side, rooting for a monolithic tool to do everything;

@efiop , I'm still not sure what are the next steps for this issue, tho; agreeing that we are not going to deprecate --always-changed and close it?

@ghost ghost added discussion requires active participation to reach a conclusion and removed question I have a question? labels Sep 30, 2019
@mdekstrand
Copy link

The Zen of Python, "explicit is better than implicit", is IMO a good argument for --always-changed and deprecating the implicit callback behavior.

However, to the root of this issue: the proposed solutions may solve most of the problems --always-changed does solve (I am not entirely sure), but not all the problems it was designed to solve (I've added a comment to #2378 with the results of actually using --always-changed for my use case, and finding a semantic hole that doesn't seem pluggable with callbacks).

Issue #2378 is about giving DVC the ability to check the status of files that are not entirely under its control. Integrating commands together to produce the marker file assumes that the only file changes we need to be able to detect are changes that are caused by a dvc run; however, if that is not the case (e.g. I'm running the repo against a different copy of the database, that has a different status of my data processing steps), I need the ability for DVC to detect that and rerun things appropriately.

@Suor
Copy link
Contributor

Suor commented Oct 15, 2019

I am also for removing --always-changed I don't see any use for it outside of strange optimization, i.e. we have dep, which is slow to test for change.

BTW,

./create_dir.sh && ./generate_effective_md5.sh

is not smashing stages together. This is a single stage, separating it is an error, which led to the original issue and confusion.

This is how dvc operates, outputs glue stages together. If some stage generates non-file, i.e. updates database state or something, then we simulate output with a marker, which is effectively an output from dvc POV.

So I am for removing --always-changed, forgetting the idea of callback deps and documenting the marker approach.

@Suor
Copy link
Contributor

Suor commented Oct 15, 2019

Thinking a bit more about it, we should handle a situation when a non-file dep (A) might be modified by some external activity to dvc, in this case we will have a stale marker, which will no longer correspond to non-file dep state.

The solution is a callback state updating a marker each time, then any stage dependent on A may still rely on depending on the marker.

There goes an order issue. Say A might be changed both in dvc process and externally, then we will have several stages:

An update A stage:

cmd: update_A && update_marker
outs: 
    path: marker
deps: maybe some

An A status stage:

cmd: update_marker
outs: 
    path: marker

And a dependent stage:

cmd: work_with_A
deps:
    path: marker

We have several issues here, which, I think we all can ignore:

  • order of stage runs are not fully determined,
  • marker is updated twice,
  • need to use shell things like && to dedup marker creating code.

The real reqs are always update the marker (to handle outside activity) and always update marker after updating A. Both are satisfied here.

The real issue, which we can't simply handle is reproducibility though. Say we want to rerun a dependent stage on an older A state, we can't, we only do that for files. We may discuss what should (or shouldn't) we do about that.

@Suor
Copy link
Contributor

Suor commented Oct 15, 2019

A general takeout from all of this - we should define where dvc responsibility ends, like say non-file non-dvc managed deps. Trying to stretch our responsibility over those limits means bringing together mutually contradictory objectives and won't lead to anything good.

@efiop efiop closed this as completed May 3, 2021
@iterative iterative locked and limited conversation to collaborators May 3, 2021

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
discussion requires active participation to reach a conclusion
Projects
None yet
Development

No branches or pull requests

4 participants