-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove the content_hash
s?
#432
Comments
Ya, I think it's mainly to ensure that the file hasn't been hand-modified. Personally, I vaguely think it's good, but I don't have strong feelings. I think @mariusvniekerk feels a bit more strongly. See mamba-org/mamba#1209 (comment) for more context |
I think that discussion is more about the input specification than about the content of the lock file itself. The Additionally, I also don't think including a hash of the input provides much value. First of all you can easily check if the contents of the lock-file is "valid" for a given input. Simply check for each input matchspec if there is a locked package in the lockfile that matches it. You can even validate that the entire lock-file is consistent by also traversing the dependencies of the locked packages. Secondly, only hashing the input specification is not enough to create a reproducible conda-lock file since the |
The input_hash comes somewhat from a time before mamba where recomputing a lock was very expensive and this was a cheap way to know whether to even bother attempting to recompute it. This is also handy in some ci workflow that make git commits if the input spec for a lock has changed from a prior attempt |
@mariusvniekerk Are you talking about the
You could also very cheaply do the same thing by checking if the input already matches the locked-down packages. This can all be done in Python without ever touching a solver. My concern is that computing hashes is a fragile thing. If you add/remove a field, or format it slightly differently the hashes will not match. This makes it very hard to reproduce in a different environment or in another language or iterate on the design. I think doing semantic checks over checking hashes is a more stable approach. I would prefer to remove this rather complex piece from the lock file altogether in favor of keeping the file format simple and easily exchangeable with other implementations. WDYT? |
Pulling the discussion of #449 in here.
Given the information in the lock-file one can very easily verify whether or not the specs in an environment.yml are still satisfied by the locked packages in a lock-file. Even without the hash in the lock-file people can still years later install from the lock-file without having to re-lock any of the dependencies. I may be wrong here, but I think this covers most use-cases. The only thing the content-hash provides is to know whether the environment.yml matches exactly with the original environment.yml file it was created from. But in that case doesn't it make more sense to just include the specs? Instead of using an unstable hash? If even one thing changes in the years after the lock file was created and you use an updated I'm very curious to everyone's thoughts! |
Actually, I have given what you say some thought and I think there would be value in retaining the "original requested spec" for packages listed in the lockfile. I've been meaning to make a PR for it. |
Even though I already linked it at the top, I think it may be worth repeating how much this reminds me of mamba-org/mamba#1209 (comment) |
This question has been kicking around in the back of my head, and I finally had an idea that may be worth sharing. I am making the assumption that the idea of What we really want to compute is a function return compute_hash(environment_spec) == lockspec.content_hash is only an approximation to the It seems to me that the "right" way to do this would be something like the following pseudocode: # check if lockspec satisfies environment_spec
if not environment_spec.is_satisfied_by(lockspec):
return False
# check that lockspec is minimal
for pkg in lockspec if pkg.is_root_node_in_dependency_dag:
if environment_spec.is_satisfied_by(lockspec.minus(pkg)):
return False
# lockspec satisfies environment_spec and is minimal.
return True TANGENTIAL NOTE: I think it would be really helpful for the next lockfile version to contain a field to indicate which dependencies are non-transitive, and perhaps also the corresponding spec from the environment spec that they fulfill. SUBNOTE: On the other hand, this might be too much information; if a lockfile is already consistent with the environment spec, but the environment spec has changed (in a compatible way), then if we are including the spec in the lockfile, this would mean that the lockfile should probably be updated, even if the locked packages are the same. (An example I have in mind is using a |
@maresb Yes that is exactly my point! Thanks for sharing! This is precisely what we implemented in pixi. We check if the lockfile is minimal by walking the dependency graph from the specs using the locked packages in the lock file. If we dont visit a package in the lock file we now its surplus and we mark the lock file as inconsistent. If the spec is updated but still compatible with the lockfile we dont update the lock file. If relocking is required we resolve the environment while keeping the previous lockfile as the “installed” packages which ensures lockfiles are minimally updated. This would simply remove surplus packages without changing anything else. We can do all of this without the content-hash. |
Checklist
What is the idea?
While working on porting the conda lock format to Rust I'm having a tough time implementing the
content_hash
. The implementation relies heavily on the serialization of the model to json which is hard to get precisely compatible withconda-lock
.In the broader sense, I wonder what the use of
content_hash
is. Is it to verify the integrity of the file?Why is this needed?
To make integrating the conda-lock file format in other languages much easier.
What should happen?
Basically, drop the
content_hash
field from the format.Additional Context
No response
The text was updated successfully, but these errors were encountered: