-
-
Notifications
You must be signed in to change notification settings - Fork 634
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lockfiles: Add invalidation logic #12415
Comments
It seems to me that exhaustively validating lockfiles should only need to be done regularly, not every time a build is run (particularly since many individual builds will just be for a single interpreter). Perhaps exhaustive invalidation is something that's added to a daily scheduled task, with per-interpreter invalidation being the default? |
Unfortunately, I don't think so. If you changed But, do note that the impact of Pantsd should be that you don't need to do this check very often. Pantsd is meant to be long lived, and we've been trying to reduce the # of times it gets invalidated. Once you do that initial check, you'd only need to recheck if your inputs have changed or pantsd is killed. |
OK, I'll wrap my head around the issue a bit better, and then I'll give a more informed opinion |
We should probably default to not worrying too much about this. Note that we already have to query all of the repo for dependency inference to build the If we can do something clever to avoid the cost, great. But we shouldn't give up correctness because Pantsd means you should not need to pay the cost frequently. (And we should keep improving Pantsd for that to be the case.) |
Firstly, I'm trying to avoid the use of I'm thinking of making sure the invalidation digest is somewhat resilient to edits -- at the very least so that it doesn't need to be the first thing in the file. Using a similar approach to PGP/OpenSSH keys would give us something like this:
This way, if someone were to want to edit the file, it would be somewhat clear where they could do so, and we can also add extra metadata in future if we want to. Thoughts on this as far as format/naming conventions go? |
Maybe something with
Yes, I like this. And we will probably want to expand it to include things like "Input interpreter constraints", for example. |
OK, |
Hrm, ok. platform ( |
This WIP adds the lockfile invalidation header, but does not yet consume it. Partially addresses #12415.
I'm actually pretty concerned about the performance of determining the hash for lockfiles every Pants run. In particular, computing the interpreter constraints for the Pytest lockfile is slow because we have to get the transitive target for every single I'm thinking maybe we do add an cc @stuhood , thoughts? -- EDIT: I had an idea that makes me comfortable with this! Add |
From #12491, there are currently 4 proposals to improve performance. All should be compatible with each other:
@chrisjrn and I discussed this morning that the first one seems particularly important. If you run I also think opting out of mixed-interpreter constraints (#12495) is an easy win and makes semantic sense, beyond this particular performance consideration. |
Problem
We need to know when a lockfile is stale and then warn/error/regnereate. (Until #12014 is solved, we will not regenerate).
Usually, we would simply rely on the engine's caching and memoization: i.e. attempt to run a
Process
and hope for a cache hit. However, this does not work for two reasons:Instead, we need a different caching scheme.
The inputs to lockfile generation
Currently, this is our experimental request type to generate a lockfile:
pants/src/python/pants/backend/experimental/python/lockfile.py
Lines 67 to 72 in 338ea37
The requirement strings and interpreter constraints are what really matter,
dest
anddescription
should not impact the lockfile content.The requirement strings obviously matter to know what to resolve for. If reqs are changed, added, or removed, we need to regnerate.
Interpreter constraints matter because they can impact which requirements are resolved, per #12200. However, this is a little tricky to performantly determine every Pants run what the ICs are because some contexts need to consult the entire repository to calculate which ICs are used: #12412.
Finally, we might determine that the
platform
(OS + arch) matter. Technically, resolves can vary between platforms, e.g. if certain requirements are only needed on Linux vs. macOS. This part can be punted on until we figure out how we want to address this problem.Proposal
Calculate a SHA 256 hash of the input requirements + interpreter constraints every time a lockfile is used. When lockfiles are generated, write it to a custom header, like:
If the calculated SHA does not match, warn or error saying the lockfile is stale and how to regenerate it. This behavior can be triggered by an option - not sure where makes sense, but maybe
[python-setup].stale_lockfile_behavior
and use anEnum
w/warn
anderror
variants.There are two parts to implement:
Pt 1: generation
Decide how to format the header section. Probably add a
@property
onPythonLockfileRequest
for the hash.Then, change the
result.output_digest
using the filesytem API. (Get theDigestContents
, then useCreateDigest
w/ aFileContent
using the new header + original content)pants/src/python/pants/backend/experimental/python/lockfile.py
Lines 107 to 127 in 338ea37
Semi blocked by #12382, in that we need to safely write a header to the generated files. We probably want to start using pip-compile's
--no-header
option to make it easier to add.Pt 2: Consumption
This will be a little trickier to figure out how to factor, as this code is somewhat in flight. It might make sense to pair on this.
Consumption of lockfiles is intentionally abstracted from callers. All they do is set
PexRequirements
to point to a file on the system when installing a Pex viaPexRequest
:pants/src/python/pants/backend/python/util_rules/pex.py
Lines 61 to 68 in 338ea37
It's possible this factoring won't work, and we need to start explicitly linking the
PythonLockfileRequest
to thePexRequest
at the call site so that ourpex.py
code can look at the hash and do the right thing.Open questions
Missing hashes
We should probably error if the lockfile does not have any hash in it. Should we suggest what the SHA should be, or force the user to regenerate?
Note: we're not hashing the requirements.txt file itself, only the inputs that were presumably used to generate the lockfile. If a user changes the lockfile on us, we wouldn't detect it, which is probably acceptable and allows users to do things like add custom comments.
Performance of determining interpreter constraints
As mentioned above, calculating the ICs for tools like Flake8 and Bandit is slow because we have to consult the whole repo...#12412. That perf is acceptable when generating lockfiles, as that's not done very frequently.
But it's not great for when you're simply trying to use the tools. Even though it will get memoized w/ Pantsd, this will probably add to the startup time of using Pants :/
I don't know what the right solution is. I think we do need to use ICs to determine if the lockfile is valid or not. I'm wondering if we could skip that piece, but that seems too risky 🤔
The text was updated successfully, but these errors were encountered: