-
-
Notifications
You must be signed in to change notification settings - Fork 292
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
performantly redeploying PEX files by sharding their requirements #789
Comments
This is precisely how multiple of our machine learning teams are working around the problem internally. A few issues with that which spurred this new approach:
As mentioned above, we were considering using a virtualenv created via pip containing 3rdparty requirements, and using that and then running the pex file with
Either way, downloading maybe-cached individual requirements in parallel from within the datacenter seemed easier to implement and likely to be more performant than stuffing all of that into a monolithic pex which always has to be copied over in full to each machine before being able to run the user's job. The reason why the current approach (of resolving "dehydrated" requirements in pex at runtime) was considered useful enough to make a PR out of is because:
Two trains of thought we've been thinking of adopting are:
Both of these are intended to make the fact that this different type of "dehydrated" pex file isn't reproducible much more evident. Finally, it's important to note that the current implementation in #787 specifically makes sure to resolve all transitive requirements fully, then storing them into the pex file. This ensures that the intransitive "resolve" at runtime (with |
Thinking about it now, one alternative approach that might let us iterate on this idea internally that doesn't require explicit pex support or separate wrapper scripts at runtime is to make an internal pants task which modifies the bootstrap script in the generated pex file instead of doing the resolve within pex itself. However, support within the pex tool would still be required for the first commit of #787 (6390b49), to produce "dehydrated" pex files with the There are a lot of alternatives we have to the actual runtime resolve, but separating out the diff to cover just introducing the |
I'm still stuck back on why PEX is the right tool for the job at all here. It sounds like your requirements / acceptable actions include:
A virtualenv + a pinned requirements file with hashes and a |
To put a finer point on the last - with #781 in flight, the pex resolver will == the pip resolver. When Pants upgrades to pex 2.0.0 with the pip resolver, would pants generating a lockfile with hashes be enough here? You deploy the lockfile and run |
Yes, specifically pex using the pip resolver as per #781 would make virtualenv a much more feasible solution! You correctly divined that a major concern with that approach was that this would lead to dissonance between the pex and pip resolves (didn't state this explicitly). However, the reason for focusing on the It's possible, however, that pants v2 + remote execution may be able to replace this entirely, as it: That's something @stuhood has been discussing, but we haven't delved into super thoroughly yet. A previous implementation of this idea (at https://github.com/cosmicexplorer/pex/tree/incremental-pex-production-with-fingerprinting) was intended to be more v2-friendly, which attempted to fingerprint individual source modules and requirements. There would need to be some design work to understand how to hook up pex to the pants v2 distributed file store, since the only reason that could be useful is if each requirement could be downloaded separately, in parallel, at runtime. Currently, pex resolving and downloading all requirements monolithically is not something we are able to break down post hoc, and we continue to pay the cost of uploading and downloading a massive pex each time. The @stuhood also mentioned that |
Yes. Fundamentally, we're looking for a way to separate layers. And fundamentally what we need is a Where we're not so clear is preferences between (at least) three ways of using that information:
I don't have a strong preference, I think we can make any of the three work. Does anyone else have a strong preference (looking specifically at @jsirois and @kwlzn for this one :)) A separate axis of decision-making is whether the file downloads should be done by using |
My first attempt at addressing this issue did something like this, but adorning the PEX-INFO with content digests for source files. While that might be interesting later, it doesn’t address the issue we’re actually concerned about, which is downloading the 3rdparty requirements piece by piece (it still revolves requirements monolithically). So I don’t think this work should be used at all, but it did introduce a model that is potentially similar to what you’re describing here, just for source files instead of requirements. While this branch is incredibly complex, I believe that scoping the changes to just cover digesting 3rdparty requirements might be a way to support the It’s definitely possible to do that entirely outside of pex, but in investigating this type of implementation, I might at first experiment with having pex itself shard + digest 3rdparty requirements. I’ll post to this issue if we follow that route and find that it works. |
I've created a google doc to discuss the alternate implementations of this idea in greater depth at https://docs.google.com/document/d/1B_g0Ofs8aQsJtrePPR1PCtSAKgBG1o59AhS_NwfFnbI/edit?usp=sharing! |
Since work is well underway to implement all this on top if pex in pantsbuild/pants#8793 I'm going to close. Thanks for working through this @cosmicexplorer. |
### Problem See pex-tool/pex#789 for a description of the issue, and https://docs.google.com/document/d/1B_g0Ofs8aQsJtrePPR1PCtSAKgBG1o59AhS_NwfFnbI/edit for a google doc with pros and cons of different approaches. @jsirois was extremely helpful throughout the development of this feature, and pex-tool/pex#819 and pex-tool/pex#821 in pex `2.0.3` will help to optimize several other aspects of this process when we can unrevert #8787. **Note:** `src/python/pants/backend/python/subsystems/pex_build_util.py` was removed in this PR, along with all floating references to it. ### Solution With `--binary-py-generate-ipex`, a `.ipex` file will be created when `./pants binary` is run against a `python_binary()` target. This `.ipex` archive will create a `.pex` file and run it when first executed. The `.ipex` archive contains: - in `IPEX-INFO`: the source files to inject into the resulting `.pex`, and pypi indices to resolve requirements from. - in `BOOSTRAP-PEX-INFO`: the `PEX-INFO` of the pex file that *would* have been generated if `--generate-ipex` was False. - in `ipex.py`: A bootstrap script which will generate a `.pex` file when the `.ipex` file is first executed. ### Result For a `.ipex` file which hydrates the `tensorflow==1.14.0` dependency when it is first run, this translates to a >100x decrease in file size: ```bash X> ls dist total 145M -rwxr-xr-x 1 dmcclanahan staff 267k Dec 10 21:11 dehydrated.ipex* -rwxr-xr-x 1 dmcclanahan staff 134M Dec 10 21:11 dehydrated.pex* ```
See pex-tool/pex#789 for a description of the issue, and https://docs.google.com/document/d/1B_g0Ofs8aQsJtrePPR1PCtSAKgBG1o59AhS_NwfFnbI/edit for a google doc with pros and cons of different approaches. @jsirois was extremely helpful throughout the development of this feature, and pex-tool/pex#819 and pex-tool/pex#821 in pex `2.0.3` will help to optimize several other aspects of this process when we can unrevert pantsbuild#8787. **Note:** `src/python/pants/backend/python/subsystems/pex_build_util.py` was removed in this PR, along with all floating references to it. With `--binary-py-generate-ipex`, a `.ipex` file will be created when `./pants binary` is run against a `python_binary()` target. This `.ipex` archive will create a `.pex` file and run it when first executed. The `.ipex` archive contains: - in `IPEX-INFO`: the source files to inject into the resulting `.pex`, and pypi indices to resolve requirements from. - in `BOOSTRAP-PEX-INFO`: the `PEX-INFO` of the pex file that *would* have been generated if `--generate-ipex` was False. - in `ipex.py`: A bootstrap script which will generate a `.pex` file when the `.ipex` file is first executed. For a `.ipex` file which hydrates the `tensorflow==1.14.0` dependency when it is first run, this translates to a >100x decrease in file size: ```bash X> ls dist total 145M -rwxr-xr-x 1 dmcclanahan staff 267k Dec 10 21:11 dehydrated.ipex* -rwxr-xr-x 1 dmcclanahan staff 134M Dec 10 21:11 dehydrated.pex* ```
See pex-tool/pex#789 for a description of the issue, and https://docs.google.com/document/d/1B_g0Ofs8aQsJtrePPR1PCtSAKgBG1o59AhS_NwfFnbI/edit for a google doc with pros and cons of different approaches. @jsirois was extremely helpful throughout the development of this feature, and pex-tool/pex#819 and pex-tool/pex#821 in pex `2.0.3` will help to optimize several other aspects of this process when we can unrevert pantsbuild#8787. **Note:** `src/python/pants/backend/python/subsystems/pex_build_util.py` was removed in this PR, along with all floating references to it. With `--binary-py-generate-ipex`, a `.ipex` file will be created when `./pants binary` is run against a `python_binary()` target. This `.ipex` archive will create a `.pex` file and run it when first executed. The `.ipex` archive contains: - in `IPEX-INFO`: the source files to inject into the resulting `.pex`, and pypi indices to resolve requirements from. - in `BOOSTRAP-PEX-INFO`: the `PEX-INFO` of the pex file that *would* have been generated if `--generate-ipex` was False. - in `ipex.py`: A bootstrap script which will generate a `.pex` file when the `.ipex` file is first executed. For a `.ipex` file which hydrates the `tensorflow==1.14.0` dependency when it is first run, this translates to a >100x decrease in file size: ```bash X> ls dist total 145M -rwxr-xr-x 1 dmcclanahan staff 267k Dec 10 21:11 dehydrated.ipex* -rwxr-xr-x 1 dmcclanahan staff 134M Dec 10 21:11 dehydrated.pex* ```
See pex-tool/pex#789 for a description of the issue, and https://docs.google.com/document/d/1B_g0Ofs8aQsJtrePPR1PCtSAKgBG1o59AhS_NwfFnbI/edit for a google doc with pros and cons of different approaches. @jsirois was extremely helpful throughout the development of this feature, and pex-tool/pex#819 and pex-tool/pex#821 in pex `2.0.3` will help to optimize several other aspects of this process when we can unrevert #8787. **Note:** `src/python/pants/backend/python/subsystems/pex_build_util.py` was removed in this PR, along with all floating references to it. With `--binary-py-generate-ipex`, a `.ipex` file will be created when `./pants binary` is run against a `python_binary()` target. This `.ipex` archive will create a `.pex` file and run it when first executed. The `.ipex` archive contains: - in `IPEX-INFO`: the source files to inject into the resulting `.pex`, and pypi indices to resolve requirements from. - in `BOOSTRAP-PEX-INFO`: the `PEX-INFO` of the pex file that *would* have been generated if `--generate-ipex` was False. - in `ipex.py`: A bootstrap script which will generate a `.pex` file when the `.ipex` file is first executed. For a `.ipex` file which hydrates the `tensorflow==1.14.0` dependency when it is first run, this translates to a >100x decrease in file size: ```bash X> ls dist total 145M -rwxr-xr-x 1 dmcclanahan staff 267k Dec 10 21:11 dehydrated.ipex* -rwxr-xr-x 1 dmcclanahan staff 134M Dec 10 21:11 dehydrated.pex* ```
See pex-tool/pex#789 for a description of the issue, and https://docs.google.com/document/d/1B_g0Ofs8aQsJtrePPR1PCtSAKgBG1o59AhS_NwfFnbI/edit for a google doc with pros and cons of different approaches. @jsirois was extremely helpful throughout the development of this feature, and pex-tool/pex#819 and pex-tool/pex#821 in pex `2.0.3` will help to optimize several other aspects of this process when we can unrevert #8787. **Note:** `src/python/pants/backend/python/subsystems/pex_build_util.py` was removed in this PR, along with all floating references to it. With `--binary-py-generate-ipex`, a `.ipex` file will be created when `./pants binary` is run against a `python_binary()` target. This `.ipex` archive will create a `.pex` file and run it when first executed. The `.ipex` archive contains: - in `IPEX-INFO`: the source files to inject into the resulting `.pex`, and pypi indices to resolve requirements from. - in `BOOSTRAP-PEX-INFO`: the `PEX-INFO` of the pex file that *would* have been generated if `--generate-ipex` was False. - in `ipex.py`: A bootstrap script which will generate a `.pex` file when the `.ipex` file is first executed. For a `.ipex` file which hydrates the `tensorflow==1.14.0` dependency when it is first run, this translates to a >100x decrease in file size: ```bash X> ls dist total 145M -rwxr-xr-x 1 dmcclanahan staff 267k Dec 10 21:11 dehydrated.ipex* -rwxr-xr-x 1 dmcclanahan staff 134M Dec 10 21:11 dehydrated.pex* ```
See pex-tool/pex#789 for a description of the issue, and https://docs.google.com/document/d/1B_g0Ofs8aQsJtrePPR1PCtSAKgBG1o59AhS_NwfFnbI/edit for a google doc with pros and cons of different approaches. @jsirois was extremely helpful throughout the development of this feature, and pex-tool/pex#819 and pex-tool/pex#821 in pex `2.0.3` will help to optimize several other aspects of this process when we can unrevert #8787. **Note:** `src/python/pants/backend/python/subsystems/pex_build_util.py` was removed in this PR, along with all floating references to it. With `--binary-py-generate-ipex`, a `.ipex` file will be created when `./pants binary` is run against a `python_binary()` target. This `.ipex` archive will create a `.pex` file and run it when first executed. The `.ipex` archive contains: - in `IPEX-INFO`: the source files to inject into the resulting `.pex`, and pypi indices to resolve requirements from. - in `BOOSTRAP-PEX-INFO`: the `PEX-INFO` of the pex file that *would* have been generated if `--generate-ipex` was False. - in `ipex.py`: A bootstrap script which will generate a `.pex` file when the `.ipex` file is first executed. For a `.ipex` file which hydrates the `tensorflow==1.14.0` dependency when it is first run, this translates to a >100x decrease in file size: ```bash X> ls dist total 145M -rwxr-xr-x 1 dmcclanahan staff 267k Dec 10 21:11 dehydrated.ipex* -rwxr-xr-x 1 dmcclanahan staff 134M Dec 10 21:11 dehydrated.pex* ```
See pex-tool/pex#789 for a description of the issue, and https://docs.google.com/document/d/1B_g0Ofs8aQsJtrePPR1PCtSAKgBG1o59AhS_NwfFnbI/edit for a google doc with pros and cons of different approaches. @jsirois was extremely helpful throughout the development of this feature, and pex-tool/pex#819 and pex-tool/pex#821 in pex `2.0.3` will help to optimize several other aspects of this process when we can unrevert #8787. **Note:** `src/python/pants/backend/python/subsystems/pex_build_util.py` was removed in this PR, along with all floating references to it. With `--binary-py-generate-ipex`, a `.ipex` file will be created when `./pants binary` is run against a `python_binary()` target. This `.ipex` archive will create a `.pex` file and run it when first executed. The `.ipex` archive contains: - in `IPEX-INFO`: the source files to inject into the resulting `.pex`, and pypi indices to resolve requirements from. - in `BOOSTRAP-PEX-INFO`: the `PEX-INFO` of the pex file that *would* have been generated if `--generate-ipex` was False. - in `ipex.py`: A bootstrap script which will generate a `.pex` file when the `.ipex` file is first executed. For a `.ipex` file which hydrates the `tensorflow==1.14.0` dependency when it is first run, this translates to a >100x decrease in file size: ```bash X> ls dist total 145M -rwxr-xr-x 1 dmcclanahan staff 267k Dec 10 21:11 dehydrated.ipex* -rwxr-xr-x 1 dmcclanahan staff 134M Dec 10 21:11 dehydrated.pex* ```
This issue should be conclusively resolved before #787 can be reviewed.
From that PR:
Problem
Redeploying pex files full of many extremely large 3rdparty requirements (tensorflow, etc) into our datacenter at Twitter currently takes a very long time, since we upload them all at once into an internal artifact resolution utility, and then pull down the entire pex file before executing it. This slowness to redeploy then also affects multiple of our internal python development workflows and tooling for machine learning (including a Jupyter wrapper developed by @kwlzn) which depend on executing a pex file within the datacenter -- in that case, modifying any python source files in our monorepo currently requires waiting several minutes for changes to be usable within that Jupyter notebook.
As far as we are aware, other users of pex who package large machine learning applications also suffer from this issue and do not have an easy workaround.
(Initial) Proposed Solution
This is implemented in #787.
We would like to be able to ship around "dehydrated" pex files without 3rdparty requirements embedded in the pex, and resolve ("hydrate") them before executing the pex. This removes one half of the current process of synchronously waiting to upload and download 3rdparty requirements, and moves the remaining download part off the critical path of the entire redeploy process. Because the requirements to hydrate were already resolved when building the pex, we know all the exact versions of all the transitive dependencies to resolve at bootstrap time.
There are many ways we could potentially make the bootstrap resolve process faster -- #787 just uses a CachingResolver, with the idea that the machines that execute our pex files will eventually have most of the large distributions cached and won't need to redownload them on every redeploy (or that we can provision machines to have these requirements already contained within their local pex cache).
An alternative implementation of "hydration" that was considered was to use a virtualenv to hydrate requirements before running the pex with PEX_INHERIT_PATH=fallback, but it would be extremely helpful for us to avoid having to maintain separate tooling with virtualenv, and it would be really nice if pex could do that itself at bootstrap.
Feedback
From #787 (comment):
Alternative Solutions
From #787 (comment):
Responses to the above feedback to follow as comments in this issue.
The text was updated successfully, but these errors were encountered: