-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BEAM-4032]Support staging binary distributions of dependency packages #16633
Conversation
7b374c6
to
2965473
Compare
Codecov Report
@@ Coverage Diff @@
## master #16633 +/- ##
==========================================
+ Coverage 74.64% 83.59% +8.94%
==========================================
Files 655 452 -203
Lines 82325 62274 -20051
==========================================
- Hits 61454 52058 -9396
+ Misses 19871 10216 -9655
+ Partials 1000 0 -1000
Continue to review full report at Codecov.
|
2965473
to
7c42c08
Compare
7c42c08
to
e310036
Compare
R: @tvalentyn |
Run Python PreCommit |
cc @robertwb |
sdks/python/container/piputil.go
Outdated
@@ -51,7 +51,7 @@ func pipInstallRequirements(files []string, dir, name string) error { | |||
// used without following their dependencies. | |||
args := []string{"install", "-r", filepath.Join(dir, name), "--disable-pip-version-check", "--no-index", "--no-deps", "--find-links", dir} | |||
if err := execx.Execute(pip, args...); err != nil { | |||
return err | |||
fmt.Println("Requirements cache " + dir + " is empty. Downloading packages from PyPI") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Printing this in case if the first Pip call fails(when user skips populating requirements cache) to provide more context if user looks into the logs
'the requirements file using the --requirements_file option.')) | ||
'the requirements file using the --requirements_file option.' | ||
'If you want to skip populating requirements cache, please ' | ||
'specify --requirements_cache skip. This would install all' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would remove the last sentence.
'--requirements_cache_only_sources', | ||
action='store_true', | ||
help=( | ||
'Enable this flag to populate requirements cache with Source' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'Enable this flag to populate requirements cache with Source' | |
'Enable this flag to populate requirements cache only with Source' |
'Enable this flag to populate requirements cache with Source' | ||
'distributions(sdists) of the dependencies mentioned in the ' | ||
'--requirements_file' | ||
'Note: This step would slow down the worker startup time' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note (BEAM-XXXX): This flag may significantly slow down the pipeline submission. It is added to preserve the requirements cache behavior prior to 2.37.0 and will likely be removed in future releases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added BEAM-4032, since it has the discussion thread, and comments on this issue
if setup_options.requirements_cache is None else | ||
os.path.join(tempfile.gettempdir(), 'dataflow-requirements-cache') if | ||
(setup_options.requirements_cache is None) and | ||
(setup_options.requirements_cache != SKIP_REQUIREMENTS_CACHE) else |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and...
part seems not necessary
# TODO(anandinguva): When https://github.com/pypa/pip/issues/10760 is | ||
# addressed download wheel based on glib version in Beam's Python Base image | ||
pip_version = pkg_resources.get_distribution('pip').version | ||
if float(pip_version[0:4]) >= 19.3: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not reliable, for example it would fail on pip < 10; let's use an off-the-shelf helper that knows how to parse
and compare versions, see: pkg_resources.parse_version
or packaging.version.parse
os.path.join(temp_dir, 'nothing.tar.gz'), 'Fake tarball') | ||
else: | ||
self.create_temp_file( | ||
os.path.join(temp_dir, 'nothing.tar.gz'), 'Fake tarball') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
appears in both branches
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we download whls and if a whl is not present on PyPI, we download sdist. That is the reason why it is present in both branches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant that you can add nothing.tar.gz
outside the if
since this code is in both branches. But feel free to keep as is if you think it's more clear this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -342,6 +343,47 @@ def test_with_requirements_file_and_cache(self): | |||
self.assertTrue(os.path.isfile(os.path.join(staging_dir, 'abc.txt'))) | |||
self.assertTrue(os.path.isfile(os.path.join(staging_dir, 'def.txt'))) | |||
|
|||
def test_with_requirements_file_skipping_cache(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's possible to rename the test method to show what's being tested and expected outcome, this makes test cases easier to follow. for example:
test_requirements_cache_not_populated_when_cache_disabled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
||
resources = self.stager.create_and_stage_job_resources( | ||
options, | ||
populate_requirements_cache=self.populate_requirements_cache, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alternatively, you could pass a mock method and verify that it was never called.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -723,6 +765,65 @@ def test_remove_dependency_from_requirements(self): | |||
self.assertEqual(['apache_beam\n', 'avro-python3\n', 'numpy\n'], | |||
sorted(lines)) | |||
|
|||
def _create_file( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def _create_file( | |
def _populate_requitements_cache_fake( |
Co-authored-by: tvalentyn <tvalentyn@users.noreply.github.com>
PTAL: @tvalentyn |
Run Python 3.8 PostCommit |
1 similar comment
Run Python 3.8 PostCommit |
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
############################################################################### | ||
numpy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This example already requires numpy in
'numpy', |
Given that you are modifying the example, users may be confused why numpy needs to be specified twice: in a requirements.txt file and in setup.py.
Let's try to use a different pipeline for this test. You could create an example where a requirements file is used, or just a test pipeline that does something trivial, like checking whether a particular library is installed and doesn't do any data processing.
This reverts commit d95f2d0.
PTAL @tvalentyn. Added a trivial Integration test to check if the specified packages in the requirements.txt are installed on the worker container.
I will resolve conflicts once this code is good to go |
sdks/python/apache_beam/examples/complete/check_requirements_file_packages.py
Outdated
Show resolved
Hide resolved
sdks/python/apache_beam/examples/complete/check_requirements_file_packages.py
Outdated
Show resolved
Hide resolved
sdks/python/apache_beam/examples/complete/check_requirements_file_packages.py
Outdated
Show resolved
Hide resolved
Creates requirements text for the integration test during runtime
Run Python 3.8 PostCommit |
see also lint failures: https://ci-beam.apache.org/job/beam_PreCommit_PythonLint_Commit/12436/ |
Yes, I already created a commit on that locally. I used to have a lint checker installed but later I disabled it for some reason. I have to enable it now(it would make my life a lot easier) |
Portable precommit fails due to a side effect from my testing on #16008. Sorry about that. |
Run Portable_Python PreCommit |
PTAL @tvalentyn. Thanks |
Run Python 3.8 PostCommit |
PTAL @tvalentyn |
Merging, thanks! Please update the PR description. |
This PR changes how we stage packages mentioned in the
--requirements_file
. Prior to this, the default behavior was to download the source distributions of the packages specified in the--requirements_file
during staging. The current behavior would be to download binary distributions(wheels) for packages specified in the--requirements_file
without their sub-dependencies. To follow the prior behavior, set--requirements_cache_only_sources
. Staging of the packages/pipeline dependencies can be skipped by setting--requirements_cache=skip
during job submission.Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
R: @username
).[BEAM-XXX] Fixes bug in ApproximateQuantiles
, where you replaceBEAM-XXX
with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.CHANGES.md
with noteworthy changes.See the Contributor Guide for more tips on how to make review process smoother.
ValidatesRunner
compliance status (on master branch)Examples testing status on various runners
Post-Commit SDK/Transform Integration Tests Status (on master branch)
Pre-Commit Tests Status (on master branch)
See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.
GitHub Actions Tests Status (on master branch)
See CI.md for more information about GitHub Actions CI.