-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add parallel download support to BatchDownloader #12388
Conversation
This parameter is used to set the number of parallel downloads in BatchDownloader as well as to set the pool_connections in the HTTPAdapter to prevent 'Connection pool full' warnings
src/pip/_internal/cli/cmdoptions.py
Outdated
dest="parallel_downloads", | ||
type="int", | ||
metavar="n", | ||
default=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would a default of the number of cores
be acceptable? I think most users want this by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR doesn't have a solution for showing progress when downloading in parallel. It just doesn't show the progress. Setting parallel downloads on by default would cause unexpected behaviour for end users by default, which I would like to avoid
I do have an open PR for a progress bar that supports parallel downloads(still a work in progress) but even there, I don't see a solution to having clean, non breaking parallel progress bars in Jupyter. So I'd prefer to keep sequential download as the default, having expected behaviour and allowing the option to enable parallel downloads explicitly with the expectation of some weirdness(either no progress bar for parallel downloads or a broken parallel progress bar only in Jupyter)
If I'm missing something here and someone knows of a way to have parallel progress bars in Jupyter please let me know :)
self.parallel_downloads = ( | ||
parallel_downloads if (parallel_downloads is not None) else 1 | ||
) | ||
pool_maxsize = max(self.parallel_downloads, 10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this not a configurable limit? It seems reasonable to allow more than 10 threads for downloading as each package should be able to be downloading in parallel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Edit: I see now that we actually do set it below, so I think this is just unnecessary now.
Please consider removing/changing to just
pool_maxsize = self.parallel_downloads
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default pool_maxsize
is 10 in requests.Session
. My thought process was to not change it if it wasn't necessitated by the user asking for more than 10 connections to be opened. If the value of self.parallel_downloads
is greater than 10 it sets pool_maxsize
to self.parallel_downloads
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Neil, you're right. On a second review that seems a wise decision. I actually misread it as min
not max
. Finally got some coffee and this looks ready to land if the project maintainers can review it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Old comment:
This PR looks good. Made two comments, one suggesting a changed default for the CLI args and another hoping to remove the restriction to a thread pool size of 10 (which would limit concurrent downloads). Otherwise, this is awesome
Edit: Both comments are retracted, the CLI default is a careful incremental change and the limit does not exist.
This change has been carefully thought through and provides a path forward for pip parallelization. I'm keen to see some performance numbers soon and hope pypa will approve it.
Great work! Do you have a link to your other PR for the progress bar change? |
Yep here it is: #12404 |
@NeilBotelho Thanks for filing these PRs! They're on my personal radar to review, but I'm unlikely to have the bandwidth to review pip PRs in the coming weeks and will likely only look at this early next year. One of the other maintainers might have bandwidth to look at this before then, of course. I'm not going to volunteer someone else but just erring on the side of communicating my (lack of) availability so that the lack of a response isn't misjudged to be a lack of interest! :) |
@pradyunsg no issues! I totally undestand, and I appreciate the transparency. I'm just happy to finally be contributing back to pip :) |
I'll also note that I have a number of personal commitments right now that mean I'll be unable to review this in the near future. Sorry! |
This PR is very exciting for the speed-up it could offer pip in multi-core systems! I'll keep an eye on it |
src/pip/_internal/cli/cmdoptions.py
Outdated
dest="parallel_downloads", | ||
type="int", | ||
metavar="n", | ||
default=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it’s easier to use 0 as the default instead of None. This would make checks below a bit simpler.
By the way, what’s the difference between setting this to 1 and not at all? We should probably add something in documentation about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the option is not set, then in pip._internal.network.session
self.parallel_downloads gets set to 1. So there isn't really a difference between setting it to 1 and not at all. I wasn't sure if it would be confusing setting parallel_downloads to 0 or 1 by default to indicate no parallel downloads. At the time of writing I thought a default of none to indicate no parallel_downloads made sense, but I'm happy to change it if you think 0 is better.
And thinking about it a bit more, it might make more sense to mention 2 as the minimum number of parallel downloads, as that is when the behaviour would change. What do you think @uranusjr ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it’s beneficial to have a value that simply means the default. This makes the CLI easier to interface, e.g. when writing a Bash script to conditionally switch parallel downloads on and off. So I’d want this to allow 1 and up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense to me. I'll update it to make 1 the default value.
74e283c
to
e754fd6
Compare
else: | ||
parallel_downloads = 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this branch ever get used? I thought default
would cover the case where the user does not supply a value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the parallel_downloads
option is only added to the install
and download
sub-commands, so in other sub-commands (index, list etc.) there is no options.parallel_downloads
. So in those cases I'm setting parallel_downloads
to 1.
I wonder how this should be unit-tested. |
Co-authored-by: Tzu-ping Chung <uranusjr@gmail.com>
One option to test this would be to unit test the I do see some examples of packages being downloaded in |
@@ -76,6 +78,9 @@ def add_options(self) -> None: | |||
|
|||
@with_cleanup | |||
def run(self, options: Values, args: List[str]) -> int: | |||
if options.parallel_downloads < 1: | |||
raise CommandError("Value of '--parallel-downloads' must be greater than 0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be use friendly, by instead of raising CommandError, surfacing the error to the user, but defaulting to 1, even here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean raise a warning if the user sets --parallel_downloads
to 0, then set it to 1 and continue the install/download operation?
That might be more user friendly but I think it'd be better to be more explicit and throw an error with a clear message.
(I am not a pip maintainer but trying to do some optimizations in pip now) Hello, Before doing parallel download, you will need this PR to be merged first to fix/optimize the download code https://github.com/pypa/pip/pull/12810/files turns out, the code that's doing the download was downloading in small 10k bytes chunk and upgrading the progress bar after every chunk. If you see performance improvements with this PR by adding parallel downloads, I'm sorry to say this might be in large part because it removed the progress bar. I'm afraid parallelization of downloads is unlikely to give huge performance improvements until the linked PR to fix download is merged, otherwise most of the time is taken in pip python code that's holding the GIL. Cheers. |
(not a pip maintainer) my performance improvements to download code have been merged in 24.2. I see download going from ~60 to ~460 MB/s on internal network, mostly outsourcing to C code in requests/ssl that does not hold the GIL. now would be a good time to rebase and bring this PR back to life. on the code review, you will want to remove the 2 functions _sequential_download() + _download_parallel() |
I haven't had a chance to work on this in a while and I see that someone recently opened a PR to address this same issue, so I'll close this in favour of it. |
This PR adds parallel download support to
BatchDownloader
and adds a cli option--parallel-downloads <n>
to the install and download commands. If the option is not specified, or if set to 1, it doesn't change pip's behaviour or UI.BatchDownloader
is only called by_complete_partial_requirements
after resolution is completed, so it wouldn't affect the resolution process as far as I can tell. It is used to download wheels where only metadata access was available for resolution. So as more packages adopt PEP 658 the number of wheels we would be able to download in parallel would also increase :)Note
Importantly, this PR does not add UI/progress bars for parallel downloads, it just logs the
Downloading <wheel-file-name>
message and disables the progress bar.I am working on a separate PR that adds a progress bar for parallel downloads. These 2 PR's can then be combined. I don't expect this PR to be merged until that PR is opened.I have opened a separate PR to add UI support for parallel downloadsI've done it this way because the UI requires some discussion, due to limitations of how rich and even tqdm render parallel progress bars in Jupyter notebooks (see this issue). Currently the best way I can think to circumvent this issue is to add a flag (something like --jupyter) to be used when downloading in parallel in jupyter notebooks that will disable the progress bar only for the parallel downloads. But this isn't ideal.
Also, if someone could tell me where I should ideally initiate this sort of discussion(discourse/IRC/discord etc.) I would appreciate it :)