Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add parallel download support to BatchDownloader #12388

Closed
wants to merge 11 commits into from

Conversation

NeilBotelho
Copy link
Contributor

@NeilBotelho NeilBotelho commented Nov 4, 2023

This PR adds parallel download support to BatchDownloader and adds a cli option --parallel-downloads <n> to the install and download commands. If the option is not specified, or if set to 1, it doesn't change pip's behaviour or UI.

BatchDownloader is only called by _complete_partial_requirements after resolution is completed, so it wouldn't affect the resolution process as far as I can tell. It is used to download wheels where only metadata access was available for resolution. So as more packages adopt PEP 658 the number of wheels we would be able to download in parallel would also increase :)

Note

Importantly, this PR does not add UI/progress bars for parallel downloads, it just logs the Downloading <wheel-file-name> message and disables the progress bar. I am working on a separate PR that adds a progress bar for parallel downloads. These 2 PR's can then be combined. I don't expect this PR to be merged until that PR is opened. I have opened a separate PR to add UI support for parallel downloads

I've done it this way because the UI requires some discussion, due to limitations of how rich and even tqdm render parallel progress bars in Jupyter notebooks (see this issue). Currently the best way I can think to circumvent this issue is to add a flag (something like --jupyter) to be used when downloading in parallel in jupyter notebooks that will disable the progress bar only for the parallel downloads. But this isn't ideal.

Also, if someone could tell me where I should ideally initiate this sort of discussion(discourse/IRC/discord etc.) I would appreciate it :)

dest="parallel_downloads",
type="int",
metavar="n",
default=None,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would a default of the number of cores be acceptable? I think most users want this by default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR doesn't have a solution for showing progress when downloading in parallel. It just doesn't show the progress. Setting parallel downloads on by default would cause unexpected behaviour for end users by default, which I would like to avoid

I do have an open PR for a progress bar that supports parallel downloads(still a work in progress) but even there, I don't see a solution to having clean, non breaking parallel progress bars in Jupyter. So I'd prefer to keep sequential download as the default, having expected behaviour and allowing the option to enable parallel downloads explicitly with the expectation of some weirdness(either no progress bar for parallel downloads or a broken parallel progress bar only in Jupyter)

If I'm missing something here and someone knows of a way to have parallel progress bars in Jupyter please let me know :)

self.parallel_downloads = (
parallel_downloads if (parallel_downloads is not None) else 1
)
pool_maxsize = max(self.parallel_downloads, 10)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this not a configurable limit? It seems reasonable to allow more than 10 threads for downloading as each package should be able to be downloading in parallel.

Copy link

@ghost ghost Nov 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Edit: I see now that we actually do set it below, so I think this is just unnecessary now.

Please consider removing/changing to just
pool_maxsize = self.parallel_downloads

Copy link
Contributor Author

@NeilBotelho NeilBotelho Nov 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default pool_maxsize is 10 in requests.Session. My thought process was to not change it if it wasn't necessitated by the user asking for more than 10 connections to be opened. If the value of self.parallel_downloads is greater than 10 it sets pool_maxsize to self.parallel_downloads

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Neil, you're right. On a second review that seems a wise decision. I actually misread it as min not max. Finally got some coffee and this looks ready to land if the project maintainers can review it.

Copy link

@ghost ghost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Old comment:

This PR looks good. Made two comments, one suggesting a changed default for the CLI args and another hoping to remove the restriction to a thread pool size of 10 (which would limit concurrent downloads). Otherwise, this is awesome

Edit: Both comments are retracted, the CLI default is a careful incremental change and the limit does not exist.

This change has been carefully thought through and provides a path forward for pip parallelization. I'm keen to see some performance numbers soon and hope pypa will approve it.

@ofek
Copy link
Sponsor Contributor

ofek commented Nov 27, 2023

Great work! Do you have a link to your other PR for the progress bar change?

@NeilBotelho
Copy link
Contributor Author

Great work! Do you have a link to your other PR for the progress bar change?

Yep here it is: #12404

@pradyunsg
Copy link
Member

pradyunsg commented Nov 27, 2023

@NeilBotelho Thanks for filing these PRs! They're on my personal radar to review, but I'm unlikely to have the bandwidth to review pip PRs in the coming weeks and will likely only look at this early next year.

One of the other maintainers might have bandwidth to look at this before then, of course. I'm not going to volunteer someone else but just erring on the side of communicating my (lack of) availability so that the lack of a response isn't misjudged to be a lack of interest! :)

@NeilBotelho
Copy link
Contributor Author

@pradyunsg no issues! I totally undestand, and I appreciate the transparency. I'm just happy to finally be contributing back to pip :)

@pfmoore
Copy link
Member

pfmoore commented Nov 27, 2023

I'll also note that I have a number of personal commitments right now that mean I'll be unable to review this in the near future. Sorry!

@mkleinbort-ic
Copy link

This PR is very exciting for the speed-up it could offer pip in multi-core systems! I'll keep an eye on it

dest="parallel_downloads",
type="int",
metavar="n",
default=None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it’s easier to use 0 as the default instead of None. This would make checks below a bit simpler.

By the way, what’s the difference between setting this to 1 and not at all? We should probably add something in documentation about this.

Copy link
Contributor Author

@NeilBotelho NeilBotelho Feb 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the option is not set, then in pip._internal.network.session self.parallel_downloads gets set to 1. So there isn't really a difference between setting it to 1 and not at all. I wasn't sure if it would be confusing setting parallel_downloads to 0 or 1 by default to indicate no parallel downloads. At the time of writing I thought a default of none to indicate no parallel_downloads made sense, but I'm happy to change it if you think 0 is better.

And thinking about it a bit more, it might make more sense to mention 2 as the minimum number of parallel downloads, as that is when the behaviour would change. What do you think @uranusjr ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it’s beneficial to have a value that simply means the default. This makes the CLI easier to interface, e.g. when writing a Bash script to conditionally switch parallel downloads on and off. So I’d want this to allow 1 and up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense to me. I'll update it to make 1 the default value.

Comment on lines +124 to +125
else:
parallel_downloads = 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this branch ever get used? I thought default would cover the case where the user does not supply a value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the parallel_downloads option is only added to the install and download sub-commands, so in other sub-commands (index, list etc.) there is no options.parallel_downloads. So in those cases I'm setting parallel_downloads to 1.

@uranusjr
Copy link
Member

I wonder how this should be unit-tested.

Co-authored-by: Tzu-ping Chung <uranusjr@gmail.com>
@NeilBotelho
Copy link
Contributor Author

I wonder how this should be unit-tested.

One option to test this would be to unit test the _download_parallel method of BatchDownloader by adding Links for 2 packages. I don't know if this is best practise though.

I do see some examples of packages being downloaded in tests/functional/test_install_extras.py, so maybe a functional test of pip install end to end with the --parallel-downloads option passed might be better

@@ -76,6 +78,9 @@ def add_options(self) -> None:

@with_cleanup
def run(self, options: Values, args: List[str]) -> int:
if options.parallel_downloads < 1:
raise CommandError("Value of '--parallel-downloads' must be greater than 0")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be use friendly, by instead of raising CommandError, surfacing the error to the user, but defaulting to 1, even here?

Copy link
Contributor Author

@NeilBotelho NeilBotelho Apr 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean raise a warning if the user sets --parallel_downloads to 0, then set it to 1 and continue the install/download operation?
That might be more user friendly but I think it'd be better to be more explicit and throw an error with a clear message.

@morotti
Copy link
Contributor

morotti commented Jul 3, 2024

(I am not a pip maintainer but trying to do some optimizations in pip now)

Hello,

Before doing parallel download, you will need this PR to be merged first to fix/optimize the download code https://github.com/pypa/pip/pull/12810/files

turns out, the code that's doing the download was downloading in small 10k bytes chunk and upgrading the progress bar after every chunk.
as much as 30% of the download time is actually just wasted to re-render the progress bar ^^

If you see performance improvements with this PR by adding parallel downloads, I'm sorry to say this might be in large part because it removed the progress bar.

I'm afraid parallelization of downloads is unlikely to give huge performance improvements until the linked PR to fix download is merged, otherwise most of the time is taken in pip python code that's holding the GIL.
(I think the download in openssl/urllib and the hash calculation should free the GIL and allow other treads to run, unfortunately they only get chunks of 10k and 8k respectively before getting back to python code)

Cheers.

@morotti
Copy link
Contributor

morotti commented Aug 15, 2024

(not a pip maintainer)

my performance improvements to download code have been merged in 24.2. I see download going from ~60 to ~460 MB/s on internal network, mostly outsourcing to C code in requests/ssl that does not hold the GIL.
that should leave room for concurrent threads to run.

now would be a good time to rebase and bring this PR back to life.

on the code review, you will want to remove the 2 functions _sequential_download() + _download_parallel()
IMO there should be a single function. sequential download is simply an effect of using workers=1.
that way, there is a single code path to write and to maintain, which will be tested by all the tests.

@NeilBotelho
Copy link
Contributor Author

I haven't had a chance to work on this in a while and I see that someone recently opened a PR to address this same issue, so I'll close this in favour of it.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 7, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants