Add parallel download support to BatchDownloader #12388

NeilBotelho · 2023-11-04T19:12:02Z

This PR adds parallel download support to BatchDownloader and adds a cli option --parallel-downloads <n> to the install and download commands. If the option is not specified, or if set to 1, it doesn't change pip's behaviour or UI.

BatchDownloader is only called by _complete_partial_requirements after resolution is completed, so it wouldn't affect the resolution process as far as I can tell. It is used to download wheels where only metadata access was available for resolution. So as more packages adopt PEP 658 the number of wheels we would be able to download in parallel would also increase :)

Note

Importantly, this PR does not add UI/progress bars for parallel downloads, it just logs the Downloading <wheel-file-name> message and disables the progress bar. ~~I am working on a separate PR that adds a progress bar for parallel downloads. These 2 PR's can then be combined. I don't expect this PR to be merged until that PR is opened.~~ I have opened a separate PR to add UI support for parallel downloads

I've done it this way because the UI requires some discussion, due to limitations of how rich and even tqdm render parallel progress bars in Jupyter notebooks (see this issue). Currently the best way I can think to circumvent this issue is to add a flag (something like --jupyter) to be used when downloading in parallel in jupyter notebooks that will disable the progress bar only for the parallel downloads. But this isn't ideal.

Also, if someone could tell me where I should ideally initiate this sort of discussion(discourse/IRC/discord etc.) I would appreciate it :)

This parameter is used to set the number of parallel downloads in BatchDownloader as well as to set the pool_connections in the HTTPAdapter to prevent 'Connection pool full' warnings

ghost · 2023-11-21T00:53:33Z

src/pip/_internal/cli/cmdoptions.py

+    dest="parallel_downloads",
+    type="int",
+    metavar="n",
+    default=None,


Would a default of the number of cores be acceptable? I think most users want this by default.

This PR doesn't have a solution for showing progress when downloading in parallel. It just doesn't show the progress. Setting parallel downloads on by default would cause unexpected behaviour for end users by default, which I would like to avoid

I do have an open PR for a progress bar that supports parallel downloads(still a work in progress) but even there, I don't see a solution to having clean, non breaking parallel progress bars in Jupyter. So I'd prefer to keep sequential download as the default, having expected behaviour and allowing the option to enable parallel downloads explicitly with the expectation of some weirdness(either no progress bar for parallel downloads or a broken parallel progress bar only in Jupyter)

If I'm missing something here and someone knows of a way to have parallel progress bars in Jupyter please let me know :)

ghost · 2023-11-21T00:57:35Z

src/pip/_internal/network/session.py

+        self.parallel_downloads = (
+            parallel_downloads if (parallel_downloads is not None) else 1
+        )
+        pool_maxsize = max(self.parallel_downloads, 10)


Is this not a configurable limit? It seems reasonable to allow more than 10 threads for downloading as each package should be able to be downloading in parallel.

Edit: I see now that we actually do set it below, so I think this is just unnecessary now.

Please consider removing/changing to just
pool_maxsize = self.parallel_downloads

The default pool_maxsize is 10 in requests.Session. My thought process was to not change it if it wasn't necessitated by the user asking for more than 10 connections to be opened. If the value of self.parallel_downloads is greater than 10 it sets pool_maxsize to self.parallel_downloads

Thanks Neil, you're right. On a second review that seems a wise decision. I actually misread it as min not max. Finally got some coffee and this looks ready to land if the project maintainers can review it.

ghost

Old comment:

This PR looks good. Made two comments, one suggesting a changed default for the CLI args and another hoping to remove the restriction to a thread pool size of 10 (which would limit concurrent downloads). Otherwise, this is awesome

Edit: Both comments are retracted, the CLI default is a careful incremental change and the limit does not exist.

This change has been carefully thought through and provides a path forward for pip parallelization. I'm keen to see some performance numbers soon and hope pypa will approve it.

ofek · 2023-11-27T15:25:17Z

Great work! Do you have a link to your other PR for the progress bar change?

NeilBotelho · 2023-11-27T17:11:08Z

Great work! Do you have a link to your other PR for the progress bar change?

Yep here it is: #12404

pradyunsg · 2023-11-27T17:20:08Z

@NeilBotelho Thanks for filing these PRs! They're on my personal radar to review, but I'm unlikely to have the bandwidth to review pip PRs in the coming weeks and will likely only look at this early next year.

One of the other maintainers might have bandwidth to look at this before then, of course. I'm not going to volunteer someone else but just erring on the side of communicating my (lack of) availability so that the lack of a response isn't misjudged to be a lack of interest! :)

NeilBotelho · 2023-11-27T17:30:35Z

@pradyunsg no issues! I totally undestand, and I appreciate the transparency. I'm just happy to finally be contributing back to pip :)

pfmoore · 2023-11-27T19:03:48Z

I'll also note that I have a number of personal commitments right now that mean I'll be unable to review this in the near future. Sorry!

mkleinbort-ic · 2024-02-13T14:21:24Z

This PR is very exciting for the speed-up it could offer pip in multi-core systems! I'll keep an eye on it

uranusjr · 2024-02-25T08:19:24Z

src/pip/_internal/cli/cmdoptions.py

+    dest="parallel_downloads",
+    type="int",
+    metavar="n",
+    default=None,


I wonder if it’s easier to use 0 as the default instead of None. This would make checks below a bit simpler.

By the way, what’s the difference between setting this to 1 and not at all? We should probably add something in documentation about this.

If the option is not set, then in pip._internal.network.session self.parallel_downloads gets set to 1. So there isn't really a difference between setting it to 1 and not at all. I wasn't sure if it would be confusing setting parallel_downloads to 0 or 1 by default to indicate no parallel downloads. At the time of writing I thought a default of none to indicate no parallel_downloads made sense, but I'm happy to change it if you think 0 is better.

And thinking about it a bit more, it might make more sense to mention 2 as the minimum number of parallel downloads, as that is when the behaviour would change. What do you think @uranusjr ?

I think it’s beneficial to have a value that simply means the default. This makes the CLI easier to interface, e.g. when writing a Bash script to conditionally switch parallel downloads on and off. So I’d want this to allow 1 and up.

That makes sense to me. I'll update it to make 1 the default value.

uranusjr · 2024-03-12T20:22:57Z

src/pip/_internal/cli/req_command.py

+        else:
+            parallel_downloads = 1


Does this branch ever get used? I thought default would cover the case where the user does not supply a value.

the parallel_downloads option is only added to the install and download sub-commands, so in other sub-commands (index, list etc.) there is no options.parallel_downloads. So in those cases I'm setting parallel_downloads to 1.

src/pip/_internal/commands/install.py

uranusjr · 2024-03-12T20:26:53Z

I wonder how this should be unit-tested.

Co-authored-by: Tzu-ping Chung <uranusjr@gmail.com>

NeilBotelho · 2024-03-25T12:38:24Z

I wonder how this should be unit-tested.

One option to test this would be to unit test the _download_parallel method of BatchDownloader by adding Links for 2 packages. I don't know if this is best practise though.

I do see some examples of packages being downloaded in tests/functional/test_install_extras.py, so maybe a functional test of pip install end to end with the --parallel-downloads option passed might be better

chayim · 2024-04-02T11:52:04Z

src/pip/_internal/commands/download.py

@@ -76,6 +78,9 @@ def add_options(self) -> None:

    @with_cleanup
    def run(self, options: Values, args: List[str]) -> int:
+        if options.parallel_downloads < 1:
+            raise CommandError("Value of '--parallel-downloads' must be greater than 0")


Could this be use friendly, by instead of raising CommandError, surfacing the error to the user, but defaulting to 1, even here?

Do you mean raise a warning if the user sets --parallel_downloads to 0, then set it to 1 and continue the install/download operation?
That might be more user friendly but I think it'd be better to be more explicit and throw an error with a clear message.

morotti · 2024-07-03T13:14:05Z

(I am not a pip maintainer but trying to do some optimizations in pip now)

Hello,

Before doing parallel download, you will need this PR to be merged first to fix/optimize the download code https://github.com/pypa/pip/pull/12810/files

turns out, the code that's doing the download was downloading in small 10k bytes chunk and upgrading the progress bar after every chunk.
as much as 30% of the download time is actually just wasted to re-render the progress bar ^^

If you see performance improvements with this PR by adding parallel downloads, I'm sorry to say this might be in large part because it removed the progress bar.

I'm afraid parallelization of downloads is unlikely to give huge performance improvements until the linked PR to fix download is merged, otherwise most of the time is taken in pip python code that's holding the GIL.
(I think the download in openssl/urllib and the hash calculation should free the GIL and allow other treads to run, unfortunately they only get chunks of 10k and 8k respectively before getting back to python code)

Cheers.

morotti · 2024-08-15T17:08:04Z

(not a pip maintainer)

my performance improvements to download code have been merged in 24.2. I see download going from ~60 to ~460 MB/s on internal network, mostly outsourcing to C code in requests/ssl that does not hold the GIL.
that should leave room for concurrent threads to run.

now would be a good time to rebase and bring this PR back to life.

on the code review, you will want to remove the 2 functions _sequential_download() + _download_parallel()
IMO there should be a single function. sequential download is simply an effect of using workers=1.
that way, there is a single code path to write and to maintain, which will be tested by all the tests.

NeilBotelho · 2024-08-22T16:56:59Z

I haven't had a chance to work on this in a while and I see that someone recently opened a PR to address this same issue, so I'll close this in favour of it.

NeilBotelho added 5 commits November 5, 2023 00:40

add --parallel-downloads option to install and download commands

c3a5c73

Add validation to parallel_downloads option

ed13cea

Add parallel_downloads param to PipSesion

2334399

This parameter is used to set the number of parallel downloads in BatchDownloader as well as to set the pool_connections in the HTTPAdapter to prevent 'Connection pool full' warnings

Separate common download logic from Downloaders

a39cbc5

Add parallel download support to BatchDownloader

8b41c67

psf-chronographer bot added the bot:chronographer:provided label Nov 4, 2023

Add news file

ec43625

NeilBotelho mentioned this pull request Nov 19, 2023

Add custom progress bar that supports parallel downloads #12404

Closed

ghost reviewed Nov 21, 2023

View reviewed changes

pradyunsg removed the bot:chronographer:provided label Dec 20, 2023

psf-chronographer bot added the bot:chronographer:provided label Dec 20, 2023

NeilBotelho added 2 commits February 25, 2024 10:59

update docstrings

3008a61

fix pre-commit warnings

fa268ae

uranusjr reviewed Feb 25, 2024

View reviewed changes

make 1 the default value of the parallel-downloads arguement

e754fd6

NeilBotelho force-pushed the parallel-downloads branch from 74e283c to e754fd6 Compare March 6, 2024 18:55

fix typing errors

233ae10

uranusjr reviewed Mar 12, 2024

View reviewed changes

src/pip/_internal/commands/install.py Outdated Show resolved Hide resolved

Remove unused None check

e91234c

Co-authored-by: Tzu-ping Chung <uranusjr@gmail.com>

chayim reviewed Apr 2, 2024

View reviewed changes

notatallshaw mentioned this pull request Apr 9, 2024

Release 24.1 #12613

Closed

pradyunsg mentioned this pull request Apr 22, 2024

Consistent segmentation violation with pypa/pip and a minimal configuration google/triage-party#306

Open

notatallshaw mentioned this pull request Apr 29, 2024

Feature request: Control parallelism of downloads, sdist builds, and installs astral-sh/uv#3311

Closed

This was referenced Jun 2, 2024

Install packages in parallel #12742

Open

Dry-run "find" or equivalent option to list wheels and other files that will be downloaded #12749

Open

NeilBotelho closed this Aug 22, 2024

github-actions bot locked as resolved and limited conversation to collaborators Sep 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add parallel download support to BatchDownloader #12388

Add parallel download support to BatchDownloader #12388

NeilBotelho commented Nov 4, 2023 •

edited by pradyunsg

Loading

ghost Nov 21, 2023

NeilBotelho Nov 21, 2023

ghost Nov 21, 2023

ghost Nov 21, 2023 •

edited by ghost

Loading

NeilBotelho Nov 21, 2023 •

edited

Loading

ghost Nov 26, 2023

ghost left a comment •

edited by ghost

Loading

ofek commented Nov 27, 2023

NeilBotelho commented Nov 27, 2023

pradyunsg commented Nov 27, 2023 •

edited

Loading

NeilBotelho commented Nov 27, 2023

pfmoore commented Nov 27, 2023

mkleinbort-ic commented Feb 13, 2024

uranusjr Feb 25, 2024

NeilBotelho Feb 25, 2024 •

edited

Loading

uranusjr Mar 6, 2024

NeilBotelho Mar 6, 2024

uranusjr Mar 12, 2024

NeilBotelho Mar 25, 2024

uranusjr commented Mar 12, 2024

NeilBotelho commented Mar 25, 2024

chayim Apr 2, 2024

NeilBotelho Apr 3, 2024 •

edited

Loading

morotti commented Jul 3, 2024

morotti commented Aug 15, 2024

NeilBotelho commented Aug 22, 2024

Add parallel download support to BatchDownloader #12388

Add parallel download support to BatchDownloader #12388

Conversation

NeilBotelho commented Nov 4, 2023 • edited by pradyunsg Loading

Note

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ghost Nov 21, 2023 • edited by ghost Loading

Choose a reason for hiding this comment

NeilBotelho Nov 21, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ghost left a comment • edited by ghost Loading

Choose a reason for hiding this comment

ofek commented Nov 27, 2023

NeilBotelho commented Nov 27, 2023

pradyunsg commented Nov 27, 2023 • edited Loading

NeilBotelho commented Nov 27, 2023

pfmoore commented Nov 27, 2023

mkleinbort-ic commented Feb 13, 2024

Choose a reason for hiding this comment

NeilBotelho Feb 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

uranusjr commented Mar 12, 2024

NeilBotelho commented Mar 25, 2024

Choose a reason for hiding this comment

NeilBotelho Apr 3, 2024 • edited Loading

Choose a reason for hiding this comment

morotti commented Jul 3, 2024

morotti commented Aug 15, 2024

NeilBotelho commented Aug 22, 2024

NeilBotelho commented Nov 4, 2023 •

edited by pradyunsg

Loading

ghost Nov 21, 2023 •

edited by ghost

Loading

NeilBotelho Nov 21, 2023 •

edited

Loading

ghost left a comment •

edited by ghost

Loading

pradyunsg commented Nov 27, 2023 •

edited

Loading

NeilBotelho Feb 25, 2024 •

edited

Loading

NeilBotelho Apr 3, 2024 •

edited

Loading