-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add parallel download support to BatchDownloader #12388
Changes from 8 commits
c3a5c73
ed13cea
2334399
a39cbc5
8b41c67
ec43625
3008a61
fa268ae
e754fd6
233ae10
e91234c
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
Add parallel download support to BatchDownloader |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -764,6 +764,19 @@ def _handle_no_cache_dir( | |
help="Check the build dependencies when PEP517 is used.", | ||
) | ||
|
||
parallel_downloads: Callable[..., Option] = partial( | ||
Option, | ||
"--parallel-downloads", | ||
dest="parallel_downloads", | ||
type="int", | ||
metavar="n", | ||
default=None, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I wonder if it’s easier to use 0 as the default instead of None. This would make checks below a bit simpler. By the way, what’s the difference between setting this to 1 and not at all? We should probably add something in documentation about this. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If the option is not set, then in And thinking about it a bit more, it might make more sense to mention 2 as the minimum number of parallel downloads, as that is when the behaviour would change. What do you think @uranusjr ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it’s beneficial to have a value that simply means the default. This makes the CLI easier to interface, e.g. when writing a Bash script to conditionally switch parallel downloads on and off. So I’d want this to allow 1 and up. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That makes sense to me. I'll update it to make 1 the default value. |
||
help=( | ||
"Use upto <n> threads to download packages in parallel." | ||
"<n> must be greater than 0" | ||
), | ||
) | ||
|
||
|
||
def _handle_no_use_pep517( | ||
option: Option, opt: str, value: str, parser: OptionParser | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -7,6 +7,7 @@ | |
from pip._internal.cli.cmdoptions import make_target_python | ||
from pip._internal.cli.req_command import RequirementCommand, with_cleanup | ||
from pip._internal.cli.status_codes import SUCCESS | ||
from pip._internal.exceptions import CommandError | ||
from pip._internal.operations.build.build_tracker import get_build_tracker | ||
from pip._internal.req.req_install import check_legacy_setup_py_options | ||
from pip._internal.utils.misc import ensure_dir, normalize_path, write_output | ||
|
@@ -52,6 +53,7 @@ def add_options(self) -> None: | |
self.cmd_opts.add_option(cmdoptions.no_use_pep517()) | ||
self.cmd_opts.add_option(cmdoptions.check_build_deps()) | ||
self.cmd_opts.add_option(cmdoptions.ignore_requires_python()) | ||
self.cmd_opts.add_option(cmdoptions.parallel_downloads()) | ||
|
||
self.cmd_opts.add_option( | ||
"-d", | ||
|
@@ -76,6 +78,11 @@ def add_options(self) -> None: | |
|
||
@with_cleanup | ||
def run(self, options: Values, args: List[str]) -> int: | ||
if (options.parallel_downloads is not None) and ( | ||
options.parallel_downloads < 1 | ||
): | ||
raise CommandError("Value of '--parallel-downloads' must be greater than 0") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could this be use friendly, by instead of raising CommandError, surfacing the error to the user, but defaulting to 1, even here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do you mean raise a warning if the user sets |
||
|
||
options.ignore_installed = True | ||
# editable doesn't really make sense for `pip download`, but the bowels | ||
# of the RequirementSet code require that property. | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -326,6 +326,7 @@ def __init__( | |
trusted_hosts: Sequence[str] = (), | ||
index_urls: Optional[List[str]] = None, | ||
ssl_context: Optional["SSLContext"] = None, | ||
parallel_downloads: Optional[int] = None, | ||
**kwargs: Any, | ||
) -> None: | ||
""" | ||
|
@@ -362,12 +363,24 @@ def __init__( | |
backoff_factor=0.25, | ||
) # type: ignore | ||
|
||
# Used to set numbers of parallel downloads in | ||
# pip._internal.network.BatchDownloader and to set pool_connection in | ||
# the HTTPAdapter to prevent connection pool from hitting the default(10) | ||
# limit and throwing 'Connection pool is full' warnings | ||
self.parallel_downloads = ( | ||
parallel_downloads if (parallel_downloads is not None) else 1 | ||
) | ||
pool_maxsize = max(self.parallel_downloads, 10) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this not a configurable limit? It seems reasonable to allow more than 10 threads for downloading as each package should be able to be downloading in parallel. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Edit: I see now that we actually do set it below, so I think this is just unnecessary now. Please consider removing/changing to just There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The default There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks Neil, you're right. On a second review that seems a wise decision. I actually misread it as |
||
# Our Insecure HTTPAdapter disables HTTPS validation. It does not | ||
# support caching so we'll use it for all http:// URLs. | ||
# If caching is disabled, we will also use it for | ||
# https:// hosts that we've marked as ignoring | ||
# TLS errors for (trusted-hosts). | ||
insecure_adapter = InsecureHTTPAdapter(max_retries=retries) | ||
insecure_adapter = InsecureHTTPAdapter( | ||
max_retries=retries, | ||
pool_connections=pool_maxsize, | ||
pool_maxsize=pool_maxsize, | ||
) | ||
|
||
# We want to _only_ cache responses on securely fetched origins or when | ||
# the host is specified as trusted. We do this because | ||
|
@@ -385,7 +398,12 @@ def __init__( | |
max_retries=retries, | ||
) | ||
else: | ||
secure_adapter = HTTPAdapter(max_retries=retries, ssl_context=ssl_context) | ||
secure_adapter = HTTPAdapter( | ||
max_retries=retries, | ||
ssl_context=ssl_context, | ||
pool_connections=pool_maxsize, | ||
pool_maxsize=pool_maxsize, | ||
) | ||
self._trusted_host_adapter = insecure_adapter | ||
|
||
self.mount("https://", secure_adapter) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would a default of
the number of cores
be acceptable? I think most users want this by default.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR doesn't have a solution for showing progress when downloading in parallel. It just doesn't show the progress. Setting parallel downloads on by default would cause unexpected behaviour for end users by default, which I would like to avoid
I do have an open PR for a progress bar that supports parallel downloads(still a work in progress) but even there, I don't see a solution to having clean, non breaking parallel progress bars in Jupyter. So I'd prefer to keep sequential download as the default, having expected behaviour and allowing the option to enable parallel downloads explicitly with the expectation of some weirdness(either no progress bar for parallel downloads or a broken parallel progress bar only in Jupyter)
If I'm missing something here and someone knows of a way to have parallel progress bars in Jupyter please let me know :)