Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: Clarify meaning of named requirements #2964

Closed
1 task done
polarathene opened this issue Jun 23, 2024 · 4 comments · Fixed by #2967
Closed
1 task done

docs: Clarify meaning of named requirements #2964

polarathene opened this issue Jun 23, 2024 · 4 comments · Fixed by #2967
Labels
📖 documentation Improvements or additions to documentation ❓ help wanted Extra attention is needed

Comments

@polarathene
Copy link

  • I have searched the issue tracker and believe that this is not a duplicate.

Description

The docs for centralized caching have a note that it's only applicable to named requirements, but this is not well defined?:

image


"named requirements" has very few results in this repo when I searched for issues with it, while a search engine query for python "named requirements" didn't seem to help either with many results containing "file named requirements.txt".

Is the intention to refer to explicit dependencies declared? Rather than those that are implicitly installed as a result?

Just to confirm, using another source like PyTorch index for packages there, while these appear to add to the cache (pdm cache info), is this documentation note saying something about them not qualifying as cache-friendly because they're packages not sourced from the PyPi index?

When I run pdm install --frozen-lockfile, despite what pdm cache info implies with the 5GB cache, I still see these dependencies install with Downloading xx%. Even when I explicitly add them to the pyproject.toml dependencies list. This still happens when I prefer those dependencies to be pulled from PyPi, so I'm not sure if it's actually referring to a network transfer here, or something else is actually happening during this step?

@polarathene polarathene added the 🐛 bug Something isn't working label Jun 23, 2024
@frostming
Copy link
Collaborator

Can you help with it?

These are named requirements:

foo
foo>=2.1.0
foo[extra]>=2.1.0

While these are not:

file:///some/path/foo-0.1.0.tar.gz
foo @ git+https://github.com/example/foo.git

Named dependencies are relative to those defined via URLs. Feel free to pick any proper name.

@frostming frostming added 📖 documentation Improvements or additions to documentation ❓ help wanted Extra attention is needed and removed 🐛 bug Something isn't working labels Jun 23, 2024
@polarathene
Copy link
Author

polarathene commented Jun 23, 2024

Can you help with it?

Help in what way? I am just trying to get familiar with PDM and competitors (I have minimal Python experience).

I think the example you gave is more than enough. Just add that within the docs admonition. If you feel that it's a bit too verbose collapse it by default with ??? instead of !!!.

Your response did not address the query regarding PyPi being mentioned in the admonition. From my reproduction below, it doesn't seem to restricted to PyPi?

  • The Downloading xx% seems to be a bit misleading, I'm not sure what is going on there differently to uv with hardlink, perhaps extraction from an archive to .venv or something?
  • Overall disk size doesn't grow by 5GB though (while the .venv shows about that size via the hardlink usage), but whatever it is, it uses notably more time than uv to leverage the cached packages.

Final query regarding implicit packages is still vague (eg, if I remove the nvidia-* packages in the pyproject.toml below).

  • Your response suggests that they may not be cached? (pdm cache info doesn't seem to change much regardless, so I'm doubtful)
  • It doesn't seem to make any notable difference to the install/sync time from what I can see. Perhaps the way I configured pdm.lock to use static urls is relevant to cache support? Unclear if the "named requirements" matter there, the torch dep in the lockfile still references the packages via package name, so I assume this doesn't matter.

Reproduction

NOTE: If you'd like to reproduce in the same environment I was and you're familiar with Docker.

  • I used docker run --rm -it --workdir /app fedora:40 bash, that comes with Python 3.12.2.
  • To create the pyproject.toml just use dnf install -y nano + nano pyproject.toml, or volume mount the file locally.

pyproject.toml:

[project]
name = "example"

dependencies = [
    "torch", # Implicitly resolves to `2.3.1+cu121` via configured PyTorch source below
    "torchvision",
    "torchaudio",
    # Implicit packages that should be cached instead of downloaded?
    # The PyTorch source needs to include these via `nvidia-*`, otherwise different versions from PyPi are resolved
    "nvidia-cublas-cu12",
    "nvidia-cuda-cupti-cu12",
    "nvidia-cuda-nvrtc-cu12",
    "nvidia-cuda-runtime-cu12",
    "nvidia-cudnn-cu12",
    "nvidia-cufft-cu12",
    "nvidia-curand-cu12",
    "nvidia-cusolver-cu12",
    "nvidia-cusparse-cu12",
    "nvidia-nccl-cu12",
    "nvidia-nvjitlink-cu12",
    "nvidia-nvtx-cu12"
]
requires-python = ">=3.10"

[tool.pdm.resolution]
respect-source-order = true

[tool.pdm]
distribution = false

[[tool.pdm.source]]
name = "pytorch"
url = "https://download.pytorch.org/whl/cu121"
include_packages = ["torch", "torchvision", "torchaudio", "nvidia-*"]
# Install PDM if necessary:
curl -sSL https://pdm-project.org/install-pdm.py | python3 -

# PDM cache setup:
pdm cache clear
pdm config install.cache true
pdm config install.cache_method hardlink

# Prepare for minimized/optimized lockfile for `Dockerfile` build:
# NOTE: As a reference to my other timings for install, this takes approx 2 minutes to complete
pdm lock -S no_cross_platform,static_urls

# Install and cache
time pdm install --frozen-lockfile

# Clear the `.venv` to install again, this time with cache:
rm -rf .venv
pdm install --frozen-lockfile
$ pdm install --frozen-lockfile
# 1st vs 2nd (cached) install times:
real    1m41.960s
real    1m15.564s

# Cache size:
$ pdm cache info

Cache Root: /root/.cache/pdm, Total size: 4730.6 MB
  File Hash Cache: /root/.cache/pdm/hashes
    Files: 36, Size: 2.6 kB
  HTTP Cache: /root/.cache/pdm/http
    Files: 32, Size: 32.8 MB
  Wheels Cache: /root/.cache/pdm/wheels
    Files: 0, Size: 0 bytes
  Metadata Cache: /root/.cache/pdm/metadata
    Files: 1, Size: 3.6 kB
  Package Cache: /root/.cache/pdm/packages
    Packages: 25, Size: 4697.8 MB

If not using the nvidia-* from the PyTorch index, it'll look like this (larger cache, faster install but roughly same offset via cache):

$ pdm install --frozen-lockfile
# 1st vs 2nd (cached) install times:
real    1m14.046s
real    0m47.405s

# Cache size:
$ pdm cache info

Cache Root: /root/.cache/pdm, Total size: 6662.2 MB
  File Hash Cache: /root/.cache/pdm/hashes
    Files: 48, Size: 3.4 kB
  HTTP Cache: /root/.cache/pdm/http
    Files: 68, Size: 1909.7 MB
  Wheels Cache: /root/.cache/pdm/wheels
    Files: 0, Size: 0 bytes
  Metadata Cache: /root/.cache/pdm/metadata
    Files: 1, Size: 4.7 kB
  Package Cache: /root/.cache/pdm/packages
    Packages: 26, Size: 4752.4 MB

This is in contrast to uv (without any lockfile to speed the resolution process up), which defaults to a central cache with hardlinks in this reproduction environment:

# Clean slate, empty directory:
$ uv cache clean

# 1st time install:
$ uv venv
$ time uv pip install torch==2.3.1+cu121 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
real    1m14.939s

# Similar cache size:
$ du -sx --bytes --si "$(uv cache dir)"
4.7G    /root/.cache/uv

# 2nd install (cached):
$ rm -rf .venv
$ uv venv
$ time uv pip install torch==2.3.1+cu121 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
real    0m3.518s

Future integration with uv

I'm aware that you have plans to adopt uv to some extent. So that may affect the behaviour comparison I'm documenting above with PDM 2.15.4.

The PDM source with support for include_packages is rather interesting vs what I've seen with hatch and uv which I don't think are as flexible (they require the explicit local identifier cu121 suffix to resolve correctly, but that version pinning is less flexible, neither presently have an equivalent like pdm update --update-all -u either as a workaround for latest patch version). I'm not sure how compatible that will be once you integrate with uv.

@frostming
Copy link
Collaborator

frostming commented Jun 23, 2024

Help in what way? I am just trying to get familiar with PDM and competitors (I have minimal Python experience).

Help with the docs

Your response did not address the query regarding PyPi being mentioned in the admonition. From my reproduction below, it doesn't seem to restricted to PyPi?

PyPI doesn't mean exactly pypi.org, It is a synonym for all packages sources. Again, it is opposed to dependencies with direct URLs.

so I'm not sure if it's actually referring to a network transfer here, or something else is actually happening during this step?

No network transfer, the cache layer is built inside the HTTP session, so we display Downloading... anyway and the session will decide whether to use cache internally.

In contrast to uv, we still consult the index(in this case, pytorch source), download the wheel(use the http cache if available) and link to the local package cache for installation.

while uv skips the http entirely as it assumes the local copy MUST be identical.

@polarathene
Copy link
Author

Help with the docs

I can contribute your feedback here to the docs admonition sure 👍


In contrast to uv, we still consult the index(in this case, pytorch source), download the wheel(use the http cache if available) and link to the local package cache for installation.

while uv skips the http entirely as it assumes the local copy MUST be identical.

I'm assuming PDM provides no way to get that equivalent uv behaviour?

Is the lockfile or cache lacking sufficient information to know that it should be able to use the cache?

  • An extra minute spent on redundant activity seems unnecessary for a sync/install when I have cache + pdm.lock + --frozen-lockfile? In other languages I think the lockfile stores a content hash that can be used to match a cache entry.
  • If the lockfile isn't being refreshed (as --frozen-lockfile implies), I'm not sure why you'd need to consult the index and download something that the cache should already have? (assuming cache hit) But from the sound of it, you're relying on an HTTP cache mechanism that doesn't seem to be helping much in this scenario? (25 sec less than no cache install, but still over a minute to install)

PyPI doesn't mean exactly pypi.org, It is a synonym for all packages sources. Again, it is opposed to dependencies with direct URLs.

I am familiar with these as package indexes/sources, and PDM has an index type for sources with an implicit pypi default.

If it's not incorrect to refer to them more agnostically with "resolved from a package index (eg: PyPi)" instead of "resolved from PyPi", that would communicate the intent more clearly? (along with the examples you provided for added context).

Still the expectation of the cache with PDM not providing the performance benefit should probably also be documented for awareness to quell any potential confusion (seems a few have reported this concern previously). Unless the uv feature work will resolve the issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
📖 documentation Improvements or additions to documentation ❓ help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants