Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exclude unnecssary data from pvlib packaging #2271

Open
AdamRJensen opened this issue Oct 20, 2024 · 14 comments · May be fixed by #2277
Open

Exclude unnecssary data from pvlib packaging #2271

AdamRJensen opened this issue Oct 20, 2024 · 14 comments · May be fixed by #2277

Comments

@AdamRJensen
Copy link
Member

AdamRJensen commented Oct 20, 2024

Most of the data in the data folder are only used for testing and not useful for any examples. However, they add up a considerable amount of space (MBs) relative to the pvlib package itself. For applications such as WASM (code running in the browser), all this clutter slows down loading significantly. The data folder is currently 57 MB out of a total of 63 MB.

Of course, there are data files that we want to keep such as the Linke turbidity and CEC database. However, a rough estimation shows that removing unused data files can reduce the pvlib package but half!

Excluding/including data files can be specified in the pyproject.toml file, for example:

[tool.setuptools.packages.find]
where = ["src"]  # list of folders that contain the packages (["."] by default)
include = ["my_package*"]  # package names should match these glob patterns (["*"] by default)
exclude = ["my_package.tests*"]  # exclude packages matching these glob patterns (empty by default)

This can also be done in the MANIFEST.in file, which is pvlib's current approach, and seems to have finer control.

The pvlib/tests folder can also be excluded, and save 2.5 MB.

@cwhanse
Copy link
Member

cwhanse commented Oct 20, 2024

#1056 this has been needed for some time.

@adriesse
Copy link
Member

Good initiative. Is it common to exclude the tests from a package?

@kandersolar
Copy link
Member

kandersolar commented Oct 20, 2024

scipy and pandas both include their tests in their distribution files (although some pandas maintainers have agreed that it would be better to exclude them, pandas-dev/pandas#30741).

I see little value in including the tests in the wheels. If we exclude most data files, then the test functions become even less useful to include. It may make sense to continue including the tests (and data files?) in the sdist distributions, I am not sure.

@adriesse
Copy link
Member

Few more thoughts:

  • @AdamRJensen is it really the volume in MB that is slowing down WASM? 63 MB isn't much nowadays.
  • Some (many?) files could be reduced in size by truncation (e.g. trailing zeros) and/or compression (e.g., zip)
  • Perhaps some files could serve their purpose with fewer records.

@echedey-ls
Copy link
Contributor

I've finished classifying the files in #1056's spreadsheet and the result shows the pvlib/data folder would reduce to 24MB (41% of the original) if only necessary files for the API code are shipped in that folder.

Although I understand facilitating the tests can be a source of easing and validating the working environments, it is a driving bottleneck in the installation process. I would just remove them from wheels due to that.

Some (many?) files could be reduced in size by truncation (e.g. trailing zeros) and/or compression (e.g., zip)

That would add extra time at running tests and adding/maintaining them. Last alternative does not sound too bad, but it's not a small effort considering the amount of files used in the tests (58) and that an unknown amount of expected results are hardcoded, I can't quantify how many.

@cwhanse
Copy link
Member

cwhanse commented Oct 21, 2024

If we remove data files for tests from the distribution, does that mean that pytest test_xxxx.py will return a bunch of errors unrelated to the contributor's new tests?

@echedey-ls
Copy link
Contributor

If we remove data files for tests from the distribution, does that mean that pytest test_xxxx.py will return a bunch of errors unrelated to the contributor's new tests?

No since pytest.yml workflow already checks out the repo, so all files are available at run time, it's just a matter of telling the distribution builder whether to include them or not before posting it to PyPI.

https://github.com/pvlib/pvlib-python/blob/c52e600b996733a4a3371623a3784ccda5667e68/.github/workflows/pytest.yml#L33C1-L36C27

I just wonder if the conda recipe will need to be changed, IDK how conda-forge recipes work in the background:

https://github.com/conda-forge/pvlib-python-feedstock/blob/f37d40d81becdb025578d0e09ca44ebe3cbeb466/recipe/meta.yaml#L60C1-L62C16

@cwhanse
Copy link
Member

cwhanse commented Oct 22, 2024

If we remove data files for tests from the distribution, does that mean that pytest test_xxxx.py will return a bunch of errors unrelated to the contributor's new tests?

No since pytest.yml workflow already checks out the repo, so all files are available at run time, it's just a matter of telling the distribution builder whether to include them or not before posting it to PyPI.

But wouldn't the errors raise if the test is run locally instead of using a GH action via a pull request?

@echedey-ls
Copy link
Contributor

echedey-ls commented Oct 22, 2024

Ah I see, you refer to that if I install pvlib (non editable mode, either from the cloned repo or PyPI), then go to the package folder in Python's site-packages/pvlib/tests and run that. Yeah, you are right, it would fail. For that [edge-]case Adam's statement:

The pvlib/tests folder can also be excluded, and save 2.5 MB.

would solve it by not allowing any tests to be run from the distribution installation. I personally support that.

Edit: but still on the cloned repo everything should work (it's essentially the same as the workflow), so a contributor shouldn't find any problem running them.

@cwhanse
Copy link
Member

cwhanse commented Oct 22, 2024

I understand my own question better. Excluding the test files from the distribution wouldn't affect running tests locally on a fork, since the fork clones the repository rather than installs from the distribution. I don't have a concern here.

@AdamRJensen
Copy link
Member Author

AdamRJensen commented Oct 22, 2024

  • @AdamRJensen is it really the volume in MB that is slowing down WASM? 63 MB isn't much nowadays.

I'd like to develop widgets that can run using WASM, and these need to be relatively fast so as not to annoy the user with a wait time. As long as there is no downsides to the package, I think it's preferred to have it as small as possible.

  • Some (many?) files could be reduced in size by truncation (e.g. trailing zeros) and/or compression (e.g., zip)

I don't think the required files have significant trailing zeros (at least not the files I inspected). The big culprit is the LinkeTurbidities.h5 file(15.3 MB), so if we were to do something, it should start with this file.

  • Perhaps some files could serve their purpose with fewer records.

This predominantly applies to test files, which can just be excluded.

@echedey-ls echedey-ls linked a pull request Oct 22, 2024 that will close this issue
5 tasks
@adriesse
Copy link
Member

adriesse commented Oct 23, 2024

My strong impression is that the suggestion to remove all tests is being promoted and implemented much too rapidly. Why the furious flurry of activity?

@kandersolar
Copy link
Member

Order of operations here should be something like this:

  1. Merge Show distribution file sizes in publish.yml GH Actions workflow #2278, and update [MAINT]: Remove tests from wheel distro #2277 accordingly, so the CI can check that the "official" wheel file sizes are reduced as expected and that the resulting wheel is still installable
  2. Manual inspection/verification of the resulting wheel & sdist, similar to what we have done in other PRs that change our packaging (Support PEP621 #1910, Remove setup.cfg in favor of pyproject.toml twoaxistracking#58)
  3. If automated and manual checks look good, merge [MAINT]: Remove tests from wheel distro #2277
  4. Sometime before the next release (mid December), make a pre-release to verify that the new wheels don't run into any issues downstream of us with PyPI/conda.

the suggestion to remove all tests

Just to clarify for anyone reading this issue: the proposed changes to pvlib's tests is to exclude them only from distribution bundles, i.e. the python files delivered when running pip install pvlib. Tests will not be removed from the repository itself, and they will continue to be a central aspect of pvlib's development.

@kandersolar
Copy link
Member

By the way, a fun estimate: reducing wheel size by 1/3 (as suggested by #2277), multiplied by 14k pvlib downloads per day, eliminates >100GB/day data transfer from PyPI. Maybe not world-changing, but if there is no downside...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants