Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactoring of benchmarks #133

Merged
merged 36 commits into from
Jul 5, 2024
Merged

Conversation

Alexsandruss
Copy link
Contributor

@Alexsandruss Alexsandruss commented Mar 30, 2023

Benchmarks rework

Entry points: README, Developer Guide

Key features:

  • Benchmarks runner, report generator and separate benchmarks are implemented to run as python modules
  • Benchmarking config specification change
  • Report generator based on handling results as pd.DataFrame's to perform comparison operations and use openpyxl functionality to write dataframes natively

Solved issues

@samir-nasibli
Copy link
Contributor

@Alexsandruss Thank you for your hard work here!

I think this pull request is getting quite large and includes multiple features. It contains too much code to be reviewed. It's useful to have a way to split it into smaller tasks for your reviewers so that they can focus on specific portions of it.
I am pretty sure that we already can merge some of features here into master, if we would have separate PRs for them.

Wouldn't it be better to move an already finished feature to a separate PR? In this case we can quickly and efficiently review and merge it.

README.md Show resolved Hide resolved
@napetrov
Copy link
Contributor

napetrov commented Jun 9, 2023

This CODEOWNERS file contains errors

.github/CODEOWNERS Outdated Show resolved Hide resolved
sklbench/utils/common.py Outdated Show resolved Hide resolved
sklbench/utils/common.py Outdated Show resolved Hide resolved
sklbench/utils/common.py Outdated Show resolved Hide resolved
sklbench/utils/data.py Outdated Show resolved Hide resolved
sklbench/runner.py Outdated Show resolved Hide resolved
sklbench/runner.py Outdated Show resolved Hide resolved
sklbench/utils/data.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@icfaust icfaust left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am running Python3.9 and this was required to get it to work (requires changing some unions to an older style)

requirements-bench.txt Outdated Show resolved Hide resolved
@icfaust
Copy link
Collaborator

icfaust commented Jul 12, 2023

So far I am also running into an issue downloading datasets in 3.9 with urllib, specifically with the SSL certs. To get it to download I had to export

SSL_CERT_DIR=/etc/ssl/certs

To get it to properly download. Somehow this wasnt a problem with 3.10 ¯\(ツ)

@icfaust
Copy link
Collaborator

icfaust commented Jul 12, 2023

Also, some of the errors seen in the PR associated with KNN/sklearn in the CI checks have been solved with this PR:
scikit-learn/scikit-learn@f473d7e

But likely a handling of Errors needs to occur to make sure that it will run through even though part of sklearn doesn't

@Alexsandruss
Copy link
Contributor Author

So far I am also running into an issue downloading datasets in 3.9 with urllib, specifically with the SSL certs. To get it to download I had to export

SSL_CERT_DIR=/etc/ssl/certs

To get it to properly download. Somehow this wasnt a problem with 3.10 ¯**(ツ)**/¯

It looks like problem of specific python environment. Python 3.9 from conda-forge works fine.

@Alexsandruss
Copy link
Contributor Author

Also, some of the errors seen in the PR associated with KNN/sklearn in the CI checks have been solved with this PR: scikit-learn/scikit-learn@f473d7e

But likely a handling of Errors needs to occur to make sure that it will run through even though part of sklearn doesn't

There is no external fix for this sklearn version, so previous version is pinned to dependencies from now.

@icfaust
Copy link
Collaborator

icfaust commented Jul 21, 2023

Naive question: could we catch Errors in such a way to allow for reports to be generated even in the case of a failure in a certain test case? Something like how its done in pytest, or would that be out of scope of this PR.

Copy link
Collaborator

@icfaust icfaust left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @Alexsandruss , just as couple follow ups to my general comment.

sklbench/benchs/sklearn_estimator.py Outdated Show resolved Hide resolved
sklbench/benchs/sklearn_estimator.py Outdated Show resolved Hide resolved
sklbench/benchs/sklearn_estimator.py Outdated Show resolved Hide resolved
sklbench/benchs/sklearn_estimator.py Outdated Show resolved Hide resolved
@Alexsandruss Alexsandruss force-pushed the dev/refactor branch 4 times, most recently from 8e9dde5 to a3776ef Compare June 7, 2024 10:08
- change parameters in sklearn and xgboost example configs
- fix default value of `result-files` argument
- change right border of `n_informative` parameter
Copy link

@ethanglaser ethanglaser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partially reviewed. Overall have some questions/clarifications, minor suggestions, and potential revisions for examples that may not be functional but its functional and not much left to do before merge

sklbench/datasets/README.md Show resolved Hide resolved
envs/conda-env-sklearn.yml Outdated Show resolved Hide resolved
envs/requirements-sklearn.txt Show resolved Hide resolved
sklbench/benchmarks/estimator_task_map.json Show resolved Hide resolved
sklbench/report/README.md Outdated Show resolved Hide resolved
configs/common/sklearn.json Show resolved Hide resolved

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@razdoburdin please review xgboost part

configs/regular/dbscan.json Outdated Show resolved Hide resolved
configs/regular/ensemble.json Outdated Show resolved Hide resolved
configs/spmd/knn.json Show resolved Hide resolved
.github/CODEOWNERS Outdated Show resolved Hide resolved
Copy link
Contributor

@samir-nasibli samir-nasibli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Alexsandruss for the work done.
I am confident that new features and capabilities will only bring benefits to the projects we are working on.

For me, the description does not provide more argumentation and motivation for why changes were made in the whole project and in its individual parts. The changes affected practically every part of the libraries. Thus, I can call this implementation not “refactoring”, but scikit-learn_bench 2.0.

This PR goes beyond the usual contributions to the project. It is just remaking it. Therefore, I looked at it from the point of view of not a change, but rather as a new project that was written from scratch and can be compared with the previous main.

I would suggest:

  1. Taking measurements of the current main branch and this one. And compare what discrepancies there are. If there are no critical cases, then there are no blocks for merging.
  2. I also think it’s necessary to do initial checks using valgrind or other tools to track resources.
  3. Add new code owners, and try to split ownership of the project between several people.
  4. It makes sense to collect comments from the current review, add them to the backlog and add minor changes to the updated repo in small commits/PRs.

Assuming also review of people who already touch this code and contributed in the branch. @icfaust @ethanglaser @Vika-F

Many thanks!

@samir-nasibli samir-nasibli requested a review from icfaust June 20, 2024 09:25
- Add `false` and `true` to CLI parameters parser
- Sync PyPI and conda-forge envs
- Remove tabs from configs
configs/README.md Outdated Show resolved Hide resolved
Copy link
Collaborator

@icfaust icfaust left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think some expansion of the CI checking and clarification of the python versions are necessary before merging. As far as I can tell only 3.10 is tested, though python 3.8+ is supported. I added some suggestions for some english corrections, and have slightly touched the code in some places.

Should the codebase use sklearnex as a submodule, and is it possible to link the dependencies to it to save maintenance time?

README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
configs/README.md Outdated Show resolved Hide resolved
test-configuration-linux.yml Outdated Show resolved Hide resolved
pyproject.toml Outdated Show resolved Hide resolved
envs/requirements-sklearn.txt Outdated Show resolved Hide resolved
envs/conda-env-sklearn.yml Outdated Show resolved Hide resolved
envs/conda-env-rapids.yml Show resolved Hide resolved
sklbench/report/compatibility.py Show resolved Hide resolved
@icfaust
Copy link
Collaborator

icfaust commented Jun 25, 2024

@Alexsandruss After discussions with others, for this PR, just limit everything to 3.10 so as to be able to merge it ASAP Please also change things related to private CI related to the python versions (so some of my previous comments can be disregarded short-term).

README.md Outdated Show resolved Hide resolved
Copy link

@ethanglaser ethanglaser left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving pending a few minor revisions from comments. It's ready enough and can revise issues in follow-ups. Infra branch also looks like it supports (at least on CPU).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
8 participants