Refactoring of benchmarks #133

Alexsandruss · 2023-03-30T10:52:26Z

Benchmarks rework

Entry points: README, Developer Guide

Key features:

Benchmarks runner, report generator and separate benchmarks are implemented to run as python modules
Benchmarking config specification change
Report generator based on handling results as pd.DataFrame's to perform comparison operations and use openpyxl functionality to write dataframes natively

Solved issues

reporting format of benchmarks #12 (better reporting format with HW info and speedup output)
Benchmarks silently execute stock version if scikit-learn-intelex is not installed #75 (ensures sklearnex is used and outputs warning if not)
HistGradientBoostingEstimator #82 (implements support for all classes with sklearn estimator API)
lot of memory allocations becomes bottleneck #120 (allows prefetching of datasets in parallel before running of benchmarks)
Add support for single row inference cases #131 (implements single row inference mode)

samir-nasibli · 2023-05-19T01:22:17Z

@Alexsandruss Thank you for your hard work here!

I think this pull request is getting quite large and includes multiple features. It contains too much code to be reviewed. It's useful to have a way to split it into smaller tasks for your reviewers so that they can focus on specific portions of it.
I am pretty sure that we already can merge some of features here into master, if we would have separate PRs for them.

Wouldn't it be better to move an already finished feature to a separate PR? In this case we can quickly and efficiently review and merge it.

README.md

napetrov · 2023-06-09T14:39:12Z

This CODEOWNERS file contains errors

.github/CODEOWNERS

sklbench/utils/common.py

sklbench/utils/data.py

sklbench/runner.py

sklbench/utils/data.py

icfaust

I am running Python3.9 and this was required to get it to work (requires changing some unions to an older style)

requirements-bench.txt

icfaust · 2023-07-12T11:01:28Z

So far I am also running into an issue downloading datasets in 3.9 with urllib, specifically with the SSL certs. To get it to download I had to export

SSL_CERT_DIR=/etc/ssl/certs

To get it to properly download. Somehow this wasnt a problem with 3.10 ¯\(ツ)/¯

icfaust · 2023-07-12T11:43:00Z

Also, some of the errors seen in the PR associated with KNN/sklearn in the CI checks have been solved with this PR:
scikit-learn/scikit-learn@f473d7e

But likely a handling of Errors needs to occur to make sure that it will run through even though part of sklearn doesn't

Alexsandruss · 2023-07-12T15:52:28Z

So far I am also running into an issue downloading datasets in 3.9 with urllib, specifically with the SSL certs. To get it to download I had to export

SSL_CERT_DIR=/etc/ssl/certs

To get it to properly download. Somehow this wasnt a problem with 3.10 ¯**(ツ)**/¯

It looks like problem of specific python environment. Python 3.9 from conda-forge works fine.

Alexsandruss · 2023-07-12T15:56:26Z

Also, some of the errors seen in the PR associated with KNN/sklearn in the CI checks have been solved with this PR: scikit-learn/scikit-learn@f473d7e

But likely a handling of Errors needs to occur to make sure that it will run through even though part of sklearn doesn't

There is no external fix for this sklearn version, so previous version is pinned to dependencies from now.

icfaust · 2023-07-21T15:27:22Z

Naive question: could we catch Errors in such a way to allow for reports to be generated even in the case of a failure in a certain test case? Something like how its done in pytest, or would that be out of scope of this PR.

icfaust

Hey @Alexsandruss , just as couple follow ups to my general comment.

sklbench/benchs/sklearn_estimator.py

sklbench/runner/implementation.py

- change parameters in sklearn and xgboost example configs - fix default value of `result-files` argument - change right border of `n_informative` parameter

ethanglaser

Partially reviewed. Overall have some questions/clarifications, minor suggestions, and potential revisions for examples that may not be functional but its functional and not much left to do before merge

sklbench/datasets/README.md

envs/conda-env-sklearn.yml

envs/requirements-sklearn.txt

sklbench/benchmarks/estimator_task_map.json

sklbench/report/README.md

configs/common/sklearn.json

ethanglaser · 2024-06-13T21:29:55Z

configs/common/xgboost.json

@razdoburdin please review xgboost part

configs/regular/dbscan.json

configs/regular/ensemble.json

configs/spmd/knn.json

.github/CODEOWNERS

samir-nasibli

Thank you @Alexsandruss for the work done.
I am confident that new features and capabilities will only bring benefits to the projects we are working on.

For me, the description does not provide more argumentation and motivation for why changes were made in the whole project and in its individual parts. The changes affected practically every part of the libraries. Thus, I can call this implementation not “refactoring”, but scikit-learn_bench 2.0.

This PR goes beyond the usual contributions to the project. It is just remaking it. Therefore, I looked at it from the point of view of not a change, but rather as a new project that was written from scratch and can be compared with the previous main.

I would suggest:

Taking measurements of the current main branch and this one. And compare what discrepancies there are. If there are no critical cases, then there are no blocks for merging.
I also think it’s necessary to do initial checks using valgrind or other tools to track resources.
Add new code owners, and try to split ownership of the project between several people.
It makes sense to collect comments from the current review, add them to the backlog and add minor changes to the updated repo in small commits/PRs.

Assuming also review of people who already touch this code and contributed in the branch. @icfaust @ethanglaser @Vika-F

Many thanks!

- Add `false` and `true` to CLI parameters parser - Sync PyPI and conda-forge envs - Remove tabs from configs

configs/README.md

icfaust

I think some expansion of the CI checking and clarification of the python versions are necessary before merging. As far as I can tell only 3.10 is tested, though python 3.8+ is supported. I added some suggestions for some english corrections, and have slightly touched the code in some places.

Should the codebase use sklearnex as a submodule, and is it possible to link the dependencies to it to save maintenance time?

README.md

configs/README.md

test-configuration-linux.yml

pyproject.toml

envs/requirements-sklearn.txt

envs/conda-env-sklearn.yml

envs/conda-env-rapids.yml

sklbench/report/compatibility.py

icfaust · 2024-06-25T04:58:32Z

@Alexsandruss After discussions with others, for this PR, just limit everything to 3.10 so as to be able to merge it ASAP Please also change things related to private CI related to the python versions (so some of my previous comments can be disregarded short-term).

.github/CODEOWNERS

envs/conda-env-sklearn.yml

README.md

ethanglaser

Approving pending a few minor revisions from comments. It's ready enough and can revise issues in follow-ups. Infra branch also looks like it supports (at least on CPU).

Alexsandruss mentioned this pull request Apr 27, 2023

Make use of "--device(s)" for XGBoost #102

Closed

This was referenced May 16, 2023

xgboost benchmark datasets missing #117

Open

lot of memory allocations becomes bottleneck #120

Closed

HistGradientBoostingEstimator #82

Closed

This was referenced May 17, 2023

adding parameters for device context and patching of Scikit-Learn #23

Closed

XPU cases for sklearnex DF #119

Closed

napetrov reviewed Jun 8, 2023

View reviewed changes

README.md Show resolved Hide resolved

samir-nasibli reviewed Jun 19, 2023

View reviewed changes

.github/CODEOWNERS Outdated Show resolved Hide resolved