Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactoring of benchmarks #133

Merged
merged 36 commits into from
Jul 5, 2024
Merged
Show file tree
Hide file tree
Changes from 34 commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
23a7df5
Refactor draft
Alexsandruss Dec 7, 2023
f865330
Fix dev.guide link
Alexsandruss Dec 7, 2023
94a2add
Configs update and minor fixes
Alexsandruss Dec 10, 2023
6831628
Copyright and doc fixes
Alexsandruss Jan 26, 2024
2cb6fa1
Add argument aliases
Alexsandruss Jan 26, 2024
5299104
Update configs and docs with corresponding code changes
Alexsandruss Feb 14, 2024
6a73712
Change INCLUDE directive in config spec
Alexsandruss Feb 20, 2024
ceb813e
Basic daal4py modelbuilders support
Alexsandruss Feb 20, 2024
2a1e134
Correction of configs
Alexsandruss Feb 20, 2024
610dc8e
Add basic sklearn-like emulation of approx. kNN
Alexsandruss Feb 29, 2024
80ce836
Change configs structure (add common sets);
Alexsandruss Mar 8, 2024
11d4a56
Linting
Alexsandruss Mar 8, 2024
6dd91ff
Update online computation mode
Alexsandruss Mar 12, 2024
780a141
Update for ANN emulators
Alexsandruss Mar 14, 2024
c44943c
Update xgboost configs;
Alexsandruss Mar 15, 2024
917cc32
Remove mutex from envs
Alexsandruss Mar 15, 2024
509dbba
Add modin format; fix for faiss ivf_pq compatibility
Alexsandruss Mar 15, 2024
37b21d3
Add modin support; fixes for ANN emulators
Alexsandruss Mar 15, 2024
22ce12e
Add SVS NearestNeighbors emulator
Alexsandruss Mar 16, 2024
1c6bd66
Update CI and minor code rework
Alexsandruss Apr 22, 2024
e318f64
Add dpnp and dpctl support
Alexsandruss Apr 22, 2024
6c8a08b
Intermediate changes: apply comments, bug fixes
Alexsandruss May 21, 2024
836ffcc
Pin CI Python version to 3.10
Alexsandruss May 21, 2024
8453243
CI command fix and doc links fix
Alexsandruss May 22, 2024
06ae1ab
Shell usage fix
Alexsandruss May 22, 2024
9d62d5d
SPMD support
Alexsandruss May 30, 2024
765a5c3
CI fixes
Alexsandruss May 30, 2024
6798d01
Update sklbench args info
Alexsandruss Jun 5, 2024
1510e96
Conda envs and CI conf update
Alexsandruss Jun 7, 2024
ef0b7c5
Example configs update and fixes:
Alexsandruss Jun 13, 2024
831df21
Fixes and comments applying:
Alexsandruss Jun 20, 2024
8359dea
Fix doctree link and add missing config warning
Alexsandruss Jun 20, 2024
96b7e15
CI matrix update and doc fixes
Alexsandruss Jul 3, 2024
dc00f5f
Docs, configs and codeowners changes
Alexsandruss Jul 3, 2024
f2fd91e
Update codeowners and doc fix
Alexsandruss Jul 4, 2024
7d144bc
Add examples run to CI
Alexsandruss Jul 5, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 7 additions & 9 deletions .github/CODEOWNERS
Validating CODEOWNERS rules …
Original file line number Diff line number Diff line change
@@ -1,9 +1,7 @@
#owners and reviewers
cuml_bench/* @Alexsandruss
daal4py_bench/* @Alexsandruss @samir-nasibli
datasets/* @Alexsandruss
modelbuilders_bench/* @Alexsandruss
report_generator/* @Alexsandruss
sklearn_bench/* @Alexsandruss @samir-nasibli
xgboost_bench/* @Alexsandruss
*.md @Alexsandruss @maria-Petrova
# owners and reviewers
configs @Alexsandruss
configs/spmd* @Alexsandruss @ethanglaser
Alexsandruss marked this conversation as resolved.
Show resolved Hide resolved
sklbench @Alexsandruss
*.md @Alexsandruss @samir-nasibli
requirements*.txt @Alexsandruss
conda-env-*.yml @Alexsandruss
16 changes: 8 additions & 8 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,18 +1,18 @@
# Logs
*.log

# Release and work directories
__pycache__*
__work*

# Visual Studio related files, e.g., ".vscode"
.vs*

# Datasets
data
# Dataset files
data_cache
*.csv
*.npy
*.npz

# Results
results*.json
*.xlsx
# Results at repo root
vtune_results
/*.json
/*.xlsx
/*.ipynb
27 changes: 27 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
#===============================================================================
# Copyright 2024 Intel Corporation
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#===============================================================================

repos:
- repo: https://github.com/psf/black
rev: 23.7.0
hooks:
- id: black
language_version: python3.10
- repo: https://github.com/PyCQA/isort
rev: 5.12.0
hooks:
- id: isort
language_version: python3.10
174 changes: 66 additions & 108 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,147 +1,105 @@

# Machine Learning Benchmarks <!-- omit in toc -->
# Machine Learning Benchmarks

Alexsandruss marked this conversation as resolved.
Show resolved Hide resolved
[![Build Status](https://dev.azure.com/daal/scikit-learn_bench/_apis/build/status/IntelPython.scikit-learn_bench?branchName=main)](https://dev.azure.com/daal/scikit-learn_bench/_build/latest?definitionId=8&branchName=main)
Alexsandruss marked this conversation as resolved.
Show resolved Hide resolved

**Machine Learning Benchmarks** contains implementations of machine learning algorithms
across data analytics frameworks. Scikit-learn_bench can be extended to add new frameworks
and algorithms. It currently supports the [scikit-learn](https://scikit-learn.org/),
[DAAL4PY](https://intelpython.github.io/daal4py/), [cuML](https://github.com/rapidsai/cuml),
and [XGBoost](https://github.com/dmlc/xgboost) frameworks for commonly used
[machine learning algorithms](#supported-algorithms).

## Follow us on Medium <!-- omit in toc -->

We publish blogs on Medium, so [follow us](https://medium.com/intel-analytics-software/tagged/machine-learning) to learn tips and tricks for more efficient data analysis. Here are our latest blogs:
**Scikit-learn_bench** is a benchmark tool for libraries and frameworks implementing Scikit-learn-like APIs and other workloads.

- [Save Time and Money with Intel Extension for Scikit-learn](https://medium.com/intel-analytics-software/save-time-and-money-with-intel-extension-for-scikit-learn-33627425ae4)
- [Superior Machine Learning Performance on the Latest Intel Xeon Scalable Processors](https://medium.com/intel-analytics-software/superior-machine-learning-performance-on-the-latest-intel-xeon-scalable-processor-efdec279f5a3)
- [Leverage Intel Optimizations in Scikit-Learn](https://medium.com/intel-analytics-software/leverage-intel-optimizations-in-scikit-learn-f562cb9d5544)
- [Optimizing CatBoost Performance](https://medium.com/intel-analytics-software/optimizing-catboost-performance-4f73f0593071)
- [Intel Gives Scikit-Learn the Performance Boost Data Scientists Need](https://medium.com/intel-analytics-software/intel-gives-scikit-learn-the-performance-boost-data-scientists-need-42eb47c80b18)
- [From Hours to Minutes: 600x Faster SVM](https://medium.com/intel-analytics-software/from-hours-to-minutes-600x-faster-svm-647f904c31ae)
- [Improve the Performance of XGBoost and LightGBM Inference](https://medium.com/intel-analytics-software/improving-the-performance-of-xgboost-and-lightgbm-inference-3b542c03447e)
- [Accelerate Kaggle Challenges Using Intel AI Analytics Toolkit](https://medium.com/intel-analytics-software/accelerate-kaggle-challenges-using-intel-ai-analytics-toolkit-beb148f66d5a)
- [Accelerate Your scikit-learn Applications](https://medium.com/intel-analytics-software/improving-the-performance-of-xgboost-and-lightgbm-inference-3b542c03447e)
- [Optimizing XGBoost Training Performance](https://medium.com/intel-analytics-software/new-optimizations-for-cpu-in-xgboost-1-1-81144ea21115)
- [Accelerate Linear Models for Machine Learning](https://medium.com/intel-analytics-software/accelerating-linear-models-for-machine-learning-5a75ff50a0fe)
- [Accelerate K-Means Clustering](https://medium.com/intel-analytics-software/accelerate-k-means-clustering-6385088788a1)
- [Fast Gradient Boosting Tree Inference](https://medium.com/intel-analytics-software/fast-gradient-boosting-tree-inference-for-intel-xeon-processors-35756f174f55)
Benefits:
- Full control of benchmarks suite through CLI
- Flexible and powerful benchmark config structure
- Available with advanced profiling tools, such as Intel(R) VTune* Profiler
- Automated benchmarks report generation

## Table of content <!-- omit in toc -->
### 📜 Table of Contents

- [How to create conda environment for benchmarking](#how-to-create-conda-environment-for-benchmarking)
- [Running Python benchmarks with runner script](#running-python-benchmarks-with-runner-script)
- [Benchmark supported algorithms](#benchmark-supported-algorithms)
- [Scikit-learn benchmakrs](#scikit-learn-benchmakrs)
- [Algorithm parameters](#algorithm-parameters)
- [Machine Learning Benchmarks](#machine-learning-benchmarks)
- [🔧 Create a Python Environment](#-create-a-python-environment)
- [🚀 How To Use Scikit-learn\_bench](#-how-to-use-scikit-learn_bench)
- [Benchmarks Runner](#benchmarks-runner)
- [Report Generator](#report-generator)
- [Scikit-learn\_bench High-Level Workflow](#scikit-learn_bench-high-level-workflow)
- [📚 Benchmark Types](#-benchmark-types)
- [📑 Documentation](#-documentation)

## How to create conda environment for benchmarking
## 🔧 Create a Python Environment

Create a suitable conda environment for each framework to test. Each item in the list below links to instructions to create an appropriate conda environment for the framework.
How to create a usable Python environment with the following required frameworks:

- [**scikit-learn**](sklearn_bench#how-to-create-conda-environment-for-benchmarking)
- **sklearn, sklearnex, and gradient boosting frameworks**:

```bash
pip install -r sklearn_bench/requirements.txt
# or
conda install -c intel scikit-learn scikit-learn-intelex pandas tqdm
# with pip
pip install -r envs/requirements-sklearn.txt
# or with conda
conda env create -n sklearn -f envs/conda-env-sklearn.yml
```

- [**daal4py**](daal4py_bench#how-to-create-conda-environment-for-benchmarking)
- **RAPIDS**:

```bash
conda install -c conda-forge scikit-learn daal4py pandas tqdm
conda env create -n rapids --solver=libmamba -f envs/conda-env-rapids.yml
```

- [**cuml**](cuml_bench#how-to-create-conda-environment-for-benchmarking)
## 🚀 How To Use Scikit-learn_bench

```bash
conda install -c rapidsai -c conda-forge cuml pandas cudf tqdm
```
### Benchmarks Runner

- [**xgboost**](xgboost_bench#how-to-create-conda-environment-for-benchmarking)
How to run benchmarks using the `sklbench` module and a specific configuration:

```bash
pip install -r xgboost_bench/requirements.txt
# or
conda install -c conda-forge xgboost scikit-learn pandas tqdm
python -m sklbench --config configs/sklearn_example.json
```

## Running Python benchmarks with runner script

Run `python runner.py --configs configs/config_example.json [--output-file result.json --verbose INFO --report]` to launch benchmarks.

Options:

- ``--configs``: specify the path to a configuration file or a folder that contains configuration files.
- ``--no-intel-optimized``: use Scikit-learn without [Intel(R) Extension for Scikit-learn*](#intelr-extension-for-scikit-learn-support). Now available for [scikit-learn benchmarks](https://github.com/IntelPython/scikit-learn_bench/tree/main/sklearn_bench). By default, the runner uses Intel(R) Extension for Scikit-learn.
- ``--output-file``: specify the name of the output file for the benchmark result. The default name is `result.json`
- ``--report``: create an Excel report based on benchmark results. The `openpyxl` library is required.
- ``--dummy-run``: run configuration parser and dataset generation without benchmarks running.
- ``--verbose``: *WARNING*, *INFO*, *DEBUG*. Print out additional information when the benchmarks are running. The default is *INFO*.

| Level | Description |
|-----------|---------------|
| *DEBUG* | etailed information, typically of interest only when diagnosing problems. Usually at this level the logging output is so low level that it’s not useful to users who are not familiar with the software’s internals. |
| *INFO* | Confirmation that things are working as expected. |
| *WARNING* | An indication that something unexpected happened, or indicative of some problem in the near future (e.g. ‘disk space low’). The software is still working as expected. |

Benchmarks currently support the following frameworks:
The default output is a file with JSON-formatted results of benchmarking cases. To generate a better human-readable report, use the following command:

- **scikit-learn**
- **daal4py**
- **cuml**
- **xgboost**
```bash
python -m sklbench --config configs/sklearn_example.json --report
```

The configuration of benchmarks allows you to select the frameworks to run, select datasets for measurements and configure the parameters of the algorithms.
By default, output and report file paths are `result.json` and `report.xlsx`. To specify custom file paths, run:

You can configure benchmarks by editing a config file. Check [config.json schema](https://github.com/IntelPython/scikit-learn_bench/blob/main/configs/README.md) for more details.
```bash
python -m sklbench --config configs/sklearn_example.json --report --result-file result_example.json --report-file report_example.xlsx
```

## Benchmark supported algorithms
For a description of all benchmarks runner arguments, refer to [documentation](sklbench/runner/README.md#arguments).

| algorithm | benchmark name | sklearn (CPU) | sklearn (GPU) | daal4py | cuml | xgboost |
|---|---|---|---|---|---|---|
|**[DBSCAN](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html)**|dbscan|:white_check_mark:|:white_check_mark:|:white_check_mark:|:white_check_mark:|:x:|
|**[RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)**|df_clfs|:white_check_mark:|:x:|:white_check_mark:|:white_check_mark:|:x:|
|**[RandomForestRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html)**|df_regr|:white_check_mark:|:x:|:white_check_mark:|:white_check_mark:|:x:|
|**[pairwise_distances](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise_distances.html)**|distances|:white_check_mark:|:x:|:white_check_mark:|:x:|:x:|
|**[KMeans](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html)**|kmeans|:white_check_mark:|:white_check_mark:|:white_check_mark:|:white_check_mark:|:x:|
|**[KNeighborsClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)**|knn_clsf|:white_check_mark:|:x:|:x:|:white_check_mark:|:x:|
|**[LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)**|linear|:white_check_mark:|:white_check_mark:|:white_check_mark:|:white_check_mark:|:x:|
|**[LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)**|log_reg|:white_check_mark:|:white_check_mark:|:white_check_mark:|:white_check_mark:|:x:|
|**[PCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html)**|pca|:white_check_mark:|:x:|:white_check_mark:|:white_check_mark:|:x:|
|**[Ridge](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html)**|ridge|:white_check_mark:|:x:|:white_check_mark:|:white_check_mark:|:x:|
|**[SVM](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)**|svm|:white_check_mark:|:x:|:white_check_mark:|:white_check_mark:|:x:|
|**[TSNE](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html)**|tsne|:white_check_mark:|:x:|:x:|:white_check_mark:|:x:|
|**[train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)**|train_test_split|:white_check_mark:|:x:|:x:|:white_check_mark:|:x:|
|**[GradientBoostingClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html)**|gbt|:x:|:x:|:x:|:x:|:white_check_mark:|
|**[GradientBoostingRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html)**|gbt|:x:|:x:|:x:|:x:|:white_check_mark:|
### Report Generator

### Scikit-learn benchmakrs
To combine raw result files gathered from different environments, call the report generator:

When you run scikit-learn benchmarks on CPU, [Intel(R) Extension for Scikit-learn](https://github.com/intel/scikit-learn-intelex) is used by default. Use the ``--no-intel-optimized`` option to run the benchmarks without the extension.
```bash
python -m sklbench.report --result-files result_1.json result_2.json --report-file report_example.xlsx
```

For the algorithms with both CPU and GPU support, you may use the same [configuration file](https://github.com/IntelPython/scikit-learn_bench/blob/main/configs/skl_xpu_config.json) to run the scikit-learn benchmarks on CPU and GPU.
For a description of all report generator arguments, refer to [documentation](sklbench/report/README.md#arguments).

## Algorithm parameters
### Scikit-learn_bench High-Level Workflow

You can launch benchmarks for each algorithm separately.
To do this, go to the directory with the benchmark:
```mermaid
flowchart TB
A[User] -- High-level arguments --> B[Benchmarks runner]
B -- Generated benchmarking cases --> C["Benchmarks collection"]
C -- Raw JSON-formatted results --> D[Report generator]
D -- Human-readable report --> A

```bash
cd <framework>
classDef userStyle fill:#44b,color:white,stroke-width:2px,stroke:white;
class A userStyle
```

Run the following command:
## 📚 Benchmark Types

```bash
python <benchmark_file> --dataset-name <path to the dataset> <other algorithm parameters>
```
**Scikit-learn_bench** supports the following types of benchmarks:

The list of supported parameters for each algorithm you can find here:
- **Scikit-learn estimator** - Measures performance and quality metrics of the [sklearn-like estimator](https://scikit-learn.org/stable/glossary.html#term-estimator).
- **Function** - Measures performance metrics of specified function.

- [**scikit-learn**](sklearn_bench#algorithms-parameters)
- [**daal4py**](daal4py_bench#algorithms-parameters)
- [**cuml**](cuml_bench#algorithms-parameters)
- [**xgboost**](xgboost_bench#algorithms-parameters)
## 📑 Documentation
[Scikit-learn_bench](README.md):
- [Configs](configs/README.md)
- [Benchmarks Runner](sklbench/runner/README.md)
- [Report Generator](sklbench/report/README.md)
- [Benchmarks](sklbench/benchmarks/README.md)
- [Data Processing](sklbench/datasets/README.md)
- [Data Processing](sklbench/emulators/README.md)
- [Developer Guide](docs/README.md)
Alexsandruss marked this conversation as resolved.
Show resolved Hide resolved
Loading