Skip to content

Commit

Permalink
Refactoring of benchmarks (#133)
Browse files Browse the repository at this point in the history
  • Loading branch information
Alexsandruss authored Jul 5, 2024
1 parent 1d29e5c commit eddb9e8
Show file tree
Hide file tree
Showing 219 changed files with 8,158 additions and 20,705 deletions.
16 changes: 7 additions & 9 deletions .github/CODEOWNERS
Original file line number Diff line number Diff line change
@@ -1,9 +1,7 @@
#owners and reviewers
cuml_bench/* @Alexsandruss
daal4py_bench/* @Alexsandruss @samir-nasibli
datasets/* @Alexsandruss
modelbuilders_bench/* @Alexsandruss
report_generator/* @Alexsandruss
sklearn_bench/* @Alexsandruss @samir-nasibli
xgboost_bench/* @Alexsandruss
*.md @Alexsandruss @maria-Petrova
# owners and reviewers
configs @Alexsandruss
configs/spmd* @Alexsandruss @ethanglaser
sklbench @Alexsandruss
*.md @Alexsandruss @samir-nasibli
requirements*.txt @Alexsandruss @ethanglaser
conda-env-*.yml @Alexsandruss @ethanglaser
16 changes: 8 additions & 8 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,18 +1,18 @@
# Logs
*.log

# Release and work directories
__pycache__*
__work*

# Visual Studio related files, e.g., ".vscode"
.vs*

# Datasets
data
# Dataset files
data_cache
*.csv
*.npy
*.npz

# Results
results*.json
*.xlsx
# Results at repo root
vtune_results
/*.json
/*.xlsx
/*.ipynb
27 changes: 27 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
#===============================================================================
# Copyright 2024 Intel Corporation
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#===============================================================================

repos:
- repo: https://github.com/psf/black
rev: 23.7.0
hooks:
- id: black
language_version: python3.10
- repo: https://github.com/PyCQA/isort
rev: 5.12.0
hooks:
- id: isort
language_version: python3.10
174 changes: 66 additions & 108 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,147 +1,105 @@

# Machine Learning Benchmarks <!-- omit in toc -->
# Machine Learning Benchmarks

[![Build Status](https://dev.azure.com/daal/scikit-learn_bench/_apis/build/status/IntelPython.scikit-learn_bench?branchName=main)](https://dev.azure.com/daal/scikit-learn_bench/_build/latest?definitionId=8&branchName=main)

**Machine Learning Benchmarks** contains implementations of machine learning algorithms
across data analytics frameworks. Scikit-learn_bench can be extended to add new frameworks
and algorithms. It currently supports the [scikit-learn](https://scikit-learn.org/),
[DAAL4PY](https://intelpython.github.io/daal4py/), [cuML](https://github.com/rapidsai/cuml),
and [XGBoost](https://github.com/dmlc/xgboost) frameworks for commonly used
[machine learning algorithms](#supported-algorithms).

## Follow us on Medium <!-- omit in toc -->

We publish blogs on Medium, so [follow us](https://medium.com/intel-analytics-software/tagged/machine-learning) to learn tips and tricks for more efficient data analysis. Here are our latest blogs:
**Scikit-learn_bench** is a benchmark tool for libraries and frameworks implementing Scikit-learn-like APIs and other workloads.

- [Save Time and Money with Intel Extension for Scikit-learn](https://medium.com/intel-analytics-software/save-time-and-money-with-intel-extension-for-scikit-learn-33627425ae4)
- [Superior Machine Learning Performance on the Latest Intel Xeon Scalable Processors](https://medium.com/intel-analytics-software/superior-machine-learning-performance-on-the-latest-intel-xeon-scalable-processor-efdec279f5a3)
- [Leverage Intel Optimizations in Scikit-Learn](https://medium.com/intel-analytics-software/leverage-intel-optimizations-in-scikit-learn-f562cb9d5544)
- [Optimizing CatBoost Performance](https://medium.com/intel-analytics-software/optimizing-catboost-performance-4f73f0593071)
- [Intel Gives Scikit-Learn the Performance Boost Data Scientists Need](https://medium.com/intel-analytics-software/intel-gives-scikit-learn-the-performance-boost-data-scientists-need-42eb47c80b18)
- [From Hours to Minutes: 600x Faster SVM](https://medium.com/intel-analytics-software/from-hours-to-minutes-600x-faster-svm-647f904c31ae)
- [Improve the Performance of XGBoost and LightGBM Inference](https://medium.com/intel-analytics-software/improving-the-performance-of-xgboost-and-lightgbm-inference-3b542c03447e)
- [Accelerate Kaggle Challenges Using Intel AI Analytics Toolkit](https://medium.com/intel-analytics-software/accelerate-kaggle-challenges-using-intel-ai-analytics-toolkit-beb148f66d5a)
- [Accelerate Your scikit-learn Applications](https://medium.com/intel-analytics-software/improving-the-performance-of-xgboost-and-lightgbm-inference-3b542c03447e)
- [Optimizing XGBoost Training Performance](https://medium.com/intel-analytics-software/new-optimizations-for-cpu-in-xgboost-1-1-81144ea21115)
- [Accelerate Linear Models for Machine Learning](https://medium.com/intel-analytics-software/accelerating-linear-models-for-machine-learning-5a75ff50a0fe)
- [Accelerate K-Means Clustering](https://medium.com/intel-analytics-software/accelerate-k-means-clustering-6385088788a1)
- [Fast Gradient Boosting Tree Inference](https://medium.com/intel-analytics-software/fast-gradient-boosting-tree-inference-for-intel-xeon-processors-35756f174f55)
Benefits:
- Full control of benchmarks suite through CLI
- Flexible and powerful benchmark config structure
- Available with advanced profiling tools, such as Intel(R) VTune* Profiler
- Automated benchmarks report generation

## Table of content <!-- omit in toc -->
### 📜 Table of Contents

- [How to create conda environment for benchmarking](#how-to-create-conda-environment-for-benchmarking)
- [Running Python benchmarks with runner script](#running-python-benchmarks-with-runner-script)
- [Benchmark supported algorithms](#benchmark-supported-algorithms)
- [Scikit-learn benchmakrs](#scikit-learn-benchmakrs)
- [Algorithm parameters](#algorithm-parameters)
- [Machine Learning Benchmarks](#machine-learning-benchmarks)
- [🔧 Create a Python Environment](#-create-a-python-environment)
- [🚀 How To Use Scikit-learn\_bench](#-how-to-use-scikit-learn_bench)
- [Benchmarks Runner](#benchmarks-runner)
- [Report Generator](#report-generator)
- [Scikit-learn\_bench High-Level Workflow](#scikit-learn_bench-high-level-workflow)
- [📚 Benchmark Types](#-benchmark-types)
- [📑 Documentation](#-documentation)

## How to create conda environment for benchmarking
## 🔧 Create a Python Environment

Create a suitable conda environment for each framework to test. Each item in the list below links to instructions to create an appropriate conda environment for the framework.
How to create a usable Python environment with the following required frameworks:

- [**scikit-learn**](sklearn_bench#how-to-create-conda-environment-for-benchmarking)
- **sklearn, sklearnex, and gradient boosting frameworks**:

```bash
pip install -r sklearn_bench/requirements.txt
# or
conda install -c intel scikit-learn scikit-learn-intelex pandas tqdm
# with pip
pip install -r envs/requirements-sklearn.txt
# or with conda
conda env create -n sklearn -f envs/conda-env-sklearn.yml
```

- [**daal4py**](daal4py_bench#how-to-create-conda-environment-for-benchmarking)
- **RAPIDS**:

```bash
conda install -c conda-forge scikit-learn daal4py pandas tqdm
conda env create -n rapids --solver=libmamba -f envs/conda-env-rapids.yml
```

- [**cuml**](cuml_bench#how-to-create-conda-environment-for-benchmarking)
## 🚀 How To Use Scikit-learn_bench

```bash
conda install -c rapidsai -c conda-forge cuml pandas cudf tqdm
```
### Benchmarks Runner

- [**xgboost**](xgboost_bench#how-to-create-conda-environment-for-benchmarking)
How to run benchmarks using the `sklbench` module and a specific configuration:

```bash
pip install -r xgboost_bench/requirements.txt
# or
conda install -c conda-forge xgboost scikit-learn pandas tqdm
python -m sklbench --config configs/sklearn_example.json
```

## Running Python benchmarks with runner script

Run `python runner.py --configs configs/config_example.json [--output-file result.json --verbose INFO --report]` to launch benchmarks.

Options:

- ``--configs``: specify the path to a configuration file or a folder that contains configuration files.
- ``--no-intel-optimized``: use Scikit-learn without [Intel(R) Extension for Scikit-learn*](#intelr-extension-for-scikit-learn-support). Now available for [scikit-learn benchmarks](https://github.com/IntelPython/scikit-learn_bench/tree/main/sklearn_bench). By default, the runner uses Intel(R) Extension for Scikit-learn.
- ``--output-file``: specify the name of the output file for the benchmark result. The default name is `result.json`
- ``--report``: create an Excel report based on benchmark results. The `openpyxl` library is required.
- ``--dummy-run``: run configuration parser and dataset generation without benchmarks running.
- ``--verbose``: *WARNING*, *INFO*, *DEBUG*. Print out additional information when the benchmarks are running. The default is *INFO*.

| Level | Description |
|-----------|---------------|
| *DEBUG* | etailed information, typically of interest only when diagnosing problems. Usually at this level the logging output is so low level that it’s not useful to users who are not familiar with the software’s internals. |
| *INFO* | Confirmation that things are working as expected. |
| *WARNING* | An indication that something unexpected happened, or indicative of some problem in the near future (e.g. ‘disk space low’). The software is still working as expected. |

Benchmarks currently support the following frameworks:
The default output is a file with JSON-formatted results of benchmarking cases. To generate a better human-readable report, use the following command:

- **scikit-learn**
- **daal4py**
- **cuml**
- **xgboost**
```bash
python -m sklbench --config configs/sklearn_example.json --report
```

The configuration of benchmarks allows you to select the frameworks to run, select datasets for measurements and configure the parameters of the algorithms.
By default, output and report file paths are `result.json` and `report.xlsx`. To specify custom file paths, run:

You can configure benchmarks by editing a config file. Check [config.json schema](https://github.com/IntelPython/scikit-learn_bench/blob/main/configs/README.md) for more details.
```bash
python -m sklbench --config configs/sklearn_example.json --report --result-file result_example.json --report-file report_example.xlsx
```

## Benchmark supported algorithms
For a description of all benchmarks runner arguments, refer to [documentation](sklbench/runner/README.md#arguments).

| algorithm | benchmark name | sklearn (CPU) | sklearn (GPU) | daal4py | cuml | xgboost |
|---|---|---|---|---|---|---|
|**[DBSCAN](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html)**|dbscan|:white_check_mark:|:white_check_mark:|:white_check_mark:|:white_check_mark:|:x:|
|**[RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)**|df_clfs|:white_check_mark:|:x:|:white_check_mark:|:white_check_mark:|:x:|
|**[RandomForestRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html)**|df_regr|:white_check_mark:|:x:|:white_check_mark:|:white_check_mark:|:x:|
|**[pairwise_distances](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise_distances.html)**|distances|:white_check_mark:|:x:|:white_check_mark:|:x:|:x:|
|**[KMeans](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html)**|kmeans|:white_check_mark:|:white_check_mark:|:white_check_mark:|:white_check_mark:|:x:|
|**[KNeighborsClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)**|knn_clsf|:white_check_mark:|:x:|:x:|:white_check_mark:|:x:|
|**[LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)**|linear|:white_check_mark:|:white_check_mark:|:white_check_mark:|:white_check_mark:|:x:|
|**[LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)**|log_reg|:white_check_mark:|:white_check_mark:|:white_check_mark:|:white_check_mark:|:x:|
|**[PCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html)**|pca|:white_check_mark:|:x:|:white_check_mark:|:white_check_mark:|:x:|
|**[Ridge](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html)**|ridge|:white_check_mark:|:x:|:white_check_mark:|:white_check_mark:|:x:|
|**[SVM](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)**|svm|:white_check_mark:|:x:|:white_check_mark:|:white_check_mark:|:x:|
|**[TSNE](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html)**|tsne|:white_check_mark:|:x:|:x:|:white_check_mark:|:x:|
|**[train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)**|train_test_split|:white_check_mark:|:x:|:x:|:white_check_mark:|:x:|
|**[GradientBoostingClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html)**|gbt|:x:|:x:|:x:|:x:|:white_check_mark:|
|**[GradientBoostingRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html)**|gbt|:x:|:x:|:x:|:x:|:white_check_mark:|
### Report Generator

### Scikit-learn benchmakrs
To combine raw result files gathered from different environments, call the report generator:

When you run scikit-learn benchmarks on CPU, [Intel(R) Extension for Scikit-learn](https://github.com/intel/scikit-learn-intelex) is used by default. Use the ``--no-intel-optimized`` option to run the benchmarks without the extension.
```bash
python -m sklbench.report --result-files result_1.json result_2.json --report-file report_example.xlsx
```

For the algorithms with both CPU and GPU support, you may use the same [configuration file](https://github.com/IntelPython/scikit-learn_bench/blob/main/configs/skl_xpu_config.json) to run the scikit-learn benchmarks on CPU and GPU.
For a description of all report generator arguments, refer to [documentation](sklbench/report/README.md#arguments).

## Algorithm parameters
### Scikit-learn_bench High-Level Workflow

You can launch benchmarks for each algorithm separately.
To do this, go to the directory with the benchmark:
```mermaid
flowchart TB
A[User] -- High-level arguments --> B[Benchmarks runner]
B -- Generated benchmarking cases --> C["Benchmarks collection"]
C -- Raw JSON-formatted results --> D[Report generator]
D -- Human-readable report --> A
```bash
cd <framework>
classDef userStyle fill:#44b,color:white,stroke-width:2px,stroke:white;
class A userStyle
```

Run the following command:
## 📚 Benchmark Types

```bash
python <benchmark_file> --dataset-name <path to the dataset> <other algorithm parameters>
```
**Scikit-learn_bench** supports the following types of benchmarks:

The list of supported parameters for each algorithm you can find here:
- **Scikit-learn estimator** - Measures performance and quality metrics of the [sklearn-like estimator](https://scikit-learn.org/stable/glossary.html#term-estimator).
- **Function** - Measures performance metrics of specified function.

- [**scikit-learn**](sklearn_bench#algorithms-parameters)
- [**daal4py**](daal4py_bench#algorithms-parameters)
- [**cuml**](cuml_bench#algorithms-parameters)
- [**xgboost**](xgboost_bench#algorithms-parameters)
## 📑 Documentation
[Scikit-learn_bench](README.md):
- [Configs](configs/README.md)
- [Benchmarks Runner](sklbench/runner/README.md)
- [Report Generator](sklbench/report/README.md)
- [Benchmarks](sklbench/benchmarks/README.md)
- [Data Processing](sklbench/datasets/README.md)
- [Emulators](sklbench/emulators/README.md)
- [Developer Guide](docs/README.md)
Loading

0 comments on commit eddb9e8

Please sign in to comment.