Refactoring of benchmarks (#133)

IntelPython · Jul 5, 2024 · eddb9e8 · eddb9e8
1 parent 1d29e5c
commit eddb9e8
Show file tree

Hide file tree

Showing 219 changed files with 8,158 additions and 20,705 deletions.
diff --git a/.github/CODEOWNERS b/.github/CODEOWNERS
@@ -1,9 +1,7 @@
-#owners and reviewers
-cuml_bench/*          @Alexsandruss
-daal4py_bench/*       @Alexsandruss @samir-nasibli
-datasets/*            @Alexsandruss
-modelbuilders_bench/* @Alexsandruss
-report_generator/*    @Alexsandruss
-sklearn_bench/*       @Alexsandruss @samir-nasibli
-xgboost_bench/*       @Alexsandruss
-*.md                  @Alexsandruss @maria-Petrova
+# owners and reviewers
+configs             @Alexsandruss
+configs/spmd*       @Alexsandruss @ethanglaser
+sklbench            @Alexsandruss
+*.md                @Alexsandruss @samir-nasibli
+requirements*.txt   @Alexsandruss @ethanglaser
+conda-env-*.yml     @Alexsandruss @ethanglaser
diff --git a/.gitignore b/.gitignore
@@ -1,18 +1,18 @@
-# Logs
-*.log
-
 # Release and work directories
 __pycache__*
 __work*
 
 # Visual Studio related files, e.g., ".vscode"
 .vs*
 
-# Datasets
-data
+# Dataset files
+data_cache
 *.csv
 *.npy
+*.npz
 
-# Results
-results*.json
-*.xlsx
+# Results at repo root
+vtune_results
+/*.json
+/*.xlsx
+/*.ipynb
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -0,0 +1,27 @@
+#===============================================================================
+# Copyright 2024 Intel Corporation
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#===============================================================================
+
+repos:
+  - repo: https://github.com/psf/black
+    rev: 23.7.0
+    hooks:
+      - id: black
+        language_version: python3.10
+  - repo: https://github.com/PyCQA/isort
+    rev: 5.12.0
+    hooks:
+      - id: isort
+        language_version: python3.10
diff --git a/README.md b/README.md
@@ -1,147 +1,105 @@
-
-# Machine Learning Benchmarks <!-- omit in toc -->
+# Machine Learning Benchmarks
 
 [![Build Status](https://dev.azure.com/daal/scikit-learn_bench/_apis/build/status/IntelPython.scikit-learn_bench?branchName=main)](https://dev.azure.com/daal/scikit-learn_bench/_build/latest?definitionId=8&branchName=main)
 
-**Machine Learning Benchmarks** contains implementations of machine learning algorithms
-across data analytics frameworks.  Scikit-learn_bench can be extended to add new frameworks
-and algorithms. It currently supports the [scikit-learn](https://scikit-learn.org/),
-[DAAL4PY](https://intelpython.github.io/daal4py/), [cuML](https://github.com/rapidsai/cuml),
-and [XGBoost](https://github.com/dmlc/xgboost) frameworks for commonly used
-[machine learning algorithms](#supported-algorithms).
-
-## Follow us on Medium <!-- omit in toc -->
-
-We publish blogs on Medium, so [follow us](https://medium.com/intel-analytics-software/tagged/machine-learning) to learn tips and tricks for more efficient data analysis. Here are our latest blogs:
+**Scikit-learn_bench** is a benchmark tool for libraries and frameworks implementing Scikit-learn-like APIs and other workloads.
 
-- [Save Time and Money with Intel Extension for Scikit-learn](https://medium.com/intel-analytics-software/save-time-and-money-with-intel-extension-for-scikit-learn-33627425ae4)
-- [Superior Machine Learning Performance on the Latest Intel Xeon Scalable Processors](https://medium.com/intel-analytics-software/superior-machine-learning-performance-on-the-latest-intel-xeon-scalable-processor-efdec279f5a3)
-- [Leverage Intel Optimizations in Scikit-Learn](https://medium.com/intel-analytics-software/leverage-intel-optimizations-in-scikit-learn-f562cb9d5544)
-- [Optimizing CatBoost Performance](https://medium.com/intel-analytics-software/optimizing-catboost-performance-4f73f0593071)
-- [Intel Gives Scikit-Learn the Performance Boost Data Scientists Need](https://medium.com/intel-analytics-software/intel-gives-scikit-learn-the-performance-boost-data-scientists-need-42eb47c80b18)
-- [From Hours to Minutes: 600x Faster SVM](https://medium.com/intel-analytics-software/from-hours-to-minutes-600x-faster-svm-647f904c31ae)
-- [Improve the Performance of XGBoost and LightGBM Inference](https://medium.com/intel-analytics-software/improving-the-performance-of-xgboost-and-lightgbm-inference-3b542c03447e)
-- [Accelerate Kaggle Challenges Using Intel AI Analytics Toolkit](https://medium.com/intel-analytics-software/accelerate-kaggle-challenges-using-intel-ai-analytics-toolkit-beb148f66d5a)
-- [Accelerate Your scikit-learn Applications](https://medium.com/intel-analytics-software/improving-the-performance-of-xgboost-and-lightgbm-inference-3b542c03447e)
-- [Optimizing XGBoost Training Performance](https://medium.com/intel-analytics-software/new-optimizations-for-cpu-in-xgboost-1-1-81144ea21115)
-- [Accelerate Linear Models for Machine Learning](https://medium.com/intel-analytics-software/accelerating-linear-models-for-machine-learning-5a75ff50a0fe)
-- [Accelerate K-Means Clustering](https://medium.com/intel-analytics-software/accelerate-k-means-clustering-6385088788a1)
-- [Fast Gradient Boosting Tree Inference](https://medium.com/intel-analytics-software/fast-gradient-boosting-tree-inference-for-intel-xeon-processors-35756f174f55)
+Benefits:
+- Full control of benchmarks suite through CLI
+- Flexible and powerful benchmark config structure
+- Available with advanced profiling tools, such as Intel(R) VTune* Profiler
+- Automated benchmarks report generation
 
-## Table of content <!-- omit in toc -->
+### 📜 Table of Contents
 
-- [How to create conda environment for benchmarking](#how-to-create-conda-environment-for-benchmarking)
-- [Running Python benchmarks with runner script](#running-python-benchmarks-with-runner-script)
-- [Benchmark supported algorithms](#benchmark-supported-algorithms)
-  - [Scikit-learn benchmakrs](#scikit-learn-benchmakrs)
-- [Algorithm parameters](#algorithm-parameters)
+- [Machine Learning Benchmarks](#machine-learning-benchmarks)
+  - [🔧 Create a Python Environment](#-create-a-python-environment)
+  - [🚀 How To Use Scikit-learn\_bench](#-how-to-use-scikit-learn_bench)
+    - [Benchmarks Runner](#benchmarks-runner)
+    - [Report Generator](#report-generator)
+    - [Scikit-learn\_bench High-Level Workflow](#scikit-learn_bench-high-level-workflow)
+  - [📚 Benchmark Types](#-benchmark-types)
+  - [📑 Documentation](#-documentation)
 
-## How to create conda environment for benchmarking
+## 🔧 Create a Python Environment
 
-Create a suitable conda environment for each framework to test. Each item in the list below links to instructions to create an appropriate conda environment for the framework.
+How to create a usable Python environment with the following required frameworks:
 
-- [**scikit-learn**](sklearn_bench#how-to-create-conda-environment-for-benchmarking)
+- **sklearn, sklearnex, and gradient boosting frameworks**:
 
 ```bash
-pip install -r sklearn_bench/requirements.txt
-# or
-conda install -c intel scikit-learn scikit-learn-intelex pandas tqdm
+# with pip
+pip install -r envs/requirements-sklearn.txt
+# or with conda
+conda env create -n sklearn -f envs/conda-env-sklearn.yml
 ```
 
-- [**daal4py**](daal4py_bench#how-to-create-conda-environment-for-benchmarking)
+- **RAPIDS**:
 
 ```bash
-conda install -c conda-forge scikit-learn daal4py pandas tqdm
+conda env create -n rapids --solver=libmamba -f envs/conda-env-rapids.yml
 ```
 
-- [**cuml**](cuml_bench#how-to-create-conda-environment-for-benchmarking)
+## 🚀 How To Use Scikit-learn_bench
 
-```bash
-conda install -c rapidsai -c conda-forge cuml pandas cudf tqdm
-```
+### Benchmarks Runner
 
-- [**xgboost**](xgboost_bench#how-to-create-conda-environment-for-benchmarking)
+How to run benchmarks using the `sklbench` module and a specific configuration:
 
 ```bash
-pip install -r xgboost_bench/requirements.txt
-# or
-conda install -c conda-forge xgboost scikit-learn pandas tqdm
+python -m sklbench --config configs/sklearn_example.json
 ```
 
-## Running Python benchmarks with runner script
-
-Run `python runner.py --configs configs/config_example.json [--output-file result.json --verbose INFO --report]` to launch benchmarks.
-
-Options:
-
-- ``--configs``: specify the path to a configuration file or a folder that contains configuration files.
-- ``--no-intel-optimized``: use Scikit-learn without [Intel(R) Extension for Scikit-learn*](#intelr-extension-for-scikit-learn-support). Now available for [scikit-learn benchmarks](https://github.com/IntelPython/scikit-learn_bench/tree/main/sklearn_bench). By default, the runner uses Intel(R) Extension for Scikit-learn.
-- ``--output-file``: specify the name of the output file for the benchmark result. The default name is `result.json`
-- ``--report``: create an Excel report based on benchmark results. The `openpyxl` library is required.
-- ``--dummy-run``: run configuration parser and dataset generation without benchmarks running.
-- ``--verbose``: *WARNING*, *INFO*, *DEBUG*. Print out additional information when the benchmarks are running. The default is *INFO*.
-
-|   Level   |  Description  |
-|-----------|---------------|
-| *DEBUG*   | etailed information, typically of interest only when diagnosing problems. Usually at this level the logging output is so low level that it’s not useful to users who are not familiar with the software’s internals. |
-| *INFO*    | Confirmation that things are working as expected. |
-| *WARNING* | An indication that something unexpected happened, or indicative of some problem in the near future (e.g. ‘disk space low’). The software is still working as expected. |
-
-Benchmarks currently support the following frameworks:
+The default output is a file with JSON-formatted results of benchmarking cases. To generate a better human-readable report, use the following command:
 
-- **scikit-learn**
-- **daal4py**
-- **cuml**
-- **xgboost**
+```bash
+python -m sklbench --config configs/sklearn_example.json --report
+```
 
-The configuration of benchmarks allows you to select the frameworks to run, select datasets for measurements and configure the parameters of the algorithms.
+By default, output and report file paths are `result.json` and `report.xlsx`. To specify custom file paths, run:
 
- You can configure benchmarks by editing a config file. Check  [config.json schema](https://github.com/IntelPython/scikit-learn_bench/blob/main/configs/README.md) for more details.
+```bash
+python -m sklbench --config configs/sklearn_example.json --report --result-file result_example.json --report-file report_example.xlsx
+```
 
-## Benchmark supported algorithms
+For a description of all benchmarks runner arguments, refer to [documentation](sklbench/runner/README.md#arguments).
 
-| algorithm  | benchmark name | sklearn (CPU) | sklearn (GPU) | daal4py | cuml | xgboost |
-|---|---|---|---|---|---|---|
-|**[DBSCAN](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html)**|dbscan|:white_check_mark:|:white_check_mark:|:white_check_mark:|:white_check_mark:|:x:|
-|**[RandomForestClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)**|df_clfs|:white_check_mark:|:x:|:white_check_mark:|:white_check_mark:|:x:|
-|**[RandomForestRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html)**|df_regr|:white_check_mark:|:x:|:white_check_mark:|:white_check_mark:|:x:|
-|**[pairwise_distances](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise_distances.html)**|distances|:white_check_mark:|:x:|:white_check_mark:|:x:|:x:|
-|**[KMeans](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html)**|kmeans|:white_check_mark:|:white_check_mark:|:white_check_mark:|:white_check_mark:|:x:|
-|**[KNeighborsClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)**|knn_clsf|:white_check_mark:|:x:|:x:|:white_check_mark:|:x:|
-|**[LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html)**|linear|:white_check_mark:|:white_check_mark:|:white_check_mark:|:white_check_mark:|:x:|
-|**[LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)**|log_reg|:white_check_mark:|:white_check_mark:|:white_check_mark:|:white_check_mark:|:x:|
-|**[PCA](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html)**|pca|:white_check_mark:|:x:|:white_check_mark:|:white_check_mark:|:x:|
-|**[Ridge](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html)**|ridge|:white_check_mark:|:x:|:white_check_mark:|:white_check_mark:|:x:|
-|**[SVM](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html)**|svm|:white_check_mark:|:x:|:white_check_mark:|:white_check_mark:|:x:|
-|**[TSNE](https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html)**|tsne|:white_check_mark:|:x:|:x:|:white_check_mark:|:x:|
-|**[train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)**|train_test_split|:white_check_mark:|:x:|:x:|:white_check_mark:|:x:|
-|**[GradientBoostingClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html)**|gbt|:x:|:x:|:x:|:x:|:white_check_mark:|
-|**[GradientBoostingRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html)**|gbt|:x:|:x:|:x:|:x:|:white_check_mark:|
+### Report Generator
 
-### Scikit-learn benchmakrs
+To combine raw result files gathered from different environments, call the report generator:
 
-When you run scikit-learn benchmarks on CPU, [Intel(R) Extension for Scikit-learn](https://github.com/intel/scikit-learn-intelex) is used by default. Use the ``--no-intel-optimized`` option to run the benchmarks without the extension.
+```bash
+python -m sklbench.report --result-files result_1.json result_2.json --report-file report_example.xlsx
+```
 
-For the algorithms with both CPU and GPU support, you may use the same [configuration file](https://github.com/IntelPython/scikit-learn_bench/blob/main/configs/skl_xpu_config.json) to run the scikit-learn benchmarks on CPU and GPU.
+For a description of all report generator arguments, refer to [documentation](sklbench/report/README.md#arguments).
 
-## Algorithm parameters
+### Scikit-learn_bench High-Level Workflow
 
-You can launch benchmarks for each algorithm separately.
-To do this, go to the directory with the benchmark:
+```mermaid
+flowchart TB
+    A[User] -- High-level arguments --> B[Benchmarks runner]
+    B -- Generated benchmarking cases --> C["Benchmarks collection"]
+    C -- Raw JSON-formatted results --> D[Report generator]
+    D -- Human-readable report --> A
 
-```bash
-cd <framework>
+    classDef userStyle fill:#44b,color:white,stroke-width:2px,stroke:white;
+    class A userStyle
 ```
 
-Run the following command:
+## 📚 Benchmark Types
 
-```bash
-python <benchmark_file> --dataset-name <path to the dataset> <other algorithm parameters>
-```
+**Scikit-learn_bench** supports the following types of benchmarks:
 
-The list of supported parameters for each algorithm you can find here:
+ - **Scikit-learn estimator** - Measures performance and quality metrics of the [sklearn-like estimator](https://scikit-learn.org/stable/glossary.html#term-estimator).
+ - **Function** - Measures performance metrics of specified function.
 
-- [**scikit-learn**](sklearn_bench#algorithms-parameters)
-- [**daal4py**](daal4py_bench#algorithms-parameters)
-- [**cuml**](cuml_bench#algorithms-parameters)
-- [**xgboost**](xgboost_bench#algorithms-parameters)
+## 📑 Documentation
+[Scikit-learn_bench](README.md):
+- [Configs](configs/README.md)
+- [Benchmarks Runner](sklbench/runner/README.md)
+- [Report Generator](sklbench/report/README.md)
+- [Benchmarks](sklbench/benchmarks/README.md)
+- [Data Processing](sklbench/datasets/README.md)
+- [Emulators](sklbench/emulators/README.md)
+- [Developer Guide](docs/README.md)