adding parameters for device context and patching of Scikit-Learn #23

Alexander-Makaryev · 2020-05-17T15:38:58Z

This is initial suggestion in purpose to start the discussion.
Parameters for device context and Scikit-Learn patching are added.
Some benchmarks are changed in purpose to use new parameters.

Alexander-Makaryev · 2020-05-17T23:04:21Z

Example of the config file:

{
    "common": {
        "lib": ["sklearn"],
        "data-format": ["numpy"],
        "data-order": ["C"],
        "device": ["None", "host", "cpu", "gpu"],
        "patch_sklearn": ["False", "True"],
        "dtype": ["float64"]
    },
    "cases": [
                {
            "algorithm": "kmeans",
            "dataset": [
                {
                    "source": "synthetic",
                    "type": "kmeans",
                    "n_clusters": 10,
                    "n_features": 50,
                    "training": {
                        "n_samples": 1000000
                    }
                }
            ],
            "n-clusters": [10]
        },
        {
            "algorithm": "dbscan",
            "dataset": [
                {
                    "source": "synthetic",
                    "type": "blobs",
                    "n_clusters": 10,
                    "n_features": 50,
                    "training": {
                        "n_samples": 10000
                    }
                }
            ],
            "min-samples": [5000],
            "eps": [1]
        },
        {
            "algorithm": "linear",
            "dataset": [
                {
                    "source": "synthetic",
                    "type": "regression",
                    "n_features": 50,
                    "training": {
                        "n_samples": 1000000
                    }
                }
            ]
        },
        {
            "algorithm": "log_reg",
            "solver":["lbfgs", "newton-cg"],
            "dataset": [
                {
                    "source": "synthetic",
                    "type": "classification",
                    "n_classes": 2,
                    "n_features": 100,
                    "training": {
                        "n_samples": 100000
                    }
                },
                {
                    "source": "synthetic",
                    "type": "classification",
                    "n_classes": 5,
                    "n_features": 100,
                    "training": {
                        "n_samples": 100000
                    }
                }
            ]
        }
    ]
}

napetrov · 2020-05-18T04:09:34Z

sklearn/bench.py

@@ -153,6 +153,12 @@ def parse_args(parser, size=None, loop_types=(),
                        help='Seed to pass as random_state')
    parser.add_argument('--dataset-name', type=str, default=None,
                        help='Dataset name')
+    parser.add_argument('--device', type=str, default='None',


I guess this params should be enabled in higher level scripts that will execute them - but i don't see this in make?

Another question - are we going to pass single value for bench.py or this should be tuple so bench.py will iterate via them?

At this moment I used only runner.py script. Possible that additional changes should be done in make or somewhere else. Thanks.

Devices are provided as a list in config "device": ["None", "host", "cpu", "gpu"],. See example above. Benchmarks are executing with all devices in the list.

PetrovKP

Don't these changes break the launch of traditional DAAL? Can you check?

PetrovKP · 2020-05-18T14:18:05Z

sklearn/bench.py

+
+    if args.patch_sklearn is not None and args.patch_sklearn != 'None':
+        from daal4py.sklearn import patch_sklearn, unpatch_sklearn
+        if args.patch_sklearn == "True":


Why are "True", "False" string, not boolean?

One more option is "None" - to leave patching state "as is". I think it can be useful for back compatibility. Possible we should change it to boolean.

What will be broken in this case?

It's not common to have to write --patch-sklearn True. It might be better to instead have two possible options: --patch-sklearn setting patch_sklearn to True, --no-patch-sklearn setting patch_sklearn to False, and then None by default. This can be done by adding two arguments with the same dest='patch_sklearn', but with different actions store_true and store_false, and the same default None.
You may also want to put both of these arguments into a mutually exclusive group.

For the --device option, I would keep it the way you have it right now, but just let the default be None (not the string), and not allow specifying None (unless you call it something like auto). --device None is a bit confusing, but reading --device cpu or --device auto makes much more sense

If possible, it would be nice to not have the exact same construction of argument parsing in two places for both the device and for the patching. What if you added the same tri-state patch argument as a kwarg to patch_sklearn, and the device as a kwarg to run_with_context?

PetrovKP · 2020-05-18T14:22:19Z

Example of the config file:

{
    "common": {
        "lib": ["sklearn"],
        "data-format": ["numpy"],
        "data-order": ["C"],
        "device": ["None", "host", "cpu", "gpu"],
        "patch_sklearn": ["False", "True"],
        "dtype": ["float64"]
    },
    "cases": [
                {
            "algorithm": "kmeans",
            "dataset": [
                {
                    "source": "synthetic",
                    "type": "kmeans",
                    "n_clusters": 10,
                    "n_features": 50,
                    "training": {
                        "n_samples": 1000000
                    }
                }
            ],
            "n-clusters": [10]
        },
        {
            "algorithm": "dbscan",
            "dataset": [
                {
                    "source": "synthetic",
                    "type": "blobs",
                    "n_clusters": 10,
                    "n_features": 50,
                    "training": {
                        "n_samples": 10000
                    }
                }
            ],
            "min-samples": [5000],
            "eps": [1]
        },
        {
            "algorithm": "linear",
            "dataset": [
                {
                    "source": "synthetic",
                    "type": "regression",
                    "n_features": 50,
                    "training": {
                        "n_samples": 1000000
                    }
                }
            ]
        },
        {
            "algorithm": "log_reg",
            "solver":["lbfgs", "newton-cg"],
            "dataset": [
                {
                    "source": "synthetic",
                    "type": "classification",
                    "n_classes": 2,
                    "n_features": 100,
                    "training": {
                        "n_samples": 100000
                    }
                },
                {
                    "source": "synthetic",
                    "type": "classification",
                    "n_classes": 5,
                    "n_features": 100,
                    "training": {
                        "n_samples": 100000
                    }
                }
            ]
        }
    ]
}

Why doesn't the config add to this repository?

Alexander-Makaryev · 2020-05-19T21:31:47Z

@PetrovKP At this moment we have only one example in the repository. I suppose to create a directory with various configs. Please comment what do you think about it.

PetrovKP · 2020-05-19T22:07:42Z

@PetrovKP At this moment we have only one example in the repository. I suppose to create a directory with various configs. Please comment what do you think about it.

We wanted to store all the configs here. This is correct and convenient. I don’t know why we still haven’t moved the configs ...

bibikar

Generally looks good. See comment about command-line options
In the future, it might be worth considering turning this project into a more standard Python module structure, with a sklearn_bench module containing sklearn, daal4py, etc. as submodules. Currently, bench.py is duplicated for each implementation of benchmarks, and it could just be moved to the sklearn_bench module.

bibikar · 2020-05-20T14:23:59Z

sklearn/bench.py

+
+    if args.patch_sklearn is not None and args.patch_sklearn != 'None':
+        from daal4py.sklearn import patch_sklearn, unpatch_sklearn
+        if args.patch_sklearn == "True":


It's not common to have to write --patch-sklearn True. It might be better to instead have two possible options: --patch-sklearn setting patch_sklearn to True, --no-patch-sklearn setting patch_sklearn to False, and then None by default. This can be done by adding two arguments with the same dest='patch_sklearn', but with different actions store_true and store_false, and the same default None.
You may also want to put both of these arguments into a mutually exclusive group.

For the --device option, I would keep it the way you have it right now, but just let the default be None (not the string), and not allow specifying None (unless you call it something like auto). --device None is a bit confusing, but reading --device cpu or --device auto makes much more sense

If possible, it would be nice to not have the exact same construction of argument parsing in two places for both the device and for the patching. What if you added the same tri-state patch argument as a kwarg to patch_sklearn, and the device as a kwarg to run_with_context?

Alexsandruss · 2023-05-17T12:49:36Z

Implemented in #133

adding parameters for device context and patching of Scikit-Learn

902bec1

Alexander-Makaryev requested a review from PetrovKP May 17, 2020 19:21

fix checking of patch_sklearn option

607c832

Alexander-Makaryev requested review from Alexsandruss, bibikar and napetrov May 18, 2020 01:22

napetrov reviewed May 18, 2020

View reviewed changes

PetrovKP reviewed May 18, 2020

View reviewed changes

bibikar reviewed May 20, 2020

View reviewed changes

michael-smirnov added 3 commits November 13, 2020 10:36

Merge branch 'master' into feature/sycl-sklearn

2e08178

merge continuation

56cccee

kmeans benchmark fix

8d38330

samir-nasibli self-assigned this May 16, 2023

Alexsandruss closed this May 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adding parameters for device context and patching of Scikit-Learn #23

adding parameters for device context and patching of Scikit-Learn #23

Alexander-Makaryev commented May 17, 2020

Alexander-Makaryev commented May 17, 2020

napetrov May 18, 2020

Alexander-Makaryev May 18, 2020

PetrovKP left a comment

PetrovKP May 18, 2020

Alexander-Makaryev May 19, 2020 •

edited

Loading

PetrovKP May 19, 2020

bibikar May 20, 2020

PetrovKP commented May 18, 2020

Alexander-Makaryev commented May 19, 2020

PetrovKP commented May 19, 2020

bibikar left a comment

bibikar May 20, 2020

Alexsandruss commented May 17, 2023

adding parameters for device context and patching of Scikit-Learn #23

adding parameters for device context and patching of Scikit-Learn #23

Conversation

Alexander-Makaryev commented May 17, 2020

Alexander-Makaryev commented May 17, 2020

napetrov May 18, 2020

Choose a reason for hiding this comment

Alexander-Makaryev May 18, 2020

Choose a reason for hiding this comment

PetrovKP left a comment

Choose a reason for hiding this comment

PetrovKP May 18, 2020

Choose a reason for hiding this comment

Alexander-Makaryev May 19, 2020 • edited Loading

Choose a reason for hiding this comment

PetrovKP May 19, 2020

Choose a reason for hiding this comment

bibikar May 20, 2020

Choose a reason for hiding this comment

PetrovKP commented May 18, 2020

Alexander-Makaryev commented May 19, 2020

PetrovKP commented May 19, 2020

bibikar left a comment

Choose a reason for hiding this comment

bibikar May 20, 2020

Choose a reason for hiding this comment

Alexsandruss commented May 17, 2023

Alexander-Makaryev May 19, 2020 •

edited

Loading