Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding parameters for device context and patching of Scikit-Learn #23

Closed
wants to merge 5 commits into from

Conversation

Alexander-Makaryev
Copy link
Contributor

This is initial suggestion in purpose to start the discussion.
Parameters for device context and Scikit-Learn patching are added.
Some benchmarks are changed in purpose to use new parameters.

@Alexander-Makaryev
Copy link
Contributor Author

Example of the config file:

{
    "common": {
        "lib": ["sklearn"],
        "data-format": ["numpy"],
        "data-order": ["C"],
        "device": ["None", "host", "cpu", "gpu"],
        "patch_sklearn": ["False", "True"],
        "dtype": ["float64"]
    },
    "cases": [
                {
            "algorithm": "kmeans",
            "dataset": [
                {
                    "source": "synthetic",
                    "type": "kmeans",
                    "n_clusters": 10,
                    "n_features": 50,
                    "training": {
                        "n_samples": 1000000
                    }
                }
            ],
            "n-clusters": [10]
        },
        {
            "algorithm": "dbscan",
            "dataset": [
                {
                    "source": "synthetic",
                    "type": "blobs",
                    "n_clusters": 10,
                    "n_features": 50,
                    "training": {
                        "n_samples": 10000
                    }
                }
            ],
            "min-samples": [5000],
            "eps": [1]
        },
        {
            "algorithm": "linear",
            "dataset": [
                {
                    "source": "synthetic",
                    "type": "regression",
                    "n_features": 50,
                    "training": {
                        "n_samples": 1000000
                    }
                }
            ]
        },
        {
            "algorithm": "log_reg",
            "solver":["lbfgs", "newton-cg"],
            "dataset": [
                {
                    "source": "synthetic",
                    "type": "classification",
                    "n_classes": 2,
                    "n_features": 100,
                    "training": {
                        "n_samples": 100000
                    }
                },
                {
                    "source": "synthetic",
                    "type": "classification",
                    "n_classes": 5,
                    "n_features": 100,
                    "training": {
                        "n_samples": 100000
                    }
                }
            ]
        }
    ]
}

@@ -153,6 +153,12 @@ def parse_args(parser, size=None, loop_types=(),
help='Seed to pass as random_state')
parser.add_argument('--dataset-name', type=str, default=None,
help='Dataset name')
parser.add_argument('--device', type=str, default='None',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this params should be enabled in higher level scripts that will execute them - but i don't see this in make?

Another question - are we going to pass single value for bench.py or this should be tuple so bench.py will iterate via them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this moment I used only runner.py script. Possible that additional changes should be done in make or somewhere else. Thanks.

Devices are provided as a list in config "device": ["None", "host", "cpu", "gpu"],. See example above. Benchmarks are executing with all devices in the list.

Copy link
Contributor

@PetrovKP PetrovKP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't these changes break the launch of traditional DAAL? Can you check?


if args.patch_sklearn is not None and args.patch_sklearn != 'None':
from daal4py.sklearn import patch_sklearn, unpatch_sklearn
if args.patch_sklearn == "True":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are "True", "False" string, not boolean?

Copy link
Contributor Author

@Alexander-Makaryev Alexander-Makaryev May 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more option is "None" - to leave patching state "as is". I think it can be useful for back compatibility. Possible we should change it to boolean.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will be broken in this case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not common to have to write --patch-sklearn True. It might be better to instead have two possible options: --patch-sklearn setting patch_sklearn to True, --no-patch-sklearn setting patch_sklearn to False, and then None by default. This can be done by adding two arguments with the same dest='patch_sklearn', but with different actions store_true and store_false, and the same default None.
You may also want to put both of these arguments into a mutually exclusive group.

For the --device option, I would keep it the way you have it right now, but just let the default be None (not the string), and not allow specifying None (unless you call it something like auto). --device None is a bit confusing, but reading --device cpu or --device auto makes much more sense

If possible, it would be nice to not have the exact same construction of argument parsing in two places for both the device and for the patching. What if you added the same tri-state patch argument as a kwarg to patch_sklearn, and the device as a kwarg to run_with_context?

@PetrovKP
Copy link
Contributor

Example of the config file:

{
    "common": {
        "lib": ["sklearn"],
        "data-format": ["numpy"],
        "data-order": ["C"],
        "device": ["None", "host", "cpu", "gpu"],
        "patch_sklearn": ["False", "True"],
        "dtype": ["float64"]
    },
    "cases": [
                {
            "algorithm": "kmeans",
            "dataset": [
                {
                    "source": "synthetic",
                    "type": "kmeans",
                    "n_clusters": 10,
                    "n_features": 50,
                    "training": {
                        "n_samples": 1000000
                    }
                }
            ],
            "n-clusters": [10]
        },
        {
            "algorithm": "dbscan",
            "dataset": [
                {
                    "source": "synthetic",
                    "type": "blobs",
                    "n_clusters": 10,
                    "n_features": 50,
                    "training": {
                        "n_samples": 10000
                    }
                }
            ],
            "min-samples": [5000],
            "eps": [1]
        },
        {
            "algorithm": "linear",
            "dataset": [
                {
                    "source": "synthetic",
                    "type": "regression",
                    "n_features": 50,
                    "training": {
                        "n_samples": 1000000
                    }
                }
            ]
        },
        {
            "algorithm": "log_reg",
            "solver":["lbfgs", "newton-cg"],
            "dataset": [
                {
                    "source": "synthetic",
                    "type": "classification",
                    "n_classes": 2,
                    "n_features": 100,
                    "training": {
                        "n_samples": 100000
                    }
                },
                {
                    "source": "synthetic",
                    "type": "classification",
                    "n_classes": 5,
                    "n_features": 100,
                    "training": {
                        "n_samples": 100000
                    }
                }
            ]
        }
    ]
}

Why doesn't the config add to this repository?

@Alexander-Makaryev
Copy link
Contributor Author

@PetrovKP At this moment we have only one example in the repository. I suppose to create a directory with various configs. Please comment what do you think about it.

@PetrovKP
Copy link
Contributor

@PetrovKP At this moment we have only one example in the repository. I suppose to create a directory with various configs. Please comment what do you think about it.

We wanted to store all the configs here. This is correct and convenient. I don’t know why we still haven’t moved the configs ...

Copy link
Contributor

@bibikar bibikar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good. See comment about command-line options
In the future, it might be worth considering turning this project into a more standard Python module structure, with a sklearn_bench module containing sklearn, daal4py, etc. as submodules. Currently, bench.py is duplicated for each implementation of benchmarks, and it could just be moved to the sklearn_bench module.


if args.patch_sklearn is not None and args.patch_sklearn != 'None':
from daal4py.sklearn import patch_sklearn, unpatch_sklearn
if args.patch_sklearn == "True":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not common to have to write --patch-sklearn True. It might be better to instead have two possible options: --patch-sklearn setting patch_sklearn to True, --no-patch-sklearn setting patch_sklearn to False, and then None by default. This can be done by adding two arguments with the same dest='patch_sklearn', but with different actions store_true and store_false, and the same default None.
You may also want to put both of these arguments into a mutually exclusive group.

For the --device option, I would keep it the way you have it right now, but just let the default be None (not the string), and not allow specifying None (unless you call it something like auto). --device None is a bit confusing, but reading --device cpu or --device auto makes much more sense

If possible, it would be nice to not have the exact same construction of argument parsing in two places for both the device and for the patching. What if you added the same tri-state patch argument as a kwarg to patch_sklearn, and the device as a kwarg to run_with_context?

@samir-nasibli samir-nasibli self-assigned this May 16, 2023
@Alexsandruss
Copy link
Contributor

Implemented in #133

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants