Skip to content

MonicaSai7/Capstone-Project---Azure-Machine-Learning-Engineer

Repository files navigation

Machine Learning Engineer with Microsoft Azure - Capstone Project

This project is part of the Machine Learning Engineer with Microsoft Azure Nanodegree Program by Udacity and Microsoft.

Dataset

The dataset used in this project is the Wisconsin Breast Cancer dataset from Kaggle. Two different models are developed, one model trained using Automated ML (AutoML) and the other model trained and tuned with HyperDrive. The performance of both the models are compared and the best model is deployed. The model is then consumed from the generated REST endpoint.

Overview

The dataset is obtained is the Wisconsin Breast Cancer dataset from Kaggle. The features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. The mean, standard error and "worst" or largest (mean of the three largest values) of these features were computed for each image, resulting in 30 features. The class distribution is 357 benign, 212 malignant.

Attribute Information:

  1. ID number
  2. Diagnosis (M = malignant, B = benign)
    3-32)

Ten real-valued features are computed for each cell nucleus:

a) radius (mean of distances from center to points on the perimeter)
b) texture (standard deviation of gray-scale values)
c) perimeter
d) area
e) smoothness (local variation in radius lengths)
f) compactness (perimeter^2 / area - 1.0)
g) concavity (severity of concave portions of the contour)
h) concave points (number of concave portions of the contour)
i) symmetry
j) fractal dimension ("coastline approximation" - 1)

Task

The dataset provides us with a binary classification task that requires us to classify the the given details of the FNA image into two classes: Malignant and Benign. All the attributes excluding the ID Number are used for training the model. The column diagnosis is the target variable.

Access

The dataset is uploaded into this github repository which then is associated with raw github content URL. This URL is used to access the dataset from the workspace.
https://raw.githubusercontent.com/MonicaSai7/Capstone-Project---Azure-Machine-Learning-Engineer/main/BreastCancerWisconsinDataset.csv

Automated ML

The major AutoML Settings are defined as:

  1. "experiment_timeout_minutes" = 20
    The maximum amount of time in minutes that all iterations combined can take before the experiment terminates. This configuration has been set to 20 due to on-demand lab duration.

  2. "primary_metric" = 'AUC_weighted'
    The metrics that Azure AutoML will optimize for model selection.

The configurations for the AutoMLConfig instance are defined as:

  1. task = "classification"
    The task of the dataset and experiment which is a binary classification task.
  2. label_column_name="diagnosis"
    The name of the label column which needs to be predicted. The target column in this dataset is the "diagnosis" column whose values are either Malignant or Benign.
  3. enable_early_stopping= True
    Whether to enable early termination if the score is not improving in the short term. Early stopping is triggered if the absolute value of best score calculated is the same for past early_stopping_n_iters iterations, that is, if there is no improvement in score for early_stopping_n_iters iterations.

The run details of the AutomatedML run are as below:





The different models run during the experiment:



Results

The AutoML experiment run generated VotingEnsemble algorithm as the best model with accuracy of 0.9841478696741854.

A voting ensemble (or a “majority voting ensemble“) is an ensemble machine learning model that combines the predictions from multiple other models. It is a technique that may be used to improve model performance, ideally achieving better performance than any single model used in the ensemble. A voting ensemble works by combining the predictions from multiple models. It can be used for classification or regression. In the case of regression, this involves calculating the average of the predictions from the models. In the case of classification, the predictions for each label are summed and the label with the majority vote is predicted.
Voting ensembles are most effective when:

  • Combining multiple fits of a model trained using stochastic learning algorithms.
  • Combining multiple fits of a model with different hyperparameters.

The best model details are given as below:



The parameters generated by the AutoML best model are:

KNeighborsClassifier(algorithm='auto',
                      leaf_size=30,
                       metric='l1',
                       metric_params=None,
                       n_jobs=1,
                       n_neighbors=9,
                       p=2,
                       weights='distance')

The metrics of the best run model can be seen as below:







Hyperparameter Tuning

The model used in this experiment is the RandomForestClassifier from sklearn library. A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is controlled with the max_samples parameter if bootstrap=True (default), otherwise the whole dataset is used to build each tree.

Random forest adds additional randomness to the model, while growing the trees. Instead of searching for the most important feature while splitting a node, it searches for the best feature among a random subset of features. This results in a wide diversity that generally results in a better model.

The required dataset is loaded as a dataframe using pandas library. The dataset is split into test and train with the test set being 30% of the data and at the default random state of 42. The scikit-learn pipeline uses the RandomForestClassifier which requires the following hyperparameters:

  1. --n_estimators - Number of trees in the forest.
  2. --min_samples_split - Minimum number of samples required to split an internal node.
  3. --max_features - {'auto', 'sqrt', 'log2'}
    The number of features to consider when looking for the best split:

If int, then consider max_features features at each split.

If float, then max_features is a fraction and round(max_features * n_features) features are considered at each split.

If “auto”, then max_features=sqrt(n_features).

If “sqrt”, then max_features=sqrt(n_features) (same as “auto”).

If “log2”, then max_features=log2(n_features).

If None, then max_features=n_features.
4. --bootstrap - Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree.

A compute cluster with vm_size STANDARD_D2_V2 and 4 maximum nodes is used to run the experiment. The HyperDriveConfig is created with the mentioned samples, estimater and policy along with maximum total runs set to 20 and maximum concurrent runs to 5.

There are three types of sampling in the hyperparameter space:

  1. Random Sampling
    Random sampling supports discrete and continuous hyperparameters. It supports early termination of low-performance runs. In random sampling, hyperparameter values are randomly selected from the defined search space.

  2. Grid Sampling
    Grid sampling supports discrete hyperparameters. Use grid sampling if you can budget to exhaustively search over the search space. Supports early termination of low-performance runs. Performs a simple grid search over all possible values. Grid sampling can only be used with choice hyperparameters.

  3. Bayesian Sampling
    Bayesian sampling is based on the Bayesian optimization algorithm. It picks samples based on how previous samples performed, so that new samples improve the primary metric. Bayesian sampling is recommended if you have enough budget to explore the hyperparameter space. For best results, we recommend a maximum number of runs greater than or equal to 20 times the number of hyperparameters being tuned.

Random Sampling is chosen as the sampling parameter as it supports both discrete and continuous hyperparameters providing a wider range of possible parameter combinations for the users. Grid sampling supports only discrete hyperparameters and performs an exhaustive search over the parameter space which requires high computational resources. Bayesian Sampling is justified when the maximum runs is greater than equal to 20, also demanding budget enough to withstand. Apart from that Bayesian sampling does not support early termination which is a requirement for our project. So, random sampling is an efficient choice for our dataset.

Bandit policy is based on slack factor/slack amount and evaluation interval. Bandit terminates runs where the primary metric is not within the specified slack factor/slack amount compared to the best performing run. Bandit Policy with a smaller allowable slack is used for aggressive savings, which means that the running jobs can be terminated by Azure in case of higher priority requirements of resources. Since our project does not need to run continuously, such an aggressive savings policy is sufficient than a conservative savings policy.

The run details of the experiment:





The hyperdrive tuned best model details can be given as:


Results

The Hyperdrive tuned best model generated an accuracy of 0.9766081871345029 with the following the configurations:
No of Estimators: 200
Min No of Samples to Split: 2
No of Features Considered: sqrt
Bootstrap: True

All the iteration/child runs of the experiment with different hyperparamters are:



Model Deployment

Deploying the best model will allow us to interact with the HTTP API service by sending data over POST requests.

The following screenshot shows the real-time endpoint created after the best model is deployed:



The endpoint.py script can be used to make a POST request for predicting the label of given records. The endpoint.py script contains the data payload that is required to pass through the HTTP request.

  • After the endpoint is deployed, a scoring URI and secret key will be generated.
  • The generated scoring URI and secret key must be added in the endpoint.py script.
  • Then run the endpoint.py script to consume the deployed endpoint.

The predict.py python script is used by the endpoint to interact with the registered best model. The data passed in the payload must be JSON serializable and the predict.py extracts data from the JSON body of the request and passes onto the model as input.

The sample input can be given as:





Screen Recording

The screencast of the project demo can be viewed here

Standout Suggestions

A limitation of the voting ensemble is that it treats all models the same, meaning all models contribute equally to the prediction. This is a problem if some models are good in some situations and poor in others. In order to improve the model performance in the future:

  1. Prevent overfitting

An over-fitted model will assume that the feature value combinations seen during training will always result in the exact same output for the target.
The best way to prevent overfitting is to follow ML best-practices including:

  • Using more training data, and eliminating statistical bias
  • Preventing target leakage
  • Using fewer features
  • Regularization and hyperparameter optimization
  • Model complexity limitations
  • Cross-validation

In the context of automated ML, the first three items above are best-practices you implement. The last three bolded items are best-practices automated ML implements by default to protect against over-fitting. In settings other than automated ML, all six best-practices are worth following to avoid over-fitting models.
More information can be found at,
https://docs.microsoft.com/en-us/azure/machine-learning/concept-manage-ml-pitfalls

  1. Using wider ranging hyperparameter sampling in the scikit-learn pipeline

  2. Enable Deep Learning in Classification while creating the AutoML experiment.

  3. Perform data preprocessing such as feature selection by observing the influence of features on different models.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published