This is the Capstone project for the Udacity Course - ML Engineer with Microsoft Azure.
This project provides the opportunity to use the knowledge obtained from the course to solve an interesting problem. In this project, two models are created: one using AutoML plus one customized model whose hyperparameters are tuned using HyperDrive. The performance of both the models is compared/contrasted and best performing model is deployed. This project demonstrates the ability to use external datasets, train a model using different AzureML framework tools available as well as the ability to deploy the model as a web service.
Both Hyperdrive and AutoML API are used in this project.
In this project we attempt to solve the problem of classifying penguin species for a given input data. We first apply AutoML where multiple models are trained to fit the training data. Secondly, we use a LogicalRegression model while tuning hyperparameters using HyperDrive. Finally, the best model from the two approaches is chosen (in terms of accuracy) and deployed as a web service.
- Created new workspace called udacity-capstone
- Created new compute instance (DS-3) to be used by workspace/notebooks
- Forked nd00333-capstone project to my github from Udacity's instance
- Uploaded code from github.com/aspatton/nd00333-capstone repo to my workspace
- Imported all needed dependencies in the notebooks.
- Imported dataset in the workspace (penguins.csv)
- Train model using AutoML
- Train model using HyperDrive
- Compare model performance - AutoML vs HyperDrive
- Select the best performing model via the comparison
- Deploy the best performing model as a web service
- Test the model
- ** Added screenshots throughout the project **
- Editing/updated readme.md
This project uses the Palmer Archipelago penguin data from Kaggle -> https://www.kaggle.com/datasets/parulpandey/palmer-archipelago-antarctica-penguin-data
Palmer Archipelago (Antarctica) penguin data
Data collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network.
The dataset contains the following columns:
- species: penguin species, the column to be predicted.
- island: The island where the penguin was found.
- culmen_length_mm: The length of the penguin's bill in millimeters.
- culmen_depth_mm: The depth of the penguin's bill in millimeters.
- flipper_length_mm: The length of the penguin's flipper in millimeters.
- body_mass_g: The penguins's body mass in grams.
- sex: Penguin's sex.
The distribution of data categories:
- species: There are 3 species of penguins in the dataset: Adelie (152), Gentoo (124), and Chinstrap (68).
- island: Penguins are from 3 islands: Biscoe (168), Dream (124), and Torgersen (52).
- sex: There are 168 male, 165 female, and 1 ambiguous gender.
There are two models run with this penguin data:
- HyperDrive - This uses LinearRegression model. The sklearn library is used to work with this model.
- C paraameter is used which controls the regularization, basically to avoid overfitting the model
- max_iter parameter is used which specifies the maximum number of iterations taken for the solvers to converge.
- Preprocessing is handled with imputers from sklearn to get data in proper format for model.
- AutomML
- task parameter set to classification as this is a classification problem.
- label_column_name set to 'species' as this is the target field for prediction.
- featurization set to 'auto' allowing AutoML to control this.
A Dataset is uploaded from the external data source to Azure during creation.
Our objective is to build prediction models that predict a penguin's species from the set of given penguin features.
Working with the AutoML run we specify the classification task, the primary metric of accuracy, training data set along with target column name. Featurization is set to "auto", meaning that the featurization step should be done automatically. Early stopping is set to True to avoid overfitting, and debugging info is sent to local log file.
automl_config = AutoMLConfig(compute_target=cpu_cluster,
task = "classification",
training_data=ds,
label_column_name="species",
enable_early_stopping= True,
featurization= 'auto',
debug_log = "automl_errors.log",
experiment_timeout_minutes=15,
max_concurrent_iterations=4,
primary_metric='accuracy'
)
NOTE AGAIN - The RunDetails widget is not working latest Azure AI - Machine Learning Studio which is Python 3.8 with AzureML
A ticket has been logged with MSFT, this was noted and not a problem with previous two project postings.
The next three screen show the job running and the details of the runs with the output shown not using RunDetails widget.
Output details:
{'runId': 'HD_3e829474-1309-43d0-a49b-412e08bc6ae0', 'target': 'capstone-compute', 'status': 'Completed', 'startTimeUtc': '2024-03-22T21:15:41.277773Z', 'endTimeUtc': '2024-03-22T21:19:44.111593Z', 'services': {}, 'properties': {'primary_metric_config': '{"name":"Accuracy","goal":"maximize"}', 'resume_from': 'null', 'runTemplate': 'HyperDrive', 'azureml.runsource': 'hyperdrive', 'platform': 'AML', 'ContentSnapshotId': '185e47a0-47a3-4a28-a433-77aa4773e99b', 'user_agent': 'python/3.8.5 (Linux-5.15.0-1040-azure-x86_64-with-glibc2.10) msrest/0.7.1 Hyperdrive.Service/1.0.0 Hyperdrive.SDK/core.1.51.0', 'space_size': '9', 'best_child_run_id': 'HD_3e829474-1309-43d0-a49b-412e08bc6ae0_1', 'score': '0.7536231884057971', 'best_metric_status': 'Succeeded', 'best_data_container_id': 'dcid.HD_3e829474-1309-43d0-a49b-412e08bc6ae0_1'}, 'inputDatasets': [], 'outputDatasets': [], 'runDefinition': {'configuration': None, 'attribution': None, 'telemetryValues': {'amlClientType': 'azureml-sdk-train', 'amlClientModule': '[Scrubbed]', 'amlClientFunction': '[Scrubbed]', 'tenantId': '41b3409a-40cf-45e8-8201-ddfa1984ab3d', 'amlClientRequestId': '41ff6797-074a-436a-b118-bc3c29ee435c', 'amlClientSessionId': 'b2739a00-4a8b-487b-9ac8-aa49ce70f7e8', 'subscriptionId': 'a022d83d-6229-4b5e-b039-b680692436b5', 'estimator': 'NoneType', 'samplingMethod': 'RANDOM', 'terminationPolicy': 'Bandit', 'primaryMetricGoal': 'maximize', 'maxTotalRuns': 8, 'maxConcurrentRuns': 4, 'maxDurationMinutes': 10080, 'vmSize': None}, 'snapshotId': '185e47a0-47a3-4a28-a433-77aa4773e99b', 'snapshots': [], 'sourceCodeDataReference': None, 'parentRunId': None, 'dataContainerId': None, 'runType': None, 'displayName': None, 'environmentAssetId': None, 'properties': {}, 'tags': {}, 'aggregatedArtifactPath': None}, 'logFiles': {'azureml-logs/hyperdrive.txt': 'https://udacitymlazure9709269552.blob.core.windows.net/azureml/ExperimentRun/dcid.HD_3e829474-1309-43d0-a49b-412e08bc6ae0/azureml-logs/hyperdrive.txt?sv=2019-07-07&sr=b&sig=ufrLNXSCk259vE9d9OxwslQJ8F02SJjNcsg1pDOeBHQ%3D&skoid=fca8080a-91d7-47f4-9bee-f9c2cdae4035&sktid=41b3409a-40cf-45e8-8201-ddfa1984ab3d&skt=2024-03-22T21%3A06%3A58Z&ske=2024-03-24T05%3A16%3A58Z&sks=b&skv=2019-07-07&st=2024-03-22T21%3A10%3A35Z&se=2024-03-23T05%3A20%3A35Z&sp=r'}, 'submittedBy': 'Tony Patton'}
The following screen shows the code/model running.
Once the model is finished, the following screen shows the details of the run.
AutoML's best peforming model is VotingAlgorhthm with 0.82 Accuracy and 0.96 Weighted AUC.
The model was registered:
We will compare the above automl run with a tradional ML approach via LogicalRegression, and Hypeparameters tuned via HyperDrive.
These parameters used:
param_sampling = RandomParameterSampling({
'--C': choice(0.1, 0.2, 0.3),
'--max_iter': choice(10, 20, 30)
})
Early termination policy: BanditPolicy defines an early termination policy based on slack criteria and a frequency interval for evaluation. Any run that does ot fall within a specific slack factor of the evaluation metric with respect to the best performing run will be terminated.
early_termination_policy = BanditPolicy(evaluation_interval=5, slack_factor=0.1)
ScriptRunConfig is used to setup script/notebook configuration for runs.
src = ScriptRunConfig(source_directory="./",
script='train.py',
compute_target=cpu_cluster,
environment=sklearn_env)
To initialize a HyperDriveConfog class we need to specify the following:
hyperdrive_run_config = HyperDriveConfig(run_config=src,
hyperparameter_sampling=param_sampling,
policy=early_termination_policy,
primary_metric_name='Accuracy',
primary_metric_goal=PrimaryMetricGoal.MAXIMIZE,
max_total_runs=8,
max_concurrent_runs=4)
Hyperparameter space: RandomParameterSampling defines a random sampling over the hyperparameter search spaces. The advantages here are that it is not so exhaustive and the lack of bias. It is a good first choice.
An import step is perfomred via the train.py file where data is preprocessed, removing missing values and transform the data.
- OrdinalEncoder on the species column.
- Categorical encoder/transformer on the island column.
- Numeric on rest of the columns.
Paremeters:
The following screen shows the parameters setup via Python:
The HyperDrive approach with the specified parameters resulted in 0.75 accuracy.
Hyperdrive run output:
[2024-03-20T18:38:01.601970][GENERATOR][INFO]Trying to sample '4' jobs from the hyperparameter space [2024-03-20T18:38:02.2603411Z][SCHEDULER][INFO]Scheduling job, id='HD_8ce5ea1a-0a6b-4f56-9230-12e6f697dd7d_1' [2024-03-20T18:38:02.2265898Z][SCHEDULER][INFO]Scheduling job, id='HD_8ce5ea1a-0a6b-4f56-9230-12e6f697dd7d_0' [2024-03-20T18:38:02.3812686Z][SCHEDULER][INFO]Scheduling job, id='HD_8ce5ea1a-0a6b-4f56-9230-12e6f697dd7d_2' [2024-03-20T18:38:02.4720932Z][SCHEDULER][INFO]Scheduling job, id='HD_8ce5ea1a-0a6b-4f56-9230-12e6f697dd7d_3' [2024-03-20T18:38:02.442419][GENERATOR][INFO]Successfully sampled '4' jobs, they will soon be submitted to the execution target. [2024-03-20T18:38:02.9263687Z][SCHEDULER][INFO]Successfully scheduled a job. Id='HD_8ce5ea1a-0a6b-4f56-9230-12e6f697dd7d_0' [2024-03-20T18:38:03.0099833Z][SCHEDULER][INFO]Successfully scheduled a job. Id='HD_8ce5ea1a-0a6b-4f56-9230-12e6f697dd7d_2' [2024-03-20T18:38:02.9645510Z][SCHEDULER][INFO]Successfully scheduled a job. Id='HD_8ce5ea1a-0a6b-4f56-9230-12e6f697dd7d_1' [2024-03-20T18:38:03.2713601Z][SCHEDULER][INFO]Successfully scheduled a job. Id='HD_8ce5ea1a-0a6b-4f56-9230-12e6f697dd7d_3' [2024-03-20T18:40:01.141075][GENERATOR][INFO]Trying to sample '4' jobs from the hyperparameter space [2024-03-20T18:40:01.5384437Z][SCHEDULER][INFO]Scheduling job, id='HD_8ce5ea1a-0a6b-4f56-9230-12e6f697dd7d_4' [2024-03-20T18:40:01.6793582Z][SCHEDULER][INFO]Scheduling job, id='HD_8ce5ea1a-0a6b-4f56-9230-12e6f697dd7d_5' [2024-03-20T18:40:01.7996859Z][SCHEDULER][INFO]Successfully scheduled a job. Id='HD_8ce5ea1a-0a6b-4f56-9230-12e6f697dd7d_4' [2024-03-20T18:40:01.8064952Z][SCHEDULER][INFO]Scheduling job, id='HD_8ce5ea1a-0a6b-4f56-9230-12e6f697dd7d_6' [2024-03-20T18:40:01.9185047Z][SCHEDULER][INFO]Scheduling job, id='HD_8ce5ea1a-0a6b-4f56-9230-12e6f697dd7d_7' [2024-03-20T18:40:01.9174646Z][SCHEDULER][INFO]Successfully scheduled a job. Id='HD_8ce5ea1a-0a6b-4f56-9230-12e6f697dd7d_5' [2024-03-20T18:40:01.822364][GENERATOR][INFO]Successfully sampled '4' jobs, they will soon be submitted to the execution target. [2024-03-20T18:40:02.0763872Z][SCHEDULER][INFO]Successfully scheduled a job. Id='HD_8ce5ea1a-0a6b-4f56-9230-12e6f697dd7d_6' [2024-03-20T18:40:02.1592897Z][SCHEDULER][INFO]Successfully scheduled a job. Id='HD_8ce5ea1a-0a6b-4f56-9230-12e6f697dd7d_7' [2024-03-20T18:40:31.154938][GENERATOR][INFO]Max number of jobs '8' reached for experiment. [2024-03-20T18:40:31.300672][GENERATOR][INFO]All jobs generated. [2024-03-20T18:41:33.4559368Z][CONTROLLER][INFO]Changing Run Status from Running to Completed
Best Hyperdrive Model:
The model was deployed as hd_best_model, visible within Models in ML AI Studio.
AutoML delivered the best peforming model 0.82 Accuracy and 0.96 Weighted AUC which utlized the VotingEnsemble algorhthm. The model was rolled out/registered within Azure ML Studio.
Call the endpoint:
The endpoint is called by posting a request to the service via request library. Test data is taken as a sample from the overall dataset, formated as JSON and passed to the web service uri along with necessary headers.
Endpoint log info from the test:
Delete the endpoint:
Files:
- score_it.py : Scoring file for best run
- envfile.yml : Environment file for best run
- auto_best_run.pkl : Best run model from AutoML saved
Screen recording of the following items --> https://youtu.be/Ixdtd0__eFM
The screen recording covers these points:
- Overview and comparison of two models - 1 from HyperDrive and 1 from AutoML.
- A working model which is the best model observed from AutoML run.
- Demo of the deployed model
- Demo of a sample request sent to the endpoint and its response
- I think reworking Hyperdrive/parameters with Keras/Tensorflow would greatly improve accuracy.
- Tweak parameters with AutoML to measure success rate, get a trend.