- This directory contains the documentation and configuration files needed to help oboard your Deeprank model to AML.
Please ensure you join the "aml-deeprank" Security Group on idweb.
- Scope
- Regular training
- Distributed training
- PRS
- Hyperparameter optimization (sweep)
For information on the installation and job submission please see here.
Steps:
- Setup compute instance and open a Jupyter notebook
- Run the notebook aml-deeprank.ipynb
There will be a new Aether module that you can use to specify your input and output paths simarly to what was done in approach #2. The AML module will have a command parameter and 10 input paths as parameters so that you can reference the input paths in the command.
[Aashna to provide screenshot]
- All the deeprank supported models can be found in the deeprank/configs folder
- For every model deeprank supports, there is a separate json file for each type of job submission (regular training, distributed training, sweep or PRS job) in the aml folder for that model. Example deepranl/configs/meb/qr_embedding_bag_nested/aml/regular_job_adls_mount.json
- You will need to customize the json file and make changes to the input paths, output paths and user command. This json is used to submit a regular training job, distributed training job, inferencing job (PRS), sweep or scope jobs to AML.
Parameter definitions in JSON configuration file: Example
- module - In the module section, specify the type of job you will like to run (training, distributed, PRS etc.).
- inputs - Input paths are a combination of datastore and paths (directory).
- datastore - Datastores are attached to workspaces and are used to store connection information to Azure storage services so you can refer to them by name and don't need to remember the connection information and secret used to connect to the storage services.
- outputs - The output is where all the outputs of the run will get uploaded.
- user_command - This is where you can run the deeprank command. Ensure to reference the input paths and output path that you specified in the json file
- reference any python script arguments here.
- target - A designated compute resource or environment where you run your training script or host your service deployment.
- instance_count - GPU count
- instance_type - Virtual Machine sizes or SKUs
- process_count_per_node - Number of processes executed on each node. (optional, default value is number of cores on node.)
- node_count - Number of nodes in the compute target used for running the ParallelRunStep.
- error_threshold - The number of record failures for TabularDataset and file failures for FileDataset that should be ignored during processing. If the error count goes above this value, then the job will be aborted. Error threshold is for the entire input and not for individual mini-batches sent to run() method. The range is [-1, int.max]. -1 indicates ignore all failures during processing.
- mini_batch_size - A hyperparameter that defines the number of samples to work through before updating the internal model parameters.
- logging_level - A string of the logging level name, which is defined in 'logging'. Possible values are 'WARNING', 'INFO', and 'DEBUG'. (optional, default value is 'INFO'.)
- run_max_try - Number of times to restart the run when it is failed or stopped
- run_invocation_timeout - Timeout in seconds for each invocation of the run() method. (optional, default value is 60.)
In Aether, by replacing your ITP training subgraph with only one Deeprank AML ITP training module, you can continue to submit to ITP cluster and/or AML compute and get all the added benefits of AML.
Current Training Module submitted to ITP
AML Deeprank Training Module (Work in Progress) There will be a new Aether module that you can use to specify your input and output paths simarly to what was done in approach #2. The AML module will have a command parameter and 10 input paths as parameters so that you can reference the input paths in the command.
[need screenshot]
Benefits:
- Improved experiment tracking
- Automated hyperparameter tuning
- Increased resource utilization (AML Compute + existing ITP compute)
- Replace long commands with configs and overridable params
In Aether, by replacing your Inferencing subgraph with only one Deeprank AML ITP inferencing module, you can continue to submit to ITP cluster and/or AML compute and get all the added benefits of AML.
AML Deeprank PRS Module [need screenshot]
Configurable Parallelization
- There is also configurable parallelization where you can specify the number of numNodes and numGPUs which can save 100x graph complexity. Benefits:
- Increased resource utilization (AML Compute+existing ITP)
- Automatic retries and non-redundant code/modules for batch inferencing
- Aether graph for inferencing simplified by 100x
Hyperparameter tuning, also called hyperparameter optimization, is the process of finding the configuration of hyperparameters that results in the best performance. The process is typically computationally expensive and manual. Use this notebook to walkthrough how to set up your resources and to learn how to submit an AML Pipeline using the sweep component. The Hyperparameter Optimization section of the notebook covers how to configure and submit a pipeline sweep job. See here for more details on the Sweep component.
Use the configuration file sweep.json to edit the parameters and settings for your job. This file outlines how you can do the following tasks.
- Define the parameter search space. In this example, we are tuning on batch_size_per_gpu and learning_rate.
- Specify a primary metric to optimize. In this example, we are optimizing the metric onedcg_3.
- Specify early termination policy for low-performing runs
- Create and assign resources
- Launch an experiment with the defined configuration
- Visualize the training runs
- Select the best configuration for your model
- After you submit your training job, you can visualize your hyperparameter tuning runs in the Azure Machine Learning studio UI, or you can use a notebook widget.
- In the Experiments tab, if you submitted a pipeline run, select the run and then select the Steps tab.
- Navigate to the Child runs tab to view each hyperdrive child run. This visualization tracks the metrics logged for each hyperdrive child run over the duration of hyperparameter tuning. Each line represents a child run, and each point measures the primary metric value at that iteration of runtime.
[Coming soon]
[Coming soon]
Contact Aashna Garg (aagarg@microsoft.com), Shané Winner (shwinne@microsoft.com) for any questions/feedback.
Contacts for respective teams issues with:
ADLS issues: ComponentSDK issues: Sweep issues: PRS issues: