Skip to content

Azure/cyclecloud-scalelib

Repository files navigation

Azure CycleCloud Autoscaling Library

The cyclecloud-scalelib project provides Python helpers to simplify autoscaler development for any scheduler in Azure using Azure CycleCloud and the Azure CycleCloud REST API to orchestrate resource creation in Microsoft Azure.

Autoscale Example

The primary use-case of this library is to facilitate and standardize scheduler autoscale integrations. An example of such an integration with Celery is included in this project.

Building the project

The cyclecloud-scalelib project is generally used in a Python 3 virtualenv and has several standard python dependencies, but it also depends on the Azure CycleCloud Python Client Library.

Pre-requisites

The instructions below assume that:

  • you have python 3 available on your system
  • you have access to an Azure CycleCloud installation

Before attempting to build the project, obtain a copy of the Azure CycleCloud Python Client library. You can get the wheel distribution from the /opt/cycle_server/tools/ directory in your Azure CycleCloud installation or you can download the wheel from the CycleCloud UI following the instructions here.

The instructions below assume that you have copied the cyclecloud-api.tar.gz to your working directory.

Creating the virtualenv

    # If Cyclecloud is installed on the current machine:
    # cp /opt/cycle_server/tools/cyclecloud_api*.whl .

    python3 -m venv ~/.virtualenvs/autoscale/
    . ~/.virtualenvs/autoscale/bin/activate
    pip install -r ./dev-requirements.txt
    pip install ./cyclecloud_api*.whl
    python setup.py build
    pip install -e .

Testing the project:

The project includes several helpers for contributors to validate, test and format changes to the code.

    # OPTIONAL: use the following to type check / reformat code
    python setup.py types
    python setup.py format
    python setup.py test

Resources

Default Resources

The cyclecloud-scalelib application matches scheduler resources to azure cloud resources to provide rich autoscaling and cluster configuration tools. We call these default_resources because they exist even for nodes that have not been materialized, versus resources that are defined after the node has joined the cluster, and potentially overrides these.

Here is an example for matching PBSPro:

{"default_resources": [
   {
      "select": {},
      "name": "ncpus",
      "value": "node.vcpu_count"
   },
   {
      "select": {},
      "name": "group_id",
      "value": "node.placement_group"
   },
   {
      "select": {},
      "name": "host",
      "value": "node.hostname"
   },
   {
      "select": {},
      "name": "mem",
      "value": "node.memory"
   },
   {
      "select": {},
      "name": "vm_size",
      "value": "node.vm_size"
   },
   {
      "select": {},
      "name": "disk",
      "value": "size::20g"
   }]
}

Note that disk is currently hardcoded to size::20g because of platform limitations to determine how much disk a node will have. In the select statement, we can filter how the resources are applied, i.e. by VM Size or nodearray etc. Here is an example of handling VM Size specific disk size and gpus for a nodearray.

   {
      "select": {"node.vm_size": "Standard_F2"},
      "name": "disk",
      "value": "size::20g"
   },
   {
      "select": {"node.vm_size": "Standard_H44rs"},
      "name": "disk",
      "value": "size::2t"
   },
   {
      "select": {"node.nodearray": "gpuarray"},
      "name": "ngpus",
      "value": 8
   }

Note that these are applied in order , and once a default value is defined for a matching potential node, other defaults are ignored. This means that you should always use your most restrictive select filters first.

In other words, if we want to override the pcpu_count for just one nodearray, doing this will work:

{
 "select": "node.nodearray": "special-nodearray",
 "name": "ncpus",
 "value": 42
},
{"select": {},
"name": "ncpus",
"value": "node.pcpu_count"
}

However, this will ignore the second definition.

{"select": {},
"name": "ncpus",
"value": "node.pcpu_count"
},
{
 "select": "node.nodearray": "special-nodearray",
 "name": "ncpus",
 "value": 42
},

Node Properties

Property Type Description
node.bucket_id BucketId UUID for this combination of NodeArray, VM Size and Placement Group
node.colocated bool Will this node be put into a VMSS with a placement group, allowing infiniband
node.cores_per_socket int CPU cores per CPU socket
node.create_time datetime When was this node created, as a datetime
node.create_time_remaining float How much time is remaining for this node to reach the Ready state
node.create_time_unix float When was this node created, in unix time
node.delete_time datetime When was this node deleted, as datetime
node.delete_time_unix float When was this node deleted, in unix time
node.exists bool Has this node actually been created in CycleCloud yet
node.gpu_count int GPU Count
node.hostname Optional[Hostname] Hostname for this node. May be None if the node has not been given one yet
node.hostname_or_uuid Optional[Hostname] Hostname or a UUID. Useful when partitioning a mixture of real and potential nodes by hostname
node.infiniband bool Does the VM Size of this node support infiniband
node.instance_id Optional[InstanceId] Azure VM Instance Id, if the node has a backing VM.
node.keep_alive bool Is this node protected by CycleCloud to prevent it from being terminated.
node.last_match_time datetime The last time this node was matched with a job, as datetime.
node.last_match_time_unix float The last time this node was matched with a job, in unix time.
node.location Location Azure location for this VM, e.g. westus2
node.memory Memory Amount of memory, as reported by the Azure API. OS reported memory will differ.
node.name NodeName Name of the node in CycleCloud, e.g. execute-1
node.nodearray NodeArrayName NodeArray name associated with this node, e.g. execute
node.pcpu_count int Physical CPU count
node.placement_group Optional[PlacementGroup] If set, this node is put into a VMSS where all nodes with the same placement group are tightly coupled
node.private_ip Optional[IpAddress] Private IP address of the node, if it has one.
node.spot bool If true, this node is taking advantage of unused capacity for a cheaper rate
node.state NodeStatus State of the node, as reported by CycleCloud.
node.vcpu_count int Virtual CPU Count
node.version str Internal version property to handle upgrades
node.vm_family VMFamily Azure VM Family of this node, e.g. standardFFamily
node.vm_size VMSize Azure VM Size of this node, e.g. Standard_F2

Constraints

And

The logical 'and' operator ensures that all of its child constraints are met.

{"and": [{"ncpus": 1}, {"mem": "memory::4g"}]}

Note that and is implied when combining multiple resource definitions in the same dictionary. e.g. the following have identical semantic meaning, the latter being shorthand for the former.

{"and": [{"ncpus": 1}, {"mem": "memory::4g"}]}
{"ncpus": 1, "mem": "memory::4g"}

ExclusiveNode

Defines whether, when allocating, if a node will exclusively run this job.

{"exclusive": true}

-> One and only one iteration of the job can run on this node.

{"exclusive-task": true}

-> One or more iterations of the same job can run on this node.

InAPlacementGroup

Ensures that all nodes allocated will be in any placement group or not. Typically this is most useful to prevent a job from being allocated to a node in a placement group.

{"in-a-placement-group": true}
{"in-a-placement-group": false}

MinResourcePerNode

Filters for nodes that have at least a certain amount of a resource left to allocate.

{"ncpus": 1}
{"mem": "memory::4g"}
{"ngpus": 4}

Or, shorthand for combining the above into one expression

{"ncpus": 1, "mem": "memory::4g", "ngpus": 4}

Never

Rejects every node. Most useful when generating a complex node constraint that cannot be determined to be satisfiable until it is generated. For example, say a scheduler supports an 'excluded_users' list for scheduler specific "projects". When constructing a set of constraints you may realize that this user will never be able to run a job on a node with that project.

{"or":
    [{"project": "open"},
     {"project": "restricted",
      "never": "User is denied access to this project"}
    ]
}

NodePropertyConstraint

Similar to NodeResourceConstraint, but these are constraints based purely on the read only node properties, i.e. those starting with 'node.'

{"node.vm_size": ["Standard_F16", "Standard_E32"]}
{"node.location": "westus2"}
{"node.pcpu_count": 44}

Note that the last example does not allocate 44 node.pcpu_count, but simply matches nodes that have a pcpu_count of exactly 44.

NodeResourceConstraint

These are constraints that filter out which node is matched based on read-only resources.

{"custom_string1": "custom_value"}
{"custom_string2": ["custom_value1", "custom_value2"]}

For read-only integers you can programmatically call NodeResourceConstraint("custom_int", 16) or use a list of integers

{"custom_integer": [16, 32]}

For shorthand, you can combine the above expressions

{
    "custom_string1": "custom_value",
    "custom_string2": ["custom_value1", "custom_value2"],
    "custom_integer": [16, 32]
}

Not

Logical 'not' operator negates the single child constraint.

Only allocate machines with GPUs

{"not": {"node.gpu_count": 0}}

Only allocate machines with no GPUs available

{"not": {"ngpus": 1}}

Or

Logical 'or' for matching a set of child constraints. Given a list of child constraints, the first constraint that matches is the one used to decrement the node. No further constraints are considered after the first child constraint has been satisfied. For example, say we want to use a GPU instance if we can get a spot instance, otherwise we want to use a non-spot CPU instance.

{"or": [{"node.vm_size": "Standard_NC6", "node.spot": true},
        {"node.vm_size": "Standard_F16", "node.spot": false}]
}

SharedConsumableConstraint

Represent a shared consumable resource, for example a queue quota or number of licenses. Please use the SharedConsumableResource object to represent this resource.

While there is a json representation of this object, it is up to the author to create the SharedConsumableResources programmatically so programmatic creation of this constraint is recommended.

# global value
SHARED_RESOURCES = [SharedConsumableConstraint(resource_name="licenses",
                                               source="/path/to/license_limits",
                                               initial_value=1000,
                                               current_value=1000)]
    
def make_constraint(value: int) -> SharedConsumableConstraint:
    return SharedConsumableConstraint(SHARED_RESOURCES, value)

SharedNonConsumableConstraint

Similar to a SharedConsumableConstraint, except that the resource that is shared is not consumable (like a string etc). Please use the SharedNonConsumableResource object to represent this resource.

While there is a json representation of this object, it is up to the author to create the SharedNonConsumableResource programmatically so programmatic creation of this constraint is recommended.

# global value
SHARED_RESOURCES = [SharedNonConsumableResource(resource_name="prodversion",
                                                source="/path/to/prod_version",
                                                current_value="1.2.3")]
    
def make_constraint(value: str) -> SharedConstraint:
    return SharedConstraint(SHARED_RESOURCES, value)

XOr

Similar to the or operation, however one and only one of the child constraints may satisfy the node. Here is a trivial example where we have a failover for allocating to a second region but we ensure that only one of them is valid at a time.

{"xor": [{"node.location": "westus2"},
         {"node.location": "eastus"}]
}

Timeouts

By default we set idle and boot timeouts across all nodes.

   "boot_timeout": 3600

You can also set these per nodearray.

   "boot_timeout": {"default": 3600, "nodearray1": 7200, "nodearray2": 900},

Incorrect VM Size Information

In some regions or subscriptions, CycleCloud cannot get the proper VM Size information for all VM Sizes. Often this results in an incorrect number of GPUs being reported, otherwise all attributes are incorrect. By default, as of 1.0.2, an internal record of all public regions and vm_sizes - hpc/autoscale/node/vm_sizes.json - will fallback on a common US region for US Gov/DOD regions.

At the top of this file, you will find the following

{
  "proxied-locations": {
    "_comment_": "This is a mapping of locations that are not available in the Azure API, but are proxied to another location.",
    "usdodcentral": "southcentralus",
    "usdodeast": "southcentralus",
    "usdodtexas": "southcentralus",
    "usgovarizona": "southcentralus",
    "usgoviowa": "southcentralus",
    "usgovtexas": "southcentralus",
    "usgovvirginia": "southcentralus",
    "usseceast": "southcentralus",
    "ussecwest": "southcentralus"
  },

This states that for these locations we should just use the data on hand for southcentralus. A user can modify this however they want after installation, there is no requirement that these map to southcentralus.

Note Please remember that you can also always define a default_resource for your gpu resource with an explicit integer. This is the preferred way to deal with this issue, however this can become painful when dealing with many VM Sizes that are simply missing basic information.

Lastly, it should be noted that if the GPUs defined in this file are higher than CycleCloud reports, then this file takes precedence. This is due to the subscription that CycleCloud is using is being told that the VM Size has 0 GPUs in some locked down regions when the GPU count should be higher.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.