Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add document for bash script.sh cannot do conda activate problem #422

Merged
merged 15 commits into from
Feb 24, 2022
Merged
21 changes: 17 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,19 +13,32 @@ sky launch -c mycluster hello_sky.yaml

```yaml
# hello_sky.yaml

resources:
accelerators: V100:1 # 1x NVIDIA V100 GPU
# Optional; if left out, pick from the available clouds.
cloud: aws

# Get 1 K80 GPU. Use <name>:<n> to get more (e.g., "K80:8").
accelerators: K80

workdir: . # Sync code dir to cloud
# Working directory (optional) containing the project codebase.
# This directory will be synced to ~/sky_workdir on the provisioned cluster.
workdir: .

# Typical use: pip install -r requirements.txt
setup: |
# Typical use: pip install -r requirements.txt
echo "running setup"
# If using a `my_setup.sh` script that requires conda,
# invoke it as below to ensure `conda activate` works:
# bash -i my_setup.sh

# Typical use: make use of resources, such as running training.
run: |
# Typical use: make use of resources, such as running training.
echo "hello sky!"
conda env list
# If using a `my_run.sh` script that requires conda,
# invoke it as below to ensure `conda activate` works:
# `bash -i my_run.sh`
```

## Getting Started
Expand Down
31 changes: 18 additions & 13 deletions docs/source/examples/distributed-jobs.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,29 +11,34 @@ For example, here is a simple PyTorch Distributed training example:
name: resnet-distributed-app

resources:
accelerators: V100
accelerators: V100

num_nodes: 2

setup: |
pip3 install --upgrade pip
git clone https://github.com/michaelzhiluo/pytorch-distributed-resnet
cd pytorch-distributed-resnet && pip3 install -r requirements.txt
mkdir -p data && mkdir -p saved_models && cd data && \
wget -c --quiet https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
tar -xvzf cifar-10-python.tar.gz
pip3 install --upgrade pip
git clone https://github.com/michaelzhiluo/pytorch-distributed-resnet
cd pytorch-distributed-resnet && pip3 install -r requirements.txt
mkdir -p data && mkdir -p saved_models && cd data && \
wget -c --quiet https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
tar -xvzf cifar-10-python.tar.gz

run: |
cd pytorch-distributed-resnet
python3 -m torch.distributed.launch --nproc_per_node=1 \
--nnodes=${#SKY_NODE_IPS[@]} --node_rank=${SKY_NODE_RANK} --master_addr=${SKY_NODE_IPS[0]} \
--master_port=8008 resnet_ddp.py --num_epochs 20
cd pytorch-distributed-resnet

num_nodes=`echo "$SKY_NODE_IPS" | wc -l`
master_addr=`echo "$SKY_NODE_IPS" | sed -n 1p`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the simpler head -n1 do the job?

Copy link
Collaborator Author

@Michaelvll Michaelvll Feb 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, but I was planning to hint the user how to get the ip of other workers. I changed it to head -n 1 here and for the examples since it is simpler, and leave the document for $SKY_NODE_IPS with a hint about how to retrieve the IP of node-k.

python3 -m torch.distributed.launch --nproc_per_node=1 \
--nnodes=$num_nodes --node_rank=${SKY_NODE_RANK} --master_addr=$master_addr \
--master_port=8008 resnet_ddp.py --num_epochs 20

In the above, :code:`num_nodes: 2` specifies that this task is to be run on 2
nodes. The commands in :code:`run` are executed on both nodes. Several useful
environment variables are available to distinguish per-node commands:

- :code:`SKY_NODE_RANK`: rank (an integer ID from 0 to :code:`num_nodes-1`) of
the node executing the task
- :code:`SKY_NODE_IPS`: a list of IP addresses of the nodes reserved to execute
the task
- :code:`SKY_NODE_IPS`: a string of IP addresses of the nodes reserved to execute
the task, where each line contains one IP address. You can retrieve the number of
nodes by :code:`echo "$SKY_NODE_IPS" | wc -l` and the IP address of node-3 by
:code:`echo "$SKY_NODE_IPS" | cut -n 3p`
10 changes: 5 additions & 5 deletions docs/source/examples/grid-search.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,11 +12,11 @@ Submitting multiple trials with different hyperparameters is simple:

.. code-block:: bash
# Launch 4 trials in parallel
sky exec mycluster --gpus V100:1 -d -- python train.py --lr 1e-3
sky exec mycluster --gpus V100:1 -d -- python train.py --lr 3e-3
sky exec mycluster --gpus V100:1 -d -- python train.py --lr 1e-4
sky exec mycluster --gpus V100:1 -d -- python train.py --lr 1e-2
$ # Launch 4 trials in parallel
$ sky exec mycluster --gpus V100:1 -d -- python train.py --lr 1e-3
$ sky exec mycluster --gpus V100:1 -d -- python train.py --lr 3e-3
$ sky exec mycluster --gpus V100:1 -d -- python train.py --lr 1e-4
$ sky exec mycluster --gpus V100:1 -d -- python train.py --lr 1e-2
# gets queued and will run once a GPU is available
sky exec mycluster --gpus V100:1 -d -- python train.py --lr 1e-6
Expand Down
10 changes: 5 additions & 5 deletions docs/source/examples/iterative-dev-project.rst
Original file line number Diff line number Diff line change
Expand Up @@ -48,8 +48,8 @@ Use the familiar scp/rsync to transfer files between your local machine and remo

.. code-block::
$ rsync -Pavz my_code/ dev:/path/to/destination # copy files to remote VM
$ rsync -Pavz dev:/path/to/source my_code/ # copy files from remote VM
$ rsync -Pavz my_code/ dev:/path/to/destination # copy files to remote VM
$ rsync -Pavz dev:/path/to/source my_code/ # copy files from remote VM
Sky **simplifies code syncing** by the automatic transfer of a working directory
to the cluster. The working directory can be configured with the
Expand All @@ -58,8 +58,8 @@ option:

.. code-block::
$ sky launch --workdir=/path/to/code task.yaml
$ sky exec --workdir=/path/to/code task.yaml
$ sky launch --workdir=/path/to/code task.yaml
$ sky exec --workdir=/path/to/code task.yaml
These commands sync the working directory to a location on the remote VM, and
the task is run under that working directory (e.g., to invoke scripts, access
Expand All @@ -80,4 +80,4 @@ To restart a stopped cluster:

.. code-block:: console
$ sky start dev
$ sky start dev
60 changes: 30 additions & 30 deletions docs/source/getting-started/installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,11 +7,11 @@ Install Sky using pip:

.. code-block:: console
$ # Clone the sky codebase
$ git clone ssh://git@github.com/sky-proj/sky.git
$ cd sky
$ # Sky requires python >= 3.6. 3.10+ is currently NOT supported.
$ pip install ".[all]"
$ # Clone the sky codebase
$ git clone ssh://git@github.com/sky-proj/sky.git
$ cd sky
$ # Sky requires python >= 3.6. 3.10+ is currently NOT supported.
$ pip install ".[all]"
If you only want the dependencies for certain clouds, you can also use
:code:`".[aws,azure,gcp]"`.
Expand All @@ -28,27 +28,27 @@ To get the **AWS Access Key** required by the :code:`aws configure`, please refe

.. code-block:: console
$ # Install boto
$ pip install boto3
$ # Install boto
$ pip install boto3
$ # Configure your AWS credentials
$ aws configure
$ # Configure your AWS credentials
$ aws configure
**GCP**:

.. code-block:: console
$ pip install google-api-python-client
$ # Install `gcloud`; see https://cloud.google.com/sdk/docs/quickstart
$ conda install -c conda-forge google-cloud-sdk
$ pip install google-api-python-client
$ # Install `gcloud`; see https://cloud.google.com/sdk/docs/quickstart
$ conda install -c conda-forge google-cloud-sdk
$ # Init.
$ gcloud init
$ # Init.
$ gcloud init
$ # Run this if you don't have a credentials file.
$ # This will generate ~/.config/gcloud/application_default_credentials.json.
$ gcloud auth application-default login
$ # Run this if you don't have a credentials file.
$ # This will generate ~/.config/gcloud/application_default_credentials.json.
$ gcloud auth application-default login
If you meet the following error (*RemoveError: 'requests' is a dependency of conda and cannot be removed from conda's operating environment*) while running :code:`conda install -c conda-forge google-cloud-sdk`, please try :code:`conda update --force conda` and run it again.

Expand All @@ -57,12 +57,12 @@ If you meet the following error (*RemoveError: 'requests' is a dependency of con

.. code-block:: console
$ # Install the Azure CLI
$ pip install azure-cli==2.30.0
$ # Login azure
$ az login
$ # Set the subscription to use
$ az account set -s <subscription_id>
$ # Install the Azure CLI
$ pip install azure-cli==2.30.0
$ # Login azure
$ az login
$ # Set the subscription to use
$ az account set -s <subscription_id>
**Verifying cloud setup**

Expand All @@ -71,16 +71,16 @@ the CLI:

.. code-block:: console
$ # Verify cloud account setup
$ sky check
$ # Verify cloud account setup
$ sky check
This will produce output verifying the correct setup of each supported cloud.

.. code-block:: text
Checking credentials to enable clouds for Sky.
AWS: enabled
GCP: enabled
Azure: enabled
Checking credentials to enable clouds for Sky.
AWS: enabled
GCP: enabled
Azure: enabled
Sky will use only the enabled clouds to run tasks. To change this, configure cloud credentials, and run sky check.
Sky will use only the enabled clouds to run tasks. To change this, configure cloud credentials, and run sky check.
82 changes: 39 additions & 43 deletions docs/source/getting-started/quickstart.rst
Original file line number Diff line number Diff line change
Expand Up @@ -31,23 +31,23 @@ Let's provision an instance with a single K80 GPU.

.. code-block:: bash

# Provisions/reuses an interactive node with a single K80 GPU.
# Any of the interactive node commands (gpunode, tpunode, cpunode)
# will automatically log in to the cluster.
sky gpunode -c mygpu --gpus K80
# Provisions/reuses an interactive node with a single K80 GPU.
# Any of the interactive node commands (gpunode, tpunode, cpunode)
# will automatically log in to the cluster.
sky gpunode -c mygpu --gpus K80

Last login: Wed Feb 23 22:35:47 2022 from 136.152.143.101
ubuntu@ip-172-31-86-108:~$ gpustat
ip-172-31-86-108 Wed Feb 23 22:42:43 2022 450.142.00
[0] Tesla K80 | 31°C, 0 % | 0 / 11441 MB |
ubuntu@ip-172-31-86-108:~$
^D
Last login: Wed Feb 23 22:35:47 2022 from 136.152.143.101
ubuntu@ip-172-31-86-108:~$ gpustat
ip-172-31-86-108 Wed Feb 23 22:42:43 2022 450.142.00
[0] Tesla K80 | 31°C, 0 % | 0 / 11441 MB |
ubuntu@ip-172-31-86-108:~$
^D

# View the machine in the cluster table.
sky status
# View the machine in the cluster table.
sky status

NAME LAUNCHED RESOURCES COMMAND STATUS
mygpu a few secs ago 1x Azure(Standard_NC6_Promo) sky gpunode -c mygpu --gpus K80 UP
NAME LAUNCHED RESOURCES COMMAND STATUS
mygpu a few secs ago 1x Azure(Standard_NC6_Promo) sky gpunode -c mygpu --gpus K80 UP

After you are done, run :code:`sky down mygpu` to terminate the cluster. Find more details
on managing the lifecycle of your cluster :ref:`here <interactive-nodes>`.
Expand Down Expand Up @@ -78,37 +78,33 @@ requiring an NVIDIA Tesla K80 GPU on AWS. See more example yaml files in the `re

.. code-block:: yaml

# hello_sky.yaml
# hello_sky.yaml

resources:
# Optional; if left out, pick from the available clouds.
cloud: aws
resources:
# Optional; if left out, pick from the available clouds.
cloud: aws

# Get 1 K80 GPU. Use <name>:<n> to get more (e.g., "K80:8").
accelerators: K80
# Get 1 K80 GPU. Use <name>:<n> to get more (e.g., "K80:8").
accelerators: K80

# Working directory (optional) containing the project codebase.
# This directory will be synced to ~/sky_workdir on the provisioned cluster.
workdir: .
# Working directory (optional) containing the project codebase.
# This directory will be synced to ~/sky_workdir on the provisioned cluster.
workdir: .

# Typical use: pip install -r requirements.txt
setup: |
# Typical use: pip install -r requirements.txt
# Typical use: pip install -r requirements.txt
setup: |
echo "running setup"
# If using a `my_setup.sh` script that requires conda,
# invoke it as below to ensure `conda activate` works:
# bash -i my_setup.sh

# If using a `my_setup.sh` script to setup, please use
# `bash -i my_setup.sh` to capture the environment
# variable and make sure `conda activate` works
echo "running setup"

# Typical use: make use of resources, such as running training.
run: |
# Typical use: make use of resources, such as running training.

# If using a my_run.sh script to run commands, please use
# `bash -i my_run.sh` to capture the environment variable
# and make sure `conda activate` works
echo "hello sky!"
conda env list
# Typical use: make use of resources, such as running training.
run: |
echo "hello sky!"
conda env list
# If using a `my_run.sh` script that requires conda,
# invoke it as below to ensure `conda activate` works:
# `bash -i my_run.sh`

Sky handles selecting an appropriate VM based on user-specified resource
constraints, launching the cluster on an appropriate cloud provider, and
Expand All @@ -118,7 +114,7 @@ To launch a task based on our above YAML spec, we can use :code:`sky launch`.

.. code-block:: console

$ sky launch -c mycluster hello_sky.yaml
$ sky launch -c mycluster hello_sky.yaml

The :code:`-c` option allows us to specify a cluster name. If a cluster with the
same name already exists, Sky will reuse that cluster. If no such cluster
Expand All @@ -130,7 +126,7 @@ We can view our existing clusters by running :code:`sky status`:

.. code-block:: console

$ sky status
$ sky status

This may show multiple clusters, if you have created several:

Expand All @@ -144,7 +140,7 @@ If you would like to log into the a cluster, Sky provides convenient SSH access

.. code-block:: console

$ ssh mycluster
$ ssh mycluster

If you would like to transfer files to and from the cluster, *rsync* or *scp* can be used:

Expand Down
Loading