skypilot-org
diff --git a/‎README.md
+16-4 b/‎README.md
+16-4
diff --git a/‎docs/source/examples/distributed-jobs.rst
+17-12 b/‎docs/source/examples/distributed-jobs.rst
+17-12
diff --git a/‎docs/source/examples/grid-search.rst
+5-5 b/‎docs/source/examples/grid-search.rst
+5-5
diff --git a/‎docs/source/examples/iterative-dev-project.rst
+5-5 b/‎docs/source/examples/iterative-dev-project.rst
+5-5
diff --git a/‎docs/source/getting-started/installation.rst
+30-30 b/‎docs/source/getting-started/installation.rst
+30-30
diff --git a/‎docs/source/getting-started/quickstart.rst
+38-33 b/‎docs/source/getting-started/quickstart.rst
+38-33
@@ -13,19 +13,31 @@ sky launch -c mycluster hello_sky.yaml
 
 ```yaml
 # hello_sky.yaml
+
 resources:
-  accelerators: V100:1  # 1x NVIDIA V100 GPU
+  # Optional; if left out, pick from the available clouds.
+  cloud: aws
+
+  accelerators: V100:1 # 1x NVIDIA V100 GPU
 
-workdir: .  # Sync code dir to cloud
+# Working directory (optional) containing the project codebase.
+# This directory will be synced to ~/sky_workdir on the provisioned cluster.
+workdir: .
 
+# Typical use: pip install -r requirements.txt
 setup: |
-  # Typical use: pip install -r requirements.txt
   echo "running setup"
+  # If using a `my_setup.sh` script that requires conda,
+  # invoke it as below to ensure `conda activate` works:
+  # bash -i my_setup.sh
 
+# Typical use: make use of resources, such as running training.
 run: |
-  # Typical use: make use of resources, such as running training.
   echo "hello sky!"
   conda env list
+  # If using a `my_run.sh` script that requires conda,
+  # invoke it as below to ensure `conda activate` works:
+  # bash -i my_run.sh
 ```
 
 ## Getting Started
 
@@ -11,22 +11,25 @@ For example, here is a simple PyTorch Distributed training example:
   name: resnet-distributed-app
 
   resources:
-      accelerators: V100
+    accelerators: V100
 
   num_nodes: 2
 
   setup: |
-      pip3 install --upgrade pip
-      git clone https://github.com/michaelzhiluo/pytorch-distributed-resnet
-      cd pytorch-distributed-resnet && pip3 install -r requirements.txt
-      mkdir -p data  && mkdir -p saved_models && cd data && \
-        wget -c --quiet https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
-      tar -xvzf cifar-10-python.tar.gz
+    pip3 install --upgrade pip
+    git clone https://github.com/michaelzhiluo/pytorch-distributed-resnet
+    cd pytorch-distributed-resnet && pip3 install -r requirements.txt
+    mkdir -p data  && mkdir -p saved_models && cd data && \
+      wget -c --quiet https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
+    tar -xvzf cifar-10-python.tar.gz
 
   run: |
-      cd pytorch-distributed-resnet
-      python3 -m torch.distributed.launch --nproc_per_node=1 \
-        --nnodes=${#SKY_NODE_IPS[@]} --node_rank=${SKY_NODE_RANK} --master_addr=${SKY_NODE_IPS[0]} \
+    cd pytorch-distributed-resnet
+
+    num_nodes=`echo "$SKY_NODE_IPS" | wc -l`
+    master_addr=`echo "$SKY_NODE_IPS" | head -n1`
+    python3 -m torch.distributed.launch --nproc_per_node=1 \
+      --nnodes=$num_nodes --node_rank=${SKY_NODE_RANK} --master_addr=$master_addr \
       --master_port=8008 resnet_ddp.py --num_epochs 20
 
 In the above, :code:`num_nodes: 2` specifies that this task is to be run on 2
@@ -35,5 +38,7 @@ environment variables are available to distinguish per-node commands:
 
 - :code:`SKY_NODE_RANK`: rank (an integer ID from 0 to :code:`num_nodes-1`) of
   the node executing the task
-- :code:`SKY_NODE_IPS`: a list of IP addresses of the nodes reserved to execute
-  the task
+- :code:`SKY_NODE_IPS`: a string of IP addresses of the nodes reserved to execute
+  the task, where each line contains one IP address. You can retrieve the number of
+  nodes by :code:`echo "$SKY_NODE_IPS" | wc -l` and the IP address of node-3 by
+  :code:`echo "$SKY_NODE_IPS" | sed -n 3p`
@@ -12,11 +12,11 @@ Submitting multiple trials with different hyperparameters is simple:
 
 .. code-block:: bash
 
-  # Launch 4 trials in parallel
-  sky exec mycluster --gpus V100:1 -d -- python train.py --lr 1e-3
-  sky exec mycluster --gpus V100:1 -d -- python train.py --lr 3e-3
-  sky exec mycluster --gpus V100:1 -d -- python train.py --lr 1e-4
-  sky exec mycluster --gpus V100:1 -d -- python train.py --lr 1e-2
+  $ # Launch 4 trials in parallel
+  $ sky exec mycluster --gpus V100:1 -d -- python train.py --lr 1e-3
+  $ sky exec mycluster --gpus V100:1 -d -- python train.py --lr 3e-3
+  $ sky exec mycluster --gpus V100:1 -d -- python train.py --lr 1e-4
+  $ sky exec mycluster --gpus V100:1 -d -- python train.py --lr 1e-2
 
   # gets queued and will run once a GPU is available
   sky exec mycluster --gpus V100:1 -d -- python train.py --lr 1e-6
 
@@ -48,8 +48,8 @@ Use the familiar scp/rsync to transfer files between your local machine and remo
 
 .. code-block::
 
-    $ rsync -Pavz my_code/ dev:/path/to/destination  # copy files to remote VM
-    $ rsync -Pavz dev:/path/to/source my_code/       # copy files from remote VM
+  $ rsync -Pavz my_code/ dev:/path/to/destination  # copy files to remote VM
+  $ rsync -Pavz dev:/path/to/source my_code/       # copy files from remote VM
 
 Sky **simplifies code syncing** by the automatic transfer of a working directory
 to the cluster.  The working directory can be configured with the
@@ -58,8 +58,8 @@ option:
 
 .. code-block::
 
-    $ sky launch --workdir=/path/to/code task.yaml
-    $ sky exec --workdir=/path/to/code task.yaml
+  $ sky launch --workdir=/path/to/code task.yaml
+  $ sky exec --workdir=/path/to/code task.yaml
 
 These commands sync the working directory to a location on the remote VM, and
 the task is run under that working directory (e.g., to invoke scripts, access
@@ -80,4 +80,4 @@ To restart a stopped cluster:
 
 .. code-block:: console
 
-    $ sky start dev
+  $ sky start dev
@@ -7,11 +7,11 @@ Install Sky using pip:
 
 .. code-block:: console
 
-   $ # Clone the sky codebase
-   $ git clone ssh://git@github.com/sky-proj/sky.git
-   $ cd sky
-   $ # Sky requires python >= 3.6. 3.10+ is currently NOT supported.
-   $ pip install ".[all]"
+  $ # Clone the sky codebase
+  $ git clone ssh://git@github.com/sky-proj/sky.git
+  $ cd sky
+  $ # Sky requires python >= 3.6. 3.10+ is currently NOT supported.
+  $ pip install ".[all]"
 
 If you only want the dependencies for certain clouds, you can also use
 :code:`".[aws,azure,gcp]"`.
@@ -28,27 +28,27 @@ To get the **AWS Access Key** required by the :code:`aws configure`, please refe
 
 .. code-block:: console
 
-   $ # Install boto
-   $ pip install boto3
+  $ # Install boto
+  $ pip install boto3
 
-   $ # Configure your AWS credentials
-   $ aws configure
+  $ # Configure your AWS credentials
+  $ aws configure
 
 
 **GCP**:
 
 .. code-block:: console
 
-   $ pip install google-api-python-client
-   $ # Install `gcloud`; see https://cloud.google.com/sdk/docs/quickstart
-   $ conda install -c conda-forge google-cloud-sdk
+  $ pip install google-api-python-client
+  $ # Install `gcloud`; see https://cloud.google.com/sdk/docs/quickstart
+  $ conda install -c conda-forge google-cloud-sdk
 
-   $ # Init.
-   $ gcloud init
+  $ # Init.
+  $ gcloud init
 
-   $ # Run this if you don't have a credentials file.
-   $ # This will generate ~/.config/gcloud/application_default_credentials.json.
-   $ gcloud auth application-default login
+  $ # Run this if you don't have a credentials file.
+  $ # This will generate ~/.config/gcloud/application_default_credentials.json.
+  $ gcloud auth application-default login
 
 If you meet the following error (*RemoveError: 'requests' is a dependency of conda and cannot be removed from conda's operating environment*) while running :code:`conda install -c conda-forge google-cloud-sdk`, please try :code:`conda update --force conda` and run it again.
 
@@ -57,12 +57,12 @@ If you meet the following error (*RemoveError: 'requests' is a dependency of con
 
 .. code-block:: console
 
-   $ # Install the Azure CLI
-   $ pip install azure-cli==2.30.0
-   $ # Login azure
-   $ az login
-   $ # Set the subscription to use
-   $ az account set -s <subscription_id>
+  $ # Install the Azure CLI
+  $ pip install azure-cli==2.30.0
+  $ # Login azure
+  $ az login
+  $ # Set the subscription to use
+  $ az account set -s <subscription_id>
 
 **Verifying cloud setup**
 
@@ -71,16 +71,16 @@ the CLI:
 
 .. code-block:: console
 
-   $ # Verify cloud account setup
-   $ sky check
+  $ # Verify cloud account setup
+  $ sky check
 
 This will produce output verifying the correct setup of each supported cloud.
 
 .. code-block:: text
 
-   Checking credentials to enable clouds for Sky.
-      AWS: enabled
-      GCP: enabled
-      Azure: enabled
+  Checking credentials to enable clouds for Sky.
+    AWS: enabled
+    GCP: enabled
+    Azure: enabled
 
-   Sky will use only the enabled clouds to run tasks. To change this, configure cloud credentials, and run sky check.
+  Sky will use only the enabled clouds to run tasks. To change this, configure cloud credentials, and run sky check.
@@ -31,23 +31,23 @@ Let's provision an instance with a single K80 GPU.
 
 .. code-block:: bash
 
-    # Provisions/reuses an interactive node with a single K80 GPU.
-    # Any of the interactive node commands (gpunode, tpunode, cpunode)
-    # will automatically log in to the cluster.
-    sky gpunode -c mygpu --gpus K80
+  # Provisions/reuses an interactive node with a single K80 GPU.
+  # Any of the interactive node commands (gpunode, tpunode, cpunode)
+  # will automatically log in to the cluster.
+  sky gpunode -c mygpu --gpus K80
 
-    Last login: Wed Feb 23 22:35:47 2022 from 136.152.143.101
-    ubuntu@ip-172-31-86-108:~$ gpustat
-    ip-172-31-86-108     Wed Feb 23 22:42:43 2022  450.142.00
-    [0] Tesla K80        | 31°C,   0 % |     0 / 11441 MB |
-    ubuntu@ip-172-31-86-108:~$
-    ^D
+  Last login: Wed Feb 23 22:35:47 2022 from 136.152.143.101
+  ubuntu@ip-172-31-86-108:~$ gpustat
+  ip-172-31-86-108     Wed Feb 23 22:42:43 2022  450.142.00
+  [0] Tesla K80        | 31°C,   0 % |     0 / 11441 MB |
+  ubuntu@ip-172-31-86-108:~$
+  ^D
 
-    # View the machine in the cluster table.
-    sky status
+  # View the machine in the cluster table.
+  sky status
 
-    NAME   LAUNCHED        RESOURCES                     COMMAND                          STATUS
-    mygpu  a few secs ago  1x Azure(Standard_NC6_Promo)  sky gpunode -c mygpu --gpus K80  UP
+  NAME   LAUNCHED        RESOURCES                     COMMAND                          STATUS
+  mygpu  a few secs ago  1x Azure(Standard_NC6_Promo)  sky gpunode -c mygpu --gpus K80  UP
 
 After you are done, run :code:`sky down mygpu` to terminate the cluster. Find more details
 on managing the lifecycle of your cluster :ref:`here <interactive-nodes>`.
@@ -78,27 +78,32 @@ requiring an NVIDIA Tesla K80 GPU on AWS. See more example yaml files in the `re
 
 .. code-block:: yaml
 
-   # hello_sky.yaml
+  # hello_sky.yaml
 
-   resources:
-     # Optional; if left out, pick from the available clouds.
-     cloud: aws
+  resources:
+    # Optional; if left out, pick from the available clouds.
+    cloud: aws
 
-     # Get 1 K80 GPU.  Use <name>:<n> to get more (e.g., "K80:8").
-     accelerators: K80
+    accelerators: V100:1 # 1x NVIDIA V100 GPU
 
-   # Working directory (optional) containing the project codebase.
-   # This directory will be synced to ~/sky_workdir on the provisioned cluster.
-   workdir: .
+  # Working directory (optional) containing the project codebase.
+  # This directory will be synced to ~/sky_workdir on the provisioned cluster.
+  workdir: .
 
-   # Typical use: pip install -r requirements.txt
-   setup: |
-     echo "running setup"
+  # Typical use: pip install -r requirements.txt
+  setup: |
+    echo "running setup"
+    # If using a `my_setup.sh` script that requires conda,
+    # invoke it as below to ensure `conda activate` works:
+    # bash -i my_setup.sh
 
-   # Typical use: make use of resources, such as running training.
-   run: |
-     echo "hello sky!"
-     conda env list
+  # Typical use: make use of resources, such as running training.
+  run: |
+    echo "hello sky!"
+    conda env list
+    # If using a `my_run.sh` script that requires conda,
+    # invoke it as below to ensure `conda activate` works:
+    # `bash -i my_run.sh`
 
 Sky handles selecting an appropriate VM based on user-specified resource
 constraints, launching the cluster on an appropriate cloud provider, and
@@ -108,7 +113,7 @@ To launch a task based on our above YAML spec, we can use :code:`sky launch`.
 
 .. code-block:: console
 
-   $ sky launch -c mycluster hello_sky.yaml
+  $ sky launch -c mycluster hello_sky.yaml
 
 The :code:`-c` option allows us to specify a cluster name. If a cluster with the
 same name already exists, Sky will reuse that cluster. If no such cluster
@@ -120,7 +125,7 @@ We can view our existing clusters by running :code:`sky status`:
 
 .. code-block:: console
 
-   $ sky status
+  $ sky status
 
 This may show multiple clusters, if you have created several:
 
@@ -134,7 +139,7 @@ If you would like to log into the a cluster, Sky provides convenient SSH access
 
 .. code-block:: console
 
-   $ ssh mycluster
+  $ ssh mycluster
 
 If you would like to transfer files to and from the cluster, *rsync* or *scp* can be used: