Skip to content

Commit 6249847

Browse files
authored
Add document for bash script.sh cannot do conda activate problem (#422)
* Add conda activate support to bashrc * Add doc and make sure conda activate works * bring back conda activate command for GCP * Move comment to quickstart * format * Fix comments * Add test/example of using user_script * Fix indents * bash -i only for conda activate * Fix the SKY_NODE_IPS fail to pass to the shell script * Update readme * update env_check * Fix comments * Change to head -n1
1 parent d71e2d2 commit 6249847

13 files changed

+227
-148
lines changed

README.md

+16-4
Original file line numberDiff line numberDiff line change
@@ -13,19 +13,31 @@ sky launch -c mycluster hello_sky.yaml
1313

1414
```yaml
1515
# hello_sky.yaml
16+
1617
resources:
17-
accelerators: V100:1 # 1x NVIDIA V100 GPU
18+
# Optional; if left out, pick from the available clouds.
19+
cloud: aws
20+
21+
accelerators: V100:1 # 1x NVIDIA V100 GPU
1822

19-
workdir: . # Sync code dir to cloud
23+
# Working directory (optional) containing the project codebase.
24+
# This directory will be synced to ~/sky_workdir on the provisioned cluster.
25+
workdir: .
2026

27+
# Typical use: pip install -r requirements.txt
2128
setup: |
22-
# Typical use: pip install -r requirements.txt
2329
echo "running setup"
30+
# If using a `my_setup.sh` script that requires conda,
31+
# invoke it as below to ensure `conda activate` works:
32+
# bash -i my_setup.sh
2433
34+
# Typical use: make use of resources, such as running training.
2535
run: |
26-
# Typical use: make use of resources, such as running training.
2736
echo "hello sky!"
2837
conda env list
38+
# If using a `my_run.sh` script that requires conda,
39+
# invoke it as below to ensure `conda activate` works:
40+
# bash -i my_run.sh
2941
```
3042
3143
## Getting Started

docs/source/examples/distributed-jobs.rst

+17-12
Original file line numberDiff line numberDiff line change
@@ -11,22 +11,25 @@ For example, here is a simple PyTorch Distributed training example:
1111
name: resnet-distributed-app
1212
1313
resources:
14-
accelerators: V100
14+
accelerators: V100
1515
1616
num_nodes: 2
1717
1818
setup: |
19-
pip3 install --upgrade pip
20-
git clone https://github.com/michaelzhiluo/pytorch-distributed-resnet
21-
cd pytorch-distributed-resnet && pip3 install -r requirements.txt
22-
mkdir -p data && mkdir -p saved_models && cd data && \
23-
wget -c --quiet https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
24-
tar -xvzf cifar-10-python.tar.gz
19+
pip3 install --upgrade pip
20+
git clone https://github.com/michaelzhiluo/pytorch-distributed-resnet
21+
cd pytorch-distributed-resnet && pip3 install -r requirements.txt
22+
mkdir -p data && mkdir -p saved_models && cd data && \
23+
wget -c --quiet https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
24+
tar -xvzf cifar-10-python.tar.gz
2525
2626
run: |
27-
cd pytorch-distributed-resnet
28-
python3 -m torch.distributed.launch --nproc_per_node=1 \
29-
--nnodes=${#SKY_NODE_IPS[@]} --node_rank=${SKY_NODE_RANK} --master_addr=${SKY_NODE_IPS[0]} \
27+
cd pytorch-distributed-resnet
28+
29+
num_nodes=`echo "$SKY_NODE_IPS" | wc -l`
30+
master_addr=`echo "$SKY_NODE_IPS" | head -n1`
31+
python3 -m torch.distributed.launch --nproc_per_node=1 \
32+
--nnodes=$num_nodes --node_rank=${SKY_NODE_RANK} --master_addr=$master_addr \
3033
--master_port=8008 resnet_ddp.py --num_epochs 20
3134
3235
In the above, :code:`num_nodes: 2` specifies that this task is to be run on 2
@@ -35,5 +38,7 @@ environment variables are available to distinguish per-node commands:
3538

3639
- :code:`SKY_NODE_RANK`: rank (an integer ID from 0 to :code:`num_nodes-1`) of
3740
the node executing the task
38-
- :code:`SKY_NODE_IPS`: a list of IP addresses of the nodes reserved to execute
39-
the task
41+
- :code:`SKY_NODE_IPS`: a string of IP addresses of the nodes reserved to execute
42+
the task, where each line contains one IP address. You can retrieve the number of
43+
nodes by :code:`echo "$SKY_NODE_IPS" | wc -l` and the IP address of node-3 by
44+
:code:`echo "$SKY_NODE_IPS" | sed -n 3p`

docs/source/examples/grid-search.rst

+5-5
Original file line numberDiff line numberDiff line change
@@ -12,11 +12,11 @@ Submitting multiple trials with different hyperparameters is simple:
1212

1313
.. code-block:: bash
1414
15-
# Launch 4 trials in parallel
16-
sky exec mycluster --gpus V100:1 -d -- python train.py --lr 1e-3
17-
sky exec mycluster --gpus V100:1 -d -- python train.py --lr 3e-3
18-
sky exec mycluster --gpus V100:1 -d -- python train.py --lr 1e-4
19-
sky exec mycluster --gpus V100:1 -d -- python train.py --lr 1e-2
15+
$ # Launch 4 trials in parallel
16+
$ sky exec mycluster --gpus V100:1 -d -- python train.py --lr 1e-3
17+
$ sky exec mycluster --gpus V100:1 -d -- python train.py --lr 3e-3
18+
$ sky exec mycluster --gpus V100:1 -d -- python train.py --lr 1e-4
19+
$ sky exec mycluster --gpus V100:1 -d -- python train.py --lr 1e-2
2020
2121
# gets queued and will run once a GPU is available
2222
sky exec mycluster --gpus V100:1 -d -- python train.py --lr 1e-6

docs/source/examples/iterative-dev-project.rst

+5-5
Original file line numberDiff line numberDiff line change
@@ -48,8 +48,8 @@ Use the familiar scp/rsync to transfer files between your local machine and remo
4848

4949
.. code-block::
5050
51-
$ rsync -Pavz my_code/ dev:/path/to/destination # copy files to remote VM
52-
$ rsync -Pavz dev:/path/to/source my_code/ # copy files from remote VM
51+
$ rsync -Pavz my_code/ dev:/path/to/destination # copy files to remote VM
52+
$ rsync -Pavz dev:/path/to/source my_code/ # copy files from remote VM
5353
5454
Sky **simplifies code syncing** by the automatic transfer of a working directory
5555
to the cluster. The working directory can be configured with the
@@ -58,8 +58,8 @@ option:
5858

5959
.. code-block::
6060
61-
$ sky launch --workdir=/path/to/code task.yaml
62-
$ sky exec --workdir=/path/to/code task.yaml
61+
$ sky launch --workdir=/path/to/code task.yaml
62+
$ sky exec --workdir=/path/to/code task.yaml
6363
6464
These commands sync the working directory to a location on the remote VM, and
6565
the task is run under that working directory (e.g., to invoke scripts, access
@@ -80,4 +80,4 @@ To restart a stopped cluster:
8080

8181
.. code-block:: console
8282
83-
$ sky start dev
83+
$ sky start dev

docs/source/getting-started/installation.rst

+30-30
Original file line numberDiff line numberDiff line change
@@ -7,11 +7,11 @@ Install Sky using pip:
77

88
.. code-block:: console
99
10-
$ # Clone the sky codebase
11-
$ git clone ssh://git@github.com/sky-proj/sky.git
12-
$ cd sky
13-
$ # Sky requires python >= 3.6. 3.10+ is currently NOT supported.
14-
$ pip install ".[all]"
10+
$ # Clone the sky codebase
11+
$ git clone ssh://git@github.com/sky-proj/sky.git
12+
$ cd sky
13+
$ # Sky requires python >= 3.6. 3.10+ is currently NOT supported.
14+
$ pip install ".[all]"
1515
1616
If you only want the dependencies for certain clouds, you can also use
1717
:code:`".[aws,azure,gcp]"`.
@@ -28,27 +28,27 @@ To get the **AWS Access Key** required by the :code:`aws configure`, please refe
2828

2929
.. code-block:: console
3030
31-
$ # Install boto
32-
$ pip install boto3
31+
$ # Install boto
32+
$ pip install boto3
3333
34-
$ # Configure your AWS credentials
35-
$ aws configure
34+
$ # Configure your AWS credentials
35+
$ aws configure
3636
3737
3838
**GCP**:
3939

4040
.. code-block:: console
4141
42-
$ pip install google-api-python-client
43-
$ # Install `gcloud`; see https://cloud.google.com/sdk/docs/quickstart
44-
$ conda install -c conda-forge google-cloud-sdk
42+
$ pip install google-api-python-client
43+
$ # Install `gcloud`; see https://cloud.google.com/sdk/docs/quickstart
44+
$ conda install -c conda-forge google-cloud-sdk
4545
46-
$ # Init.
47-
$ gcloud init
46+
$ # Init.
47+
$ gcloud init
4848
49-
$ # Run this if you don't have a credentials file.
50-
$ # This will generate ~/.config/gcloud/application_default_credentials.json.
51-
$ gcloud auth application-default login
49+
$ # Run this if you don't have a credentials file.
50+
$ # This will generate ~/.config/gcloud/application_default_credentials.json.
51+
$ gcloud auth application-default login
5252
5353
If you meet the following error (*RemoveError: 'requests' is a dependency of conda and cannot be removed from conda's operating environment*) while running :code:`conda install -c conda-forge google-cloud-sdk`, please try :code:`conda update --force conda` and run it again.
5454

@@ -57,12 +57,12 @@ If you meet the following error (*RemoveError: 'requests' is a dependency of con
5757

5858
.. code-block:: console
5959
60-
$ # Install the Azure CLI
61-
$ pip install azure-cli==2.30.0
62-
$ # Login azure
63-
$ az login
64-
$ # Set the subscription to use
65-
$ az account set -s <subscription_id>
60+
$ # Install the Azure CLI
61+
$ pip install azure-cli==2.30.0
62+
$ # Login azure
63+
$ az login
64+
$ # Set the subscription to use
65+
$ az account set -s <subscription_id>
6666
6767
**Verifying cloud setup**
6868

@@ -71,16 +71,16 @@ the CLI:
7171

7272
.. code-block:: console
7373
74-
$ # Verify cloud account setup
75-
$ sky check
74+
$ # Verify cloud account setup
75+
$ sky check
7676
7777
This will produce output verifying the correct setup of each supported cloud.
7878

7979
.. code-block:: text
8080
81-
Checking credentials to enable clouds for Sky.
82-
AWS: enabled
83-
GCP: enabled
84-
Azure: enabled
81+
Checking credentials to enable clouds for Sky.
82+
AWS: enabled
83+
GCP: enabled
84+
Azure: enabled
8585
86-
Sky will use only the enabled clouds to run tasks. To change this, configure cloud credentials, and run sky check.
86+
Sky will use only the enabled clouds to run tasks. To change this, configure cloud credentials, and run sky check.

docs/source/getting-started/quickstart.rst

+38-33
Original file line numberDiff line numberDiff line change
@@ -31,23 +31,23 @@ Let's provision an instance with a single K80 GPU.
3131

3232
.. code-block:: bash
3333
34-
# Provisions/reuses an interactive node with a single K80 GPU.
35-
# Any of the interactive node commands (gpunode, tpunode, cpunode)
36-
# will automatically log in to the cluster.
37-
sky gpunode -c mygpu --gpus K80
34+
# Provisions/reuses an interactive node with a single K80 GPU.
35+
# Any of the interactive node commands (gpunode, tpunode, cpunode)
36+
# will automatically log in to the cluster.
37+
sky gpunode -c mygpu --gpus K80
3838
39-
Last login: Wed Feb 23 22:35:47 2022 from 136.152.143.101
40-
ubuntu@ip-172-31-86-108:~$ gpustat
41-
ip-172-31-86-108 Wed Feb 23 22:42:43 2022 450.142.00
42-
[0] Tesla K80 | 31°C, 0 % | 0 / 11441 MB |
43-
ubuntu@ip-172-31-86-108:~$
44-
^D
39+
Last login: Wed Feb 23 22:35:47 2022 from 136.152.143.101
40+
ubuntu@ip-172-31-86-108:~$ gpustat
41+
ip-172-31-86-108 Wed Feb 23 22:42:43 2022 450.142.00
42+
[0] Tesla K80 | 31°C, 0 % | 0 / 11441 MB |
43+
ubuntu@ip-172-31-86-108:~$
44+
^D
4545
46-
# View the machine in the cluster table.
47-
sky status
46+
# View the machine in the cluster table.
47+
sky status
4848
49-
NAME LAUNCHED RESOURCES COMMAND STATUS
50-
mygpu a few secs ago 1x Azure(Standard_NC6_Promo) sky gpunode -c mygpu --gpus K80 UP
49+
NAME LAUNCHED RESOURCES COMMAND STATUS
50+
mygpu a few secs ago 1x Azure(Standard_NC6_Promo) sky gpunode -c mygpu --gpus K80 UP
5151
5252
After you are done, run :code:`sky down mygpu` to terminate the cluster. Find more details
5353
on managing the lifecycle of your cluster :ref:`here <interactive-nodes>`.
@@ -78,27 +78,32 @@ requiring an NVIDIA Tesla K80 GPU on AWS. See more example yaml files in the `re
7878

7979
.. code-block:: yaml
8080
81-
# hello_sky.yaml
81+
# hello_sky.yaml
8282
83-
resources:
84-
# Optional; if left out, pick from the available clouds.
85-
cloud: aws
83+
resources:
84+
# Optional; if left out, pick from the available clouds.
85+
cloud: aws
8686
87-
# Get 1 K80 GPU. Use <name>:<n> to get more (e.g., "K80:8").
88-
accelerators: K80
87+
accelerators: V100:1 # 1x NVIDIA V100 GPU
8988
90-
# Working directory (optional) containing the project codebase.
91-
# This directory will be synced to ~/sky_workdir on the provisioned cluster.
92-
workdir: .
89+
# Working directory (optional) containing the project codebase.
90+
# This directory will be synced to ~/sky_workdir on the provisioned cluster.
91+
workdir: .
9392
94-
# Typical use: pip install -r requirements.txt
95-
setup: |
96-
echo "running setup"
93+
# Typical use: pip install -r requirements.txt
94+
setup: |
95+
echo "running setup"
96+
# If using a `my_setup.sh` script that requires conda,
97+
# invoke it as below to ensure `conda activate` works:
98+
# bash -i my_setup.sh
9799
98-
# Typical use: make use of resources, such as running training.
99-
run: |
100-
echo "hello sky!"
101-
conda env list
100+
# Typical use: make use of resources, such as running training.
101+
run: |
102+
echo "hello sky!"
103+
conda env list
104+
# If using a `my_run.sh` script that requires conda,
105+
# invoke it as below to ensure `conda activate` works:
106+
# `bash -i my_run.sh`
102107
103108
Sky handles selecting an appropriate VM based on user-specified resource
104109
constraints, launching the cluster on an appropriate cloud provider, and
@@ -108,7 +113,7 @@ To launch a task based on our above YAML spec, we can use :code:`sky launch`.
108113

109114
.. code-block:: console
110115
111-
$ sky launch -c mycluster hello_sky.yaml
116+
$ sky launch -c mycluster hello_sky.yaml
112117
113118
The :code:`-c` option allows us to specify a cluster name. If a cluster with the
114119
same name already exists, Sky will reuse that cluster. If no such cluster
@@ -120,7 +125,7 @@ We can view our existing clusters by running :code:`sky status`:
120125

121126
.. code-block:: console
122127
123-
$ sky status
128+
$ sky status
124129
125130
This may show multiple clusters, if you have created several:
126131

@@ -134,7 +139,7 @@ If you would like to log into the a cluster, Sky provides convenient SSH access
134139

135140
.. code-block:: console
136141
137-
$ ssh mycluster
142+
$ ssh mycluster
138143
139144
If you would like to transfer files to and from the cluster, *rsync* or *scp* can be used:
140145

0 commit comments

Comments
 (0)