Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add document for bash script.sh cannot do conda activate problem #422

Merged
merged 15 commits into from
Feb 24, 2022
Merged
8 changes: 8 additions & 0 deletions docs/source/getting-started/quickstart.rst
Original file line number Diff line number Diff line change
Expand Up @@ -48,10 +48,18 @@ requiring an NVIDIA Tesla K80 GPU on AWS. (See more example yaml files in the `r

setup: |
# Typical use: pip install -r requirements.txt

# If using a `my_setup.sh` script to setup, please use
# `bash -i my_setup.sh` to capture the environment
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to automatically do this later for users?

Copy link
Collaborator Author

@Michaelvll Michaelvll Feb 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could not think of a better way to do this. I feel like we should respect the setup section written by the user, without modifying any code in it. Otherwise, more unexpected behaviors will occur for the user.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok we should figure this out later... (e.g. how to fix backend so that bash will emulate local...)

# variable and make sure `conda activate` works
echo "running setup"

run: |
# Typical use: make use of resources, such as running training.

# If using a my_run.sh script to run commands, please use
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is twice the burden on the user

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is not an ideal solution. I am not sure if there is any better one for this problem, as slurm also does not handle conda activate well.

# `bash -i my_run.sh` to capture the environment variable
# and make sure `conda activate` works
echo "hello sky!"
conda env list

Expand Down
17 changes: 7 additions & 10 deletions sky/backends/cloud_vm_ray_backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -732,9 +732,8 @@ def _retry_region_zones(self, to_provision: Resources, num_nodes: int,

cluster_name = config_dict['cluster_name']
plural = '' if num_nodes == 1 else 's'
logger.info(
f'{style.BRIGHT}Successfully provisioned or found'
f' existing VM{plural}. Setup completed.{style.RESET_ALL}')
logger.info(f'{style.BRIGHT}Successfully provisioned or found'
f' existing VM{plural}.{style.RESET_ALL}')
return config_dict
message = ('Failed to acquire resources in all regions/zones'
f' (requested {to_provision}).'
Expand Down Expand Up @@ -1219,13 +1218,10 @@ def setup(self, handle: ResourceHandle, task: Task) -> None:

if task.setup is None:
return
codegen = textwrap.dedent(f"""\
#!/bin/bash
# TODO(zhwu): Move this to bashrc
. $(conda info --base)/etc/profile.d/conda.sh
{task.setup}""")

setup_script = log_lib.make_task_bash_script(task.setup)
with tempfile.NamedTemporaryFile('w', prefix='sky_setup_') as f:
f.write(codegen)
f.write(setup_script)
f.flush()
setup_sh_path = f.name
setup_file = os.path.basename(setup_sh_path)
Expand All @@ -1249,7 +1245,8 @@ def setup(self, handle: ResourceHandle, task: Task) -> None:
with_outputs=False)
backend_utils.run_command_on_ip_via_ssh(
ip,
f'cd {SKY_REMOTE_WORKDIR}; /bin/bash /tmp/{setup_file}',
# -i will make sure `conda activate` works
f'/bin/bash -i /tmp/{setup_file}',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if a Task.run is also running bash user_script_needing_conda.sh? Do we need -i there?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Collaborator Author

@Michaelvll Michaelvll Feb 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the user needs to use -i for run as well.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a test case/example where the bash script would fail without -i and with -i it works

Copy link
Collaborator Author

@Michaelvll Michaelvll Feb 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point! I found a problem with distributed training with user_script during doing this, that $SKY_NODE_IPS cannot be retrieved from the user_script, due to the bash bug.

ssh_user=ssh_user,
ssh_private_key=ssh_private_key,
log_path=os.path.join(self.log_dir, 'setup.log'),
Expand Down