-
Notifications
You must be signed in to change notification settings - Fork 589
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[k8s] Parallelize multi-node setup #4297
Conversation
…o k8s_disable_ssh
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making the change @romilbhardwaj! Looks mostly good to me.
sky/provision/kubernetes/instance.py
Outdated
if not provider_config.get('disable_ssh', False): | ||
ssh_install_cmd = install_ssh_k8s_cmd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need this? If we do, how much faster this change will offer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I forgot to push my changes last night. The disable_ssh
flag has been removed now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update @romilbhardwaj! LGTM.
…o k8s_multinode_perf # Conflicts: # sky/templates/kubernetes-ray.yml.j2
Supersedes #4261 and #4270.
We identified that
apt update
is the slowest operation and triggering it after setup takes a lot of time, so we now inline it in container init args to let kubernetes run it in parallel.Testing on nemo image
sky launch -y -c test --num-nodes 100 --cloud kubernetes --image-id nvcr.io/nvidia/nemo:24.05.01
Master branch: 19:56.21 total
This branch: 15:26.56 total
Testing on default image:
sky launch -y -c test --num-nodes 100 --cloud kubernetes
This branch: 4:13.41 total
Tested (run the relevant ones):
bash format.sh