Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[k8s] Parallelize multi-node setup #4297

Merged
merged 17 commits into from
Nov 11, 2024
Merged

[k8s] Parallelize multi-node setup #4297

merged 17 commits into from
Nov 11, 2024

Conversation

romilbhardwaj
Copy link
Collaborator

Supersedes #4261 and #4270.

We identified that apt update is the slowest operation and triggering it after setup takes a lot of time, so we now inline it in container init args to let kubernetes run it in parallel.

Testing on nemo image

sky launch -y -c test --num-nodes 100 --cloud kubernetes --image-id nvcr.io/nvidia/nemo:24.05.01

Master branch: 19:56.21 total

This branch: 15:26.56 total

Testing on default image:

sky launch -y -c test --num-nodes 100 --cloud kubernetes

This branch: 4:13.41 total

Tested (run the relevant ones):

  • Code formatting: bash format.sh
  • Manual testing with 100 nodes on nemo image, our base image and miniconda image.
  • Need to run relevant smoke tests.

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making the change @romilbhardwaj! Looks mostly good to me.

Comment on lines 462 to 463
if not provider_config.get('disable_ssh', False):
ssh_install_cmd = install_ssh_k8s_cmd
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need this? If we do, how much faster this change will offer?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I forgot to push my changes last night. The disable_ssh flag has been removed now.

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update @romilbhardwaj! LGTM.

…o k8s_multinode_perf

# Conflicts:
#	sky/templates/kubernetes-ray.yml.j2
@romilbhardwaj romilbhardwaj added this pull request to the merge queue Nov 11, 2024
Merged via the queue into master with commit 24982a1 Nov 11, 2024
20 checks passed
@romilbhardwaj romilbhardwaj deleted the k8s_multinode_perf branch November 11, 2024 03:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants