-
Notifications
You must be signed in to change notification settings - Fork 589
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change Ray's default port to avoid conflicts with Redis #1790
Conversation
we should use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! Quick look. ray start -h
shows some other ports to check. Is it ok to share them with an app's Ray?
--object-manager-port INTEGER the port to use for starting the object
manager
--node-manager-port INTEGER the port to use for starting the node
manager
--gcs-server-port INTEGER Port number for the GCS server.
--min-worker-port INTEGER the lowest port number that workers will
bind on. If not set, random ports will be
chosen.
--max-worker-port INTEGER the highest port number that workers will
bind on. If set, '--min-worker-port' must
also be set.
--worker-port-list TEXT a comma-separated list of open ports for
workers to bind on. Overrides '--min-worker-
port' and '--max-worker-port'.
--ray-client-server-port INTEGER
the port number the ray client server binds
on, default to 10001.
...
--dashboard-port INTEGER the port to bind the dashboard server to--
defaults to 8265
--dashboard-agent-listen-port INTEGER
the port for dashboard agents to listen for
http on.
--dashboard-agent-grpc-port INTEGER
the port for dashboard agents to listen for
grpc on.
--metrics-export-port INTEGER the port to use to expose Ray metrics
through a Prometheus endpoint.
May want to run
- https://github.com/skypilot-org/skypilot/tree/master/examples/ray_tune_examples
- AutoGluon
- back compat
sky/templates/azure-ray.yml.j2
Outdated
@@ -143,12 +143,12 @@ head_start_ray_commands: | |||
# Start skylet daemon. (Should not place it in the head_setup_commands, otherwise it will run before skypilot is installed.) | |||
# NOTE: --disable-usage-stats in `ray start` saves 10 seconds of idle wait. | |||
- ((ps aux | grep -v nohup | grep -v grep | grep -q -- "python3 -m sky.skylet.skylet") || nohup python3 -m sky.skylet.skylet >> ~/.sky/skylet.log 2>&1 &); | |||
ray stop; RAY_SCHEDULER_EVENTS=0 ray start --disable-usage-stats --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml {{"--resources='%s'" % custom_resources if custom_resources}} || exit 1; | |||
ray stop; RAY_SCHEDULER_EVENTS=0 ray start --disable-usage-stats --head --port={{ray_port}} --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml {{"--resources='%s'" % custom_resources if custom_resources}} --temp-dir {{ray_temp_dir}} || exit 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it ok that our Ray continues to use --object-manager-port=8076
?
To my understanding, from their doc, I'll upload an AutoGluon example I run before. Back compat should be fine if we keep our dashboard running on the same port Btw, I'm also blocked by this issue in vicuna project in which I use ray to distribute work to multiple gpus. Now my sky yaml fails to run my app. We may want to merge this PR soon. |
For the backward compatibility of the 8265 port, in our new implementation, we can start the ray cluster with dashboard port 8266 and note down the port in a file. When we are submitting the task or querying the ray cluster for the job status, we can look at the file first and decide which port to use for them: For job submission here skypilot/sky/backends/cloud_vm_ray_backend.py Lines 2611 to 2615 in 15f01d6
We can do something like the following:
|
Tested and confirmed that the following backwards compatibility cases work:
|
A kind bump for this PR @infwinston. Any updates? |
Will get back to office and look into this tomorrow. |
I've updated this PR to ensure backward compatibility! and tested it. seems to work fine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix @infwinston! This would be very useful for users with their own ray cluster. Left several comments.
port_dict_str = f'{{"ray_port":{ray_port}, "ray_dashboard_port":{ray_dashboard_port}}}' | ||
dump_port_command = f'python -c \'import json, os; json.dump({port_dict_str}, open(os.path.expanduser("{constants.SKY_REMOTE_RAY_PORT_FILE}"), "w"))\'' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Since we are already using python, should we just do the following:
port_json = json.dumps({'ray_port': ray_port, 'ray_dashboard_port': ray_dashboard_port})
dump_port_command = f'echo {port_json!r} > {constants.SKY_REMOTE_RAY_PORT_FILE}'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I tried this before. The issue is json.dumps
will add a space between ":" and the value. which will cause the YAML parser to fail... we need "{'ray_port':6380}"
not "{'ray_port': 6380}"
..
I've added some comments to explain it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, got it. I suppose if we add a |
after the -
in the head_start_ray_commands
, the problem should be fixed. : )
sky/backends/cloud_vm_ray_backend.py
Outdated
'RAY_DASHBOARD_PORT=$(jq ".ray_dashboard_port // empty" ' | ||
f'{ray_port_file} 2> /dev/null); ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does jq
installed by default on all the images we used for all the clouds? If not, we may want to use python
instead? Something like the following:
'RAY_DASHBOARD_POART=$(python -c "import json; print(json.load(open(\'{ray_port_file}\')).get(\'ray_dashboard_port\', 8265);" 2> /dev/null || echo 8265)'
or just
'RAY_DASHBOARD_POART=$(python -c "from sky.skylet import job_lib; print(job_lib.get_job_submission_port()) 2> /dev/null || echo 8265)'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! updated.
sky/backends/cloud_vm_ray_backend.py
Outdated
address={ray_address!r}, | ||
namespace='__sky__{job_id}__', | ||
log_to_driver=True, | ||
_temp_dir={constants.SKY_REMOTE_RAY_TEMPDIR!r} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this work if we are submitting the job through sky exec
to an existing cluster with the ray cluster running under the default _temp_dir
? Should we check the existence of constants.SKY_REMOTE_TEMPDIR
on the remote VM before we ray.init
?
kwargs = {}
if os.path.exists(constants.SKY_REMOTE_RAY_TEMPDIR):
kwargs['_temp_dir'] = constants.SKY_REMOTE_RAY_TEMPDIR
ray.init(..., **kwargs)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Added!
I've merged the master and addressed the comment. Tested:
@Michaelvll could you review if there's anything left? thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix @infwinston! LGTM. It should be good to go if the tests are passed.
Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>
Thanks a lot for the reviews @Michaelvll! Now all the smoke tests passed. Merging this. |
Fixes #1789. I looked up common ports used by well-known services here
https://en.wikipedia.org/wiki/List_of_TCP_and_UDP_port_numbers
and 6380 seems safe.
This PR also enables Ray applications on SkyPilot by making our Ray cluster invisible by changing the default
temp_dir
to/tmp/ray_skypilot
. (related #1305, #671)Tested (run the relevant ones):
pytest tests/test_smoke.py
bash tests/backward_comaptibility_tests.sh