Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change Ray's default port to avoid conflicts with Redis #1790

Merged
merged 19 commits into from
May 28, 2023

Conversation

infwinston
Copy link
Member

@infwinston infwinston commented Mar 16, 2023

Fixes #1789. I looked up common ports used by well-known services here
https://en.wikipedia.org/wiki/List_of_TCP_and_UDP_port_numbers
and 6380 seems safe.

This PR also enables Ray applications on SkyPilot by making our Ray cluster invisible by changing the default temp_dir to /tmp/ray_skypilot. (related #1305, #671)

Tested (run the relevant ones):

  • All smoke tests: pytest tests/test_smoke.py
  • Backward compatibility tests: bash tests/backward_comaptibility_tests.sh

@suquark
Copy link
Collaborator

suquark commented Mar 18, 2023

we should use ray_skypilot instead of ray_patch as the temp dir name, in case this becomes our long-term solution

@concretevitamin concretevitamin added this to the 0.3 milestone Mar 22, 2023
Copy link
Member

@concretevitamin concretevitamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! Quick look. ray start -h shows some other ports to check. Is it ok to share them with an app's Ray?

  --object-manager-port INTEGER   the port to use for starting the object
                                  manager
  --node-manager-port INTEGER     the port to use for starting the node
                                  manager
  --gcs-server-port INTEGER       Port number for the GCS server.
  --min-worker-port INTEGER       the lowest port number that workers will
                                  bind on. If not set, random ports will be
                                  chosen.
  --max-worker-port INTEGER       the highest port number that workers will
                                  bind on. If set, '--min-worker-port' must
                                  also be set.
  --worker-port-list TEXT         a comma-separated list of open ports for
                                  workers to bind on. Overrides '--min-worker-
                                  port' and '--max-worker-port'.
  --ray-client-server-port INTEGER
                                  the port number the ray client server binds
                                  on, default to 10001.
...
  --dashboard-port INTEGER        the port to bind the dashboard server to--
                                  defaults to 8265
  --dashboard-agent-listen-port INTEGER
                                  the port for dashboard agents to listen for
                                  http on.
  --dashboard-agent-grpc-port INTEGER
                                  the port for dashboard agents to listen for
                                  grpc on.
  --metrics-export-port INTEGER   the port to use to expose Ray metrics
                                  through a Prometheus endpoint.

May want to run

@@ -143,12 +143,12 @@ head_start_ray_commands:
# Start skylet daemon. (Should not place it in the head_setup_commands, otherwise it will run before skypilot is installed.)
# NOTE: --disable-usage-stats in `ray start` saves 10 seconds of idle wait.
- ((ps aux | grep -v nohup | grep -v grep | grep -q -- "python3 -m sky.skylet.skylet") || nohup python3 -m sky.skylet.skylet >> ~/.sky/skylet.log 2>&1 &);
ray stop; RAY_SCHEDULER_EVENTS=0 ray start --disable-usage-stats --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml {{"--resources='%s'" % custom_resources if custom_resources}} || exit 1;
ray stop; RAY_SCHEDULER_EVENTS=0 ray start --disable-usage-stats --head --port={{ray_port}} --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml {{"--resources='%s'" % custom_resources if custom_resources}} --temp-dir {{ray_temp_dir}} || exit 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it ok that our Ray continues to use --object-manager-port=8076?

@infwinston
Copy link
Member Author

infwinston commented Apr 2, 2023

Is it ok that our Ray continues to use --object-manager-port=8076?

To my understanding, from their doc,object-manager-port by default will be randomly decided in runtime (in which user's ray will use a port different from 8076).
so it'd not cause issue. But @Michaelvll do you know why we use a specific port 8076 in the first place?

I'll upload an AutoGluon example I run before.

Back compat should be fine if we keep our dashboard running on the same port 8265, which is used to submit our Ray job. However, if user's Ray app requires dashboard (default port: 8265), it'll still be conflicts. We may need to decide whether to change our dashboard port. each side comes with pros/cons.

Btw, I'm also blocked by this issue in vicuna project in which I use ray to distribute work to multiple gpus. Now my sky yaml fails to run my app. We may want to merge this PR soon.

@Michaelvll Michaelvll self-requested a review April 8, 2023 07:39
@Michaelvll
Copy link
Collaborator

Michaelvll commented Apr 11, 2023

For the backward compatibility of the 8265 port, in our new implementation, we can start the ray cluster with dashboard port 8266 and note down the port in a file. When we are submitting the task or querying the ray cluster for the job status, we can look at the file first and decide which port to use for them:
In ray yaml:
ray stop; RAY_SCHEDULER_EVENTS=0 ray start --disable-usage-stats --head --port={{ray_port}} --object-manager-port=8076 --dashboard-port 8266 --autoscaling-config=~/ray_bootstrap_config.yaml {{"--resources='%s'" % custom_resources if custom_resources}} --temp-dir {{ray_temp_dir}} && echo '8266' > ~/.sky/ray-dashboard-port || exit 1;`

For job submission here

job_submit_cmd = (
f'{cd} && mkdir -p {remote_log_dir} && ray job submit '
f'--address=http://127.0.0.1:8265 --submission-id {ray_job_id} '
'--no-wait '
f'"{executable} -u {script_path} > {remote_log_path} 2>&1"')

We can do something like the following:

job_submit_cmd = (
                f'RAY_DASHBOARD_PORT=$(cat ~/.sky/ray_dashboard_port 2> /dev/null); RAY_DASHBOARD_PORT=${RAY_DASHBOARD_PORT:-8265}; {cd} && mkdir -p {remote_log_dir} && ray job submit '
                f'--address=http://127.0.0.1:$RAY_DASHBOARD_PORT --submission-id {ray_job_id} '
                '--no-wait '
                f'"{executable} -u {script_path} > {remote_log_path} 2>&1"')

@iojw
Copy link
Collaborator

iojw commented Apr 13, 2023

Tested and confirmed that the following backwards compatibility cases work:

  • exec on existing up cluster - as discussed, this depends on the job manager port, so changing the GCS port does not affect the submission of jobs
  • start on existing stopped cluster - this re-provisions the cluster using the new GCS port, so there is no issue here

@Michaelvll Michaelvll linked an issue May 9, 2023 that may be closed by this pull request
@Michaelvll
Copy link
Collaborator

Michaelvll commented May 9, 2023

A kind bump for this PR @infwinston. Any updates?

@infwinston
Copy link
Member Author

Will get back to office and look into this tomorrow.

@infwinston
Copy link
Member Author

infwinston commented May 13, 2023

I've updated this PR to ensure backward compatibility! and tested it. seems to work fine.

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix @infwinston! This would be very useful for users with their own ray cluster. Left several comments.

Comment on lines +851 to +852
port_dict_str = f'{{"ray_port":{ray_port}, "ray_dashboard_port":{ray_dashboard_port}}}'
dump_port_command = f'python -c \'import json, os; json.dump({port_dict_str}, open(os.path.expanduser("{constants.SKY_REMOTE_RAY_PORT_FILE}"), "w"))\''
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Since we are already using python, should we just do the following:

port_json = json.dumps({'ray_port': ray_port, 'ray_dashboard_port': ray_dashboard_port})
dump_port_command = f'echo {port_json!r} > {constants.SKY_REMOTE_RAY_PORT_FILE}'

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I tried this before. The issue is json.dumps will add a space between ":" and the value. which will cause the YAML parser to fail... we need "{'ray_port':6380}" not "{'ray_port': 6380}"..

I've added some comments to explain it.

Copy link
Collaborator

@Michaelvll Michaelvll May 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, got it. I suppose if we add a | after the - in the head_start_ray_commands, the problem should be fixed. : )

Comment on lines 2656 to 2657
'RAY_DASHBOARD_PORT=$(jq ".ray_dashboard_port // empty" '
f'{ray_port_file} 2> /dev/null); '
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does jq installed by default on all the images we used for all the clouds? If not, we may want to use python instead? Something like the following:

'RAY_DASHBOARD_POART=$(python -c "import json; print(json.load(open(\'{ray_port_file}\')).get(\'ray_dashboard_port\', 8265);" 2> /dev/null || echo 8265)'

or just

'RAY_DASHBOARD_POART=$(python -c "from sky.skylet import job_lib; print(job_lib.get_job_submission_port()) 2> /dev/null || echo 8265)'

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! updated.

address={ray_address!r},
namespace='__sky__{job_id}__',
log_to_driver=True,
_temp_dir={constants.SKY_REMOTE_RAY_TEMPDIR!r}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this work if we are submitting the job through sky exec to an existing cluster with the ray cluster running under the default _temp_dir? Should we check the existence of constants.SKY_REMOTE_TEMPDIR on the remote VM before we ray.init?

kwargs = {}
if os.path.exists(constants.SKY_REMOTE_RAY_TEMPDIR):
    kwargs['_temp_dir'] = constants.SKY_REMOTE_RAY_TEMPDIR
ray.init(..., **kwargs)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Added!

@infwinston
Copy link
Member Author

infwinston commented May 27, 2023

I've merged the master and addressed the comment.

Tested:

  • sky launch examples/autogluon.yaml works.
  • Backward compatibility: I switched to master and launched a VM. and switch back to this branch. sky exec still works.
  • Smoke tests all passed

@Michaelvll could you review if there's anything left? thanks!

Copy link
Collaborator

@Michaelvll Michaelvll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix @infwinston! LGTM. It should be good to go if the tests are passed.

Co-authored-by: Zhanghao Wu <zhanghao.wu@outlook.com>
@infwinston
Copy link
Member Author

Thanks a lot for the reviews @Michaelvll! Now all the smoke tests passed. Merging this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants