-
Notifications
You must be signed in to change notification settings - Fork 6.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core][deprecate run_on_all_workers 1/n] set worker's sys.path through JobConfig._py_driver_sys_path #31383
Conversation
@@ -2027,6 +2027,29 @@ def connect( | |||
runtime_env.pop("excludes", None) | |||
job_config.set_runtime_env(runtime_env) | |||
|
|||
if mode == SCRIPT_MODE: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you help me understand why this block needed to move before core worker initializatilon?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are two (relevant) types of workers: drivers and task/actor execution workers.
the job_config flow from drivers -> raylet (worker_pool) -> executor-workers, both implicitly happens through coreworker constructor worker.core_worker = ray._raylet.CoreWorker
, where driver set the job_config to raylet and executor receive the job_config from the raylet.
That's why we need to change it before core_worker initialization, as otherwise the job_config has already being sent to raylet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
btw how executor-worker get job_config will change by #31375 soon
Thanks for fixing this. As synced offline, my only concern is that we make the code_search_path and the new introduced ones doing similar works. Maybe we should try to unify them.
Not very strong opinion here if this require a lot of work. |
Btw, could you also delete import thread which is not useful anymore with this work. |
Prototyped it and found it a bit more confusing to me... I'd propose we stick with with current implementation, since ultimately we will replace it by runtime env. |
…h JobConfig._py_driver_sys_path (ray-project#31383) Why are these changes needed? Today we use run_on_all_workers to set worker's system path, where the run_on_all_workers suffers from weak ordering guarantees and will be deprecated. Instead, we should use JobConfig._py_driver_sys_path to set worker's system path, where the worker will add these paths into its sys.path on startup. Note we have JobConfig.code_search_path which servers similar functionality, however it uses a different load_code_from_local code path and behaves differently and introduced bugs that failed tests (ray-project#17605 (comment)).
…h JobConfig._py_driver_sys_path (#31383) Why are these changes needed? Today we use run_on_all_workers to set worker's system path, where the run_on_all_workers suffers from weak ordering guarantees and will be deprecated. Instead, we should use JobConfig._py_driver_sys_path to set worker's system path, where the worker will add these paths into its sys.path on startup. Note we have JobConfig.code_search_path which servers similar functionality, however it uses a different load_code_from_local code path and behaves differently and introduced bugs that failed tests (#17605 (comment)).
…h JobConfig._py_driver_sys_path (#31383) Why are these changes needed? Today we use run_on_all_workers to set worker's system path, where the run_on_all_workers suffers from weak ordering guarantees and will be deprecated. Instead, we should use JobConfig._py_driver_sys_path to set worker's system path, where the worker will add these paths into its sys.path on startup. Note we have JobConfig.code_search_path which servers similar functionality, however it uses a different load_code_from_local code path and behaves differently and introduced bugs that failed tests (#17605 (comment)).
…h JobConfig._py_driver_sys_path (ray-project#31383) Why are these changes needed? Today we use run_on_all_workers to set worker's system path, where the run_on_all_workers suffers from weak ordering guarantees and will be deprecated. Instead, we should use JobConfig._py_driver_sys_path to set worker's system path, where the worker will add these paths into its sys.path on startup. Note we have JobConfig.code_search_path which servers similar functionality, however it uses a different load_code_from_local code path and behaves differently and introduced bugs that failed tests (ray-project#17605 (comment)). Signed-off-by: mohamed <mohamed.nidabdella@tweag.io>
…h JobConfig._py_driver_sys_path (ray-project#31383) Why are these changes needed? Today we use run_on_all_workers to set worker's system path, where the run_on_all_workers suffers from weak ordering guarantees and will be deprecated. Instead, we should use JobConfig._py_driver_sys_path to set worker's system path, where the worker will add these paths into its sys.path on startup. Note we have JobConfig.code_search_path which servers similar functionality, however it uses a different load_code_from_local code path and behaves differently and introduced bugs that failed tests (ray-project#17605 (comment)). Signed-off-by: tmynn <hovhannes.tamoyan@gmail.com>
…_on_all_workers (ray-project#30895) This function is deprecated. The only use of run_funcion_on_all_workers in ray core has been replaced by ray-project#31383 Delete it to make our worker prestart change simpler. This PR depends on ray-project#31383 ray-project#31528 Signed-off-by: e428265 <arvind.chandramouli@lmco.com>
In #31383 we add _py_driver_sys_path to all workers, making them search from the driver's path. This makes the workers able to find modules near the driver file. However, in these cases it's not working: if there's working_dir. The code should search from working_dir, not from the driver's dir. e.g. lib code in driver's dir can update after job submission and before worker starts, making the runs inconsistent among workers. if the worker is in a different node than the driver's node. The code would search "the driver's dir as in the node where the driver runs in", but in a different node. If there happen to be a stale code file in the worker node, the stale code is used. (already handled) if it's dashboard, it's client mode, or it's interactive mode. This PR addresses the first issue. Left a TODO for the second issue. Part of #42863
Why are these changes needed?
Today we use
run_on_all_workers
to set worker's system path, where therun_on_all_workers
suffers from weak ordering guarantees and will be deprecated.Instead, we should use JobConfig._py_driver_sys_path to set worker's system path, where the worker will add these paths into its sys.path on startup.
Note we have JobConfig.code_search_path which servers similar functionality, however it uses a different
load_code_from_local
code path and behaves differently and introduced bugs that failed tests (#17605 (comment)).Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.