-
Notifications
You must be signed in to change notification settings - Fork 6.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
No output in windows machine. #15970
Comments
Check my answer: I got same problem with you, and I try to stop all the ray process and restart it. After that, it works. |
@maycuatroi Thank you for taking the time to answer my issue.
I have tried running these commands in that order by I still did not receive any output. Running the second command produces this output.
I have placed the
Is there anything I'm missing or doing wrong? |
There can only be one instance of ray runtime, started using |
Do not know how to do only start one instance (I just ran 2021-06-14 02:56:12,391 INFO worker.py:640 -- Connecting to existing Ray cluster at address: 192.168.1.100:6379
2021-06-14 02:56:30,856 WARNING worker.py:1115 -- The actor or task with ID a67dc375e60ddd1affffffffffffffffffffffff01000000 cannot be scheduled right now. It requires {CPU: 1.000000} for placement, but this node only has remaining {64.000000/64.000000 CPU, 19.777052 GiB/19.777052 GiB memory, 1.000000/1.000000 GPU, 9.888526 GiB/9.888526 GiB object_store_memory, 1.000000/1.000000 node:192.168.1.100}
. In total there are 1 pending tasks and 0 pending actors on this node. This is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale. |
I can reproduce both a working version and version of the demo that hangs. In both cases, when starting the demo I see two More details: I built ray locally from HEAD, and the demo works:
The same demo does not work when using the wheel I download with pip install (using a stock python3.8)
The demo (repeated from above)
|
In
There was no error in |
A couple more things I noticed when working on this with @czgdp1807:
|
I built
In this case I cloned c:\users\gagan\gsingh\ray\python\ray\_private\services.py:238: UserWarning: Not all Ray Dashboard dependencies were found. To use the dashboard please install Ray using `pip install ray[default]`. To disable this message, set RAY_DISABLE_IMPORT_WARNING env var to '1'.
warnings.warn(warning_message)
sys.version_info(major=3, minor=8, micro=11, releaselevel='final', serial=0)
2.0.0.dev0
[0, 1, 4, 9]
C:\ProgramData\Anaconda3\envs\ray_stable\lib\site-packages\ray\_private\services.py:238: UserWarning: Not all Ray Dashboard dependencies were found. To use the dashboard please install Ray using `pip install ray[default]`. To disable this message, set RAY_DISABLE_IMPORT_WARNING env var to '1'.
warnings.warn(warning_message)
sys.version_info(major=3, minor=8, micro=11, releaselevel='final', serial=0)
1.6.0
[0, 1, 4, 9] OS Information Edition Windows 10 Pro
Version 20H2
Installed on 9/2/2021
OS build 19042.1165
Experience Windows Feature Experience Pack 120.2212.3530.0 I can try doing a clean build and update the status here. Let me know. Update - Works fine with 1.3.0, 1.0.0 as well. |
Using @czgdp1807's setup, I can see a difference between a working environment and a failing one: in the working environment, |
|
Building ray-1.6.0 from source with a virtualenv (not a conda env) does not pass the demo. Here are the steps, it seems bazel is very sensitive to the
This hangs when running
So it seems somehow the worker_pool thinks it is starting a worker process with pid 16980, but somehow that process thinks its pid is 3020. |
I am able to reproduce the issue in a virtual env on my system (described in #15970 (comment)) |
On my system the code hangs because the following line is getting repeated in the logs, .
.
.
[2021-09-20 19:03:23,889 W 9432 2156] metric_exporter.cc:206: Export metrics to agent failed: IOError: 14: failed to connect to all addresses. This won't affect Ray, but you can lose metrics from the cluster.
[2021-09-20 19:03:33,951 W 9432 2156] metric_exporter.cc:206: Export metrics to agent failed: IOError: 14: failed to connect to all addresses. This won't affect Ray, but you can lose metrics from the cluster.
[2021-09-20 19:03:44,014 W 9432 2156] metric_exporter.cc:206: Export metrics to agent failed: IOError: 14: failed to connect to all addresses. This won't affect Ray, but you can lose metrics from the cluster.
[2021-09-20 19:03:54,015 W 9432 2156] metric_exporter.cc:206: Export metrics to agent failed: IOError: 14: failed to connect to all addresses. This won't affect Ray, but you can lose metrics from the cluster.
[2021-09-20 19:04:04,109 W 9432 2156] metric_exporter.cc:206: Export metrics to agent failed: IOError: 14: failed to connect to all addresses. This won't affect Ray, but you can lose metrics from the cluster.
[2021-09-20 19:04:14,203 W 9432 2156] metric_exporter.cc:206: Export metrics to agent failed: IOError: 14: failed to connect to all addresses. This won't affect Ray, but you can lose metrics from the cluster.
.
.
. Upon digging a bit, I found that the above log warning is coming from this call. So, I applied the following changes and the code doesn't hang anymore. The final diff --git a/src/ray/stats/metric_exporter.cc b/src/ray/stats/metric_exporter.cc
index fea2ee4f8..370028f47 100644
--- a/src/ray/stats/metric_exporter.cc
+++ b/src/ray/stats/metric_exporter.cc
@@ -206,6 +206,9 @@ void OpenCensusProtoExporter::ExportViewData(
RAY_LOG(WARNING)
<< "Export metrics to agent failed: " << status
<< ". This won't affect Ray, but you can lose metrics from the cluster.";
+ if (status.ShouldExitWorker() || status.IsIOError()) {
+ _Exit(1);
+ }
}
});
} The above diff is not a fix. It's just shows that the code hangs somewhere due to the repeated logging of the above message. We need to exit the process which is doing this if we are exiting the other processes as well i.e., if this line is executed. Common IOError due to connection failure issues (not relevant with hanging of [2021-09-20 19:14:27,458 E 2580 10464] core_worker.cc:411: Failed to register worker 01000000ffffffffffffffffffffffffffffffffffffffffffffffff to Raylet. IOError: [RayletClient] Unable to register worker with raylet. Unknown error In conda environment the IOError is never thrown and hence, the metric exporter never hangs so the processes exit normally. |
Gist - https://gist.github.com/czgdp1807/75e9850b8143ba0b789cf587436a29b7 The above gist is a result of bunch of experiments that I did after observing two things,
A brief analysis and summary of the above details,
The reason for code hanging appears to be returning a status when an unknown worker shim process is encountered. Registration time out is not associated with code hanging as increasing Possible Directions - The configurations in this file should be platform specific. Questions - Are unknown shim processes critical enough to leave the |
Hmm this seems like a sign that there's something internally inconsistent. For reference, we start worker processes around here: ray/src/ray/raylet/worker_pool.cc Line 372 in 7c99aae
I wonder if there's something windows-specific that we aren't taking into account here? |
Btw do you have any intuition on whether 450ms to start a worker process sounds reasonable to you? It's a little surprisingly high, but then again, I'm not a windows developer. In any case, if there's not issue other than startup speed, we can probably just bump that timeout to 1s or something. |
Yeah. Though, while registering the worker,
Same. I never used Windows before. Though, the primary reason for this from my perspective is that Windows performs security checks on every process and hence the startup time increases. |
Inconsistency - The output of The above inconsistency only happens in virtual environment. In conda environment both the things are same and probably therefore the code doesn't hang there. |
Ah! Exactly. I was thinking the same when I was fixing this in the morning. First a Python process is launched whose PID is stored inside |
Well, it seems like the above theory turned out to be true. After the following block in Lines 123 to 126 in 72cc0c9
What to do next?
ray/src/ray/raylet/worker_pool.cc Lines 507 to 510 in 7c99aae
cc: @wuisawesome |
@czgdp1807 Since the most common failure is folks instalIing via pip in a venv, I think your second option is a better approach. Does your trick with |
If we are going for second approach (i.e., a proper fix instead of just documentation) then I would suggest the following to be done.
ray/src/ray/raylet/worker_pool.cc Lines 507 to 510 in 7c99aae
|
This should be fixed by #19014 now |
What is the problem?
I can't run ray code on my Windows 10 (Build:19041) system.
ray==1.3.0 and ray==2.0.0.dev0
python==3.8.10
Reproduction (REQUIRED)
Code used for testing.
The terminal remains blank and no output is generated.
I tried running it using the Python Console and got this output that repeats every second indefinitely.
I have installed VCRUNTIME140_1.dll on my windows machine.
Terminal Output with nightly version:
The text was updated successfully, but these errors were encountered: