Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No output in windows machine. #15970

Closed
2 tasks done
RashiqAzhan opened this issue May 21, 2021 · 24 comments
Closed
2 tasks done

No output in windows machine. #15970

RashiqAzhan opened this issue May 21, 2021 · 24 comments
Assignees
Labels
bug Something that is supposed to be working; but isn't P1 Issue that should be fixed within a few weeks windows

Comments

@RashiqAzhan
Copy link

RashiqAzhan commented May 21, 2021

What is the problem?

I can't run ray code on my Windows 10 (Build:19041) system.
ray==1.3.0 and ray==2.0.0.dev0
python==3.8.10

Reproduction (REQUIRED)

Code used for testing.

import ray
ray.init(include_dashboard=False)

@ray.remote
def f(x):
    return x * x

futures = [f.remote(i) for i in range(4)]
print(ray.get(futures))

The terminal remains blank and no output is generated.

I tried running it using the Python Console and got this output that repeats every second indefinitely.

Waiting longer for RedisShards to be populated.

I have installed VCRUNTIME140_1.dll on my windows machine.

Terminal Output with nightly version:

[2021-05-21 05:42:41,959 C 21012 12748] service_based_gcs_client.cc:74:  Check failed: GetGcsServerAddressFromRedis( redis_client_->GetPrimaryContext()->sync_context(), &current_gcs_server_address_, RayConfig::instance().gcs_service_connect_retries()) Failed to get gcs server address when init gcs client.
*** StackTrace Information ***
    @   00007FFF6441C22A  public: void __cdecl google::NullStreamFatal::`vbase destructor'(void) __ptr64
    @   00007FFF6441ACA1  public: virtual __cdecl google::NullStreamFatal::~NullStreamFatal(void) __ptr64
    @   00007FFF6428C753  PyInit__raylet
    @   00007FFF641955DC  PyInit__raylet
    @   00007FFF640DD5BF  (unknown)
    @   00007FFF917E40FE  _PyEval_EvalFrameDefault
    @   00007FFF917DE618  _PyEval_EvalCodeWithName
    @   00007FFF917E39B2  _PyEval_EvalFrameDefault
    @   00007FFF917E2D24  _PyEval_EvalFrameDefault
    @   00007FFF917E2D24  _PyEval_EvalFrameDefault
    @   00007FFF917DE618  _PyEval_EvalCodeWithName
    @   00007FFF917DFD5F  _PyFunction_Vectorcall
    @   00007FFF917E6B9E  _PyEval_EvalFrameDefault
    @   00007FFF917DE618  _PyEval_EvalCodeWithName
    @   00007FFF917DFD5F  _PyFunction_Vectorcall
    @   00007FFF917E6B9E  _PyEval_EvalFrameDefault
    @   00007FFF917DE618  _PyEval_EvalCodeWithName
    @   00007FFF917DFD5F  _PyFunction_Vectorcall
    @   00007FFF918AA48D  PyFloat_GetInfo
    @   00007FFF918068B6  PyTuple_New
    @   00007FFF918053AA  _PyObject_MakeTpCall
    @   00007FFF917E7E59  _PyEval_EvalFrameDefault
    @   00007FFF917DE618  _PyEval_EvalCodeWithName
    @   00007FFF917DFD5F  _PyFunction_Vectorcall
    @   00007FFF917FA6A2  PyVectorcall_Call
    @   00007FFF917FA533  PySequence_GetItem
    @   00007FFF917E67A5  _PyEval_EvalFrameDefault
    @   00007FFF917DE618  _PyEval_EvalCodeWithName
    @   00007FFF917DFD5F  _PyFunction_Vectorcall
    @   00007FFF917E6B9E  _PyEval_EvalFrameDefault
    @   00007FFF917DE618  _PyEval_EvalCodeWithName
    @   00007FFF917F315B  PyEval_EvalCodeEx
Process finished with exit code -1073740791 (0xC0000409)
  • I have verified my script runs in a clean environment and reproduces the issue.
  • I have verified the issue also occurs with the latest wheels.
@RashiqAzhan RashiqAzhan added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels May 21, 2021
@maycuatroi
Copy link

Check my answer: I got same problem with you, and I try to stop all the ray process and restart it. After that, it works.

#16137 (comment)

@RashiqAzhan
Copy link
Author

RashiqAzhan commented Jun 1, 2021

@maycuatroi Thank you for taking the time to answer my issue.

ray stop --force
ray start --head

I have tried running these commands in that order by I still did not receive any output. Running the second command produces this output.

Local node IP: 192.168.1.100
2021-06-01 14:04:31,365 INFO services.py:1267 -- View the Ray dashboard at http://127.0.0.1:8265

--------------------
Ray runtime started.
--------------------

Next steps
  To connect to this Ray runtime from another node, run
    ray start --address='192.168.1.100:6379' --redis-password='5241590000000000'

  Alternatively, use the following Python code:
    import ray
    ray.init(address='auto', _redis_password='5241590000000000')

  If connection fails, check your firewall settings and network configuration.

  To terminate the Ray runtime, run
    ray stop

I have placed the address and _redis_password in my python file's ray.init argument list and then ran the script.

2021-06-01 14:10:14,842 INFO worker.py:640 -- Connecting to existing Ray cluster at address: 192.168.1.100:6379
Waiting longer for RedisShards to be populated.
Waiting longer for RedisShards to be populated.
Waiting longer for RedisShards to be populated.
Waiting longer for RedisShards to be populated.
Waiting longer for RedisShards to be populated.
Waiting longer for RedisShards to be populated.
Waiting longer for RedisShards to be populated.
Waiting longer for RedisShards to be populated.
# Repeat "Waiting longer for RedisShards to be populated." indefinitely.

Is there anything I'm missing or doing wrong?

@saihas
Copy link

saihas commented Jun 2, 2021

There can only be one instance of ray runtime, started using ray.init(), running in your OS at a time. Make sure that you have initialized only one and run again.

@RashiqAzhan
Copy link
Author

Do not know how to do only start one instance (I just ran ray start --head once) but I restarted my device and repeated my last attempt. Now I have a different output in the terminal.

2021-06-14 02:56:12,391	INFO worker.py:640 -- Connecting to existing Ray cluster at address: 192.168.1.100:6379
2021-06-14 02:56:30,856	WARNING worker.py:1115 -- The actor or task with ID a67dc375e60ddd1affffffffffffffffffffffff01000000 cannot be scheduled right now. It requires {CPU: 1.000000} for placement, but this node only has remaining {64.000000/64.000000 CPU, 19.777052 GiB/19.777052 GiB memory, 1.000000/1.000000 GPU, 9.888526 GiB/9.888526 GiB object_store_memory, 1.000000/1.000000 node:192.168.1.100}
. In total there are 1 pending tasks and 0 pending actors on this node. This is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale.

@richardliaw richardliaw added P3 Issue moderate in impact or severity and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jul 2, 2021
@pcmoritz pcmoritz added P2 Important issue, but not time-critical P1 Issue that should be fixed within a few weeks and removed P2 Important issue, but not time-critical P3 Issue moderate in impact or severity labels Jul 30, 2021
@mattip
Copy link
Contributor

mattip commented Sep 2, 2021

I can reproduce both a working version and version of the demo that hangs. In both cases, when starting the demo I see two redis-server.exe processes, a raylet.exe process, and multiple python processes in Task Manager. How would I go about debugging the interactions between the processes to see what is going on?

More details:

I built ray locally from HEAD, and the demo works:

conda create -n ray_dev python=3.8
conda activate ray_dev
conda install bazel=3.6.0
conda install cython=0.29.24 pytest pip
set BAZEL_SH=%CONDA_PREFIX%\\Library\usr\bin\bash.exe`
pip install .\python --use-feature=in-tree-build 

The same demo does not work when using the wheel I download with pip install (using a stock python3.8)

python -m venv \temp\ray_venv
\temp\ray_venv\Scripts\activate.bat
pip install ray   # installs ray 1.6.0

The demo (repeated from above)

import ray
ray.init(include_dashboard=False)

@ray.remote
def f(x):
    import time
    return x * x

futures = [f.remote(i) for i in range(4)]
print(ray.get(futures))

@mattip
Copy link
Contributor

mattip commented Sep 2, 2021

In %TEMP%/ray/session_latest/logs/python-core-driver-01000000ffffffffffffffffffffffffffffffffffffffffffffffff_9036.log file I see this error

core_worker.cc:408: Failed to register worker \
01000000ffffffffffffffffffffffffffffffffffffffffffffffff to Raylet. \
IOError: [RayletClient] Unable to register worker with raylet. Unknown error

There was no error in raylet.err or raylet.out and no real difference between those log files on the successful and failing runs.

@mattip
Copy link
Contributor

mattip commented Sep 3, 2021

A couple more things I noticed when working on this with @czgdp1807:

  • when doing ray start --head --include-dashboard False the pip version starts the dashboard agent and log monitors (seen when calling ray stop -v): there are 10 processes in total started. When doing the same thing with the local build, no dashboard agent nor log monitors are started, there are 6 processes in total started.
  • The first difference in the python-core-driver-*.log between a successful run and a failing run is a line about "grpc_server.cc:92: driver server started, listening on port 51388.", which is missing from the failed run. But I do see a grpc_server.exe running.

@czgdp1807
Copy link
Contributor

czgdp1807 commented Sep 4, 2021

I built ray from master and installed from pip in different conda environments with python 3.8. The demo works in both the cases and gives same output (see below).

master

In this case I cloned ray using, git clone -c core.symlinks=true https://github.com/ray-project/ray.git and rest of the steps were same as in #15970 (comment)

c:\users\gagan\gsingh\ray\python\ray\_private\services.py:238: UserWarning: Not all Ray Dashboard dependencies were found. To use the dashboard please install Ray using `pip install ray[default]`. To disable this message, set RAY_DISABLE_IMPORT_WARNING env var to '1'.
  warnings.warn(warning_message)
sys.version_info(major=3, minor=8, micro=11, releaselevel='final', serial=0)
2.0.0.dev0
[0, 1, 4, 9]

pip

C:\ProgramData\Anaconda3\envs\ray_stable\lib\site-packages\ray\_private\services.py:238: UserWarning: Not all Ray Dashboard dependencies were found. To use the dashboard please install Ray using `pip install ray[default]`. To disable this message, set RAY_DISABLE_IMPORT_WARNING env var to '1'.
  warnings.warn(warning_message)
sys.version_info(major=3, minor=8, micro=11, releaselevel='final', serial=0)
1.6.0
[0, 1, 4, 9]

OS Information

Edition	Windows 10 Pro
Version	20H2
Installed on	‎9/‎2/‎2021
OS build	19042.1165
Experience	Windows Feature Experience Pack 120.2212.3530.0

I can try doing a clean build and update the status here. Let me know.

Update - Works fine with 1.3.0, 1.0.0 as well.

@mattip
Copy link
Contributor

mattip commented Sep 12, 2021

Using @czgdp1807's setup, I can see a difference between a working environment and a failing one: in the working environment, ray.init(include_dashboard=False) starts up some python processes, raylet.exe and gcs_server.exe, a single redis_server.exe, and a plasma_store_server.exe. In the failing setup, I see python processes, raylet.exe and gcs_server.exe, two redis_server.exe and no plasma_store_server.exe. I wonder how the plasma_store_server.exe is being switched for a second redis_server.exe.

@mattip
Copy link
Contributor

mattip commented Sep 12, 2021

plasma_store_server.exe was removed in #14950, @czgdp1807's setup uses ray 1.2.0 which still used the exe form and not the threaded form.

@mattip
Copy link
Contributor

mattip commented Sep 12, 2021

Building ray-1.6.0 from source with a virtualenv (not a conda env) does not pass the demo. Here are the steps, it seems bazel is very sensitive to the bash.exe used, the one from WSL2 did not work.

  1. Download https://github.com/bazelbuild/bazel/releases/download/3.6.0/bazel-3.6.0-windows-x86_64.zip and extract the contents bazel.exe to C:\bazel\bazel.exe. Then set environment variabless
set BAZEL_PATH=c:\bazel\bazel.exe
set PATH=%PATH%;C:\bazel
  1. Set BAZEL_SH set BAZEL_SH=C:\Program Files\Git\bin\bash.exe
  2. Create a virtualenv from a python downloaded from python.org. I used python3.8.10.
  3. Build ray as above pip install .\python --use-feature=in-tree-build

This hangs when running python -c "import ray; ray.init(num_cpus=1, include_dashboard=False)". When looking at the logs in %TEMP%\ray\session_latest, I see a few differences between the successful run (using conda) and this failed run:

  • In debug_state.txt I see a section WorkerPool. The failed run has - num PYTHON drivers: 1, while the successful run has - num PYTHON drivers: 0.
  • logs\python-core-worker-<uid>_3020.log in the failed run reports core_worker.cc:408: Failed to register worker <uid> to Raylet. Invalid: Invalid: Unknown worker
  • in logs\raylet.out there is the other half of the error from above:
    worker_pool.cc:369: Started worker process of 1 worker(s) with pid 16980
    worker_pool.cc:511: Received a register request from an unknown worker shim process:3020
    

So it seems somehow the worker_pool thinks it is starting a worker process with pid 16980, but somehow that process thinks its pid is 3020.

@czgdp1807
Copy link
Contributor

I am able to reproduce the issue in a virtual env on my system (described in #15970 (comment))

@czgdp1807
Copy link
Contributor

czgdp1807 commented Sep 20, 2021

On my system the code hangs because the following line is getting repeated in the logs,

.
.
.
[2021-09-20 19:03:23,889 W 9432 2156] metric_exporter.cc:206: Export metrics to agent failed: IOError: 14: failed to connect to all addresses. This won't affect Ray, but you can lose metrics from the cluster.
[2021-09-20 19:03:33,951 W 9432 2156] metric_exporter.cc:206: Export metrics to agent failed: IOError: 14: failed to connect to all addresses. This won't affect Ray, but you can lose metrics from the cluster.
[2021-09-20 19:03:44,014 W 9432 2156] metric_exporter.cc:206: Export metrics to agent failed: IOError: 14: failed to connect to all addresses. This won't affect Ray, but you can lose metrics from the cluster.
[2021-09-20 19:03:54,015 W 9432 2156] metric_exporter.cc:206: Export metrics to agent failed: IOError: 14: failed to connect to all addresses. This won't affect Ray, but you can lose metrics from the cluster.
[2021-09-20 19:04:04,109 W 9432 2156] metric_exporter.cc:206: Export metrics to agent failed: IOError: 14: failed to connect to all addresses. This won't affect Ray, but you can lose metrics from the cluster.
[2021-09-20 19:04:14,203 W 9432 2156] metric_exporter.cc:206: Export metrics to agent failed: IOError: 14: failed to connect to all addresses. This won't affect Ray, but you can lose metrics from the cluster.
.
.
.

Upon digging a bit, I found that the above log warning is coming from this call. So, I applied the following changes and the code doesn't hang anymore. The final IOError is common in both cases (with and without the following change).

diff --git a/src/ray/stats/metric_exporter.cc b/src/ray/stats/metric_exporter.cc
index fea2ee4f8..370028f47 100644
--- a/src/ray/stats/metric_exporter.cc
+++ b/src/ray/stats/metric_exporter.cc
@@ -206,6 +206,9 @@ void OpenCensusProtoExporter::ExportViewData(
           RAY_LOG(WARNING)
               << "Export metrics to agent failed: " << status
               << ". This won't affect Ray, but you can lose metrics from the cluster.";
+          if (status.ShouldExitWorker() || status.IsIOError()) {
+            _Exit(1);
+          }
         }
       });
 }

The above diff is not a fix. It's just shows that the code hangs somewhere due to the repeated logging of the above message. We need to exit the process which is doing this if we are exiting the other processes as well i.e., if this line is executed.

Common IOError due to connection failure issues (not relevant with hanging of ray.init()),

[2021-09-20 19:14:27,458 E 2580 10464] core_worker.cc:411: Failed to register worker 01000000ffffffffffffffffffffffffffffffffffffffffffffffff to Raylet. IOError: [RayletClient] Unable to register worker with raylet. Unknown error

In conda environment the IOError is never thrown and hence, the metric exporter never hangs so the processes exit normally.

@wuisawesome wuisawesome self-assigned this Sep 21, 2021
@czgdp1807
Copy link
Contributor

Gist - https://gist.github.com/czgdp1807/75e9850b8143ba0b789cf587436a29b7

The above gist is a result of bunch of experiments that I did after observing two things,

  1. The worker registration timeout limit is not the same for all the platforms. As of now in master it is set as 30 ms. However, on Windows a worker process takes around 450 ms (give or take 10-20 ms, see duration.count() in the above gist). Reference to the config in ray is available here. I will refer to this configuration as worker_reg_timeout from now on.
  2. A warning and invalid status is returned when an unknown SHIM process is encountered. Reference to code in ray available here.

A brief analysis and summary of the above details,

  1. CASE 1 - The worker_reg_timeout is set to 30 ms. It can be seen that the code hangs, an unknown worker shim process warning is received and registration timeout can be seen (list line under raylet.out). The code hangs.
  2. CASE 2 - The worker_reg_timeout is increased to 800 ms. The registration time out is now removed. The code still hangs.
  3. CASE 3 - The worker_reg_timeout is set to 30 ms and an invalid status is not returned for an unknown worker shim process, only a warning is logged. A dict is returned as output and the code doesn't hang.
  4. CASE 4 - The worker_reg_timeout is set to 30 ms but an OK status is returned as well. An unknown worker shim process warning is received and the code hangs. Registration timeout also happens.
  5. CASE 5 - The worker_reg_timeout is set to 800 ms and an OK status is returned. Unknown worker shim warning is thrown and the code hangs.
  6. CASE 6 - The worker_reg_timeout is set to 800 ms and a warning is logged. Only an unknown worker shim warning is thrown. The code doesn't hang. Worker log files (see comments) are created and an output is received.

The reason for code hanging appears to be returning a status when an unknown worker shim process is encountered. Registration time out is not associated with code hanging as increasing worker_reg_timeout just removes the timeout message from the logs. When we increase the worker_reg_timeout and just log a warning for unknown worker shim process (i.e., CASE 6), the code runs fine and the initialisation seems to complete successfully.

Possible Directions - The configurations in this file should be platform specific.

Questions - Are unknown shim processes critical enough to leave the WorkerPool::RegisterWorker function in between by returning an invalid status? If they are not that critical then state.starting_worker_processes can be just updated if this condition (i.e., shim process is unknown) succeeds. Or even a simpler approach can be taken by just logging the warning and moving forward.

@wuisawesome
Copy link
Contributor

Are unknown shim processes critical enough to leave the WorkerPool::RegisterWorker function in between by returning an invalid status?

Hmm this seems like a sign that there's something internally inconsistent.

For reference, we start worker processes around here:

state.starting_worker_processes.emplace(

I wonder if there's something windows-specific that we aren't taking into account here?

@wuisawesome
Copy link
Contributor

Btw do you have any intuition on whether 450ms to start a worker process sounds reasonable to you? It's a little surprisingly high, but then again, I'm not a windows developer. In any case, if there's not issue other than startup speed, we can probably just bump that timeout to 1s or something.

@czgdp1807
Copy link
Contributor

For reference, we start worker processes around here:

Yeah. Though, while registering the worker, worker_shim_pid/pid are not found in starting_worker_processes map leading to the error. So, the interesting thing to discover is that how registration of worker processes and starting of worker processes is connected and where do worker_shim_pid/pid become incorrect.

It's a little surprisingly high, but then again, I'm not a windows developer.

Same. I never used Windows before. Though, the primary reason for this from my perspective is that Windows performs security checks on every process and hence the startup time increases.

@czgdp1807
Copy link
Contributor

czgdp1807 commented Sep 25, 2021

Inconsistency - The output of getpid in this line is different from pi.dwProcessId(here) for the same process. pid = pi.dwProcessId is under #ifdef _WIN32 and #endif. So, may be this is difference in PIDs is the cause of whole problem in windows?

The above inconsistency only happens in virtual environment. In conda environment both the things are same and probably therefore the code doesn't hang there.

@dharhas
Copy link
Member

dharhas commented Sep 27, 2021

@czgdp1807
Copy link
Contributor

czgdp1807 commented Sep 27, 2021

Ah! Exactly. I was thinking the same when I was fixing this in the morning. First a Python process is launched whose PID is stored inside ProcessFD::pid and then this Python process launches another subprocess whose pid is given by getpid and is stored in pid and worker_shim_pid. This is a good theory behind the above inconsistency. I will investigate it. Thanks for sharing the link.

@czgdp1807
Copy link
Contributor

czgdp1807 commented Sep 27, 2021

Well, it seems like the above theory turned out to be true. After the following block in src/ray/util/process.cc file I deliberately made args[0] to point to my conda environment's, args[0] = "C:\\ProgramData\\Anaconda3\\envs\\ray_dev\\python.exe"; and the code worked without any issues. Though for environments which don't have all the dependencies installed, using their python executables will not work (such as root and base conda environments)

ray/src/ray/util/process.cc

Lines 123 to 126 in 72cc0c9

std::vector<std::string> args;
for (size_t i = 0; argv[i]; ++i) {
args.push_back(argv[i]);
}

What to do next?

  1. Document this thing along with an explanation on how to make ray work.
  2. Or, instead of checking the following condition, just check if there is any process running on Windows with the ID, worker_shim_process (may be using GetExitCode() on Windows).

auto &state = GetStateForLanguage(worker->GetLanguage());
auto shim_process = Process::FromPid(worker_shim_pid);
worker->SetShimProcess(shim_process);
if (state.starting_worker_processes.count(shim_process) == 0) {

cc: @wuisawesome

@dharhas
Copy link
Member

dharhas commented Sep 28, 2021

@czgdp1807 Since the most common failure is folks instalIing via pip in a venv, I think your second option is a better approach.

Does your trick with args[0] work with virtual envs?

@czgdp1807
Copy link
Contributor

czgdp1807 commented Sep 28, 2021

If we are going for second approach (i.e., a proper fix instead of just documentation) then I would suggest the following to be done.

  • Consider the PID of worker process and it's parent to determine the truthfulness of the following condition.

auto &state = GetStateForLanguage(worker->GetLanguage());
auto shim_process = Process::FromPid(worker_shim_pid);
worker->SetShimProcess(shim_process);
if (state.starting_worker_processes.count(shim_process) == 0) {

  • The reason is because in a virtual environment, python.exe first launches a wrapper process and then a child process is created which is our worker process. So, pid/worker_shim_pid passed to NodeManager::ProcessRegisterClientRequestMessage via message_data argument is of this worker process. However, the process stored as key in state.starting_worker_process is the wrapper process. In fact, when I checked the PID of the worker process's parent (via os.getppid()), it matched with that of the wrapper process. Hence, we should append parent_pid to message_data and consider it as well when checking for the truthfulness of the above condition.
  • A more general fix would be to enumerate all the children of the processes present in state.starting_worker_process and then check if anyone is having the pid/worker_shim_pid. If so, then we are good, and otherwise return an Invalid Status. Basically, this approach is inverse of what I suggested above. It might be not that easy to do because of the limitations of Windows API and I am not sure if it's theoretically possible.

@pcmoritz
Copy link
Contributor

This should be fixed by #19014 now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't P1 Issue that should be fixed within a few weeks windows
Projects
None yet
Development

No branches or pull requests

10 participants