You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have been trying to get ray working on Windows for a few days now, but I keep running into the same problem.
Ray keeps hanging on init.
The following error message is logged in the worker log:
[2021-05-27 23:27:35,441 E 25468 3004] core_worker.cc:390: Failed to register worker 11baac3114b0e5ec6797733be05ecfeeb3cca79520cff01f14712d28 to Raylet. Invalid: Invalid: Unknown worker
The following is logged in raylet.out:
[2021-05-27 23:34:14,449 I 22100 25172] io_service_pool.cc:35: IOServicePool is running with 1 io_service.
[2021-05-27 23:34:14,661 I 22100 25172] store_runner.cc:29: Allowing the Plasma store to use up to 1.85846GB of memory.
[2021-05-27 23:34:14,662 I 22100 25172] store_runner.cc:42: Starting object store with directory C:\Users\Bram\AppData\Local\Temp and huge page support disabled
[2021-05-27 23:34:14,664 I 22100 25172] grpc_server.cc:71: ObjectManager server started, listening on port 51896.
[2021-05-27 23:34:14,666 I 22100 25172] node_manager.cc:230: Initializing NodeManager with ID 5f4f53b3891a61c81991e86f1fc3dda550d6bd43ffe8a4ef054b49c5
[2021-05-27 23:34:14,666 I 22100 25172] grpc_server.cc:71: NodeManager server started, listening on port 51898.
[2021-05-27 23:34:14,786 I 22100 25172] raylet.cc:146: Raylet of id, 5f4f53b3891a61c81991e86f1fc3dda550d6bd43ffe8a4ef054b49c5 started. Raylet consists of node_manager and object_manager. node_manager address: 192.168.0.25:51898 object_manager address: 192.168.0.25:51896 hostname: 192.168.0.25
[2021-05-27 23:34:14,787 I 22100 15128] agent_manager.cc:76: Monitor agent process with pid 24972, register timeout 30000ms.
[2021-05-27 23:34:14,792 I 22100 25172] service_based_accessor.cc:579: Received notification for node id = 5f4f53b3891a61c81991e86f1fc3dda550d6bd43ffe8a4ef054b49c5, IsAlive = 1
[2021-05-27 23:34:15,544 I 22100 25172] worker_pool.cc:289: Started worker process of 1 worker(s) with pid 18128
[2021-05-27 23:34:16,228 W 22100 25172] worker_pool.cc:418: Received a register request from an unknown worker 22252
[2021-05-27 23:34:16,230 I 22100 25172] node_manager.cc:1132: NodeManager::DisconnectClient, disconnect_type=0, has creation task exception = 0
[2021-05-27 23:34:16,230 I 22100 25172] node_manager.cc:1146: Ignoring client disconnect because the client has already been disconnected.
[2021-05-27 23:34:26,551 W 22100 9452] metric_exporter.cc:206: Export metrics to agent failed: IOError: 14: failed to connect to all addresses. This won't affect Ray, but you can lose metrics from the cluster.
[2021-05-27 23:34:34,612 W 22100 9452] metric_exporter.cc:206: Export metrics to agent failed: IOError: 14: failed to connect to all addresses. This won't affect Ray, but you can lose metrics from the cluster.
[2021-05-27 23:34:44,700 W 22100 9452] metric_exporter.cc:206: Export metrics to agent failed: IOError: 14: failed to connect to all addresses. This won't affect Ray, but you can lose metrics from the cluster.
[2021-05-27 23:34:44,788 W 22100 25172] agent_manager.cc:82: Agent process with pid 24972 has not registered, restart it.
[2021-05-27 23:34:44,789 W 22100 15128] agent_manager.cc:92: Agent process with pid 24972 exit, return value 1067
[2021-05-27 23:34:45,545 I 22100 25172] worker_pool.cc:315: Some workers of the worker process(18128) have not registered to raylet within timeout.
[2021-05-27 23:34:45,793 I 22100 18252] agent_manager.cc:76: Monitor agent process with pid 25032, register timeout 30000ms.
And this is logged in gcs_server.out:
[2021-05-27 23:34:14,163 I 7164 3876] io_service_pool.cc:35: IOServicePool is running with 1 io_service.
[2021-05-27 23:34:14,165 I 7164 3876] gcs_redis_failure_detector.cc:30: Starting redis failure detector.
[2021-05-27 23:34:14,167 I 7164 3876] gcs_init_data.cc:44: Loading job table data.
[2021-05-27 23:34:14,167 I 7164 3876] gcs_init_data.cc:56: Loading node table data.
[2021-05-27 23:34:14,167 I 7164 3876] gcs_init_data.cc:68: Loading object table data.
[2021-05-27 23:34:14,167 I 7164 3876] gcs_init_data.cc:81: Loading cluster resources table data.
[2021-05-27 23:34:14,167 I 7164 3876] gcs_init_data.cc:108: Loading actor table data.
[2021-05-27 23:34:14,167 I 7164 3876] gcs_init_data.cc:94: Loading placement group table data.
[2021-05-27 23:34:14,171 I 7164 3876] gcs_init_data.cc:48: Finished loading job table data, size = 0
[2021-05-27 23:34:14,171 I 7164 3876] gcs_init_data.cc:60: Finished loading node table data, size = 0
[2021-05-27 23:34:14,171 I 7164 3876] gcs_init_data.cc:73: Finished loading object table data, size = 0
[2021-05-27 23:34:14,171 I 7164 3876] gcs_init_data.cc:85: Finished loading cluster resources table data, size = 0
[2021-05-27 23:34:14,171 I 7164 3876] gcs_init_data.cc:112: Finished loading actor table data, size = 0
[2021-05-27 23:34:14,171 I 7164 3876] gcs_init_data.cc:99: Finished loading placement group table data, size = 0
[2021-05-27 23:34:14,171 I 7164 3876] gcs_heartbeat_manager.cc:30: GcsHeartbeatManager start, num_heartbeats_timeout=300
[2021-05-27 23:34:14,385 I 7164 3876] grpc_server.cc:71: GcsServer server started, listening on port 51888.
[2021-05-27 23:34:14,391 I 7164 3876] gcs_server.cc:276: Gcs server address = 192.168.0.25:51888
[2021-05-27 23:34:14,392 I 7164 3876] gcs_server.cc:280: Finished setting gcs server address: 192.168.0.25:51888
[2021-05-27 23:34:14,392 I 7164 3876] gcs_server.cc:379: GcsNodeManager: {RegisterNode request count: 0, UnregisterNode request count: 0, GetAllNodeInfo request count: 0, GetInternalConfig request count: 0}
GcsActorManager: {RegisterActor request count: 0, CreateActor request count: 0, GetActorInfo request count: 0, GetNamedActorInfo request count: 0, KillActor request count: 0, Registered actors count: 0, Destroyed actors count: 0, Named actors count: 0, Unresolved actors count: 0, Pending actors count: 0, Created actors count: 0}
GcsObjectManager: {GetObjectLocations request count: 0, GetAllObjectLocations request count: 0, AddObjectLocation request count: 0, RemoveObjectLocation request count: 0, Object count: 0}
GcsPlacementGroupManager: {CreatePlacementGroup request count: 0, RemovePlacementGroup request count: 0, GetPlacementGroup request count: 0, GetAllPlacementGroup request count: 0, WaitPlacementGroupUntilReady request count: 0, Registered placement groups count: 0, Named placement group count: 0, Pending placement groups count: 0}
GcsPubSub:
- num channels subscribed to: 0
- total commands queued: 0
DefaultTaskInfoHandler: {AddTask request count: 0, GetTask request count: 0, AddTaskLease request count: 0, GetTaskLease request count: 0, AttemptTaskReconstruction request count: 0}
[2021-05-27 23:34:14,786 I 7164 3876] gcs_node_manager.cc:34: Registering node info, node id = 5f4f53b3891a61c81991e86f1fc3dda550d6bd43ffe8a4ef054b49c5, address = 192.168.0.25
[2021-05-27 23:34:14,786 I 7164 3876] gcs_node_manager.cc:39: Finished registering node info, node id = 5f4f53b3891a61c81991e86f1fc3dda550d6bd43ffe8a4ef054b49c5, address = 192.168.0.25
[2021-05-27 23:34:14,792 I 7164 3876] gcs_job_manager.cc:93: Getting all job info.
[2021-05-27 23:34:14,792 I 7164 3876] gcs_job_manager.cc:99: Finished getting all job info.
[2021-05-27 23:34:15,544 I 7164 3876] gcs_job_manager.cc:26: Adding job, job id = 01000000, driver pid = 21792
[2021-05-27 23:34:15,544 I 7164 3876] gcs_job_manager.cc:36: Finished adding job, job id = 01000000, driver pid = 21792
[2021-05-27 23:34:26,246 W 7164 12960] metric_exporter.cc:206: Export metrics to agent failed: IOError: 14: failed to connect to all addresses. This won't affect Ray, but you can lose metrics from the cluster.
[2021-05-27 23:34:34,310 W 7164 12960] metric_exporter.cc:206: Export metrics to agent failed: IOError: 14: failed to connect to all addresses. This won't affect Ray, but you can lose metrics from the cluster.
[2021-05-27 23:34:44,398 W 7164 12960] metric_exporter.cc:206: Export metrics to agent failed: IOError: 14: failed to connect to all addresses. This won't affect Ray, but you can lose metrics from the cluster.
[2021-05-27 23:34:54,467 W 7164 12960] metric_exporter.cc:206: Export metrics to agent failed: IOError: 14: failed to connect to all addresses. This won't affect Ray, but you can lose metrics from the cluster.
[2021-05-27 23:35:04,538 W 7164 12960] metric_exporter.cc:206: Export metrics to agent failed: IOError: 14: failed to connect to all addresses. This won't affect Ray, but you can lose metrics from the cluster.
[2021-05-27 23:35:14,392 I 7164 3876] gcs_server.cc:379: GcsNodeManager: {RegisterNode request count: 1, UnregisterNode request count: 0, GetAllNodeInfo request count: 3, GetInternalConfig request count: 1}
GcsActorManager: {RegisterActor request count: 0, CreateActor request count: 0, GetActorInfo request count: 0, GetNamedActorInfo request count: 0, KillActor request count: 0, Registered actors count: 0, Destroyed actors count: 0, Named actors count: 0, Unresolved actors count: 0, Pending actors count: 0, Created actors count: 0}
GcsObjectManager: {GetObjectLocations request count: 0, GetAllObjectLocations request count: 0, AddObjectLocation request count: 0, RemoveObjectLocation request count: 0, Object count: 0}
GcsPlacementGroupManager: {CreatePlacementGroup request count: 0, RemovePlacementGroup request count: 0, GetPlacementGroup request count: 0, GetAllPlacementGroup request count: 0, WaitPlacementGroupUntilReady request count: 0, Registered placement groups count: 0, Named placement group count: 0, Pending placement groups count: 0}
Ray version and other system information (Python version, TensorFlow version, OS):
Python version: 3.8.7
Ray: latest release for Widows
If the code snippet cannot be run by itself, the issue will be closed with "needs-repro-script".
I have verified my script runs in a clean environment and reproduces the issue.
I have verified the issue also occurs with the latest wheels.
The text was updated successfully, but these errors were encountered:
bramhoven
added
bug
Something that is supposed to be working; but isn't
triage
Needs triage (eg: priority, bug/not-bug, and owning component)
labels
May 28, 2021
@maycuatroi That at least got me a bit further!
I now get the warning:
The actor or task with ID ffffffffffffffff9e1217f0b41378c338db8e2001000000 cannot be scheduled right now. It requires {CPU: 1.000000} for placement, but this node only has remaining {4.000000/4.000000 CPU, 3.636816 GiB/3.636816 GiB memory, 1.000000/1.000000 GPU, 1.818408 GiB/1.818408 GiB object_store_memory, 1.000000/1.000000 node:192.168.0.25.
In total there are 0 pending tasks and 2 pending actors on this node. This is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale or if you specified a runtime_env for this task or actor because it takes time to install.
Any tips for this error message? I cannot seem to get it working even with different amounts of CPU and memory.
richardliaw
added
P3
Issue moderate in impact or severity
and removed
triage
Needs triage (eg: priority, bug/not-bug, and owning component)
labels
Jul 2, 2021
What is the problem?
I have been trying to get ray working on Windows for a few days now, but I keep running into the same problem.
Ray keeps hanging on init.
The following error message is logged in the worker log:
The following is logged in
raylet.out
:And this is logged in
gcs_server.out
:Ray version and other system information (Python version, TensorFlow version, OS):
Python version: 3.8.7
Ray: latest release for Widows
Reproduction (REQUIRED)
The problem occurs on this call.
If the code snippet cannot be run by itself, the issue will be closed with "needs-repro-script".
The text was updated successfully, but these errors were encountered: