Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot get ray to start on Windows (stuck on init) #16137

Closed
2 tasks done
bramhoven opened this issue May 28, 2021 · 3 comments
Closed
2 tasks done

Cannot get ray to start on Windows (stuck on init) #16137

bramhoven opened this issue May 28, 2021 · 3 comments
Labels
bug Something that is supposed to be working; but isn't P3 Issue moderate in impact or severity windows

Comments

@bramhoven
Copy link

What is the problem?

I have been trying to get ray working on Windows for a few days now, but I keep running into the same problem.
Ray keeps hanging on init.
The following error message is logged in the worker log:

[2021-05-27 23:27:35,441 E 25468 3004] core_worker.cc:390: Failed to register worker 11baac3114b0e5ec6797733be05ecfeeb3cca79520cff01f14712d28 to Raylet. Invalid: Invalid: Unknown worker

The following is logged in raylet.out:

[2021-05-27 23:34:14,449 I 22100 25172] io_service_pool.cc:35: IOServicePool is running with 1 io_service.
[2021-05-27 23:34:14,661 I 22100 25172] store_runner.cc:29: Allowing the Plasma store to use up to 1.85846GB of memory.
[2021-05-27 23:34:14,662 I 22100 25172] store_runner.cc:42: Starting object store with directory C:\Users\Bram\AppData\Local\Temp and huge page support disabled
[2021-05-27 23:34:14,664 I 22100 25172] grpc_server.cc:71: ObjectManager server started, listening on port 51896.
[2021-05-27 23:34:14,666 I 22100 25172] node_manager.cc:230: Initializing NodeManager with ID 5f4f53b3891a61c81991e86f1fc3dda550d6bd43ffe8a4ef054b49c5
[2021-05-27 23:34:14,666 I 22100 25172] grpc_server.cc:71: NodeManager server started, listening on port 51898.
[2021-05-27 23:34:14,786 I 22100 25172] raylet.cc:146: Raylet of id, 5f4f53b3891a61c81991e86f1fc3dda550d6bd43ffe8a4ef054b49c5 started. Raylet consists of node_manager and object_manager. node_manager address: 192.168.0.25:51898 object_manager address: 192.168.0.25:51896 hostname: 192.168.0.25
[2021-05-27 23:34:14,787 I 22100 15128] agent_manager.cc:76: Monitor agent process with pid 24972, register timeout 30000ms.
[2021-05-27 23:34:14,792 I 22100 25172] service_based_accessor.cc:579: Received notification for node id = 5f4f53b3891a61c81991e86f1fc3dda550d6bd43ffe8a4ef054b49c5, IsAlive = 1
[2021-05-27 23:34:15,544 I 22100 25172] worker_pool.cc:289: Started worker process of 1 worker(s) with pid 18128
[2021-05-27 23:34:16,228 W 22100 25172] worker_pool.cc:418: Received a register request from an unknown worker 22252
[2021-05-27 23:34:16,230 I 22100 25172] node_manager.cc:1132: NodeManager::DisconnectClient, disconnect_type=0, has creation task exception = 0
[2021-05-27 23:34:16,230 I 22100 25172] node_manager.cc:1146: Ignoring client disconnect because the client has already been disconnected.
[2021-05-27 23:34:26,551 W 22100 9452] metric_exporter.cc:206: Export metrics to agent failed: IOError: 14: failed to connect to all addresses. This won't affect Ray, but you can lose metrics from the cluster.
[2021-05-27 23:34:34,612 W 22100 9452] metric_exporter.cc:206: Export metrics to agent failed: IOError: 14: failed to connect to all addresses. This won't affect Ray, but you can lose metrics from the cluster.
[2021-05-27 23:34:44,700 W 22100 9452] metric_exporter.cc:206: Export metrics to agent failed: IOError: 14: failed to connect to all addresses. This won't affect Ray, but you can lose metrics from the cluster.
[2021-05-27 23:34:44,788 W 22100 25172] agent_manager.cc:82: Agent process with pid 24972 has not registered, restart it.
[2021-05-27 23:34:44,789 W 22100 15128] agent_manager.cc:92: Agent process with pid 24972 exit, return value 1067
[2021-05-27 23:34:45,545 I 22100 25172] worker_pool.cc:315: Some workers of the worker process(18128) have not registered to raylet within timeout.
[2021-05-27 23:34:45,793 I 22100 18252] agent_manager.cc:76: Monitor agent process with pid 25032, register timeout 30000ms.

And this is logged in gcs_server.out:

[2021-05-27 23:34:14,163 I 7164 3876] io_service_pool.cc:35: IOServicePool is running with 1 io_service.
[2021-05-27 23:34:14,165 I 7164 3876] gcs_redis_failure_detector.cc:30: Starting redis failure detector.
[2021-05-27 23:34:14,167 I 7164 3876] gcs_init_data.cc:44: Loading job table data.
[2021-05-27 23:34:14,167 I 7164 3876] gcs_init_data.cc:56: Loading node table data.
[2021-05-27 23:34:14,167 I 7164 3876] gcs_init_data.cc:68: Loading object table data.
[2021-05-27 23:34:14,167 I 7164 3876] gcs_init_data.cc:81: Loading cluster resources table data.
[2021-05-27 23:34:14,167 I 7164 3876] gcs_init_data.cc:108: Loading actor table data.
[2021-05-27 23:34:14,167 I 7164 3876] gcs_init_data.cc:94: Loading placement group table data.
[2021-05-27 23:34:14,171 I 7164 3876] gcs_init_data.cc:48: Finished loading job table data, size = 0
[2021-05-27 23:34:14,171 I 7164 3876] gcs_init_data.cc:60: Finished loading node table data, size = 0
[2021-05-27 23:34:14,171 I 7164 3876] gcs_init_data.cc:73: Finished loading object table data, size = 0
[2021-05-27 23:34:14,171 I 7164 3876] gcs_init_data.cc:85: Finished loading cluster resources table data, size = 0
[2021-05-27 23:34:14,171 I 7164 3876] gcs_init_data.cc:112: Finished loading actor table data, size = 0
[2021-05-27 23:34:14,171 I 7164 3876] gcs_init_data.cc:99: Finished loading placement group table data, size = 0
[2021-05-27 23:34:14,171 I 7164 3876] gcs_heartbeat_manager.cc:30: GcsHeartbeatManager start, num_heartbeats_timeout=300
[2021-05-27 23:34:14,385 I 7164 3876] grpc_server.cc:71: GcsServer server started, listening on port 51888.
[2021-05-27 23:34:14,391 I 7164 3876] gcs_server.cc:276: Gcs server address = 192.168.0.25:51888
[2021-05-27 23:34:14,392 I 7164 3876] gcs_server.cc:280: Finished setting gcs server address: 192.168.0.25:51888
[2021-05-27 23:34:14,392 I 7164 3876] gcs_server.cc:379: GcsNodeManager: {RegisterNode request count: 0, UnregisterNode request count: 0, GetAllNodeInfo request count: 0, GetInternalConfig request count: 0}
GcsActorManager: {RegisterActor request count: 0, CreateActor request count: 0, GetActorInfo request count: 0, GetNamedActorInfo request count: 0, KillActor request count: 0, Registered actors count: 0, Destroyed actors count: 0, Named actors count: 0, Unresolved actors count: 0, Pending actors count: 0, Created actors count: 0}
GcsObjectManager: {GetObjectLocations request count: 0, GetAllObjectLocations request count: 0, AddObjectLocation request count: 0, RemoveObjectLocation request count: 0, Object count: 0}
GcsPlacementGroupManager: {CreatePlacementGroup request count: 0, RemovePlacementGroup request count: 0, GetPlacementGroup request count: 0, GetAllPlacementGroup request count: 0, WaitPlacementGroupUntilReady request count: 0, Registered placement groups count: 0, Named placement group count: 0, Pending placement groups count: 0}
GcsPubSub:
- num channels subscribed to: 0
- total commands queued: 0
DefaultTaskInfoHandler: {AddTask request count: 0, GetTask request count: 0, AddTaskLease request count: 0, GetTaskLease request count: 0, AttemptTaskReconstruction request count: 0}
[2021-05-27 23:34:14,786 I 7164 3876] gcs_node_manager.cc:34: Registering node info, node id = 5f4f53b3891a61c81991e86f1fc3dda550d6bd43ffe8a4ef054b49c5, address = 192.168.0.25
[2021-05-27 23:34:14,786 I 7164 3876] gcs_node_manager.cc:39: Finished registering node info, node id = 5f4f53b3891a61c81991e86f1fc3dda550d6bd43ffe8a4ef054b49c5, address = 192.168.0.25
[2021-05-27 23:34:14,792 I 7164 3876] gcs_job_manager.cc:93: Getting all job info.
[2021-05-27 23:34:14,792 I 7164 3876] gcs_job_manager.cc:99: Finished getting all job info.
[2021-05-27 23:34:15,544 I 7164 3876] gcs_job_manager.cc:26: Adding job, job id = 01000000, driver pid = 21792
[2021-05-27 23:34:15,544 I 7164 3876] gcs_job_manager.cc:36: Finished adding job, job id = 01000000, driver pid = 21792
[2021-05-27 23:34:26,246 W 7164 12960] metric_exporter.cc:206: Export metrics to agent failed: IOError: 14: failed to connect to all addresses. This won't affect Ray, but you can lose metrics from the cluster.
[2021-05-27 23:34:34,310 W 7164 12960] metric_exporter.cc:206: Export metrics to agent failed: IOError: 14: failed to connect to all addresses. This won't affect Ray, but you can lose metrics from the cluster.
[2021-05-27 23:34:44,398 W 7164 12960] metric_exporter.cc:206: Export metrics to agent failed: IOError: 14: failed to connect to all addresses. This won't affect Ray, but you can lose metrics from the cluster.
[2021-05-27 23:34:54,467 W 7164 12960] metric_exporter.cc:206: Export metrics to agent failed: IOError: 14: failed to connect to all addresses. This won't affect Ray, but you can lose metrics from the cluster.
[2021-05-27 23:35:04,538 W 7164 12960] metric_exporter.cc:206: Export metrics to agent failed: IOError: 14: failed to connect to all addresses. This won't affect Ray, but you can lose metrics from the cluster.
[2021-05-27 23:35:14,392 I 7164 3876] gcs_server.cc:379: GcsNodeManager: {RegisterNode request count: 1, UnregisterNode request count: 0, GetAllNodeInfo request count: 3, GetInternalConfig request count: 1}
GcsActorManager: {RegisterActor request count: 0, CreateActor request count: 0, GetActorInfo request count: 0, GetNamedActorInfo request count: 0, KillActor request count: 0, Registered actors count: 0, Destroyed actors count: 0, Named actors count: 0, Unresolved actors count: 0, Pending actors count: 0, Created actors count: 0}
GcsObjectManager: {GetObjectLocations request count: 0, GetAllObjectLocations request count: 0, AddObjectLocation request count: 0, RemoveObjectLocation request count: 0, Object count: 0}
GcsPlacementGroupManager: {CreatePlacementGroup request count: 0, RemovePlacementGroup request count: 0, GetPlacementGroup request count: 0, GetAllPlacementGroup request count: 0, WaitPlacementGroupUntilReady request count: 0, Registered placement groups count: 0, Named placement group count: 0, Pending placement groups count: 0}

Ray version and other system information (Python version, TensorFlow version, OS):
Python version: 3.8.7
Ray: latest release for Widows

Reproduction (REQUIRED)

The problem occurs on this call.

ray.init(local_mode=True, include_dashboard=False, num_gpus=1, num_cpus=1, logging_level=logging.DEBUG)

If the code snippet cannot be run by itself, the issue will be closed with "needs-repro-script".

  • I have verified my script runs in a clean environment and reproduces the issue.
  • I have verified the issue also occurs with the latest wheels.
@bramhoven bramhoven added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels May 28, 2021
@maycuatroi
Copy link

maycuatroi commented Jun 1, 2021

I try to enter this command as Administrator and restart Ray, it works 😄

ray stop --force
ray start --head

@bramhoven
Copy link
Author

@maycuatroi That at least got me a bit further!
I now get the warning:

The actor or task with ID ffffffffffffffff9e1217f0b41378c338db8e2001000000 cannot be scheduled right now. It requires {CPU: 1.000000} for placement, but this node only has remaining {4.000000/4.000000 CPU, 3.636816 GiB/3.636816 GiB memory, 1.000000/1.000000 GPU, 1.818408 GiB/1.818408 GiB object_store_memory, 1.000000/1.000000 node:192.168.0.25.
In total there are 0 pending tasks and 2 pending actors on this node. This is likely due to all cluster resources being claimed by actors. To resolve the issue, consider creating fewer actors or increase the resources available to this Ray cluster. You can ignore this message if this Ray cluster is expected to auto-scale or if you specified a runtime_env for this task or actor because it takes time to install.

Any tips for this error message? I cannot seem to get it working even with different amounts of CPU and memory.

@richardliaw richardliaw added P3 Issue moderate in impact or severity and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jul 2, 2021
@pcmoritz
Copy link
Contributor

should be fixed by #19014

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't P3 Issue moderate in impact or severity windows
Projects
None yet
Development

No branches or pull requests

5 participants