You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am running Ray cluster on Google Cloud Platform, using RaySGD for training my model. Every 30 minutes I see an error in ray monitor [yaml] log: Error during autoscaling (stacktrace attached). This error does no harm and the cluster keeps working fine, nothing is scaled or killed. Finally, after AUTOSCALER_MAX_NUM_FAILURES attemps the error is followed by StandardAutoscaler: Too many errors, abort. and all the worker nodes are killed.
Ray version and other system information (Python version, TensorFlow version, OS):
Ray 0.8.6, Ubuntu 18.04 on GCP, Pytorch 1.4, RaySGD TorchTrainer via remote() call
Reproduction (REQUIRED)
Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):
I am running my project using remote RaySGD trainer:
[last 3 shown]
2020-07-08 23:27:17,537 ERROR autoscaler.py:112 -- StandardAutoscaler: Error during autoscaling.
2020-07-08 23:57:43,427 ERROR autoscaler.py:112 -- StandardAutoscaler: Error during autoscaling.
2020-07-09 00:27:35,044 ERROR autoscaler.py:112 -- StandardAutoscaler: Error during autoscaling.
and then
2020-07-09 00:27:35,046 CRITICAL autoscaler.py:116 -- StandardAutoscaler: Too many errors, abort.
The reoccurring Error during autoscaling itself does not seem to do any harm at all (all workers are doing their job), but killing of the workers of course is very problematic as new ones won't be created.
I have verified my script runs in a clean environment and reproduces the issue.
I have verified the issue also occurs with the latest wheels.
Full stacktrace for the last autoscaling error and error that causes the abortion:
2020-07-09 00:27:20,049 INFO autoscaler.py:187 -- Ending bringup phase
2020-07-09 00:27:20,049 INFO autoscaler.py:389 -- StandardAutoscaler: 2/2 target nodes (0 pending) (2 failed to update)
2020-07-09 00:27:20,049 INFO autoscaler.py:390 -- LoadMetrics: MostDelayedHeartbeats={'10.164.15.233': 0.34097719192504883, '10.164.15.235': 0.3409607410430908, '10.164.0.14': 0.34094786643981934}, NodeIdleSeconds=Min=0 Mean=4477 Max=13432, NumNodesConnected=3, NumNodesUsed=2.0, ResourceUsage=2.0/3.0 CPU, 2.0/2.0 GPU, 0.0 GiB/26.06 GiB memory, 0.0/1.0 node:10.164.0.14, 0.0/1.0 node:10.164.15.233, 0.0/1.0 node:10.164.15.235, 0.0 GiB/7.92 GiB object_store_memory, TimeSinceLastHeartbeat=Min=0 Mean=0 Max=0
2020-07-09 00:27:25,142 INFO autoscaler.py:389 -- StandardAutoscaler: 2/2 target nodes (0 pending) (2 failed to update)
2020-07-09 00:27:25,143 INFO autoscaler.py:390 -- LoadMetrics: MostDelayedHeartbeats={'10.164.0.14': 0.4190995693206787, '10.164.15.233': 0.4190821647644043, '10.164.15.235': 0.4190678596496582}, NodeIdleSeconds=Min=0 Mean=4479 Max=13437, NumNodesConnected=3, NumNodesUsed=2.0, ResourceUsage=2.0/3.0 CPU, 2.0/2.0 GPU, 0.0 GiB/26.06 GiB memory, 0.0/1.0 node:10.164.0.14, 0.0/1.0 node:10.164.15.233, 0.0/1.0 node:10.164.15.235, 0.0 GiB/7.92 GiB object_store_memory, TimeSinceLastHeartbeat=Min=0 Mean=0 Max=0
2020-07-09 00:27:25,144 INFO autoscaler.py:187 -- Ending bringup phase
2020-07-09 00:27:25,144 INFO autoscaler.py:389 -- StandardAutoscaler: 2/2 target nodes (0 pending) (2 failed to update)
2020-07-09 00:27:25,144 INFO autoscaler.py:390 -- LoadMetrics: MostDelayedHeartbeats={'10.164.0.14': 0.420978307723999, '10.164.15.233': 0.4209609031677246, '10.164.15.235': 0.4209465980529785}, NodeIdleSeconds=Min=0 Mean=4479 Max=13437, NumNodesConnected=3, NumNodesUsed=2.0, ResourceUsage=2.0/3.0 CPU, 2.0/2.0 GPU, 0.0 GiB/26.06 GiB memory, 0.0/1.0 node:10.164.0.14, 0.0/1.0 node:10.164.15.233, 0.0/1.0 node:10.164.15.235, 0.0 GiB/7.92 GiB object_store_memory, TimeSinceLastHeartbeat=Min=0 Mean=0 Max=0
2020-07-09 00:27:30,226 INFO autoscaler.py:389 -- StandardAutoscaler: 2/2 target nodes (0 pending) (2 failed to update)
2020-07-09 00:27:30,227 INFO autoscaler.py:390 -- LoadMetrics: MostDelayedHeartbeats={'10.164.0.14': 0.40857815742492676, '10.164.15.233': 0.4085671901702881, '10.164.15.235': 0.4085569381713867}, NodeIdleSeconds=Min=0 Mean=4481 Max=13442, NumNodesConnected=3, NumNodesUsed=2.0, ResourceUsage=2.0/3.0 CPU, 2.0/2.0 GPU, 0.0 GiB/26.06 GiB memory, 0.0/1.0 node:10.164.0.14, 0.0/1.0 node:10.164.15.233, 0.0/1.0 node:10.164.15.235, 0.0 GiB/7.92 GiB object_store_memory, TimeSinceLastHeartbeat=Min=0 Mean=0 Max=0
2020-07-09 00:27:30,228 INFO autoscaler.py:187 -- Ending bringup phase
2020-07-09 00:27:30,228 INFO autoscaler.py:389 -- StandardAutoscaler: 2/2 target nodes (0 pending) (2 failed to update)
2020-07-09 00:27:30,229 INFO autoscaler.py:390 -- LoadMetrics: MostDelayedHeartbeats={'10.164.0.14': 0.4104330539703369, '10.164.15.233': 0.41042208671569824, '10.164.15.235': 0.4104118347167969}, NodeIdleSeconds=Min=0 Mean=4481 Max=13442, NumNodesConnected=3, NumNodesUsed=2.0, ResourceUsage=2.0/3.0 CPU, 2.0/2.0 GPU, 0.0 GiB/26.06 GiB memory, 0.0/1.0 node:10.164.0.14, 0.0/1.0 node:10.164.15.233, 0.0/1.0 node:10.164.15.235, 0.0 GiB/7.92 GiB object_store_memory, TimeSinceLastHeartbeat=Min=0 Mean=0 Max=0
2020-07-09 00:27:35,044 ERROR autoscaler.py:112 -- StandardAutoscaler: Error during autoscaling.
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/ray/autoscaler/autoscaler.py", line 110, in update
self._update()
File "/opt/conda/lib/python3.7/site-packages/ray/autoscaler/autoscaler.py", line 129, in _update
nodes = self.workers()
File "/opt/conda/lib/python3.7/site-packages/ray/autoscaler/autoscaler.py", line 385, in workers
tag_filters={TAG_RAY_NODE_TYPE: NODE_TYPE_WORKER})
File "/opt/conda/lib/python3.7/site-packages/ray/autoscaler/gcp/node_provider.py", line 85, in non_terminated_nodes
filter=filter_expr,
File "/opt/conda/lib/python3.7/site-packages/googleapiclient/_helpers.py", line 134, in positional_wrapper
return wrapped(*args, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/googleapiclient/http.py", line 901, in execute
headers=self.headers,
File "/opt/conda/lib/python3.7/site-packages/googleapiclient/http.py", line 204, in _retry_request
raise exception
File "/opt/conda/lib/python3.7/site-packages/googleapiclient/http.py", line 177, in _retry_request
resp, content = http.request(uri, method, *args, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/google_auth_httplib2.py", line 187, in request
self._request, method, uri, request_headers)
File "/opt/conda/lib/python3.7/site-packages/google/auth/credentials.py", line 124, in before_request
self.refresh(request)
File "/opt/conda/lib/python3.7/site-packages/google/auth/compute_engine/credentials.py", line 96, in refresh
self._retrieve_info(request)
File "/opt/conda/lib/python3.7/site-packages/google/auth/compute_engine/credentials.py", line 77, in _retrieve_info
request, service_account=self._service_account_email
File "/opt/conda/lib/python3.7/site-packages/google/auth/compute_engine/_metadata.py", line 226, in get_service_account_info
recursive=True,
File "/opt/conda/lib/python3.7/site-packages/google/auth/compute_engine/_metadata.py", line 144, in get
response = request(url=url, method="GET", headers=_METADATA_HEADERS)
File "/opt/conda/lib/python3.7/site-packages/google_auth_httplib2.py", line 116, in __call__
url, method=method, body=body, headers=headers, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/httplib2/__init__.py", line 1994, in request
cachekey,
File "/opt/conda/lib/python3.7/site-packages/httplib2/__init__.py", line 1651, in _request
conn, request_uri, method, body, headers
File "/opt/conda/lib/python3.7/site-packages/httplib2/__init__.py", line 1558, in _conn_request
conn.request(method, request_uri, body, headers)
File "/opt/conda/lib/python3.7/http/client.py", line 1252, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/opt/conda/lib/python3.7/http/client.py", line 1298, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/opt/conda/lib/python3.7/http/client.py", line 1247, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/opt/conda/lib/python3.7/http/client.py", line 1026, in _send_output
self.send(msg)
File "/opt/conda/lib/python3.7/http/client.py", line 987, in send
self.sock.sendall(data)
BrokenPipeError: [Errno 32] Broken pipe
2020-07-09 00:27:35,046 CRITICAL autoscaler.py:116 -- StandardAutoscaler: Too many errors, abort.
Error in monitor loop
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/ray/monitor.py", line 283, in run
self._run()
File "/opt/conda/lib/python3.7/site-packages/ray/monitor.py", line 238, in _run
self.autoscaler.update()
File "/opt/conda/lib/python3.7/site-packages/ray/autoscaler/autoscaler.py", line 118, in update
raise e
File "/opt/conda/lib/python3.7/site-packages/ray/autoscaler/autoscaler.py", line 110, in update
self._update()
File "/opt/conda/lib/python3.7/site-packages/ray/autoscaler/autoscaler.py", line 129, in _update
nodes = self.workers()
File "/opt/conda/lib/python3.7/site-packages/ray/autoscaler/autoscaler.py", line 385, in workers
tag_filters={TAG_RAY_NODE_TYPE: NODE_TYPE_WORKER})
File "/opt/conda/lib/python3.7/site-packages/ray/autoscaler/gcp/node_provider.py", line 85, in non_terminated_nodes
filter=filter_expr,
File "/opt/conda/lib/python3.7/site-packages/googleapiclient/_helpers.py", line 134, in positional_wrapper
return wrapped(*args, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/googleapiclient/http.py", line 901, in execute
headers=self.headers,
File "/opt/conda/lib/python3.7/site-packages/googleapiclient/http.py", line 204, in _retry_request
raise exception
File "/opt/conda/lib/python3.7/site-packages/googleapiclient/http.py", line 177, in _retry_request
resp, content = http.request(uri, method, *args, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/google_auth_httplib2.py", line 187, in request
self._request, method, uri, request_headers)
File "/opt/conda/lib/python3.7/site-packages/google/auth/credentials.py", line 124, in before_request
self.refresh(request)
File "/opt/conda/lib/python3.7/site-packages/google/auth/compute_engine/credentials.py", line 96, in refresh
self._retrieve_info(request)
File "/opt/conda/lib/python3.7/site-packages/google/auth/compute_engine/credentials.py", line 77, in _retrieve_info
request, service_account=self._service_account_email
File "/opt/conda/lib/python3.7/site-packages/google/auth/compute_engine/_metadata.py", line 226, in get_service_account_info
recursive=True,
File "/opt/conda/lib/python3.7/site-packages/google/auth/compute_engine/_metadata.py", line 144, in get
response = request(url=url, method="GET", headers=_METADATA_HEADERS)
File "/opt/conda/lib/python3.7/site-packages/google_auth_httplib2.py", line 116, in __call__
url, method=method, body=body, headers=headers, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/httplib2/__init__.py", line 1994, in request
cachekey,
File "/opt/conda/lib/python3.7/site-packages/httplib2/__init__.py", line 1651, in _request
conn, request_uri, method, body, headers
File "/opt/conda/lib/python3.7/site-packages/httplib2/__init__.py", line 1558, in _conn_request
conn.request(method, request_uri, body, headers)
File "/opt/conda/lib/python3.7/http/client.py", line 1252, in request
self._send_request(method, url, body, headers, encode_chunked)
File "/opt/conda/lib/python3.7/http/client.py", line 1298, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "/opt/conda/lib/python3.7/http/client.py", line 1247, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "/opt/conda/lib/python3.7/http/client.py", line 1026, in _send_output
self.send(msg)
File "/opt/conda/lib/python3.7/http/client.py", line 987, in send
self.sock.sendall(data)
BrokenPipeError: [Errno 32] Broken pipe
2020-07-09 00:27:35,047 ERROR autoscaler.py:415 -- StandardAutoscaler: kill_workers triggered
2020-07-09 00:27:35,325 INFO node_provider.py:207 -- NodeProvider: ray-sgd-pytorch-worker-65665abb: Terminating node
2020-07-09 00:27:35,953 INFO node_provider.py:22 -- wait_for_compute_zone_operation: Waiting for operation operation-1594254455341-5a9f74a7e0197-43345963-b0c773d2 to finish...
2020-07-09 00:27:57,311 INFO node_provider.py:33 -- wait_for_compute_zone_operation: Operation operation-1594254455341-5a9f74a7e0197-43345963-b0c773d2 finished.
2020-07-09 00:27:57,312 INFO node_provider.py:207 -- NodeProvider: ray-sgd-pytorch-worker-75eafb86: Terminating node
2020-07-09 00:27:57,811 INFO node_provider.py:22 -- wait_for_compute_zone_operation: Waiting for operation operation-1594254477325-5a9f74bcd7556-4ba10d54-e3ddb3f4 to finish...
2020-07-09 00:28:24,453 INFO node_provider.py:33 -- wait_for_compute_zone_operation: Operation operation-1594254477325-5a9f74bcd7556-4ba10d54-e3ddb3f4 finished.
2020-07-09 00:28:24,454 ERROR autoscaler.py:420 -- StandardAutoscaler: terminated 2 node(s)
2020-07-09 00:28:25,344 INFO commands.py:94 -- teardown_cluster: Keeping 2 nodes...
Monitor: Cleanup exception. Trying again...
2020-07-09 00:28:28,161 INFO commands.py:94 -- teardown_cluster: Keeping 2 nodes...
Monitor: Cleanup exception. Trying again...
2020-07-09 00:28:31,082 INFO commands.py:94 -- teardown_cluster: Keeping 2 nodes...
Monitor: Cleanup exception. Trying again...
2020-07-09 00:28:34,096 INFO commands.py:94 -- teardown_cluster: Keeping 2 nodes...
My cluster setup YAML:
cluster_name: sgd-pytorch
provider:
type: gcp
region: europe-west4
availability_zone: europe-west4-b
project_id: [MYID] # Globally unique project id
auth: {ssh_user: ubuntu}
min_workers: 2
initial_workers: 2
max_workers: 2
idle_timeout_minutes: 25
head_node:
machineType: n1-standard-2
disks:
- boot: true
autoDelete: true
type: PERSISTENT
initializeParams:
diskSizeGb: 80
# See https://cloud.google.com/compute/docs/images for more images
sourceImage: projects/deeplearning-platform-release/global/images/family/pytorch-latest-cpu
scheduling:
- onHostMaintenance: TERMINATE
worker_nodes:
machineType: n1-standard-4
disks:
- boot: true
autoDelete: true
type: PERSISTENT
initializeParams:
diskSizeGb: 50
# See https://cloud.google.com/compute/docs/images for more images
sourceImage: projects/deeplearning-platform-release/global/images/family/pytorch-latest-gpu
guestAccelerators:
- acceleratorType: projects/[MYPROJECT]/zones/europe-west4-b/acceleratorTypes/nvidia-tesla-t4
acceleratorCount: 1
metadata:
items:
- key: install-nvidia-driver
value: "True"
# Run workers on preemtible instance by default.
# Comment this out to use on-demand.
scheduling:
- preemptible: true
- onHostMaintenance: TERMINATE
file_mounts: {
"/home/ubuntu/mypointnet": "/home/[LOCAL]/project/[project]",
"/home/ubuntu/model/": "/home/[local]/model/"
}
# List of commands that will be run before `setup_commands`. If docker is
# enabled, these commands will run outside the container and before docker
# is setup.
initialization_commands: #[]
- sudo apt-get -y install wget nano unzip zip man
# List of shell commands to run to set up nodes.
setup_commands:
# Note: if you're developing Ray, you probably want to create an AMI that
# has your Ray repo pre-cloned. Then, you can replace the pip installs
# below with a git checkout <your_sha> (and possibly a recompile).
# - echo 'export PATH="$HOME/anaconda3/envs/tensorflow_p36/bin:$PATH"' >> ~/.bashrc
- pip install --upgrade pip
- pip install -U oauth2client
- pip install -U visdom jsonpatch
- pip install -U shapely descartes ray[rllib]
- cd [MYFOLDER] && pip install -e . # install my package
- wget https://storage.googleapis.com/[MYFILE].zip
- mkdir -p /home/ubuntu/data
- unzip -uq [MYFILE].zip -d /home/ubuntu/data
- rm [MYFILE].zip
# Custom commands that will be run on the head node after common setup.
head_setup_commands:
- pip install -U google-api-python-client
- git config --global user.email "[email]"
- git config --global user.name "[name]"
- visdom -port=8999 &> /dev/null &
- mkdir -p /home/ubuntu/checkpoint
# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: []
# Command to start ray on the head node. You don't need to change this.
# NCCL_TIMEOUT_S is the maximum time one training (ray submit) can run
head_start_ray_commands:
- ray stop
- >-
ulimit -n 65536;
export NCCL_TIMEOUT_S=1209600;
export NCCL_BUFFSIZE=8388608;
ray start
--head
--redis-port=6379
--object-manager-port=8076
--autoscaling-config=~/ray_bootstrap_config.yaml
--num-cpus=1
--num-redis-shards=3
# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
- ray stop
- >-
ulimit -n 65536;
export NCCL_TIMEOUT_S=604800;
export NCCL_BUFFSIZE=8388608;
ray start
--address=$RAY_HEAD_IP:6379
--object-manager-port=8076
--num-cpus=1
The text was updated successfully, but these errors were encountered:
martinrebane
added
bug
Something that is supposed to be working; but isn't
triage
Needs triage (eg: priority, bug/not-bug, and owning component)
labels
Jul 9, 2020
richardliaw
added
P2
Important issue, but not time-critical
and removed
triage
Needs triage (eg: priority, bug/not-bug, and owning component)
labels
Jul 17, 2020
@richardliaw I solved it by adding export AUTOSCALER_MAX_NUM_FAILURES=100; into my startup yaml file before ray start commands (I added it both for the head and workers). This works well for me, but for new users discovering the problem and solution might be rather frustrating.
So the error is still popping up, but because of max num of failures is high, it does not kill the job.
What is the problem?
I am running Ray cluster on Google Cloud Platform, using RaySGD for training my model. Every 30 minutes I see an error in
ray monitor [yaml]
log:Error during autoscaling
(stacktrace attached). This error does no harm and the cluster keeps working fine, nothing is scaled or killed. Finally, afterAUTOSCALER_MAX_NUM_FAILURES
attemps the error is followed byStandardAutoscaler: Too many errors, abort.
and all the worker nodes are killed.Ray version and other system information (Python version, TensorFlow version, OS):
Ray 0.8.6, Ubuntu 18.04 on GCP, Pytorch 1.4, RaySGD TorchTrainer via remote() call
Reproduction (REQUIRED)
Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):
I am running my project using remote RaySGD trainer:
The error happens exactly every 30 minutes:
and then
The reoccurring
Error during autoscaling
itself does not seem to do any harm at all (all workers are doing their job), but killing of the workers of course is very problematic as new ones won't be created.Full stacktrace for the last autoscaling error and error that causes the abortion:
My cluster setup YAML:
The text was updated successfully, but these errors were encountered: