[autoscaler] Autoscaler prints (harmless) errors every 30 mins and then kills the workers in GCP cluster #9368

martinrebane · 2020-07-09T01:46:45Z

What is the problem?

I am running Ray cluster on Google Cloud Platform, using RaySGD for training my model. Every 30 minutes I see an error in ray monitor [yaml] log: Error during autoscaling (stacktrace attached). This error does no harm and the cluster keeps working fine, nothing is scaled or killed. Finally, after AUTOSCALER_MAX_NUM_FAILURES attemps the error is followed by StandardAutoscaler: Too many errors, abort. and all the worker nodes are killed.

Ray version and other system information (Python version, TensorFlow version, OS):
Ray 0.8.6, Ubuntu 18.04 on GCP, Pytorch 1.4, RaySGD TorchTrainer via remote() call

Reproduction (REQUIRED)

Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):
I am running my project using remote RaySGD trainer:

stats =  ray.get(trainer.train.remote(max_retries=100, num_steps=50))

The error happens exactly every 30 minutes:

[last 3 shown]
2020-07-08 23:27:17,537 ERROR autoscaler.py:112 -- StandardAutoscaler: Error during autoscaling.
2020-07-08 23:57:43,427 ERROR autoscaler.py:112 -- StandardAutoscaler: Error during autoscaling.
2020-07-09 00:27:35,044 ERROR autoscaler.py:112 -- StandardAutoscaler: Error during autoscaling.

and then

2020-07-09 00:27:35,046	CRITICAL autoscaler.py:116 -- StandardAutoscaler: Too many errors, abort.

The reoccurring Error during autoscaling itself does not seem to do any harm at all (all workers are doing their job), but killing of the workers of course is very problematic as new ones won't be created.

I have verified my script runs in a clean environment and reproduces the issue.
I have verified the issue also occurs with the latest wheels.

Full stacktrace for the last autoscaling error and error that causes the abortion:

2020-07-09 00:27:20,049	INFO autoscaler.py:187 -- Ending bringup phase
2020-07-09 00:27:20,049	INFO autoscaler.py:389 -- StandardAutoscaler: 2/2 target nodes (0 pending) (2 failed to update)
2020-07-09 00:27:20,049	INFO autoscaler.py:390 -- LoadMetrics: MostDelayedHeartbeats={'10.164.15.233': 0.34097719192504883, '10.164.15.235': 0.3409607410430908, '10.164.0.14': 0.34094786643981934}, NodeIdleSeconds=Min=0 Mean=4477 Max=13432, NumNodesConnected=3, NumNodesUsed=2.0, ResourceUsage=2.0/3.0 CPU, 2.0/2.0 GPU, 0.0 GiB/26.06 GiB memory, 0.0/1.0 node:10.164.0.14, 0.0/1.0 node:10.164.15.233, 0.0/1.0 node:10.164.15.235, 0.0 GiB/7.92 GiB object_store_memory, TimeSinceLastHeartbeat=Min=0 Mean=0 Max=0
2020-07-09 00:27:25,142	INFO autoscaler.py:389 -- StandardAutoscaler: 2/2 target nodes (0 pending) (2 failed to update)
2020-07-09 00:27:25,143	INFO autoscaler.py:390 -- LoadMetrics: MostDelayedHeartbeats={'10.164.0.14': 0.4190995693206787, '10.164.15.233': 0.4190821647644043, '10.164.15.235': 0.4190678596496582}, NodeIdleSeconds=Min=0 Mean=4479 Max=13437, NumNodesConnected=3, NumNodesUsed=2.0, ResourceUsage=2.0/3.0 CPU, 2.0/2.0 GPU, 0.0 GiB/26.06 GiB memory, 0.0/1.0 node:10.164.0.14, 0.0/1.0 node:10.164.15.233, 0.0/1.0 node:10.164.15.235, 0.0 GiB/7.92 GiB object_store_memory, TimeSinceLastHeartbeat=Min=0 Mean=0 Max=0
2020-07-09 00:27:25,144	INFO autoscaler.py:187 -- Ending bringup phase
2020-07-09 00:27:25,144	INFO autoscaler.py:389 -- StandardAutoscaler: 2/2 target nodes (0 pending) (2 failed to update)
2020-07-09 00:27:25,144	INFO autoscaler.py:390 -- LoadMetrics: MostDelayedHeartbeats={'10.164.0.14': 0.420978307723999, '10.164.15.233': 0.4209609031677246, '10.164.15.235': 0.4209465980529785}, NodeIdleSeconds=Min=0 Mean=4479 Max=13437, NumNodesConnected=3, NumNodesUsed=2.0, ResourceUsage=2.0/3.0 CPU, 2.0/2.0 GPU, 0.0 GiB/26.06 GiB memory, 0.0/1.0 node:10.164.0.14, 0.0/1.0 node:10.164.15.233, 0.0/1.0 node:10.164.15.235, 0.0 GiB/7.92 GiB object_store_memory, TimeSinceLastHeartbeat=Min=0 Mean=0 Max=0
2020-07-09 00:27:30,226	INFO autoscaler.py:389 -- StandardAutoscaler: 2/2 target nodes (0 pending) (2 failed to update)
2020-07-09 00:27:30,227	INFO autoscaler.py:390 -- LoadMetrics: MostDelayedHeartbeats={'10.164.0.14': 0.40857815742492676, '10.164.15.233': 0.4085671901702881, '10.164.15.235': 0.4085569381713867}, NodeIdleSeconds=Min=0 Mean=4481 Max=13442, NumNodesConnected=3, NumNodesUsed=2.0, ResourceUsage=2.0/3.0 CPU, 2.0/2.0 GPU, 0.0 GiB/26.06 GiB memory, 0.0/1.0 node:10.164.0.14, 0.0/1.0 node:10.164.15.233, 0.0/1.0 node:10.164.15.235, 0.0 GiB/7.92 GiB object_store_memory, TimeSinceLastHeartbeat=Min=0 Mean=0 Max=0
2020-07-09 00:27:30,228	INFO autoscaler.py:187 -- Ending bringup phase
2020-07-09 00:27:30,228	INFO autoscaler.py:389 -- StandardAutoscaler: 2/2 target nodes (0 pending) (2 failed to update)
2020-07-09 00:27:30,229	INFO autoscaler.py:390 -- LoadMetrics: MostDelayedHeartbeats={'10.164.0.14': 0.4104330539703369, '10.164.15.233': 0.41042208671569824, '10.164.15.235': 0.4104118347167969}, NodeIdleSeconds=Min=0 Mean=4481 Max=13442, NumNodesConnected=3, NumNodesUsed=2.0, ResourceUsage=2.0/3.0 CPU, 2.0/2.0 GPU, 0.0 GiB/26.06 GiB memory, 0.0/1.0 node:10.164.0.14, 0.0/1.0 node:10.164.15.233, 0.0/1.0 node:10.164.15.235, 0.0 GiB/7.92 GiB object_store_memory, TimeSinceLastHeartbeat=Min=0 Mean=0 Max=0
2020-07-09 00:27:35,044	ERROR autoscaler.py:112 -- StandardAutoscaler: Error during autoscaling.
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/ray/autoscaler/autoscaler.py", line 110, in update
    self._update()
  File "/opt/conda/lib/python3.7/site-packages/ray/autoscaler/autoscaler.py", line 129, in _update
    nodes = self.workers()
  File "/opt/conda/lib/python3.7/site-packages/ray/autoscaler/autoscaler.py", line 385, in workers
    tag_filters={TAG_RAY_NODE_TYPE: NODE_TYPE_WORKER})
  File "/opt/conda/lib/python3.7/site-packages/ray/autoscaler/gcp/node_provider.py", line 85, in non_terminated_nodes
    filter=filter_expr,
  File "/opt/conda/lib/python3.7/site-packages/googleapiclient/_helpers.py", line 134, in positional_wrapper
    return wrapped(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/googleapiclient/http.py", line 901, in execute
    headers=self.headers,
  File "/opt/conda/lib/python3.7/site-packages/googleapiclient/http.py", line 204, in _retry_request
    raise exception
  File "/opt/conda/lib/python3.7/site-packages/googleapiclient/http.py", line 177, in _retry_request
    resp, content = http.request(uri, method, *args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/google_auth_httplib2.py", line 187, in request
    self._request, method, uri, request_headers)
  File "/opt/conda/lib/python3.7/site-packages/google/auth/credentials.py", line 124, in before_request
    self.refresh(request)
  File "/opt/conda/lib/python3.7/site-packages/google/auth/compute_engine/credentials.py", line 96, in refresh
    self._retrieve_info(request)
  File "/opt/conda/lib/python3.7/site-packages/google/auth/compute_engine/credentials.py", line 77, in _retrieve_info
    request, service_account=self._service_account_email
  File "/opt/conda/lib/python3.7/site-packages/google/auth/compute_engine/_metadata.py", line 226, in get_service_account_info
    recursive=True,
  File "/opt/conda/lib/python3.7/site-packages/google/auth/compute_engine/_metadata.py", line 144, in get
    response = request(url=url, method="GET", headers=_METADATA_HEADERS)
  File "/opt/conda/lib/python3.7/site-packages/google_auth_httplib2.py", line 116, in __call__
    url, method=method, body=body, headers=headers, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/httplib2/__init__.py", line 1994, in request
    cachekey,
  File "/opt/conda/lib/python3.7/site-packages/httplib2/__init__.py", line 1651, in _request
    conn, request_uri, method, body, headers
  File "/opt/conda/lib/python3.7/site-packages/httplib2/__init__.py", line 1558, in _conn_request
    conn.request(method, request_uri, body, headers)
  File "/opt/conda/lib/python3.7/http/client.py", line 1252, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/opt/conda/lib/python3.7/http/client.py", line 1298, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/opt/conda/lib/python3.7/http/client.py", line 1247, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/opt/conda/lib/python3.7/http/client.py", line 1026, in _send_output
    self.send(msg)
  File "/opt/conda/lib/python3.7/http/client.py", line 987, in send
    self.sock.sendall(data)
BrokenPipeError: [Errno 32] Broken pipe
2020-07-09 00:27:35,046	CRITICAL autoscaler.py:116 -- StandardAutoscaler: Too many errors, abort.
Error in monitor loop
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/ray/monitor.py", line 283, in run
    self._run()
  File "/opt/conda/lib/python3.7/site-packages/ray/monitor.py", line 238, in _run
    self.autoscaler.update()
  File "/opt/conda/lib/python3.7/site-packages/ray/autoscaler/autoscaler.py", line 118, in update
    raise e
  File "/opt/conda/lib/python3.7/site-packages/ray/autoscaler/autoscaler.py", line 110, in update
    self._update()
  File "/opt/conda/lib/python3.7/site-packages/ray/autoscaler/autoscaler.py", line 129, in _update
    nodes = self.workers()
  File "/opt/conda/lib/python3.7/site-packages/ray/autoscaler/autoscaler.py", line 385, in workers
    tag_filters={TAG_RAY_NODE_TYPE: NODE_TYPE_WORKER})
  File "/opt/conda/lib/python3.7/site-packages/ray/autoscaler/gcp/node_provider.py", line 85, in non_terminated_nodes
    filter=filter_expr,
  File "/opt/conda/lib/python3.7/site-packages/googleapiclient/_helpers.py", line 134, in positional_wrapper
    return wrapped(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/googleapiclient/http.py", line 901, in execute
    headers=self.headers,
  File "/opt/conda/lib/python3.7/site-packages/googleapiclient/http.py", line 204, in _retry_request
    raise exception
  File "/opt/conda/lib/python3.7/site-packages/googleapiclient/http.py", line 177, in _retry_request
    resp, content = http.request(uri, method, *args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/google_auth_httplib2.py", line 187, in request
    self._request, method, uri, request_headers)
  File "/opt/conda/lib/python3.7/site-packages/google/auth/credentials.py", line 124, in before_request
    self.refresh(request)
  File "/opt/conda/lib/python3.7/site-packages/google/auth/compute_engine/credentials.py", line 96, in refresh
    self._retrieve_info(request)
  File "/opt/conda/lib/python3.7/site-packages/google/auth/compute_engine/credentials.py", line 77, in _retrieve_info
    request, service_account=self._service_account_email
  File "/opt/conda/lib/python3.7/site-packages/google/auth/compute_engine/_metadata.py", line 226, in get_service_account_info
    recursive=True,
  File "/opt/conda/lib/python3.7/site-packages/google/auth/compute_engine/_metadata.py", line 144, in get
    response = request(url=url, method="GET", headers=_METADATA_HEADERS)
  File "/opt/conda/lib/python3.7/site-packages/google_auth_httplib2.py", line 116, in __call__
    url, method=method, body=body, headers=headers, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/httplib2/__init__.py", line 1994, in request
    cachekey,
  File "/opt/conda/lib/python3.7/site-packages/httplib2/__init__.py", line 1651, in _request
    conn, request_uri, method, body, headers
  File "/opt/conda/lib/python3.7/site-packages/httplib2/__init__.py", line 1558, in _conn_request
    conn.request(method, request_uri, body, headers)
  File "/opt/conda/lib/python3.7/http/client.py", line 1252, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/opt/conda/lib/python3.7/http/client.py", line 1298, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/opt/conda/lib/python3.7/http/client.py", line 1247, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/opt/conda/lib/python3.7/http/client.py", line 1026, in _send_output
    self.send(msg)
  File "/opt/conda/lib/python3.7/http/client.py", line 987, in send
    self.sock.sendall(data)
BrokenPipeError: [Errno 32] Broken pipe
2020-07-09 00:27:35,047	ERROR autoscaler.py:415 -- StandardAutoscaler: kill_workers triggered
2020-07-09 00:27:35,325	INFO node_provider.py:207 -- NodeProvider: ray-sgd-pytorch-worker-65665abb: Terminating node
2020-07-09 00:27:35,953	INFO node_provider.py:22 -- wait_for_compute_zone_operation: Waiting for operation operation-1594254455341-5a9f74a7e0197-43345963-b0c773d2 to finish...
2020-07-09 00:27:57,311	INFO node_provider.py:33 -- wait_for_compute_zone_operation: Operation operation-1594254455341-5a9f74a7e0197-43345963-b0c773d2 finished.
2020-07-09 00:27:57,312	INFO node_provider.py:207 -- NodeProvider: ray-sgd-pytorch-worker-75eafb86: Terminating node
2020-07-09 00:27:57,811	INFO node_provider.py:22 -- wait_for_compute_zone_operation: Waiting for operation operation-1594254477325-5a9f74bcd7556-4ba10d54-e3ddb3f4 to finish...
2020-07-09 00:28:24,453	INFO node_provider.py:33 -- wait_for_compute_zone_operation: Operation operation-1594254477325-5a9f74bcd7556-4ba10d54-e3ddb3f4 finished.
2020-07-09 00:28:24,454	ERROR autoscaler.py:420 -- StandardAutoscaler: terminated 2 node(s)
2020-07-09 00:28:25,344	INFO commands.py:94 -- teardown_cluster: Keeping 2 nodes...
Monitor: Cleanup exception. Trying again...
2020-07-09 00:28:28,161	INFO commands.py:94 -- teardown_cluster: Keeping 2 nodes...
Monitor: Cleanup exception. Trying again...
2020-07-09 00:28:31,082	INFO commands.py:94 -- teardown_cluster: Keeping 2 nodes...
Monitor: Cleanup exception. Trying again...
2020-07-09 00:28:34,096	INFO commands.py:94 -- teardown_cluster: Keeping 2 nodes...

My cluster setup YAML:

cluster_name: sgd-pytorch

provider:
    type: gcp
    region: europe-west4
    availability_zone: europe-west4-b
    project_id: [MYID] # Globally unique project id

auth: {ssh_user: ubuntu}

min_workers: 2
initial_workers: 2
max_workers: 2
idle_timeout_minutes: 25

head_node:
    machineType: n1-standard-2
    disks:
      - boot: true
        autoDelete: true
        type: PERSISTENT
        initializeParams:
          diskSizeGb: 80
          # See https://cloud.google.com/compute/docs/images for more images
          sourceImage: projects/deeplearning-platform-release/global/images/family/pytorch-latest-cpu
    scheduling:
      - onHostMaintenance: TERMINATE

worker_nodes:
    machineType: n1-standard-4
    disks:
      - boot: true
        autoDelete: true
        type: PERSISTENT
        initializeParams:
          diskSizeGb: 50
          # See https://cloud.google.com/compute/docs/images for more images
          sourceImage: projects/deeplearning-platform-release/global/images/family/pytorch-latest-gpu

    guestAccelerators:
      - acceleratorType: projects/[MYPROJECT]/zones/europe-west4-b/acceleratorTypes/nvidia-tesla-t4
        acceleratorCount: 1
    metadata:
      items:
        - key: install-nvidia-driver
          value: "True"
    # Run workers on preemtible instance by default.
    # Comment this out to use on-demand.
    scheduling:
      - preemptible: true
      - onHostMaintenance: TERMINATE

file_mounts: {
    "/home/ubuntu/mypointnet": "/home/[LOCAL]/project/[project]",
    "/home/ubuntu/model/": "/home/[local]/model/"
}

# List of commands that will be run before `setup_commands`. If docker is
# enabled, these commands will run outside the container and before docker
# is setup.
initialization_commands: #[]
    - sudo apt-get -y install wget nano unzip zip man

# List of shell commands to run to set up nodes.
setup_commands:
    # Note: if you're developing Ray, you probably want to create an AMI that
    # has your Ray repo pre-cloned. Then, you can replace the pip installs
    # below with a git checkout <your_sha> (and possibly a recompile).
    # - echo 'export PATH="$HOME/anaconda3/envs/tensorflow_p36/bin:$PATH"' >> ~/.bashrc

    - pip install --upgrade pip
    - pip install -U oauth2client
    - pip install -U visdom jsonpatch
    - pip install -U shapely descartes ray[rllib]
    - cd [MYFOLDER] && pip install -e . # install my package
    - wget https://storage.googleapis.com/[MYFILE].zip
    - mkdir -p /home/ubuntu/data
    - unzip -uq [MYFILE].zip -d /home/ubuntu/data
    - rm [MYFILE].zip

# Custom commands that will be run on the head node after common setup.
head_setup_commands:
  - pip install -U google-api-python-client
  - git config --global user.email "[email]"
  - git config --global user.name "[name]"
  - visdom -port=8999 &> /dev/null &
  - mkdir -p /home/ubuntu/checkpoint

# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: []

# Command to start ray on the head node. You don't need to change this.
# NCCL_TIMEOUT_S is the maximum time one training (ray submit) can run
head_start_ray_commands:
    - ray stop
    - >-
      ulimit -n 65536;
      export NCCL_TIMEOUT_S=1209600;
      export NCCL_BUFFSIZE=8388608;
      ray start
      --head
      --redis-port=6379
      --object-manager-port=8076
      --autoscaling-config=~/ray_bootstrap_config.yaml
      --num-cpus=1
      --num-redis-shards=3

# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
    - ray stop
    - >-
      ulimit -n 65536;
      export NCCL_TIMEOUT_S=604800;
      export NCCL_BUFFSIZE=8388608;
      ray start
      --address=$RAY_HEAD_IP:6379
      --object-manager-port=8076
      --num-cpus=1

The text was updated successfully, but these errors were encountered:

martinrebane · 2020-07-09T02:42:13Z

PS! I do not know if this is relevant, but I am submitting the job with tmux option:

ray submit train_gcp.yaml ray_trainer.py --tmux

rkooo567 · 2020-07-10T23:24:21Z

cc @ijrsvt @henktillman

richardliaw · 2020-07-17T21:08:08Z

cc @martinrebane did you end up resolving this?

martinrebane · 2020-07-18T10:50:03Z

@richardliaw I solved it by adding export AUTOSCALER_MAX_NUM_FAILURES=100; into my startup yaml file before ray start commands (I added it both for the head and workers). This works well for me, but for new users discovering the problem and solution might be rather frustrating.

So the error is still popping up, but because of max num of failures is high, it does not kill the job.

thomasdesr · 2020-07-20T16:56:16Z

googleapis/google-api-python-client#218 this seems related? I don't see an obvious workaround though :S

martinrebane added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jul 9, 2020

richardliaw added P2 Important issue, but not time-critical and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jul 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[autoscaler] Autoscaler prints (harmless) errors every 30 mins and then kills the workers in GCP cluster #9368

[autoscaler] Autoscaler prints (harmless) errors every 30 mins and then kills the workers in GCP cluster #9368

martinrebane commented Jul 9, 2020 •

edited

Loading

martinrebane commented Jul 9, 2020

rkooo567 commented Jul 10, 2020

richardliaw commented Jul 17, 2020

martinrebane commented Jul 18, 2020

thomasdesr commented Jul 20, 2020

[autoscaler] Autoscaler prints (harmless) errors every 30 mins and then kills the workers in GCP cluster #9368

[autoscaler] Autoscaler prints (harmless) errors every 30 mins and then kills the workers in GCP cluster #9368

Comments

martinrebane commented Jul 9, 2020 • edited Loading

What is the problem?

Reproduction (REQUIRED)

martinrebane commented Jul 9, 2020

rkooo567 commented Jul 10, 2020

richardliaw commented Jul 17, 2020

martinrebane commented Jul 18, 2020

thomasdesr commented Jul 20, 2020

martinrebane commented Jul 9, 2020 •

edited

Loading