Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[autoscaler] Autoscaler prints (harmless) errors every 30 mins and then kills the workers in GCP cluster #9368

Open
2 tasks
martinrebane opened this issue Jul 9, 2020 · 5 comments
Labels
bug Something that is supposed to be working; but isn't P2 Important issue, but not time-critical

Comments

@martinrebane
Copy link

martinrebane commented Jul 9, 2020

What is the problem?

I am running Ray cluster on Google Cloud Platform, using RaySGD for training my model. Every 30 minutes I see an error in ray monitor [yaml] log: Error during autoscaling (stacktrace attached). This error does no harm and the cluster keeps working fine, nothing is scaled or killed. Finally, after AUTOSCALER_MAX_NUM_FAILURES attemps the error is followed by StandardAutoscaler: Too many errors, abort. and all the worker nodes are killed.

Ray version and other system information (Python version, TensorFlow version, OS):
Ray 0.8.6, Ubuntu 18.04 on GCP, Pytorch 1.4, RaySGD TorchTrainer via remote() call

Reproduction (REQUIRED)

Please provide a script that can be run to reproduce the issue. The script should have no external library dependencies (i.e., use fake or mock data / environments):
I am running my project using remote RaySGD trainer:

stats =  ray.get(trainer.train.remote(max_retries=100, num_steps=50))

The error happens exactly every 30 minutes:

[last 3 shown]
2020-07-08 23:27:17,537 ERROR autoscaler.py:112 -- StandardAutoscaler: Error during autoscaling.
2020-07-08 23:57:43,427 ERROR autoscaler.py:112 -- StandardAutoscaler: Error during autoscaling.
2020-07-09 00:27:35,044 ERROR autoscaler.py:112 -- StandardAutoscaler: Error during autoscaling.

and then

2020-07-09 00:27:35,046	CRITICAL autoscaler.py:116 -- StandardAutoscaler: Too many errors, abort.

The reoccurring Error during autoscaling itself does not seem to do any harm at all (all workers are doing their job), but killing of the workers of course is very problematic as new ones won't be created.

  • I have verified my script runs in a clean environment and reproduces the issue.
  • I have verified the issue also occurs with the latest wheels.

Full stacktrace for the last autoscaling error and error that causes the abortion:

2020-07-09 00:27:20,049	INFO autoscaler.py:187 -- Ending bringup phase
2020-07-09 00:27:20,049	INFO autoscaler.py:389 -- StandardAutoscaler: 2/2 target nodes (0 pending) (2 failed to update)
2020-07-09 00:27:20,049	INFO autoscaler.py:390 -- LoadMetrics: MostDelayedHeartbeats={'10.164.15.233': 0.34097719192504883, '10.164.15.235': 0.3409607410430908, '10.164.0.14': 0.34094786643981934}, NodeIdleSeconds=Min=0 Mean=4477 Max=13432, NumNodesConnected=3, NumNodesUsed=2.0, ResourceUsage=2.0/3.0 CPU, 2.0/2.0 GPU, 0.0 GiB/26.06 GiB memory, 0.0/1.0 node:10.164.0.14, 0.0/1.0 node:10.164.15.233, 0.0/1.0 node:10.164.15.235, 0.0 GiB/7.92 GiB object_store_memory, TimeSinceLastHeartbeat=Min=0 Mean=0 Max=0
2020-07-09 00:27:25,142	INFO autoscaler.py:389 -- StandardAutoscaler: 2/2 target nodes (0 pending) (2 failed to update)
2020-07-09 00:27:25,143	INFO autoscaler.py:390 -- LoadMetrics: MostDelayedHeartbeats={'10.164.0.14': 0.4190995693206787, '10.164.15.233': 0.4190821647644043, '10.164.15.235': 0.4190678596496582}, NodeIdleSeconds=Min=0 Mean=4479 Max=13437, NumNodesConnected=3, NumNodesUsed=2.0, ResourceUsage=2.0/3.0 CPU, 2.0/2.0 GPU, 0.0 GiB/26.06 GiB memory, 0.0/1.0 node:10.164.0.14, 0.0/1.0 node:10.164.15.233, 0.0/1.0 node:10.164.15.235, 0.0 GiB/7.92 GiB object_store_memory, TimeSinceLastHeartbeat=Min=0 Mean=0 Max=0
2020-07-09 00:27:25,144	INFO autoscaler.py:187 -- Ending bringup phase
2020-07-09 00:27:25,144	INFO autoscaler.py:389 -- StandardAutoscaler: 2/2 target nodes (0 pending) (2 failed to update)
2020-07-09 00:27:25,144	INFO autoscaler.py:390 -- LoadMetrics: MostDelayedHeartbeats={'10.164.0.14': 0.420978307723999, '10.164.15.233': 0.4209609031677246, '10.164.15.235': 0.4209465980529785}, NodeIdleSeconds=Min=0 Mean=4479 Max=13437, NumNodesConnected=3, NumNodesUsed=2.0, ResourceUsage=2.0/3.0 CPU, 2.0/2.0 GPU, 0.0 GiB/26.06 GiB memory, 0.0/1.0 node:10.164.0.14, 0.0/1.0 node:10.164.15.233, 0.0/1.0 node:10.164.15.235, 0.0 GiB/7.92 GiB object_store_memory, TimeSinceLastHeartbeat=Min=0 Mean=0 Max=0
2020-07-09 00:27:30,226	INFO autoscaler.py:389 -- StandardAutoscaler: 2/2 target nodes (0 pending) (2 failed to update)
2020-07-09 00:27:30,227	INFO autoscaler.py:390 -- LoadMetrics: MostDelayedHeartbeats={'10.164.0.14': 0.40857815742492676, '10.164.15.233': 0.4085671901702881, '10.164.15.235': 0.4085569381713867}, NodeIdleSeconds=Min=0 Mean=4481 Max=13442, NumNodesConnected=3, NumNodesUsed=2.0, ResourceUsage=2.0/3.0 CPU, 2.0/2.0 GPU, 0.0 GiB/26.06 GiB memory, 0.0/1.0 node:10.164.0.14, 0.0/1.0 node:10.164.15.233, 0.0/1.0 node:10.164.15.235, 0.0 GiB/7.92 GiB object_store_memory, TimeSinceLastHeartbeat=Min=0 Mean=0 Max=0
2020-07-09 00:27:30,228	INFO autoscaler.py:187 -- Ending bringup phase
2020-07-09 00:27:30,228	INFO autoscaler.py:389 -- StandardAutoscaler: 2/2 target nodes (0 pending) (2 failed to update)
2020-07-09 00:27:30,229	INFO autoscaler.py:390 -- LoadMetrics: MostDelayedHeartbeats={'10.164.0.14': 0.4104330539703369, '10.164.15.233': 0.41042208671569824, '10.164.15.235': 0.4104118347167969}, NodeIdleSeconds=Min=0 Mean=4481 Max=13442, NumNodesConnected=3, NumNodesUsed=2.0, ResourceUsage=2.0/3.0 CPU, 2.0/2.0 GPU, 0.0 GiB/26.06 GiB memory, 0.0/1.0 node:10.164.0.14, 0.0/1.0 node:10.164.15.233, 0.0/1.0 node:10.164.15.235, 0.0 GiB/7.92 GiB object_store_memory, TimeSinceLastHeartbeat=Min=0 Mean=0 Max=0
2020-07-09 00:27:35,044	ERROR autoscaler.py:112 -- StandardAutoscaler: Error during autoscaling.
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/ray/autoscaler/autoscaler.py", line 110, in update
    self._update()
  File "/opt/conda/lib/python3.7/site-packages/ray/autoscaler/autoscaler.py", line 129, in _update
    nodes = self.workers()
  File "/opt/conda/lib/python3.7/site-packages/ray/autoscaler/autoscaler.py", line 385, in workers
    tag_filters={TAG_RAY_NODE_TYPE: NODE_TYPE_WORKER})
  File "/opt/conda/lib/python3.7/site-packages/ray/autoscaler/gcp/node_provider.py", line 85, in non_terminated_nodes
    filter=filter_expr,
  File "/opt/conda/lib/python3.7/site-packages/googleapiclient/_helpers.py", line 134, in positional_wrapper
    return wrapped(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/googleapiclient/http.py", line 901, in execute
    headers=self.headers,
  File "/opt/conda/lib/python3.7/site-packages/googleapiclient/http.py", line 204, in _retry_request
    raise exception
  File "/opt/conda/lib/python3.7/site-packages/googleapiclient/http.py", line 177, in _retry_request
    resp, content = http.request(uri, method, *args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/google_auth_httplib2.py", line 187, in request
    self._request, method, uri, request_headers)
  File "/opt/conda/lib/python3.7/site-packages/google/auth/credentials.py", line 124, in before_request
    self.refresh(request)
  File "/opt/conda/lib/python3.7/site-packages/google/auth/compute_engine/credentials.py", line 96, in refresh
    self._retrieve_info(request)
  File "/opt/conda/lib/python3.7/site-packages/google/auth/compute_engine/credentials.py", line 77, in _retrieve_info
    request, service_account=self._service_account_email
  File "/opt/conda/lib/python3.7/site-packages/google/auth/compute_engine/_metadata.py", line 226, in get_service_account_info
    recursive=True,
  File "/opt/conda/lib/python3.7/site-packages/google/auth/compute_engine/_metadata.py", line 144, in get
    response = request(url=url, method="GET", headers=_METADATA_HEADERS)
  File "/opt/conda/lib/python3.7/site-packages/google_auth_httplib2.py", line 116, in __call__
    url, method=method, body=body, headers=headers, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/httplib2/__init__.py", line 1994, in request
    cachekey,
  File "/opt/conda/lib/python3.7/site-packages/httplib2/__init__.py", line 1651, in _request
    conn, request_uri, method, body, headers
  File "/opt/conda/lib/python3.7/site-packages/httplib2/__init__.py", line 1558, in _conn_request
    conn.request(method, request_uri, body, headers)
  File "/opt/conda/lib/python3.7/http/client.py", line 1252, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/opt/conda/lib/python3.7/http/client.py", line 1298, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/opt/conda/lib/python3.7/http/client.py", line 1247, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/opt/conda/lib/python3.7/http/client.py", line 1026, in _send_output
    self.send(msg)
  File "/opt/conda/lib/python3.7/http/client.py", line 987, in send
    self.sock.sendall(data)
BrokenPipeError: [Errno 32] Broken pipe
2020-07-09 00:27:35,046	CRITICAL autoscaler.py:116 -- StandardAutoscaler: Too many errors, abort.
Error in monitor loop
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/ray/monitor.py", line 283, in run
    self._run()
  File "/opt/conda/lib/python3.7/site-packages/ray/monitor.py", line 238, in _run
    self.autoscaler.update()
  File "/opt/conda/lib/python3.7/site-packages/ray/autoscaler/autoscaler.py", line 118, in update
    raise e
  File "/opt/conda/lib/python3.7/site-packages/ray/autoscaler/autoscaler.py", line 110, in update
    self._update()
  File "/opt/conda/lib/python3.7/site-packages/ray/autoscaler/autoscaler.py", line 129, in _update
    nodes = self.workers()
  File "/opt/conda/lib/python3.7/site-packages/ray/autoscaler/autoscaler.py", line 385, in workers
    tag_filters={TAG_RAY_NODE_TYPE: NODE_TYPE_WORKER})
  File "/opt/conda/lib/python3.7/site-packages/ray/autoscaler/gcp/node_provider.py", line 85, in non_terminated_nodes
    filter=filter_expr,
  File "/opt/conda/lib/python3.7/site-packages/googleapiclient/_helpers.py", line 134, in positional_wrapper
    return wrapped(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/googleapiclient/http.py", line 901, in execute
    headers=self.headers,
  File "/opt/conda/lib/python3.7/site-packages/googleapiclient/http.py", line 204, in _retry_request
    raise exception
  File "/opt/conda/lib/python3.7/site-packages/googleapiclient/http.py", line 177, in _retry_request
    resp, content = http.request(uri, method, *args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/google_auth_httplib2.py", line 187, in request
    self._request, method, uri, request_headers)
  File "/opt/conda/lib/python3.7/site-packages/google/auth/credentials.py", line 124, in before_request
    self.refresh(request)
  File "/opt/conda/lib/python3.7/site-packages/google/auth/compute_engine/credentials.py", line 96, in refresh
    self._retrieve_info(request)
  File "/opt/conda/lib/python3.7/site-packages/google/auth/compute_engine/credentials.py", line 77, in _retrieve_info
    request, service_account=self._service_account_email
  File "/opt/conda/lib/python3.7/site-packages/google/auth/compute_engine/_metadata.py", line 226, in get_service_account_info
    recursive=True,
  File "/opt/conda/lib/python3.7/site-packages/google/auth/compute_engine/_metadata.py", line 144, in get
    response = request(url=url, method="GET", headers=_METADATA_HEADERS)
  File "/opt/conda/lib/python3.7/site-packages/google_auth_httplib2.py", line 116, in __call__
    url, method=method, body=body, headers=headers, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/httplib2/__init__.py", line 1994, in request
    cachekey,
  File "/opt/conda/lib/python3.7/site-packages/httplib2/__init__.py", line 1651, in _request
    conn, request_uri, method, body, headers
  File "/opt/conda/lib/python3.7/site-packages/httplib2/__init__.py", line 1558, in _conn_request
    conn.request(method, request_uri, body, headers)
  File "/opt/conda/lib/python3.7/http/client.py", line 1252, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/opt/conda/lib/python3.7/http/client.py", line 1298, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/opt/conda/lib/python3.7/http/client.py", line 1247, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/opt/conda/lib/python3.7/http/client.py", line 1026, in _send_output
    self.send(msg)
  File "/opt/conda/lib/python3.7/http/client.py", line 987, in send
    self.sock.sendall(data)
BrokenPipeError: [Errno 32] Broken pipe
2020-07-09 00:27:35,047	ERROR autoscaler.py:415 -- StandardAutoscaler: kill_workers triggered
2020-07-09 00:27:35,325	INFO node_provider.py:207 -- NodeProvider: ray-sgd-pytorch-worker-65665abb: Terminating node
2020-07-09 00:27:35,953	INFO node_provider.py:22 -- wait_for_compute_zone_operation: Waiting for operation operation-1594254455341-5a9f74a7e0197-43345963-b0c773d2 to finish...
2020-07-09 00:27:57,311	INFO node_provider.py:33 -- wait_for_compute_zone_operation: Operation operation-1594254455341-5a9f74a7e0197-43345963-b0c773d2 finished.
2020-07-09 00:27:57,312	INFO node_provider.py:207 -- NodeProvider: ray-sgd-pytorch-worker-75eafb86: Terminating node
2020-07-09 00:27:57,811	INFO node_provider.py:22 -- wait_for_compute_zone_operation: Waiting for operation operation-1594254477325-5a9f74bcd7556-4ba10d54-e3ddb3f4 to finish...
2020-07-09 00:28:24,453	INFO node_provider.py:33 -- wait_for_compute_zone_operation: Operation operation-1594254477325-5a9f74bcd7556-4ba10d54-e3ddb3f4 finished.
2020-07-09 00:28:24,454	ERROR autoscaler.py:420 -- StandardAutoscaler: terminated 2 node(s)
2020-07-09 00:28:25,344	INFO commands.py:94 -- teardown_cluster: Keeping 2 nodes...
Monitor: Cleanup exception. Trying again...
2020-07-09 00:28:28,161	INFO commands.py:94 -- teardown_cluster: Keeping 2 nodes...
Monitor: Cleanup exception. Trying again...
2020-07-09 00:28:31,082	INFO commands.py:94 -- teardown_cluster: Keeping 2 nodes...
Monitor: Cleanup exception. Trying again...
2020-07-09 00:28:34,096	INFO commands.py:94 -- teardown_cluster: Keeping 2 nodes...

My cluster setup YAML:

cluster_name: sgd-pytorch

provider:
    type: gcp
    region: europe-west4
    availability_zone: europe-west4-b
    project_id: [MYID] # Globally unique project id

auth: {ssh_user: ubuntu}

min_workers: 2
initial_workers: 2
max_workers: 2
idle_timeout_minutes: 25

head_node:
    machineType: n1-standard-2
    disks:
      - boot: true
        autoDelete: true
        type: PERSISTENT
        initializeParams:
          diskSizeGb: 80
          # See https://cloud.google.com/compute/docs/images for more images
          sourceImage: projects/deeplearning-platform-release/global/images/family/pytorch-latest-cpu
    scheduling:
      - onHostMaintenance: TERMINATE

worker_nodes:
    machineType: n1-standard-4
    disks:
      - boot: true
        autoDelete: true
        type: PERSISTENT
        initializeParams:
          diskSizeGb: 50
          # See https://cloud.google.com/compute/docs/images for more images
          sourceImage: projects/deeplearning-platform-release/global/images/family/pytorch-latest-gpu

    guestAccelerators:
      - acceleratorType: projects/[MYPROJECT]/zones/europe-west4-b/acceleratorTypes/nvidia-tesla-t4
        acceleratorCount: 1
    metadata:
      items:
        - key: install-nvidia-driver
          value: "True"
    # Run workers on preemtible instance by default.
    # Comment this out to use on-demand.
    scheduling:
      - preemptible: true
      - onHostMaintenance: TERMINATE

file_mounts: {
    "/home/ubuntu/mypointnet": "/home/[LOCAL]/project/[project]",
    "/home/ubuntu/model/": "/home/[local]/model/"
}

# List of commands that will be run before `setup_commands`. If docker is
# enabled, these commands will run outside the container and before docker
# is setup.
initialization_commands: #[]
    - sudo apt-get -y install wget nano unzip zip man

# List of shell commands to run to set up nodes.
setup_commands:
    # Note: if you're developing Ray, you probably want to create an AMI that
    # has your Ray repo pre-cloned. Then, you can replace the pip installs
    # below with a git checkout <your_sha> (and possibly a recompile).
    # - echo 'export PATH="$HOME/anaconda3/envs/tensorflow_p36/bin:$PATH"' >> ~/.bashrc

    - pip install --upgrade pip
    - pip install -U oauth2client
    - pip install -U visdom jsonpatch
    - pip install -U shapely descartes ray[rllib]
    - cd [MYFOLDER] && pip install -e . # install my package
    - wget https://storage.googleapis.com/[MYFILE].zip
    - mkdir -p /home/ubuntu/data
    - unzip -uq [MYFILE].zip -d /home/ubuntu/data
    - rm [MYFILE].zip

# Custom commands that will be run on the head node after common setup.
head_setup_commands:
  - pip install -U google-api-python-client
  - git config --global user.email "[email]"
  - git config --global user.name "[name]"
  - visdom -port=8999 &> /dev/null &
  - mkdir -p /home/ubuntu/checkpoint

# Custom commands that will be run on worker nodes after common setup.
worker_setup_commands: []

# Command to start ray on the head node. You don't need to change this.
# NCCL_TIMEOUT_S is the maximum time one training (ray submit) can run
head_start_ray_commands:
    - ray stop
    - >-
      ulimit -n 65536;
      export NCCL_TIMEOUT_S=1209600;
      export NCCL_BUFFSIZE=8388608;
      ray start
      --head
      --redis-port=6379
      --object-manager-port=8076
      --autoscaling-config=~/ray_bootstrap_config.yaml
      --num-cpus=1
      --num-redis-shards=3

# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
    - ray stop
    - >-
      ulimit -n 65536;
      export NCCL_TIMEOUT_S=604800;
      export NCCL_BUFFSIZE=8388608;
      ray start
      --address=$RAY_HEAD_IP:6379
      --object-manager-port=8076
      --num-cpus=1
@martinrebane martinrebane added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jul 9, 2020
@martinrebane
Copy link
Author

PS! I do not know if this is relevant, but I am submitting the job with tmux option:

ray submit train_gcp.yaml ray_trainer.py --tmux

@rkooo567
Copy link
Contributor

cc @ijrsvt @henktillman

@richardliaw richardliaw added P2 Important issue, but not time-critical and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jul 17, 2020
@richardliaw
Copy link
Contributor

cc @martinrebane did you end up resolving this?

@martinrebane
Copy link
Author

@richardliaw I solved it by adding export AUTOSCALER_MAX_NUM_FAILURES=100; into my startup yaml file before ray start commands (I added it both for the head and workers). This works well for me, but for new users discovering the problem and solution might be rather frustrating.

So the error is still popping up, but because of max num of failures is high, it does not kill the job.

@thomasdesr
Copy link
Contributor

googleapis/google-api-python-client#218 this seems related? I don't see an obvious workaround though :S

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't P2 Important issue, but not time-critical
Projects
None yet
Development

No branches or pull requests

4 participants