Grid Hub (3.141.59) - Grid Console and status endpoints unresponsive after node terminated/killed #8055

jantman · 2020-02-25T17:05:46Z

🐛 Bug Report

Grid Hub Console (/grid/console) and /wd/hub/status unavailable (times out) for a long time (minutes) after a node is killed/terminated.

Note that the length of unavailability differs by infrastructure; with hub and nodes running on the same host in Docker, this lasts for approximately 30 seconds (reproduction scenario below). With hub and nodes running via Docker images but on separate AWS EC2 instances, this lasts for 2-4 minutes.

To Reproduce

Run a Hub instance, preferably using the 3.141.59 Docker image (though this is not specific to Docker).
Run a Node instance, either Chrome or Firefox, using the 3.141.59 Docker image (though once again, this is not specific to Docker). Ensure that the Node registers to the Hub and shows up on the Console.
Terminate the Node instance.
Hub console (/grid/console) and status (/wd/hub/status) hang/timeout all connections/requests for ~30 seconds. Note that static pages, such as the 404 page for the / endpoint, serve normally during this time.

Running the same Docker images, but without docker-compose and on separate AWS EC2 Instances, results in a 120-second outage, a short (~5-second) period of responsiveness, and then another 120-second outage.

Expected behavior

Nodes disappearing from the grid will not affect the Hub, specifically the Console and definitely not the /wd/hub/status endpoint (which is used in the official docker health check script).

Alternatively, as a workaround, a hub API endpoint exists to forcibly, immediately remove a node from the Grid, which can be called by a node just before it terminates.

Test script or set of commands reproducing this issue

I've used a docker-compose file for this for ease of reproduction, but this issue does not appear to be Docker-specific. However, I happen to be deploying Selenium via Docker and that also seems like the best way to reproduce the issue and eliminate as much variability in system configuration as possible.

To reproduce the issue:

Save the below docker-compose file somewhere locally as docker-compose.yml.
docker-compose up to start the containers. Ensure that both start and the node registers; verify via http://127.0.0.1:4444/grid/console and http://127.0.0.1:4444/wd/hub/status
In another shell/terminal, run: while true; do if curl -s -m 5 -o /dev/null http://127.0.0.1:4444/wd/hub/status; then echo "OK: $(date)"; else echo "FAIL: $(date)"; fi; sleep 5; done
In yet another shell, run: docker stop node-chrome

The output of the curl commands will be something like below, where the first FAIL occurs immediately after the node is stopped:

OK: Tue 25 Feb 2020 09:10:38 AM EST
OK: Tue 25 Feb 2020 09:10:43 AM EST
OK: Tue 25 Feb 2020 09:10:48 AM EST
FAIL: Tue 25 Feb 2020 09:10:58 AM EST
FAIL: Tue 25 Feb 2020 09:11:08 AM EST
FAIL: Tue 25 Feb 2020 09:11:18 AM EST
FAIL: Tue 25 Feb 2020 09:11:28 AM EST
OK: Tue 25 Feb 2020 09:11:37 AM EST
OK: Tue 25 Feb 2020 09:11:45 AM EST
OK: Tue 25 Feb 2020 09:11:53 AM EST
OK: Tue 25 Feb 2020 09:12:01 AM EST
OK: Tue 25 Feb 2020 09:12:09 AM EST

The corresponding log output from docker-compose:

selenium-hub    | 14:02:52.932 INFO [DefaultGridRegistry.add] - Registered a node http://172.18.0.3:5555
chrome_1        | 14:02:52.933 INFO [SelfRegisteringRemote.registerToHub] - The node is registered to the hub and ready to use
selenium-hub    | 14:02:52.940 DEBUG [RegistrationServlet.lambda$process$0] - proxy added http://172.18.0.3:5555
chrome_1        | Trapped SIGTERM/SIGINT/x so shutting down supervisord...
chrome_1        | 2020-02-25 14:10:52,303 WARN received SIGTERM indicating exit request
chrome_1        | 2020-02-25 14:10:52,303 INFO waiting for xvfb, selenium-node to die
chrome_1        | 2020-02-25 14:10:52,304 INFO stopped: selenium-node (terminated by SIGTERM)
chrome_1        | 2020-02-25 14:10:52,305 INFO stopped: xvfb (terminated by SIGTERM)
chrome_1        | Shutdown complete
grid-ng_chrome_1 exited with code 143
selenium-hub    | 14:12:01.203 DEBUG [DefaultRemoteProxy.isAlive] - Failed to check status of node: No route to host (Host unreachable)
selenium-hub    | 14:12:09.257 DEBUG [DefaultRemoteProxy.isAlive] - Failed to check status of node: No route to host (Host unreachable)
selenium-hub    | 14:12:09.257 INFO [DefaultRemoteProxy.onEvent] - Marking the node http://172.18.0.3:5555 as down: cannot reach the node for 2 tries
selenium-hub    | 14:12:17.310 DEBUG [DefaultRemoteProxy.isAlive] - Failed to check status of node: No route to host (Host unreachable)
selenium-hub    | 14:12:25.363 DEBUG [DefaultRemoteProxy.isAlive] - Failed to check status of node: No route to host (Host unreachable)
selenium-hub    | 14:12:33.416 DEBUG [DefaultRemoteProxy.isAlive] - Failed to check status of node: No route to host (Host unreachable)
selenium-hub    | 14:12:41.470 DEBUG [DefaultRemoteProxy.isAlive] - Failed to check status of node: No route to host (Host unreachable)
selenium-hub    | 14:12:49.523 DEBUG [DefaultRemoteProxy.isAlive] - Failed to check status of node: No route to host (Host unreachable)
selenium-hub    | 14:12:57.577 DEBUG [DefaultRemoteProxy.isAlive] - Failed to check status of node: No route to host (Host unreachable)
selenium-hub    | 14:13:05.630 DEBUG [DefaultRemoteProxy.isAlive] - Failed to check status of node: No route to host (Host unreachable)
selenium-hub    | 14:13:13.683 DEBUG [DefaultRemoteProxy.isAlive] - Failed to check status of node: No route to host (Host unreachable)
selenium-hub    | 14:13:13.683 INFO [DefaultRemoteProxy.onEvent] - Unregistering the node http://172.18.0.3:5555 because it's been down for 64426 milliseconds
selenium-hub    | 14:13:13.684 WARN [DefaultGridRegistry.removeIfPresent] - Cleaning up stale test sessions on the unregistered node http://172.18.0.3:5555

docker compose file

version: "3"
services:
  selenium-hub:
    image: selenium/hub:3.141.59
    container_name: selenium-hub
    ports:
      - "4444:4444"
    environment:
      - SE_OPTS=-debug
  chrome:
    image: selenium/node-chrome:3.141.59
    container_name: node-chrome
    volumes:
      - /dev/shm:/dev/shm
    depends_on:
      - selenium-hub
    environment:
      - HUB_HOST=selenium-hub
      - HUB_PORT=4444
      - SE_OPTS=-servlets org.openqa.grid.web.servlet.LifecycleServlet

Alternate Issue Reproduction in AWS EC2

Start an EC2 instance from the Amazon Linux 2 AMI, in a private subnet of a VPC. We'll call this "HUB_INSTANCE". Apply security groups to allow access to port 4444 from the VPC.
On this instance, docker run --net bridge -m 0b -e "SE_OPTS=-debug" -p 4444:4444 selenium/hub:3.141.59
Verify that http://HUB_INSTANCE_IP:4444/grid/console and http://HUB_INSTANCE_IP:4444/wd/hub/status are available.
Start a second EC2 instance from the Amazon Linux 2 AMI, in the same private subnet of a VPC. We'll call this "NODE_INSTANCE". Apply security groups to allow access to port 5555 from the VPC.
On this instance, docker run --net bridge -m 0b -e "REMOTE_HOST=http://<NODE_INSTANCE IP>:5555" -e "HUB_HOST=<HUB_INSTANCE IP>" -e "HUB_PORT=4444" -p 5555:5555 -v /dev/shm:/dev/shm selenium/node-chrome:3.141.59
Verify that http://HUB_INSTANCE_IP:4444/grid/console shows the node instance connected and http://HUB_INSTANCE_IP:4444/wd/hub/status is available.
On a third instance (or locally if you have VPN into your VPC), run: while true; do if curl -s -m 5 -o /dev/null http://<HUB_INSTANCE IP>:4444/wd/hub/status; then echo "OK: $(date)"; else echo "FAIL: $(date)"; fi; sleep 5; done
Terminate NODE_INSTANCE: aws ec2 terminate-instances --instance-ids <NODE_INSTANCE ID>

The /wd/hub/status endpoint becomes unreachable for XXX seconds, as shown below (output from command in step 7). Requests to /grid/console similarly timeout.

OK: Tue 25 Feb 2020 11:54:43 AM EST
OK: Tue 25 Feb 2020 11:54:48 AM EST
OK: Tue 25 Feb 2020 11:54:53 AM EST
FAIL: Tue 25 Feb 2020 11:55:03 AM EST
FAIL: Tue 25 Feb 2020 11:55:13 AM EST
FAIL: Tue 25 Feb 2020 11:55:23 AM EST
FAIL: Tue 25 Feb 2020 11:55:33 AM EST
FAIL: Tue 25 Feb 2020 11:55:43 AM EST
FAIL: Tue 25 Feb 2020 11:55:53 AM EST
FAIL: Tue 25 Feb 2020 11:56:03 AM EST
FAIL: Tue 25 Feb 2020 11:56:13 AM EST
FAIL: Tue 25 Feb 2020 11:56:23 AM EST
FAIL: Tue 25 Feb 2020 11:56:33 AM EST
FAIL: Tue 25 Feb 2020 11:56:43 AM EST
FAIL: Tue 25 Feb 2020 11:56:53 AM EST
FAIL: Tue 25 Feb 2020 11:57:03 AM EST
FAIL: Tue 25 Feb 2020 11:57:13 AM EST
FAIL: Tue 25 Feb 2020 11:57:23 AM EST
FAIL: Tue 25 Feb 2020 11:57:33 AM EST
FAIL: Tue 25 Feb 2020 11:57:43 AM EST
FAIL: Tue 25 Feb 2020 11:57:53 AM EST
FAIL: Tue 25 Feb 2020 11:58:03 AM EST
FAIL: Tue 25 Feb 2020 11:58:13 AM EST
FAIL: Tue 25 Feb 2020 11:58:23 AM EST
FAIL: Tue 25 Feb 2020 11:58:33 AM EST
FAIL: Tue 25 Feb 2020 11:58:43 AM EST
FAIL: Tue 25 Feb 2020 11:58:53 AM EST
FAIL: Tue 25 Feb 2020 11:59:03 AM EST
FAIL: Tue 25 Feb 2020 11:59:13 AM EST
FAIL: Tue 25 Feb 2020 11:59:23 AM EST
OK: Tue 25 Feb 2020 11:59:28 AM EST
OK: Tue 25 Feb 2020 11:59:33 AM EST
OK: Tue 25 Feb 2020 11:59:38 AM EST

The hub's logs are as follows:

11:47:46.541 INFO [DefaultGridRegistry.add] - Registered a node http://10.230.92.112:5555
11:47:46.542 DEBUG [RegistrationServlet.lambda$process$0] - proxy added http://10.230.92.112:5555
# Terminated instance here
11:54:51.851 DEBUG [DefaultRemoteProxy.isAlive] - Failed to check status of node: Failed to connect to /10.230.92.112:5555
11:57:06.811 DEBUG [DefaultRemoteProxy.isAlive] - Failed to check status of node: Failed to connect to /10.230.92.112:5555
11:57:06.811 INFO [DefaultRemoteProxy.onEvent] - Marking the node http://10.230.92.112:5555 as down: cannot reach the node for 2 tries
11:59:21.979 DEBUG [DefaultRemoteProxy.isAlive] - Failed to check status of node: Failed to connect to /10.230.92.112:5555
11:59:21.979 INFO [DefaultRemoteProxy.onEvent] - Unregistering the node http://10.230.92.112:5555 because it's been down for 135168 milliseconds
11:59:21.979 WARN [DefaultGridRegistry.removeIfPresent] - Cleaning up stale test sessions on the unregistered node http://10.230.92.112:5555

Environment

OS: Amazon Linux 2
Browser: N/A
Browser version: N/A
Browser Driver version: N/A
Language Bindings version: N/A
Selenium Grid version (if applicable): 3.141.59

PS

I think this issue might be a duplicate of SeleniumHQ/docker-selenium#113 which says, in this comment that it was caused by "a regression introduced in selenium > 3.8, should be fixed in 3.13". But 3.13 was quite a while before 3.141.59.

The text was updated successfully, but these errors were encountered:

porn · 2020-04-22T10:34:52Z

Very same problem here 😞

Our use case:

we're autoscaling (in AWS) based on current selenium usage
the usage is determined from the selenium grid page
the non-responding grid leads to missing metrics and "INSUFFICIENT_DATA" alarms:

Hotfix:
we dramatically decreased node time limits to minimize the "blind" intervals:

-nodePolling: polling more often
-nodeStatusCheckTimeout: lower polling timeout
-unregisterIfStillDownAfter: remove from hub much sooner

But this wasn't enough 🤔

The node being removed from the autoscaled group is just terminated and the hub can't connect to it and somehow nodeStatusCheckTimeout doesn't apply here -> the node was being unregistered way later than defined in unregisterIfStillDownAfter option.
To fix this we added ExecStopPost=/bin/sleep XX to the
/etc/systemd/system/selenium.node.service -> this way the hub unregistered the node in time ✔️

The result:

diemol · 2020-04-22T14:47:07Z

Hi @jantman and @porn

I completely understand the issue and was one of the things I learned to adapt while developing Zalenium. We are very much aware of the undesirable behavior, which has its root on the several checks that the hub does to the node and the attempts the node does to register.

The first thing is -nodePolling, if it is a small number, that means the hub will poll very often for the node status. This uses threads in the hub, and can block if there are many nodes to check. If the network is reliable, why do we need to check the nodes so often? But I understand that in an autoscaling scenario we need to know the node status often. An alternative for that is to use different -nodePolling values for each node. With that, the hub will poll the node at different times. This is what we were doing with Zalenium, and it improved performance.

The second thing is -downPollingLimit, which marks the node down if the node has not responded X times. In the autoscaling scenario, maybe it is enough to set this value to 1. Otherwise, a node that is down will need -nodePolling x -downPollingLimit amount of time.

The third thing is -unregisterIfStillDownAfter, which comes into play after the node is marked as down. With this parameter, the node has not responded after X ms, gets unregistered. And again, in an autoscale scenario, a very small value would make more sense.

And the last one is -registerCycle, which is how often the node will try to register (even if it already registered). If the network is reliable, maybe we can put a very high number here. Because if we leave the default, the hub will get lots of registration requests quite often.

These are hints on how you could improve the performance of your Grids. Now, the complicated part is, Selenium Grid 3.x is not under development anymore, so these issues won't be fixed, but as mentioned, we are aware of them and we are avoiding them during the development of Grid 4 (on alpha status at the moment of writing this).

Having said that, I will close this issue since we won't work on it for Grid 3, but if there are more questions, feel free to reach out through the channels mentioned here https://www.selenium.dev/support/.

porn · 2020-04-23T04:21:12Z

Okay, thanks @diemol for detailed explanation! Looking fwd to Grid 4 😉 🤞

ghost added the needs-triaging label Feb 25, 2020

AutomatedTester added C-grid and removed needs-triaging labels Mar 6, 2020

diemol closed this as completed Apr 22, 2020

github-actions bot locked and limited conversation to collaborators Sep 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Grid Hub (3.141.59) - Grid Console and status endpoints unresponsive after node terminated/killed #8055

Grid Hub (3.141.59) - Grid Console and status endpoints unresponsive after node terminated/killed #8055

jantman commented Feb 25, 2020

porn commented Apr 22, 2020 •

edited

Loading

diemol commented Apr 22, 2020

porn commented Apr 23, 2020

Grid Hub (3.141.59) - Grid Console and status endpoints unresponsive after node terminated/killed #8055

Grid Hub (3.141.59) - Grid Console and status endpoints unresponsive after node terminated/killed #8055

Comments

jantman commented Feb 25, 2020

🐛 Bug Report

To Reproduce

Expected behavior

Test script or set of commands reproducing this issue

docker compose file

Alternate Issue Reproduction in AWS EC2

Environment

PS

porn commented Apr 22, 2020 • edited Loading

diemol commented Apr 22, 2020

porn commented Apr 23, 2020

porn commented Apr 22, 2020 •

edited

Loading