-
-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Grid Hub (3.141.59) - Grid Console and status endpoints unresponsive after node terminated/killed #8055
Comments
I completely understand the issue and was one of the things I learned to adapt while developing Zalenium. We are very much aware of the undesirable behavior, which has its root on the several checks that the hub does to the node and the attempts the node does to register. The first thing is The second thing is The third thing is And the last one is These are hints on how you could improve the performance of your Grids. Now, the complicated part is, Selenium Grid 3.x is not under development anymore, so these issues won't be fixed, but as mentioned, we are aware of them and we are avoiding them during the development of Grid 4 (on alpha status at the moment of writing this). Having said that, I will close this issue since we won't work on it for Grid 3, but if there are more questions, feel free to reach out through the channels mentioned here https://www.selenium.dev/support/. |
Okay, thanks @diemol for detailed explanation! Looking fwd to Grid 4 😉 🤞 |
🐛 Bug Report
Grid Hub Console (
/grid/console
) and/wd/hub/status
unavailable (times out) for a long time (minutes) after a node is killed/terminated.Note that the length of unavailability differs by infrastructure; with hub and nodes running on the same host in Docker, this lasts for approximately 30 seconds (reproduction scenario below). With hub and nodes running via Docker images but on separate AWS EC2 instances, this lasts for 2-4 minutes.
To Reproduce
/grid/console
) and status (/wd/hub/status
) hang/timeout all connections/requests for ~30 seconds. Note that static pages, such as the 404 page for the/
endpoint, serve normally during this time.Running the same Docker images, but without docker-compose and on separate AWS EC2 Instances, results in a 120-second outage, a short (~5-second) period of responsiveness, and then another 120-second outage.
Expected behavior
Nodes disappearing from the grid will not affect the Hub, specifically the Console and definitely not the
/wd/hub/status
endpoint (which is used in the official docker health check script).Alternatively, as a workaround, a hub API endpoint exists to forcibly, immediately remove a node from the Grid, which can be called by a node just before it terminates.
Test script or set of commands reproducing this issue
I've used a docker-compose file for this for ease of reproduction, but this issue does not appear to be Docker-specific. However, I happen to be deploying Selenium via Docker and that also seems like the best way to reproduce the issue and eliminate as much variability in system configuration as possible.
To reproduce the issue:
docker-compose.yml
.docker-compose up
to start the containers. Ensure that both start and the node registers; verify via http://127.0.0.1:4444/grid/console and http://127.0.0.1:4444/wd/hub/statuswhile true; do if curl -s -m 5 -o /dev/null http://127.0.0.1:4444/wd/hub/status; then echo "OK: $(date)"; else echo "FAIL: $(date)"; fi; sleep 5; done
docker stop node-chrome
The output of the curl commands will be something like below, where the first FAIL occurs immediately after the node is stopped:
The corresponding log output from docker-compose:
docker compose file
Alternate Issue Reproduction in AWS EC2
docker run --net bridge -m 0b -e "SE_OPTS=-debug" -p 4444:4444 selenium/hub:3.141.59
docker run --net bridge -m 0b -e "REMOTE_HOST=http://<NODE_INSTANCE IP>:5555" -e "HUB_HOST=<HUB_INSTANCE IP>" -e "HUB_PORT=4444" -p 5555:5555 -v /dev/shm:/dev/shm selenium/node-chrome:3.141.59
while true; do if curl -s -m 5 -o /dev/null http://<HUB_INSTANCE IP>:4444/wd/hub/status; then echo "OK: $(date)"; else echo "FAIL: $(date)"; fi; sleep 5; done
aws ec2 terminate-instances --instance-ids <NODE_INSTANCE ID>
The
/wd/hub/status
endpoint becomes unreachable for XXX seconds, as shown below (output from command in step 7). Requests to/grid/console
similarly timeout.The hub's logs are as follows:
Environment
OS: Amazon Linux 2
Browser: N/A
Browser version: N/A
Browser Driver version: N/A
Language Bindings version: N/A
Selenium Grid version (if applicable): 3.141.59
PS
I think this issue might be a duplicate of SeleniumHQ/docker-selenium#113 which says, in this comment that it was caused by "a regression introduced in selenium > 3.8, should be fixed in 3.13". But 3.13 was quite a while before 3.141.59.
The text was updated successfully, but these errors were encountered: