-
-
Notifications
You must be signed in to change notification settings - Fork 8.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[🐛 Bug]: Timeout exception thrown randomly with RemoteWebDriver #12558
Comments
@Abhijith2893, thank you for creating this issue. We will troubleshoot it as soon as we can. Info for maintainersTriage this issue by using labels.
If information is missing, add a helpful comment and then
If the issue is a question, add the
If the issue is valid but there is no time to troubleshoot it, consider adding the
If the issue requires changes or fixes from an external project (e.g., ChromeDriver, GeckoDriver, MSEdgeDriver, W3C),
add the applicable
After troubleshooting the issue, please add the Thank you! |
Do you have the logs for the grid? From my tests recently, Python does not seem to have a Read timeout as such (#12368). Java read timeout happens at 200 seconds. Default script timeout is 30 seconds, have you changed that anywhere in your code? Would need to see logs. |
Including hub and node logs for an occurrence Node logs
Hub logs
In our custom framework we capture a screenshot on failure. We observe that after the first timeout issue (on the click). All subsequent requests also timeout like the screenshot request. We trying tweaking the scriptTimeout and set it to 5 mins but from the DEBUG logs it looks like the 500 is returned after exactly 3 mins. |
Adding some more logs from our custom framework showing the subsequent request failures, all after exactly 3mins
|
Can you set grid to log with level FINE? |
Aha. The error is throwing us off - #12484 |
Thank you for the update. Due to the traffic we run on the grid I'll try capture FINE level logs during the weekend and attach them here. |
Attaching grid FINE level logs. I tried extracting the relevant portion. Searching for |
I cannot figure out much from the logs. Is there a way to reproduce this? |
Unfortunately we are unable to find a consistent repro for this similar to #9528 |
Any updates on how to reproduce this? |
This might be related to closing the httpclient to fast, it might still be waiting for a response. It should only be closed when the session is gone, not on a time base in my mind. selenium/java/src/org/openqa/selenium/grid/router/HandleSession.java Lines 74 to 79 in 8c4f48c
@Abhijith2893 is it possible the navigation (until the page is loaded) takes more than 60 seconds? |
Yes, that could be a possibility. We haven't been able to get a consistent repro but during debugging we noticed that adding a |
@diemol I did increase the timeout to 4 minutes in 5f3f7d0, we should not close potentially waiting HttpClients in any case. @Abhijith2893 can you use the nightly build to check if this fixes this issue? |
If we change defaults for the languages, #12368, we should keep this value in mind as well. |
Yes, thanks for the hint. I have a change prepared to completly remove this timeout. We can check every minute and if it has not been used for some time, if not check the session is alive or not. If the session is alive let the client alive, if not close it. But lets see if this issue is resolved with the increased timeout first. |
I've been running some tests with the nightly build but have not faced the issue yet. The increased timeout looks promising. |
After updating the hub to But unfortunately it does not give any more context.
|
Okay this NullPointer is verry interesting, i have reviewed the code and ended up in the I could not find the actual root cause of the NPE from reviewing the code. The Failsafe mechanism used here makes it hard for me ro read the code and follow the execution. Using a loop might simplify the code drastically and might make the dependency obsulete? |
@pujagani i think the |
@joerg1985 I think your observation regarding the two threads hitting max retries concurrently might be right. |
Actually, the Fallback is threadsafe https://failsafe.dev/javadoc/core/dev/failsafe/Fallback.html. I am not sure how to reproduce the NPE. It would be nice if we could reproduce it. |
I also don't see RetryRequest in the stack trace shared in #12558 (comment). Am I missing something? |
@pujagani here is some code to provoke this, the fallback (aka null) is taken as the operation itself failing.
|
Thank you @joerg1985 for having this handly. I am going to take a look. |
The library itself is fine, i think it's our custom code that makes this issue.
PS: in the real world handler1 and handler2 are created in different threads. |
The null error here is due to the final result not available to the RetryPolicy to create a fallback. https://github.com/SeleniumHQ/selenium/blob/trunk/java/src/org/openqa/selenium/remote/http/RetryRequest.java#L100 The value at that line is null, hence the response object is not created. |
So the original exception is silently swallowed and the code will continue execution with a null response and failing with the NPE we see above. |
@titusfortner i just pushed the timeout less version in e8c77b8. So #12368 must not care about this timeout any more. |
Yes, I think so. So now that it is narrowed down, I am going to try to fix our code tomorrow. I also observed that if the ErrorFilter is added then the response contains the exception details. So the error is not swallowed.
|
The ErrorFilter should handle the exception and transform into a response, the RetryRequest should not retry it then. But i might be wrong :D |
Yes! You are right. That is exactly what happens. |
Hey, updated to the latest 14.14.1 version for the hub but are still seeing the original issue. From this comment it seemed like the fix for the original issue is still being looked at ? Attached is the DEBUG logs from version 4.14.1
We also noticed that in this version the subsequent requests were not timing out. Looking at the failure screenshot generated, it seems the click action was successful but the response being returned is 500 leading to test failure. |
Thank you for trying the latest version. I am a little confused, are you still seeing timeouts? Based on the logs you are still seeing NPE. If only there was a way to reproduce this. |
I have one action point left in this, to not close the client based on a timeout. I think a @pujagani I think there is general topic with timeouts (pageLoadTimeout > default read timeout of the http client) too, but this another story... |
@joerg1985 Your second point makes sense, but if the client has reached the timeout, it retries again if the maximum retries are not exceeded then it will be able to get the response. It is just a safety net to ensure that we don't respond with an error immediately and give the server a few chances to respond. |
@pujagani i think i have some code to provoke this, i will create a new issue for the click retry on timeout. |
* fixed merge conflict * [grid] use a URI as key to the map * [java] make the json parsing exception text more helpful * [java] fixed the get session url endpoint * [java] use sessions.getUri when only the session uri is needed * Revert "[java] make the json parsing exception text more helpful" This reverts commit ce7cfc8. * [java] format script --------- Co-authored-by: Diego Molina <diemol@users.noreply.github.com>
This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
I think this should be resolved via b5a2e11 |
What happened?
When executing tests at scale there are some random failures with read timeouts (seems to wait for 3 mins). We are observing this primarily when clicking an element but have also observed it during navigate to page. Seems to be a reoccurrence of #9528
We are using dockerized grid v4.10 with the hub/node model and we confirmed that the docker process starts with
jdk-http-client
java -Dwebdriver.http.factory=jdk-http-client -jar /opt/selenium/selenium-server.jar --ext /opt/selenium/selenium-http-jdk-client.jar hub --session-request-timeout 3600 --session-retry-interval 15 --relax-checks true --bind-host false --config /opt/selenium/config.toml
How can we reproduce the issue?
Relevant log output
The text was updated successfully, but these errors were encountered: