-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High load concurrent queries hang #36
Comments
@vmarkovtsev Can you explain which command you use to reproduce the bug? |
I've found some instances of the python-client hanging randomly on my debugging of #34. It seems to happen more frequently after the previous request failed for any reason and @smola and I suspect the driver containers could not be closing correctly after some errors, which could cause all the provisioned containers to be in a non-responsive state at the end. I'll look into it as part of that bug but maybe the fix could be the same for this one. I will post an update when I've more information. |
I'll be fixing bblfsh/python-driver/issues/27 first since it could also be related (the problem seem to be worsened by driver failures). |
Update: now that I've the docker to test in a controlled environment I'm seeing different behaviours on different tests:
|
Update: This bug have shown some existing problems that need to be fixed. Currently there is a problem that show with some files that produce an error in the Go part of the driver (the SDK) where an error is correctly returned but the next request after that one hangs. That's because the driver reads from the driver container stdin but doesn't write anything to stdout, so the server to driver communication hangs (and thus also the client to server one) . This doesn't mean that the server hangs; other connections to it while the first one is waiting will work since the server will instantiate more containers to manage these new connections.
|
Fixed the cause of all tests I have that resulted in container hangs: bblfsh/sdk/pull/135 We'll test together on monday and see if we can close it. The timeout and the container killing will be done in a separate PR since I want to check some things with my team. |
PS: @bzz |
@EgorBu and I just checked that this (already merged) fix and the previous ones make the previous problems unreproducible. We banged the server hard with three processes sending it all the tensorflow I'll release new docker images of the server and the Python & Java drivers today or tomorrow with all these fixes incorporated. Closing the issue, if the new docker images doesn't work we'll reopen it. |
here it's code that we used to test
|
New Python, Java and Server docker images have been released with all the latest fixes. |
When I execute many (3000) concurrent queries (4 threads), Babelfish server either hangs or drops some requests without answering them. CPU load is 0%. After that, server becomes completely unresponsive and I have to restart it.
The text was updated successfully, but these errors were encountered: