-
Notifications
You must be signed in to change notification settings - Fork 569
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Concurrency bug in read for cancel #761
Comments
@ns-gzhang im not clear how the connection can be reused since on cancel we don't release it. Re #727 (comment) The connection is not released back into the idle pool since it has an error. Debugging the following shows a new connection is created.
|
Ok release is called when the deadline is exceeded but its not pushing it back into the pool because of
i.e. err is contextdeadlineexceeded maybe I can simulate a situation where this occurs. |
Same test using std interface and OpenDB - connection is successfully marked as Bad since its a timeout. @ns-gzhang how are the timeouts occurring? I'm using context so maybe this a server setting? |
Thank you Dale for debugging this. It's been a challenge to recreate. We've also been putting a lot of efforts trying to recreate, in a similar environment, as well as a controlled environment, and we were not successful up to this point. So the trouble spot is only our speculation from some external observations when the problem occurred. |
Are you possibly sharing the same context across two queries...i could possibly see how that could cause an issue on cancel. Specifically:
I'll see if I can reproduce but it sounds possible. |
@ns-gzhang i dont think its reuse of a context - the client guards against this. I'm going to put up a branch that closes the connection on cancel. Can you test in your environment? if it resolves the issue we'll close this - frustrating if we can't reproduce however. |
@ns-gzhang can you test #764 and see if it runs without error? If it does, we will merge. |
Thanks Dale. We will keep trying to recreate it and can only test after recreation. Will let you know. Thanks. |
any chance you could try the branch https://github.com/ClickHouse/clickhouse-go/tree/issue_727 @ns-gzhang - see if this solves it. I've added explicit close on cancel. |
@gingerwizard question: can a canceled connection be reused? If yes, then I'm thinking one scenario: the server sends the result at the same time the client sends the cancel, and they cross over. The client connection buffer would contain the result. The connection is canceled but later reused for another request without cleaning? (this was the scenario implicitly implied during the discussion of #727 ) |
No, a cancelled connection was marked as dirty and would not be reused in the current approach. It would require two queries to effectively take the same connection from the pool. dial and acquire prevent this - a new connection is created if < max open connections, otherwise one is taken off the idle pool. All buffers are local to the connection. The only real difference with the new code is the buffer is cleared and underlying connection explicitly closed before its returned. I was hoping this might resolve issues at a lower level, where connections might be reused. |
If cancel is not the path to the trouble, under what condition, database/sql standard could release a Conn, and get it reused without possibly cleaning the connection? (ResetSession is called to reset a connection if used before, but that does nothing if the connection is not bad, in the driver code) |
so i was refering to the native interface @ns-gzhang its possible there is something in the indirection of sql/database that makes this possible. We could close the connection on ResetSession. Are you able to test these changes under your production load? otherwise we're making changes and speculating without a reproducible test case |
@ns-gzhang did the branch version with connection cancellation work? |
Hi Dale @gingerwizard I have not been able to recreate in a controlled environment yet so I couldn't verify. We will keep trying... It's important to us (and others). Thanks. |
@ns-gzhang feel free to reopen this issue if the issue is still reproducible. thank you |
Sure. Thanks @gingerwizard @jkaflik and all for your support and changes made. Really appreciate it. |
@ns-gzhang note that we did merge the connection reset code in the event of error. If you update your driver in production maybe you can tell us if it still occurs? This hasn't been released yet but will ping you when it is. |
Thanks again. I saw that merge. We tried hard to simulate the possible scenarios in our controlled environments using exactly the same CH server version and the client-driver version. I couldn't recreate. I still suspect that concurrency of std database/sql interface and the driver played some role. Friends at Altinity also helped looking into this. |
From #727 (comment)
#727 (comment)
The text was updated successfully, but these errors were encountered: