Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concurrency bug in read for cancel #761

Closed
gingerwizard opened this issue Sep 23, 2022 · 20 comments
Closed

Concurrency bug in read for cancel #761

gingerwizard opened this issue Sep 23, 2022 · 20 comments
Assignees
Labels
bug P1 stale requires a follow-up
Milestone

Comments

@gingerwizard
Copy link
Collaborator

From #727 (comment)

We had some interesting scenario as more background for your investigation. This may or may not lead to the problem we had above. Here is the observation for one request (with the v1 driver):

we set the timeout limit to 90s, and sent the query to clickhouse-go client.
from CH server query_log, we found the query arrived the server (QueryStart event) about 89s later.
we assumed that clickhouse-go driver was waiting for an open connection (during relatively busy time, all connections were used).
the clickhouse-go client timed out after 90s, and went away (the CH server noticed that event).
the CH server finished the query in 6+s, and tried to send the results back, and found the client left already, resulting in 210 error code.

The reason I'd like to describe this scenario is that there might be a possibility in the v2 driver, the timed-out connection could be reused by a future request, while the connection didn't get "reset" when it timed out by the client earlier. Is it possible? Thanks.

#727 (comment)

@gingerwizard gingerwizard added this to the 2.4.0 milestone Sep 23, 2022
@gingerwizard gingerwizard self-assigned this Sep 23, 2022
@gingerwizard gingerwizard changed the title Concurrency bug in read Concurrency bug in read for cancel Sep 24, 2022
@gingerwizard
Copy link
Collaborator Author

gingerwizard commented Sep 25, 2022

@ns-gzhang im not clear how the connection can be reused since on cancel we don't release it. Re #727 (comment)

The connection is not released back into the idle pool since it has an error. Debugging the following shows a new connection is created.

func TestQueryCancel(t *testing.T) {
	env, err := GetNativeTestEnvironment()
	require.NoError(t, err)
	useSSL, err := strconv.ParseBool(GetEnv("CLICKHOUSE_USE_SSL", "false"))
	require.NoError(t, err)
	port := env.Port
	var tlsConfig *tls.Config
	if useSSL {
		port = env.SslPort
		tlsConfig = &tls.Config{}
	}
	conn, err := GetConnectionWithOptions(&clickhouse.Options{
		Addr: []string{fmt.Sprintf("%s:%d", env.Host, port)},
		Auth: clickhouse.Auth{
			Database: "default",
			Username: env.Username,
			Password: env.Password,
		},
		Compression: &clickhouse.Compression{
			Method: clickhouse.CompressionLZ4,
		},
		TLS:          tlsConfig,
		MaxOpenConns: 1,
		MaxIdleConns: 1,
		Debug:        true,
	})
	require.NoError(t, err)
	// Issue a query which will take 3 secs, cancel after 1 and reissue a query which take 3 secs - check response is q2, not q1

	ctx, cancel := context.WithDeadline(context.Background(), time.Now().Add(time.Millisecond*time.Duration(100)))
	defer cancel()
	var queryId uint16
	// query 1
	err = conn.QueryRow(ctx, "SELECT sleep(3) + 1 as query_id").Scan(&queryId)
	require.Error(t, err)
	// query 2
	err = conn.QueryRow(context.Background(), "SELECT sleep(3) + 2 as query_id").Scan(&queryId)
	require.NoError(t, err)
	require.Equal(t, uint16(2), queryId)
}

@gingerwizard
Copy link
Collaborator Author

gingerwizard commented Sep 25, 2022

Ok release is called when the deadline is exceeded but its not pushing it back into the pool because of

	if err != nil || time.Since(conn.connectedAt) >= ch.opt.ConnMaxLifetime {

i.e. err is contextdeadlineexceeded

maybe I can simulate a situation where this occurs.

@gingerwizard
Copy link
Collaborator Author

Same test using std interface and OpenDB - connection is successfully marked as Bad since its a timeout.

image

@ns-gzhang how are the timeouts occurring? I'm using context so maybe this a server setting?

@ns-gzhang
Copy link

Thank you Dale for debugging this. It's been a challenge to recreate. We've also been putting a lot of efforts trying to recreate, in a similar environment, as well as a controlled environment, and we were not successful up to this point. So the trouble spot is only our speculation from some external observations when the problem occurred.
In a controlled environment, I have tried a few different timeout combinations, including: the context deadline exceeded in driver sooner than server (the main scenario); the app caller went away without fetching the rows (due to context timeout in the caller after return from the client driver); timeout occurred after rows is returned to the caller and caller is trying to fetch the result but the connection timed out...
The problem occurred during heavy load times, and the server seemed to finish the query execution while the client went away already. And a new victim query came and just got that result back. That's why we were suspecting a canceling problem. I don't know under what conditions a connection would be returned to the pool and get reused. And what kind of race conditions are possible, is there a possible race condition across the standard database/sql and driver boundaries (the order of marking a connection free v.s. marking error v.s. reuse)? Just some food for thought.

@gingerwizard
Copy link
Collaborator Author

Are you possibly sharing the same context across two queries...i could possibly see how that could cause an issue on cancel. Specifically:

  1. query q1 is cancelled on timeout. It tries to write the cancel signal.
  2. As q1 is writing its cancel signal, q2 modifies the context marking it as not Done() (not sure where this will happen - I need to dig)
  3. q1 completes cancel and then drops into https://github.com/ClickHouse/clickhouse-go/blob/main/conn_process.go#L68 returning nil (as ctx is no longer finished).

I'll see if I can reproduce but it sounds possible.

@gingerwizard
Copy link
Collaborator Author

@ns-gzhang i dont think its reuse of a context - the client guards against this. I'm going to put up a branch that closes the connection on cancel. Can you test in your environment? if it resolves the issue we'll close this - frustrating if we can't reproduce however.

@gingerwizard
Copy link
Collaborator Author

@ns-gzhang can you test #764 and see if it runs without error? If it does, we will merge.

@ns-gzhang
Copy link

Thanks Dale. We will keep trying to recreate it and can only test after recreation. Will let you know. Thanks.

@gingerwizard
Copy link
Collaborator Author

gingerwizard commented Sep 27, 2022

any chance you could try the branch https://github.com/ClickHouse/clickhouse-go/tree/issue_727 @ns-gzhang - see if this solves it. I've added explicit close on cancel.

@ns-gzhang
Copy link

ns-gzhang commented Sep 27, 2022

@gingerwizard question: can a canceled connection be reused? If yes, then I'm thinking one scenario: the server sends the result at the same time the client sends the cancel, and they cross over. The client connection buffer would contain the result. The connection is canceled but later reused for another request without cleaning? (this was the scenario implicitly implied during the discussion of #727 )
In this case, your explicit close will definitely fix the problem!

@gingerwizard
Copy link
Collaborator Author

No, a cancelled connection was marked as dirty and would not be reused in the current approach. It would require two queries to effectively take the same connection from the pool. dial and acquire prevent this - a new connection is created if < max open connections, otherwise one is taken off the idle pool. All buffers are local to the connection.

The only real difference with the new code is the buffer is cleared and underlying connection explicitly closed before its returned. I was hoping this might resolve issues at a lower level, where connections might be reused.

@ns-gzhang
Copy link

OK. we are using database/sql standard interface for the most part. And looks like the driver maintains another level of connection pools? Not sure if two connection pools at two levels could cause problems? Here is what database/sql says:
image
If the lifecycle of a connection is different at the two levels, there could be some state that does not satisfy the invariants?

@ns-gzhang
Copy link

ns-gzhang commented Sep 28, 2022

If cancel is not the path to the trouble, under what condition, database/sql standard could release a Conn, and get it reused without possibly cleaning the connection? (ResetSession is called to reset a connection if used before, but that does nothing if the connection is not bad, in the driver code)

@gingerwizard
Copy link
Collaborator Author

so i was refering to the native interface @ns-gzhang its possible there is something in the indirection of sql/database that makes this possible.

We could close the connection on ResetSession. Are you able to test these changes under your production load? otherwise we're making changes and speculating without a reproducible test case

@gingerwizard
Copy link
Collaborator Author

@ns-gzhang did the branch version with connection cancellation work?

@ns-gzhang
Copy link

Hi Dale @gingerwizard I have not been able to recreate in a controlled environment yet so I couldn't verify. We will keep trying... It's important to us (and others). Thanks.

@gingerwizard gingerwizard modified the milestones: 2.4.0, 2.5.0 Nov 23, 2022
@jkaflik jkaflik added stale requires a follow-up and removed help wanted labels Dec 15, 2022
@jkaflik
Copy link
Contributor

jkaflik commented Dec 16, 2022

@ns-gzhang feel free to reopen this issue if the issue is still reproducible. thank you

@jkaflik jkaflik closed this as completed Dec 16, 2022
@ns-gzhang
Copy link

Sure. Thanks @gingerwizard @jkaflik and all for your support and changes made. Really appreciate it.

@gingerwizard
Copy link
Collaborator Author

@ns-gzhang note that we did merge the connection reset code in the event of error. If you update your driver in production maybe you can tell us if it still occurs? This hasn't been released yet but will ping you when it is.

@ns-gzhang
Copy link

ns-gzhang commented Dec 17, 2022

Thanks again. I saw that merge. We tried hard to simulate the possible scenarios in our controlled environments using exactly the same CH server version and the client-driver version. I couldn't recreate. I still suspect that concurrency of std database/sql interface and the driver played some role. Friends at Altinity also helped looking into this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug P1 stale requires a follow-up
Projects
None yet
Development

No branches or pull requests

3 participants