Concurrency bug in read for cancel #761

gingerwizard · 2022-09-23T15:01:42Z

We had some interesting scenario as more background for your investigation. This may or may not lead to the problem we had above. Here is the observation for one request (with the v1 driver):
we set the timeout limit to 90s, and sent the query to clickhouse-go client.
from CH server query_log, we found the query arrived the server (QueryStart event) about 89s later.
we assumed that clickhouse-go driver was waiting for an open connection (during relatively busy time, all connections were used).
the clickhouse-go client timed out after 90s, and went away (the CH server noticed that event).
the CH server finished the query in 6+s, and tried to send the results back, and found the client left already, resulting in 210 error code.
The reason I'd like to describe this scenario is that there might be a possibility in the v2 driver, the timed-out connection could be reused by a future request, while the connection didn't get "reset" when it timed out by the client earlier. Is it possible? Thanks.

#727 (comment)

The text was updated successfully, but these errors were encountered:

gingerwizard · 2022-09-25T10:40:59Z

@ns-gzhang im not clear how the connection can be reused since on cancel we don't release it. Re #727 (comment)

The connection is not released back into the idle pool since it has an error. Debugging the following shows a new connection is created.

func TestQueryCancel(t *testing.T) {
	env, err := GetNativeTestEnvironment()
	require.NoError(t, err)
	useSSL, err := strconv.ParseBool(GetEnv("CLICKHOUSE_USE_SSL", "false"))
	require.NoError(t, err)
	port := env.Port
	var tlsConfig *tls.Config
	if useSSL {
		port = env.SslPort
		tlsConfig = &tls.Config{}
	}
	conn, err := GetConnectionWithOptions(&clickhouse.Options{
		Addr: []string{fmt.Sprintf("%s:%d", env.Host, port)},
		Auth: clickhouse.Auth{
			Database: "default",
			Username: env.Username,
			Password: env.Password,
		},
		Compression: &clickhouse.Compression{
			Method: clickhouse.CompressionLZ4,
		},
		TLS:          tlsConfig,
		MaxOpenConns: 1,
		MaxIdleConns: 1,
		Debug:        true,
	})
	require.NoError(t, err)
	// Issue a query which will take 3 secs, cancel after 1 and reissue a query which take 3 secs - check response is q2, not q1

	ctx, cancel := context.WithDeadline(context.Background(), time.Now().Add(time.Millisecond*time.Duration(100)))
	defer cancel()
	var queryId uint16
	// query 1
	err = conn.QueryRow(ctx, "SELECT sleep(3) + 1 as query_id").Scan(&queryId)
	require.Error(t, err)
	// query 2
	err = conn.QueryRow(context.Background(), "SELECT sleep(3) + 2 as query_id").Scan(&queryId)
	require.NoError(t, err)
	require.Equal(t, uint16(2), queryId)
}

gingerwizard · 2022-09-25T10:44:04Z

Ok release is called when the deadline is exceeded but its not pushing it back into the pool because of

	if err != nil || time.Since(conn.connectedAt) >= ch.opt.ConnMaxLifetime {

i.e. err is contextdeadlineexceeded

maybe I can simulate a situation where this occurs.

gingerwizard · 2022-09-25T11:04:24Z

Same test using std interface and OpenDB - connection is successfully marked as Bad since its a timeout.

@ns-gzhang how are the timeouts occurring? I'm using context so maybe this a server setting?

ns-gzhang · 2022-09-26T03:55:05Z

Thank you Dale for debugging this. It's been a challenge to recreate. We've also been putting a lot of efforts trying to recreate, in a similar environment, as well as a controlled environment, and we were not successful up to this point. So the trouble spot is only our speculation from some external observations when the problem occurred.
In a controlled environment, I have tried a few different timeout combinations, including: the context deadline exceeded in driver sooner than server (the main scenario); the app caller went away without fetching the rows (due to context timeout in the caller after return from the client driver); timeout occurred after rows is returned to the caller and caller is trying to fetch the result but the connection timed out...
The problem occurred during heavy load times, and the server seemed to finish the query execution while the client went away already. And a new victim query came and just got that result back. That's why we were suspecting a canceling problem. I don't know under what conditions a connection would be returned to the pool and get reused. And what kind of race conditions are possible, is there a possible race condition across the standard database/sql and driver boundaries (the order of marking a connection free v.s. marking error v.s. reuse)? Just some food for thought.

gingerwizard · 2022-09-26T08:32:57Z

Are you possibly sharing the same context across two queries...i could possibly see how that could cause an issue on cancel. Specifically:

query q1 is cancelled on timeout. It tries to write the cancel signal.
As q1 is writing its cancel signal, q2 modifies the context marking it as not Done() (not sure where this will happen - I need to dig)
q1 completes cancel and then drops into https://github.com/ClickHouse/clickhouse-go/blob/main/conn_process.go#L68 returning nil (as ctx is no longer finished).

I'll see if I can reproduce but it sounds possible.

gingerwizard · 2022-09-26T12:05:19Z

@ns-gzhang i dont think its reuse of a context - the client guards against this. I'm going to put up a branch that closes the connection on cancel. Can you test in your environment? if it resolves the issue we'll close this - frustrating if we can't reproduce however.

gingerwizard · 2022-09-26T13:10:11Z

@ns-gzhang can you test #764 and see if it runs without error? If it does, we will merge.

ns-gzhang · 2022-09-26T15:47:00Z

Thanks Dale. We will keep trying to recreate it and can only test after recreation. Will let you know. Thanks.

gingerwizard · 2022-09-27T06:41:11Z

any chance you could try the branch https://github.com/ClickHouse/clickhouse-go/tree/issue_727 @ns-gzhang - see if this solves it. I've added explicit close on cancel.

ns-gzhang · 2022-09-27T15:33:05Z

@gingerwizard question: can a canceled connection be reused? If yes, then I'm thinking one scenario: the server sends the result at the same time the client sends the cancel, and they cross over. The client connection buffer would contain the result. The connection is canceled but later reused for another request without cleaning? (this was the scenario implicitly implied during the discussion of #727 )
In this case, your explicit close will definitely fix the problem!

gingerwizard · 2022-09-27T19:55:07Z

No, a cancelled connection was marked as dirty and would not be reused in the current approach. It would require two queries to effectively take the same connection from the pool. dial and acquire prevent this - a new connection is created if < max open connections, otherwise one is taken off the idle pool. All buffers are local to the connection.

The only real difference with the new code is the buffer is cleared and underlying connection explicitly closed before its returned. I was hoping this might resolve issues at a lower level, where connections might be reused.

ns-gzhang · 2022-09-27T21:38:22Z

OK. we are using database/sql standard interface for the most part. And looks like the driver maintains another level of connection pools? Not sure if two connection pools at two levels could cause problems? Here is what database/sql says:

If the lifecycle of a connection is different at the two levels, there could be some state that does not satisfy the invariants?

ns-gzhang · 2022-09-28T00:44:52Z

If cancel is not the path to the trouble, under what condition, database/sql standard could release a Conn, and get it reused without possibly cleaning the connection? (ResetSession is called to reset a connection if used before, but that does nothing if the connection is not bad, in the driver code)

gingerwizard · 2022-09-29T09:47:56Z

so i was refering to the native interface @ns-gzhang its possible there is something in the indirection of sql/database that makes this possible.

We could close the connection on ResetSession. Are you able to test these changes under your production load? otherwise we're making changes and speculating without a reproducible test case

gingerwizard · 2022-10-14T10:21:37Z

@ns-gzhang did the branch version with connection cancellation work?

ns-gzhang · 2022-10-14T16:17:18Z

Hi Dale @gingerwizard I have not been able to recreate in a controlled environment yet so I couldn't verify. We will keep trying... It's important to us (and others). Thanks.

jkaflik · 2022-12-16T21:20:48Z

@ns-gzhang feel free to reopen this issue if the issue is still reproducible. thank you

ns-gzhang · 2022-12-17T00:48:03Z

Sure. Thanks @gingerwizard @jkaflik and all for your support and changes made. Really appreciate it.

gingerwizard · 2022-12-17T12:58:12Z

@ns-gzhang note that we did merge the connection reset code in the event of error. If you update your driver in production maybe you can tell us if it still occurs? This hasn't been released yet but will ping you when it is.

ns-gzhang · 2022-12-17T18:22:49Z

Thanks again. I saw that merge. We tried hard to simulate the possible scenarios in our controlled environments using exactly the same CH server version and the client-driver version. I couldn't recreate. I still suspect that concurrency of std database/sql interface and the driver played some role. Friends at Altinity also helped looking into this.

gingerwizard added the bug label Sep 23, 2022

gingerwizard added this to the 2.4.0 milestone Sep 23, 2022

gingerwizard self-assigned this Sep 23, 2022

gingerwizard changed the title ~~Concurrency bug in read~~ Concurrency bug in read for cancel Sep 24, 2022

gingerwizard mentioned this issue Sep 25, 2022

Safety: error out on unsupported server versions #727

Closed

gingerwizard mentioned this issue Sep 26, 2022

Close connection on cancel #764

Merged

gingerwizard added the help wanted label Sep 27, 2022

gingerwizard added the P1 label Sep 28, 2022

gingerwizard modified the milestones: 2.4.0, 2.5.0 Nov 23, 2022

jkaflik added stale requires a follow-up and removed help wanted labels Dec 15, 2022

jkaflik closed this as completed Dec 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Concurrency bug in read for cancel #761

Concurrency bug in read for cancel #761

gingerwizard commented Sep 23, 2022

gingerwizard commented Sep 25, 2022 •

edited

Loading

gingerwizard commented Sep 25, 2022 •

edited

Loading

gingerwizard commented Sep 25, 2022

ns-gzhang commented Sep 26, 2022

gingerwizard commented Sep 26, 2022

gingerwizard commented Sep 26, 2022

gingerwizard commented Sep 26, 2022

ns-gzhang commented Sep 26, 2022

gingerwizard commented Sep 27, 2022 •

edited

Loading

ns-gzhang commented Sep 27, 2022 •

edited

Loading

gingerwizard commented Sep 27, 2022

ns-gzhang commented Sep 27, 2022

ns-gzhang commented Sep 28, 2022 •

edited

Loading

gingerwizard commented Sep 29, 2022

gingerwizard commented Oct 14, 2022

ns-gzhang commented Oct 14, 2022

jkaflik commented Dec 16, 2022

ns-gzhang commented Dec 17, 2022

gingerwizard commented Dec 17, 2022

ns-gzhang commented Dec 17, 2022 •

edited

Loading

Concurrency bug in read for cancel #761

Concurrency bug in read for cancel #761

Comments

gingerwizard commented Sep 23, 2022

gingerwizard commented Sep 25, 2022 • edited Loading

gingerwizard commented Sep 25, 2022 • edited Loading

gingerwizard commented Sep 25, 2022

ns-gzhang commented Sep 26, 2022

gingerwizard commented Sep 26, 2022

gingerwizard commented Sep 26, 2022

gingerwizard commented Sep 26, 2022

ns-gzhang commented Sep 26, 2022

gingerwizard commented Sep 27, 2022 • edited Loading

ns-gzhang commented Sep 27, 2022 • edited Loading

gingerwizard commented Sep 27, 2022

ns-gzhang commented Sep 27, 2022

ns-gzhang commented Sep 28, 2022 • edited Loading

gingerwizard commented Sep 29, 2022

gingerwizard commented Oct 14, 2022

ns-gzhang commented Oct 14, 2022

jkaflik commented Dec 16, 2022

ns-gzhang commented Dec 17, 2022

gingerwizard commented Dec 17, 2022

ns-gzhang commented Dec 17, 2022 • edited Loading

gingerwizard commented Sep 25, 2022 •

edited

Loading

gingerwizard commented Sep 25, 2022 •

edited

Loading

gingerwizard commented Sep 27, 2022 •

edited

Loading

ns-gzhang commented Sep 27, 2022 •

edited

Loading

ns-gzhang commented Sep 28, 2022 •

edited

Loading

ns-gzhang commented Dec 17, 2022 •

edited

Loading