Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix cluster/pool release of connection #6361

Merged
merged 1 commit into from
Apr 25, 2016

Conversation

PierreF
Copy link
Contributor

@PierreF PierreF commented Apr 13, 2016

  • Rebased/mergable
  • Tests pass
  • CHANGELOG.md updated
  • Sign CLA (if not already signed)

This PR fix a possible resource exhaustion on connection to other node.
I can reproduce this issue on v0.11.1 and this PR fix issue. On 0.12 / master this issue should no longer exists - at least in opensource version. That why I didn't updated CHANGELOG.md.

The issue happen when a load spike cause multiple thread to create new connection on the pool. It may cause total to exceed capacity : code of boundedPool.Get

The later when exceeding connection are returned to the pool, they are discarded (since pool is full) but total is NOT decremented : code of boundedPool.put

At this point with the default value, pool have 3 working connections (== capacity) but total may be larger than this, lets say 6.

Now, if the target node goes down, connection will be marked as unusable which will closed them and remove them from pool. code of boundedPool.MarkUnusable

Here total is decremented, but since is was above capacity we end up with an empty pool (all connections fail/timeout since the target node is down) but total is still above 0. It will be 3 in our example.

Then from now, all call to boundedPool.Get will fail:

  • The pool is empty, unable to get a already existing connection
  • total = 3 which is NOT less than capacity, so it don't open a new connection

@PierreF PierreF mentioned this pull request Apr 13, 2016
@jwilder
Copy link
Contributor

jwilder commented Apr 19, 2016

@PierreF Is there some what I can reproduce this locally?

@PierreF
Copy link
Contributor Author

PierreF commented Apr 20, 2016

As said in #5876, I was able to reproduce it with the following step:

  • Restart all cluster node (I always done 3 node cluster)
  • start influx_stress on node1. keep it running for about 30 seconds, then stop it. This objective of this step is to create a rush in boundedPool.Get and had total exceed the capacity. We stop the stress to be sure it try to put back all connection in the pool (and thus discarding exceeding connection - without reducing total counter)
  • kill/stop node2
  • restart influx_stress on node1. This should mark all connection as unusable and close them since node2 is down. Once all connections are closed, we should end up with no connection in the pool and total > 0

With those step, I was able to reproduce it all the time on my laptop. It was using 0.11.1 with a 3 node cluster. Needless to say that with this PR I was no longer able to reproduce with the above step

@jwilder
Copy link
Contributor

jwilder commented Apr 25, 2016

@PierreF Can you add a changelog entry under 0.13?

@jwilder jwilder added this to the 0.13.0 milestone Apr 25, 2016
@PierreF PierreF force-pushed the cluster-pool-hang branch from 909b089 to 9f9a3fc Compare April 25, 2016 18:08
@PierreF
Copy link
Contributor Author

PierreF commented Apr 25, 2016

Changelog entry added + rebase on master

@jwilder
Copy link
Contributor

jwilder commented Apr 25, 2016

👍 Thanks @PierreF!

@jwilder jwilder merged commit 9a3a52a into influxdata:master Apr 25, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants