Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race condition in ExponentiallyDecayingReservoir's rescale method #1033

Merged
merged 2 commits into from
Jan 5, 2017
Merged

Fix race condition in ExponentiallyDecayingReservoir's rescale method #1033

merged 2 commits into from
Jan 5, 2017

Conversation

ho3rexqj
Copy link
Contributor

Fixes issues #1005 and #793 - note that in #793 the longPeriodsOfInactivityShouldNotCorruptSamplingState test did not cover multiple concurrent requests after a long idle period.

The race condition currently occurs when two threads update the reservoir concurrently - after the first thread has successfully updated nextScaleTime (line 162) but before the write lock is obtained (line 163) a second thread updates the reservoir (line 95+). With several threads pushing updates concurrently this presents a narrow window during which the second thread may obtain the lock before the first resulting in a corrupt reservoir state. Note, however, that if a snapshot is requested just prior to the two concurrent updates the read lock will already be active when the two threads enter rescaleIfNeeded, resulting in a much larger window for this race condition.

@arteam arteam merged commit 4abcd4d into dropwizard:3.2-development Jan 5, 2017
@arteam
Copy link
Member

arteam commented Jan 5, 2017

Thank you very much for the fix and very detailed unit test!

@ryantenney
Copy link
Contributor

I have an issue with this. The compareAndSet operation is atomic so it shouldn't allow more than one thread into that if block. Granted 'lockForRescale()' does effectively the same thing, however it is a more expensive operation than the compareAndSet.

@arteam
Copy link
Member

arteam commented Jan 5, 2017

The issue as I see it:

  • Thread 1 calls update
  • Thread 2 calls update
  • Thread 1 calls rescaleIfNeeded
  • Thread 2 calls rescaleIfNeeded
  • Thread 1 reads the CAS and updates it
  • Thread 1 gets swapped out
  • Thread 2 reads the updated CAS and exits
  • Thread 2 obtains a read lock and works with a new weight, but old (not scaled) values
  • Thread 1 gets swapped in and acquires a write lock, but it's too late

Basically, the CAS only guarantees that nextScaleTime gets updated atomically, but not other reservoir's state.

@arteam
Copy link
Member

arteam commented Jan 9, 2017

In case the fix the correct, I would like to cherry-pick this change to the 3.1.3-maintenance branch to provide users who affected by this bug an easy path to upgrade.

@arteam
Copy link
Member

arteam commented Jan 23, 2017

I didn't get more reasons opposing this fix, so I'm going to move this bugfix to the 3.1.3 branch. In the worst case, we have a performance regression, in the best we fix a concurrency bug.

@arteam arteam added this to the 3.1.3 milestone Jan 23, 2017
@arteam arteam mentioned this pull request Jan 23, 2017
@arteam arteam modified the milestones: 3.2.0, 3.1.3 Jan 23, 2017
@ho3rexqj ho3rexqj deleted the fix/exponentially_decaying_reservoir_rescale_race branch February 3, 2017 03:13
@ahadadi
Copy link

ahadadi commented May 1, 2017

@arteam, I think your comment:
"Thread 2 obtains a read lock and works with a new weight, but old (not scaled) values"
is incorrect.
The weight becomes new when startTime gets updated, but it gets updated only under the write lock by thread 1.
I still think this is a good change, as it simplifies reasoning about the code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants