-
Notifications
You must be signed in to change notification settings - Fork 691
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SOLR-16257: ZkStateReader
changes to avoid race condition between collectionWatches
and watchedCollectionStates
#909
Conversation
|
…ntain consistent behavior with calls to `keySet` on the now refactored `watchedCollectionStates`
…tches instead of watched collection states, as newly created collection might not yet be reflected in watched collection states and should be refreshed.
…ntain consistent behavior with calls to `keySet` on the now refactored `watchedCollectionStates`
Forbidden method invocation: java.util.concurrent.Executors#newSingleThreadExecutor() [Spawns threads with vague names; use a custom thread factory (Lucene's NamedThreadFactory, Solr's SolrNamedThreadFactory) and name threads so that you can tell (by its name) which executor it is associated with] in org.apache.solr.cloud.overseer.ZkStateReaderTest (ZkStateReaderTest.java:359)
Forbidden method invocation: java.util.concurrent.Executors#newSingleThreadExecutor(java.util.concurrent.ThreadFactory) [Spawns threads without MDC logging context; use ExecutorUtil.newMDCAwareSingleThreadExecutor instead]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for tackling this Patson!
I haven't gone over everything, but took a first-pass look. I think we should probably keep the helper methods instead of sub-classing ConcurrentHashMap. But that's a separate issue than what you are trying to fix really.
solr/core/src/test/org/apache/solr/cloud/overseer/ZkStateReaderTest.java
Outdated
Show resolved
Hide resolved
solr/solrj/src/java/org/apache/solr/common/cloud/ZkStateReader.java
Outdated
Show resolved
Hide resolved
solr/solrj/src/java/org/apache/solr/common/cloud/ZkStateReader.java
Outdated
Show resolved
Hide resolved
The original junit error was
Not sure if it's related to this change. This failures seems to happen occasionally http://fucit.org/solr-jenkins-reports/history-trend-of-recent-failures.html#series/org.apache.solr.cloud.TestTlogReplica.testRebalanceLeaders . Not sure if my mac book makes a diff tho 😓 (1 out of 3 X 10 |
Fixed formatting
Yes we can discuss about that, I don't have any strong opinion here. My design of putting it into a separate class is based on the reasoning that updating the doc collection is tightly tied with the state of the of the watcher, especially that the With the helper approach, it might be slightly harder to figure out the helper modifies the field by inspecting the method signature alone. It's more of a personal preference that I found methods with side effect a bit hard to track sometimes 😅 (even my current approach it still modifies the state ie has side effects too! But I think it's more obvious that the call/change is applied to the Thank you for the helpful code review! |
Friendly ping @HoustonPutman . Wondering if you have any extra thoughts about this PR? 😊 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if it's related to this change. This failures seems to happen occasionally http://fucit.org/solr-jenkins-reports/history-trend-of-recent-failures.html#series/org.apache.solr.cloud.TestTlogReplica.testRebalanceLeaders . Not sure if my mac book makes a diff tho 😓 (1 out of 3 X 10
tests.dups
)
Yeah if it's on that list you generally don't need to worry about your likely unrelated PR making it worse.
My design of putting it into a separate class is based on the reasoning that updating the doc collection is tightly tied with the state of the of the watcher, especially that the
collectionsWatched
contains the DocCollection state now 😄 . Therefore I feel that might be reasonable encapsulation.
Ok I think I'm on your side now. But I do think there is some confusion between having a null
value in collectionsWatches
and having a null
currentDoc
in the value. Can you explain that difference?
Also can we rename currentDoc
to currentState
? Also DocCollectionWatch
would be better named StatefulCollectionWatch
, because the important characteristic is that its saving the current state of the collection.
Thanks again for the work here!
solr/solrj/src/java/org/apache/solr/common/cloud/ZkStateReader.java
Outdated
Show resolved
Hide resolved
solr/solrj/src/java/org/apache/solr/common/cloud/ZkStateReader.java
Outdated
Show resolved
Hide resolved
solr/solrj/src/java/org/apache/solr/common/cloud/ZkStateReader.java
Outdated
Show resolved
Hide resolved
solr/core/src/test/org/apache/solr/cloud/overseer/ZkStateReaderTest.java
Outdated
Show resolved
Hide resolved
solr/core/src/test/org/apache/solr/cloud/overseer/ZkStateReaderTest.java
Outdated
Show resolved
Hide resolved
solr/core/src/test/org/apache/solr/cloud/overseer/ZkStateReaderTest.java
Outdated
Show resolved
Hide resolved
solr/core/src/test/org/apache/solr/cloud/overseer/ZkStateReaderTest.java
Outdated
Show resolved
Hide resolved
solr/core/src/test/org/apache/solr/cloud/overseer/ZkStateReaderTest.java
Outdated
Show resolved
Hide resolved
solr/core/src/test/org/apache/solr/cloud/overseer/ZkStateReaderTest.java
Outdated
Show resolved
Hide resolved
I have a commit to fix the return signature, could you check-off the |
It seems that the PR might be from an organisation repo and possibly the isaacs/github#1681 might still apply then, just FYI in case it does. |
Ahh thanks for reminding me about that @cpoerschke! |
solr/solrj/src/java/org/apache/solr/common/cloud/ZkStateReader.java
Outdated
Show resolved
Hide resolved
….java Co-authored-by: Houston Putman <houstonputman@gmail.com>
@madrob @HoustonPutman I added 2 commits to better reproduce the race condition (copied the test changes to Seems like we have addressed all the review comments? If so, what are the next steps? Many thanks again! 😊 |
currentCollections = | ||
reader.getCurrentCollections(); // should detect both collections (c1 watched, c2 lazy | ||
// loaded) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this is strange wrapping, separate the comment onto its own line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I will try again, I ran both ./gradlew :solr:solrj:spotlessApply
and ./gradlew :solr:core:spotlessApply
and sometimes it produces some weird wrapping...
try { | ||
log.info("Start: artificial delay for {}ms", delay); | ||
Thread.sleep(delay); | ||
log.info("Finish: artificial delay for {}ms", delay); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: move this to a finally block so that we still log even if we were interrupted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point !
lazyCollectionStates.put(collection, new LazyCollectionRef(collection)); | ||
reconstructState.set(true); | ||
CommonTestInjection.injectDelay(); // To unit test race condition |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
our common pattern is to hide this behind a java assert
. See also https://github.com/apache/solr/blob/main/solr/core/src/java/org/apache/solr/query/SolrRangeQuery.java#L183
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
O missed that one. Thanks for catching!
if (log.isDebugEnabled()) { | ||
log.debug( | ||
"clusterStateSet: interesting [{}] watched [{}] lazy [{}] total [{}]", | ||
collectionWatches.keySet().size(), | ||
watchedCollectionStates.keySet().size(), | ||
collectionWatches.watchedCollections().size(), | ||
collectionWatches.activeCollectionCount(), | ||
lazyCollectionStates.keySet().size(), | ||
clusterState.getCollectionStates().size()); | ||
} | ||
|
||
if (log.isTraceEnabled()) { | ||
log.trace( | ||
"clusterStateSet: interesting [{}] watched [{}] lazy [{}] total [{}]", | ||
collectionWatches.keySet(), | ||
watchedCollectionStates.keySet(), | ||
collectionWatches.watchedCollections(), | ||
collectionWatches.activeCollections(), | ||
lazyCollectionStates.keySet(), | ||
clusterState.getCollectionStates()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there still technically a race condition here where the watched collections and active collections could change between the successive calls?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it could change but it should be okay.
As far as I understand, we only need a strong guarantee/synchronization on the write ops. As for read op, it appears none of the read operation requires such guarantee, ie even if the value read has changed or is inconsistent with the watchedCollectionStates
, it should not cause any serious issues. (it might print value that does not reflect the latest set or fetch/not fetch something - but these are not new behaviors)
The core of this PR is to remove the race conditions between write operations of 2 separate maps, which can cause stale state. And now since all write ops are on the same map with sync, we should be safe 😊 . As for read ops, it's pretty much the same as before, which does not need to have such strong guarantee I suppose?
solr/solrj/src/java/org/apache/solr/common/cloud/ZkStateReader.java
Outdated
Show resolved
Hide resolved
int oldCVersion = | ||
oldState.getPerReplicaStates() == null | ||
? -1 | ||
: oldState.getPerReplicaStates().cversion; | ||
int newCVersion = | ||
newState.getPerReplicaStates() == null | ||
? -1 | ||
: newState.getPerReplicaStates().cversion; | ||
if (oldState.getZNodeVersion() < newState.getZNodeVersion() | ||
|| oldCVersion < newCVersion) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a test that creates c1, gets the ref, deletes c1, creates c1 again, and then checks that the new ref is correct? The node version for both instances of c1 would be 0, but the cversion should be larger, so I think your logic still works, but I'd like to see a unit test for it. Or maybe this is already covered and I missed where it is in the tests? Thanks! Similar to what we saw in SOLR-16143.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@madrob Yes I think a test case to verify different version handling makes sense!
Added a new test case for that. And since we are testing cversion
, i added PRS into the unit test (otherwise it will always be -1).
Does this address your concern?
https://github.com/apache/solr/pull/909/files#diff-c92a116c89a724226fb6efbea4c4f90ab70b40a779e1c50155c8eca646520358R227-R342
@HoustonPutman Thanks for triggering the workflow (I assume it was you? 😅 ), Now that all the checks have passed, what are the next steps? (trying to read up on https://cwiki.apache.org/confluence/display/SOLR/Git+commit+process and https://cwiki.apache.org/confluence/display/solr/HowToContribute#HowToContribute-Contributingyourwork) There's no rush for this PR, but just want to make sure I am not holding off the process because of any missing items from me. Big thank you again @HoustonPutman and @madrob ! |
@patsonluk I think this is good to go! I'll add an entry in the changelog, then commit and backport this! (Gonna run the full test suite first though) |
Actually I can't push to your PR. Can you add this at the end of the 9.1.0 "Improvements" section:
|
Tests pass, this is good to go when we get the Changelog entry |
@HoustonPutman done! (hopefully this is the right place 😅 https://github.com/apache/solr/pull/909/files#diff-cdd1ae1b418ea74ca689c6e68e61f4e5b4c6f77f6408a4e6cf76ed8421b24d09) |
Thanks so much for the work here @patsonluk , it was a long one, but I think very beneficial to the |
Race condition existed between collectionWatches and watchedCollectionStates. (cherry picked from commit 97458a5)
Thank you so much for all the great reviews and code suggestions! @HoustonPutman @madrob 🎉 Hopefully such fix can eliminate occasional issue with Solr cluster state having the stale state, we do run into it once in a while in our prod and just once earlier this morning 😓 |
FYI this has been causing NPE in ZkStateReaderTest - #966 |
Thanks for fixing it! 🙇🏼 |
…n states Downstream changes from apache/solr#909 and apache/solr#966
…d collection states (#176) * Changes to avoid race condition in ZkStateReader on watched collection states Downstream changes from apache/solr#909 and apache/solr#966 * Resolve merge conflict
https://issues.apache.org/jira/browse/SOLR-16257
Description
As described in the Jira issue,
ZkStateReader
fieldscollectionWatches
andwatchedCollectionStates
can run into race conditions.And such condition might produce stale
clusterState
for some collection which can no longer be updated (until a new watch is registered for such collection)The existing design is intended to synchronize the 2 map fields
watchedCollectionStates
andcollectionWatches
(ie if a collection is no longer watched incollectionWatches
, then no entry for that collection should exist inwatchedCollectionStates
). However, it is hard to guarantee such "synchronization", one way to break it is demonstrated in "Steps to reproduce a race condition" within the Jira issueTake note that even though in most cases
watchedCollectionStates
andcollectionWatches
should have the same keys, there seems to one valid case which if a collection to be created could be incollectionWatches
but not yet inwatchedCollectionStates
.Solution
Perhaps it's easier to simply eliminate
watchedCollectionStates
and add suchstate
(asDocCollection
) to theCollectionWatch
itself (now added a new classDocCollectionWatch
with contains a ref to theDocCollection
) , such that we no longer need to worry about entry removed incollectionWatches
somehow still remains inwatchedCollectionStates
.Take note that there were 2 cases which used
watchedCollectionStates.keySet()
, that requires some extra consideration:forciblyRefreshAllClusterStateSlow
, which we are proposing to just replacewatchedCollectionStates.keySet()
withcollectionWatches.keySet()
, the reasoning is that we should forcibly update every collection that is registered incollectionWatches
, even if previous fetch on such collection returned null (iewatchedCollectionStates
would have no entry on it). This is a minor change of behavior but I think it is more "correct"? Would really need more experienced Solr dev to share their thoughts here 🙏🏼getCurrentCollections
, which we are proposing to add ONLY collections incollectionWatches
withDocCollection currentDoc
that is non null (ie, only collection that does exist from last update)Tests
Added
testForciblyRefreshAllClusterState
,testGetCurrentCollections
andtestWatchRaceCondition
toZkStateReaderTest
.The first 2 are to verify the refactoring maintain the same behaviors for those 2 methods. And the last one is to test specifically race condition mentioned (hard to reproduce tho).
Since I am very new to this repo, I don't know the convention for junit testing on private methods (or is it a nono to test on them?). I usually would just make them package private for other projects but I see that
ZkStateReaderTest
is not within the same package ofZkStateReader
hence I cannot do that. For now I changed those 2 methods topublic
, which feels wrong 😅Please kindly advise the common approach for testing private methods 🙇🏼
Checklist
Please review the following and check all that apply:
main
branch../gradlew check
.Remarks
The
./gradlew check
failed once on test caseTestTlogReplica
with warning ofRe-running
./gradlew check
, the test no longer fails. Going to spend a bit more time to see whether the failure is related to this change.so far no luck reproducing it with
gradlew :solr:core:test --tests "org.apache.solr.cloud.TestTlogReplica.testRebalanceLeaders" -Ptests.jvms=6 "-Ptests.jvmargs=-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=120m" -Ptests.seed=C004E5CD6B2D226F -Ptests.file.encoding=US-ASCII