Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shutdown ClusterTopologyRefreshTask properly #2985

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

thachlp
Copy link
Contributor

@thachlp thachlp commented Sep 12, 2024

Issue: #2904
Make sure that:

  • You have read the contribution guidelines.
  • You have created a feature request first to discuss your contribution intent. Please reference the feature request ticket number in the pull request.
  • You applied code formatting rules using the mvn formatter:format target. Don’t submit any formatting related changes.
  • You submit test cases (unit or integration tests) that back your changes.

Comment on lines 723 to 724
Delay.delay(Duration.ofMillis(1500));
assertThat(clusterClient.isTopologyRefreshInProgress()).isTrue();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, not sure about this.
Topology refresh in test env is quick, and there is no guarantee that we are in right state for the test.
Most likely it will be either not started yet or already completed when assert is performed making the test flaky. Also we are adding a delay to the tests as a hole.

I have run the suggested test and it failed 10 /10 times on the assert.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any idea to reproduce the issue?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean how to reproduce the failing test locally or the actual issue?
For the actual issue "java.util.concurrent.RejectedExecutionException", I tried to reproduce it but could not.
I will spend some more time on it next week and see if I can come up with an approach.

Copy link
Contributor

@ggivo ggivo Sep 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @thachlp ,
I took a brief look and I think that this issue can be tested more easily using a unit test. There is a similar one testing client shutdown order already available. You can take a look at here

I suggest using a unit test to confirm that the pending ClusterTopologyRefreshTask is canceled/completed before shutting down the Executor group. We can inject a mock of the ClusterTopologyRefreshTask and complete it after client shutdown is initiated.

Hope it helps

@tishun tishun changed the title Shutdonw clustertopologyrefreshtask properly Shutdown ClusterTopologyRefreshTask properly Sep 13, 2024
Copy link
Collaborator

@tishun tishun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @thachlp,

Thanks for giving this fix a go. I think, however, you may be on the wrong path.

Judging from the stack trace in #2904 the ClusterTopologyRefreshScheduler attempts to refresh the topology AFTER the connections have been closed and the client is shutting down.

The suspendTopologyRefresh() is supposed to suspend any topology refresh tasks, but it seems there is some case (race condition perhaps?) where a task is still executed during shurdown.

@tishun tishun added the status: waiting-for-feedback We need additional information before we can continue label Oct 18, 2024
@thachlp
Copy link
Contributor Author

thachlp commented Nov 4, 2024

Hey @thachlp,

Thanks for giving this fix a go. I think, however, you may be on the wrong path.

Judging from the stack trace in #2904 the ClusterTopologyRefreshScheduler attempts to refresh the topology AFTER the connections have been closed and the client is shutting down.

The suspendTopologyRefresh() is supposed to suspend any topology refresh tasks, but it seems there is some case (race condition perhaps?) where a task is still executed during shurdown.

From the Java docs of suspendTopologyRefresh

    /**
     * Suspend periodic topology refresh if it was activated previously. Suspending cancels the periodic schedule without
     * interrupting any running topology refresh. Suspension is in place until obtaining a new {@link #connect connection}.
     *
     * @since 6.3
     */
    public void suspendTopologyRefresh() {
        topologyRefreshScheduler.suspendTopologyRefresh();
    }

From my view, when we shut down RedisClusterClient, we should STOP running CANCEL scheduled tasks, that why I write STOP running tasks.

Thank @tishun for explaining to me, do you have any suggestion for the fix?

@tishun
Copy link
Collaborator

tishun commented Nov 5, 2024

I will try to come back to you in the end of the week

@mp911de
Copy link
Collaborator

mp911de commented Nov 6, 2024

This PR introduces a check for a very specific scenario. The change doesn't necessary lead to a proper cancellation as the task itself is comprised from a series of refresh steps that are coupled through completable future's. Specifically, RedisClusterClient.refreshPartitionsAsync(…) is being called that has no notion of being interrupted.

I think conceptually the easiest approach is to synchronize (and wait) until ClusterTopologyRefreshTask has finished before shutting down ClientResources. ClusterTopologyRefreshTask would require a CompletableFuture<Void> that is being completed upon completion of Supplier<CompletionStage<?>>.

It would require also a bit of housekeeping, e.g.

if (isEventLoopActive()) {
    clientResources.eventExecutorGroup().submit(clusterTopologyRefreshTask);
    return true;
}

isn't atomic, EventExecutorGroup.submit(…) could return a failed future that requires consideration as well.

@Kvicii
Copy link

Kvicii commented Dec 19, 2024

@tishun @mp911de @thachlp
Is there any follow-up? I have the same problem.
issue-3089

@tishun
Copy link
Collaborator

tishun commented Dec 23, 2024

As @mp911de mentioned we need to devise a better solution to this problem.
He has explained this in his comment here, I also elaborated more in #2904

@thachlp thachlp closed this Dec 31, 2024
@thachlp thachlp deleted the shutdonw-clustertopologyrefreshtask-properly branch December 31, 2024 04:14
@thachlp thachlp restored the shutdonw-clustertopologyrefreshtask-properly branch December 31, 2024 04:16
@thachlp thachlp reopened this Dec 31, 2024
@thachlp thachlp force-pushed the shutdonw-clustertopologyrefreshtask-properly branch from fbb3951 to d8507a5 Compare December 31, 2024 04:34
@thachlp
Copy link
Contributor Author

thachlp commented Jan 2, 2025

As @mp911de mentioned we need to devise a better solution to this problem. He has explained this in his comment here, I also elaborated more in #2904

Thanks @mp911de @tishun for reviews
As I understand, we should add the CompletableFuture to track the canceled completion of ClusterTopologyRefreshTask.
Please help review my implementation 🙇
Btw, is the fail test is random fail, because when I run on local, it was success.

@thachlp thachlp requested review from tishun and ggivo January 2, 2025 10:24
@tishun
Copy link
Collaborator

tishun commented Jan 6, 2025

Hey @thachlp ,
apologies, but I think this would still not work. Let me elaborate.

The suspendTopologyRefresh is not the problem itself, because all it does is make sure that no new refresh is scheduled. However any refresh that is already initiated is going to continue anyway.

Then when the event loop is shut down it would print this message.

What we need to do is:

  • when we initiate a refresh we need to indicate this with a lock
  • at the point where the event loop is being shut down we need to block on the said lock
  • if there is no holder of the lock (no refresh currently) the event loop will close normally
  • if there is a holder of the lock (a refresh is currently happening) the shutdown would block and wait
  • when the refresh is complete we should release the lock (only after the job is complete)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: waiting-for-feedback We need additional information before we can continue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants