Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix][broker] Fix namespace unload might be blocked too long with extensible load manager #23433

Merged

Conversation

BewareMyPower
Copy link
Contributor

@BewareMyPower BewareMyPower commented Oct 10, 2024

Fixes #23412

Motivation

#23382 brings a regression to the broker stop phase, which could be blocked by waiting:

  • healthCheckBrokerAsync in ServiceUnitStateChannelImpl#doCleanup, which could be blocked for metadataStoreOperationTimeoutSeconds seconds (default: 30s)
  • unloadNamespaceBundle in BrokerService#unloadNamespaceBundlesGracefully, which could be blocked for namespaceBundleUnloadingTimeoutMs milliseconds (default: 1 min) for each problematic bundle

The 2nd method could be blocked when healthCheckBrokerAsync failed with NotFoundException but the following steps were not skipped.

Modifications

  • Check if the service unit state channel is disabled or closed via channelDisabled before healthCheckBrokerAsync
  • Optimize some logs to avoid printing the whole stacks
  • Improve the test to run it for 10 times in case of any flakiness introduced again. The metadataStoreOperationTimeoutSeconds and namespaceBundleUnloadingTimeoutMs configs are reduced to avoid waiting too long if the test will fail.

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository:

@BewareMyPower BewareMyPower self-assigned this Oct 10, 2024
@BewareMyPower BewareMyPower added type/bug The PR fixed a bug or issue reported a bug release/3.3.3 labels Oct 10, 2024
@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Oct 10, 2024
@BewareMyPower BewareMyPower added this to the 4.0.0 milestone Oct 10, 2024
@BewareMyPower
Copy link
Contributor Author

BewareMyPower commented Oct 10, 2024

I've investigating #23413 as well and might push another PR for it if this PR was reviewed quickly. Otherwise, I might include the fix in this PR.

Copy link
Member

@lhotari lhotari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, good work @BewareMyPower. I asked one question.

@BewareMyPower
Copy link
Contributor Author

ServiceUnitStateChannelTest.testOverrideOrphanStateData failed, I will take a look.

@BewareMyPower BewareMyPower marked this pull request as draft October 10, 2024 12:15
@BewareMyPower BewareMyPower marked this pull request as ready for review October 10, 2024 13:20
@BewareMyPower
Copy link
Contributor Author

I fixed the test. And I also found the process on NotFoundException is unnecessary.

image

The testCloseAfterLoadingBundles could still succeed even if we don't skip the orphan cleanup. So I kept the original behavior that when doCleanup(broker, false) is called and the broker is not healthy, continue the orphan cleanup.


However, from the failure of testOverrideOrphanStateData, I found ServiceUnitStateChannelTest is really confusing. In this test, disableChannels is called before all overrideTableViews calls. However, the overrideTableViews method is very complicated that it accesses the private fields via reflection and some operations will be performed on these fields. I guess this method is to simulate the ServiceUnitStateChannel#overrideOwnership method. But if there is a regression introduced to overrideOwnership, the existing test won't be able to detect the regression.

Besides, the disable() method, which set the channel state to Disabled, was never called other than tests before #23349. After #23349, this method is only called in cleanOwnerships() and doCleanup(broker, true) will always be called after that. Should we still allow doCleanup(broker, false) scheduled by handleBrokerDeletionEvent be called after disable()?

Anyway, these details are complicated and it could be a new topic to discuss.

@codecov-commenter
Copy link

codecov-commenter commented Oct 10, 2024

Codecov Report

Attention: Patch coverage is 72.72727% with 6 lines in your changes missing coverage. Please review.

Project coverage is 74.27%. Comparing base (bbc6224) to head (35fd689).
Report is 662 commits behind head on master.

Files with missing lines Patch % Lines
...xtensions/channel/ServiceUnitStateChannelImpl.java 68.42% 4 Missing and 2 partials ⚠️
Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff              @@
##             master   #23433      +/-   ##
============================================
+ Coverage     73.57%   74.27%   +0.70%     
- Complexity    32624    34963    +2339     
============================================
  Files          1877     1952      +75     
  Lines        139502   147150    +7648     
  Branches      15299    16201     +902     
============================================
+ Hits         102638   109296    +6658     
- Misses        28908    29424     +516     
- Partials       7956     8430     +474     
Flag Coverage Δ
inttests 27.41% <22.72%> (+2.82%) ⬆️
systests 24.35% <0.00%> (+0.03%) ⬆️
unittests 73.64% <72.72%> (+0.80%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...g/apache/pulsar/broker/admin/impl/BrokersBase.java 74.28% <100.00%> (+1.82%) ⬆️
...xtensions/channel/ServiceUnitStateChannelImpl.java 86.98% <68.42%> (+1.67%) ⬆️

... and 621 files with indirect coverage changes

@lhotari lhotari merged commit 5506f50 into apache:master Oct 10, 2024
54 checks passed
@BewareMyPower BewareMyPower deleted the bewaremypower/fix-extensible-lm-tests branch October 11, 2024 01:25
private boolean channelDisabled() {
final var channelState = this.channelState;
if (channelState == Disabled || channelState == Closed) {
log.warn("[{}] Skip scheduleCleanup because the state is {} now", brokerId, channelState);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may need to make this log generic or remove it as it is used in multiple places.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's right, I'll change it in the new PR

@heesung-sn
Copy link
Contributor

When a broker shutdown, the expectation is that the connected admin clients should fail all pending requests as their connections are closed.

It is a bit surprising the admin client was hung in this case, and the cleanup process waited too long until timed out.

I wonder why the admin client didn't fail the health checks sooner.

@@ -1462,6 +1478,10 @@ private CompletableFuture<Void> healthCheckBrokerAsync(String brokerId) {
}

private void doHealthCheckBrokerAsyncWithRetries(String brokerId, int retry, CompletableFuture<Void> future) {
if (channelDisabled()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means that this leader broker is shutting down and trying to handle the broker deletion notification. It is counterintuitive that this shutting down leader broker thinks the target broker is healthy here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think we should check channelDisabled() in doCleanup before calling healthCheckBrokerAsync in doCleanup(..., false).

The reason to make the change here is that the future of healthCheckBrokerAsync could be blocked until timeout during the close phase.

# this log was added by me
2024-10-11T10:59:40,184 - INFO  - [pulsar-385-8:ServiceUnitStateChannelImpl] - XYZ doHealthCheckBrokerAsyncWithRetries localhost:61361 2
2024-10-11T10:59:45,130 - INFO  - [pulsar-load-manager-384-1:ServiceUnitStateChannelImpl] - Failed to check broker:localhost:61361 health
java.util.concurrent.TimeoutException: null
	at java.base/java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1960) ~[?:?]
	at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2095) ~[?:?]
	at org.apache.pulsar.broker.loadbalance.extensions.channel.ServiceUnitStateChannelImpl.doCleanup(ServiceUnitStateChannelImpl.java:1525) ~[classes/:?]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added more logs and checked it again. The root cause is that this method is still scheduled when the web service is closed.

                            pulsar.getExecutor()
                                    .schedule(() -> doHealthCheckBrokerAsyncWithRetries(brokerId, retry + 1, future),
                                            Math.min(MAX_BROKER_HEALTH_CHECK_DELAY_IN_MILLIS, retry * retry * 50),
                                            MILLISECONDS);

Logs:

2024-10-11T11:28:58,966 - INFO  - [pulsar-load-manager-1209-1:ServiceUnitStateChannelImpl] - XYZ doHealthCheckBrokerAsyncWithRetries localhost:58117 0
2024-10-11T11:28:58,970 - WARN  - [AsyncHttpClient-1250-1:ServiceUnitStateChannelImpl] - XYZ localhost:58117 Failed to do health check
2024-10-11T11:28:58,971 - INFO  - [pulsar-1210-9:ServiceUnitStateChannelImpl] - XYZ doHealthCheckBrokerAsyncWithRetries localhost:58117 1
2024-10-11T11:28:58,972 - WARN  - [AsyncHttpClient-1250-1:ServiceUnitStateChannelImpl] - XYZ localhost:58117 Failed to do health check
2024-10-11T11:28:59,015 - INFO  - [main:PulsarService] - XYZ start closing web service
2024-10-11T11:28:59,019 - INFO  - [main:PulsarService] - XYZ finished closing web service
2024-10-11T11:28:59,022 - INFO  - [pulsar-1210-9:ServiceUnitStateChannelImpl] - XYZ doHealthCheckBrokerAsyncWithRetries localhost:58117 2
2024-10-11T11:28:59,040 - WARN  - [AsyncHttpClient-1334-2:ServiceUnitStateChannelImpl] - XYZ localhost:58117 Failed to do health check
2024-10-11T11:28:59,150 - INFO  - [pulsar-load-manager-1251-1:ServiceUnitStateChannelImpl] - XYZ doHealthCheckBrokerAsyncWithRetries localhost:58158 0
2024-10-11T11:28:59,157 - WARN  - [AsyncHttpClient-1292-1:ServiceUnitStateChannelImpl] - XYZ localhost:58158 Failed to do health check
2024-10-11T11:28:59,157 - INFO  - [pulsar-1252-9:ServiceUnitStateChannelImpl] - XYZ doHealthCheckBrokerAsyncWithRetries localhost:58158 1
2024-10-11T11:28:59,159 - WARN  - [AsyncHttpClient-1292-1:ServiceUnitStateChannelImpl] - XYZ localhost:58158 Failed to do health check
2024-10-11T11:28:59,213 - INFO  - [pulsar-1252-4:ServiceUnitStateChannelImpl] - XYZ doHealthCheckBrokerAsyncWithRetries localhost:58158 2
2024-10-11T11:28:59,215 - WARN  - [AsyncHttpClient-1292-1:ServiceUnitStateChannelImpl] - XYZ localhost:58158 Failed to do health check
2024-10-11T11:28:59,420 - INFO  - [pulsar-1252-1:ServiceUnitStateChannelImpl] - XYZ doHealthCheckBrokerAsyncWithRetries localhost:58158 3

So the doCleanup should be skipped.

Actually, the key question that since doCleanup(..., true) is called during the close. Should we skip doCleanup(..., false) after that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to do the following change to avoid testOverrideOrphanStateData fail. i.e. doCleanup(..., false) won't be skipped. Only the healthCheckBrokerAsync call will be skipped

diff --git a/pulsar-broker/src/main/java/org/apache/pulsar/broker/loadbalance/extensions/channel/ServiceUnitStateChannelImpl.java b/pulsar-broker/src/main/java/org/apache/pulsar/broker/loadbalance/extensions/channel/ServiceUnitStateChannelImpl.java
index ea1bf01be5..30d19415e7 100644
--- a/pulsar-broker/src/main/java/org/apache/pulsar/broker/loadbalance/extensions/channel/ServiceUnitStateChannelImpl.java
+++ b/pulsar-broker/src/main/java/org/apache/pulsar/broker/loadbalance/extensions/channel/ServiceUnitStateChannelImpl.java
@@ -1334,17 +1334,14 @@ public class ServiceUnitStateChannelImpl implements ServiceUnitStateChannel {
 
     private boolean channelDisabled() {
         final var channelState = this.channelState;
-        if (channelState == Disabled || channelState == Closed) {
-            log.warn("[{}] Skip scheduleCleanup because the state is {} now", brokerId, channelState);
-            return true;
-        }
-        return false;
+        return channelState == Disabled || channelState == Closed;
     }
 
     private void scheduleCleanup(String broker, long delayInSecs) {
         var scheduled = new MutableObject<CompletableFuture<Void>>();
         try {
             if (channelDisabled()) {
+                log.warn("[{}] Skip scheduleCleanup because the state is {} now", brokerId, channelState);
                 return;
             }
             cleanupJobs.computeIfAbsent(broker, k -> {
@@ -1478,10 +1475,6 @@ public class ServiceUnitStateChannelImpl implements ServiceUnitStateChannel {
     }
 
     private void doHealthCheckBrokerAsyncWithRetries(String brokerId, int retry, CompletableFuture<Void> future) {
-        if (channelDisabled()) {
-            future.complete(null);
-            return;
-        }
         try {
             var admin = getPulsarAdmin();
             admin.brokers().healthcheckAsync(TopicVersion.V2, Optional.of(brokerId))
@@ -1519,7 +1512,9 @@ public class ServiceUnitStateChannelImpl implements ServiceUnitStateChannel {
         }
 
         // if not gracefully, verify the broker is inactive by health-check.
-        if (!gracefully) {
+        // When the channel is disabled, the broker is during the close phase where the web service might not be
+        // available so that calling healthCheckBrokerAsync() could fail with timeout.
+        if (!gracefully && !channelDisabled()) {
             try {
                 healthCheckBrokerAsync(broker).get(
                         pulsar.getConfiguration().getMetadataStoreOperationTimeoutSeconds(), SECONDS);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm

heesung-sn pushed a commit to heesung-sn/pulsar that referenced this pull request Oct 25, 2024
heesung-sn added a commit that referenced this pull request Oct 28, 2024
…ensible load manager (#23433) (#23517)

Co-authored-by: Yunze Xu <xyzinfernity@163.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cherry-picked/branch-3.3 doc-not-needed Your PR changes do not impact docs ready-to-test release/3.3.3 type/bug The PR fixed a bug or issue reported a bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Flaky-test: ExtensibleLoadManagerCloseTest.testCloseAfterLoadingBundles
4 participants