[fix][broker] Fix namespace unload might be blocked too long with extensible load manager #23433

BewareMyPower · 2024-10-10T10:59:24Z

Motivation

#23382 brings a regression to the broker stop phase, which could be blocked by waiting:

healthCheckBrokerAsync in ServiceUnitStateChannelImpl#doCleanup, which could be blocked for metadataStoreOperationTimeoutSeconds seconds (default: 30s)
unloadNamespaceBundle in BrokerService#unloadNamespaceBundlesGracefully, which could be blocked for namespaceBundleUnloadingTimeoutMs milliseconds (default: 1 min) for each problematic bundle

~~The 2nd method could be blocked when healthCheckBrokerAsync failed with NotFoundException but the following steps were not skipped.~~

Modifications

Check if the service unit state channel is disabled or closed via channelDisabled before healthCheckBrokerAsync
Optimize some logs to avoid printing the whole stacks
Improve the test to run it for 10 times in case of any flakiness introduced again. The metadataStoreOperationTimeoutSeconds and namespaceBundleUnloadingTimeoutMs configs are reduced to avoid waiting too long if the test will fail.

Documentation

doc
doc-required
doc-not-needed
doc-complete

Matching PR in forked repository

PR in forked repository:

…ensible load manager

BewareMyPower · 2024-10-10T11:03:03Z

I've investigating #23413 as well and might push another PR for it if this PR was reviewed quickly. Otherwise, I might include the fix in this PR.

...est/java/org/apache/pulsar/broker/loadbalance/extensions/ExtensibleLoadManagerCloseTest.java

lhotari

LGTM, good work @BewareMyPower. I asked one question.

BewareMyPower · 2024-10-10T12:14:57Z

ServiceUnitStateChannelTest.testOverrideOrphanStateData failed, I will take a look.

…is not found

BewareMyPower · 2024-10-10T13:34:49Z

I fixed the test. And I also found the process on NotFoundException is unnecessary.

The testCloseAfterLoadingBundles could still succeed even if we don't skip the orphan cleanup. So I kept the original behavior that when doCleanup(broker, false) is called and the broker is not healthy, continue the orphan cleanup.

However, from the failure of testOverrideOrphanStateData, I found ServiceUnitStateChannelTest is really confusing. In this test, disableChannels is called before all overrideTableViews calls. However, the overrideTableViews method is very complicated that it accesses the private fields via reflection and some operations will be performed on these fields. I guess this method is to simulate the ServiceUnitStateChannel#overrideOwnership method. But if there is a regression introduced to overrideOwnership, the existing test won't be able to detect the regression.

Besides, the disable() method, which set the channel state to Disabled, was never called other than tests before #23349. After #23349, this method is only called in cleanOwnerships() and doCleanup(broker, true) will always be called after that. Should we still allow doCleanup(broker, false) scheduled by handleBrokerDeletionEvent be called after disable()?

Anyway, these details are complicated and it could be a new topic to discuss.

codecov-commenter · 2024-10-10T14:58:52Z

Codecov Report

Attention: Patch coverage is 72.72727% with 6 lines in your changes missing coverage. Please review.

Project coverage is 74.27%. Comparing base (bbc6224) to head (35fd689).
Report is 662 commits behind head on master.

Files with missing lines	Patch %	Lines
...xtensions/channel/ServiceUnitStateChannelImpl.java	68.42%	4 Missing and 2 partials ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #23433      +/-   ##
============================================
+ Coverage     73.57%   74.27%   +0.70%     
- Complexity    32624    34963    +2339     
============================================
  Files          1877     1952      +75     
  Lines        139502   147150    +7648     
  Branches      15299    16201     +902     
============================================
+ Hits         102638   109296    +6658     
- Misses        28908    29424     +516     
- Partials       7956     8430     +474

Flag	Coverage Δ
inttests	`27.41% <22.72%> (+2.82%)`	⬆️
systests	`24.35% <0.00%> (+0.03%)`	⬆️
unittests	`73.64% <72.72%> (+0.80%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
...g/apache/pulsar/broker/admin/impl/BrokersBase.java	`74.28% <100.00%> (+1.82%)`	⬆️
...xtensions/channel/ServiceUnitStateChannelImpl.java	`86.98% <68.42%> (+1.67%)`	⬆️

... and 621 files with indirect coverage changes

heesung-sn · 2024-10-11T01:41:10Z

...ava/org/apache/pulsar/broker/loadbalance/extensions/channel/ServiceUnitStateChannelImpl.java

+    private boolean channelDisabled() {
+        final var channelState = this.channelState;
+        if (channelState == Disabled || channelState == Closed) {
+            log.warn("[{}] Skip scheduleCleanup because the state is {} now", brokerId, channelState);


We may need to make this log generic or remove it as it is used in multiple places.

It's right, I'll change it in the new PR

heesung-sn · 2024-10-11T01:59:38Z

When a broker shutdown, the expectation is that the connected admin clients should fail all pending requests as their connections are closed.

It is a bit surprising the admin client was hung in this case, and the cleanup process waited too long until timed out.

I wonder why the admin client didn't fail the health checks sooner.

heesung-sn · 2024-10-11T02:13:24Z

...ava/org/apache/pulsar/broker/loadbalance/extensions/channel/ServiceUnitStateChannelImpl.java

@@ -1462,6 +1478,10 @@ private CompletableFuture<Void> healthCheckBrokerAsync(String brokerId) {
    }

    private void doHealthCheckBrokerAsyncWithRetries(String brokerId, int retry, CompletableFuture<Void> future) {
+        if (channelDisabled()) {


This means that this leader broker is shutting down and trying to handle the broker deletion notification. It is counterintuitive that this shutting down leader broker thinks the target broker is healthy here.

Yes, I think we should check channelDisabled() in doCleanup before calling healthCheckBrokerAsync in doCleanup(..., false).

The reason to make the change here is that the future of healthCheckBrokerAsync could be blocked until timeout during the close phase.

# this log was added by me 2024-10-11T10:59:40,184 - INFO - [pulsar-385-8:ServiceUnitStateChannelImpl] - XYZ doHealthCheckBrokerAsyncWithRetries localhost:61361 2 2024-10-11T10:59:45,130 - INFO - [pulsar-load-manager-384-1:ServiceUnitStateChannelImpl] - Failed to check broker:localhost:61361 health java.util.concurrent.TimeoutException: null at java.base/java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1960) ~[?:?] at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:2095) ~[?:?] at org.apache.pulsar.broker.loadbalance.extensions.channel.ServiceUnitStateChannelImpl.doCleanup(ServiceUnitStateChannelImpl.java:1525) ~[classes/:?]

I added more logs and checked it again. The root cause is that this method is still scheduled when the web service is closed.

pulsar.getExecutor() .schedule(() -> doHealthCheckBrokerAsyncWithRetries(brokerId, retry + 1, future), Math.min(MAX_BROKER_HEALTH_CHECK_DELAY_IN_MILLIS, retry * retry * 50), MILLISECONDS);

Logs:

2024-10-11T11:28:58,966 - INFO - [pulsar-load-manager-1209-1:ServiceUnitStateChannelImpl] - XYZ doHealthCheckBrokerAsyncWithRetries localhost:58117 0 2024-10-11T11:28:58,970 - WARN - [AsyncHttpClient-1250-1:ServiceUnitStateChannelImpl] - XYZ localhost:58117 Failed to do health check 2024-10-11T11:28:58,971 - INFO - [pulsar-1210-9:ServiceUnitStateChannelImpl] - XYZ doHealthCheckBrokerAsyncWithRetries localhost:58117 1 2024-10-11T11:28:58,972 - WARN - [AsyncHttpClient-1250-1:ServiceUnitStateChannelImpl] - XYZ localhost:58117 Failed to do health check 2024-10-11T11:28:59,015 - INFO - [main:PulsarService] - XYZ start closing web service 2024-10-11T11:28:59,019 - INFO - [main:PulsarService] - XYZ finished closing web service 2024-10-11T11:28:59,022 - INFO - [pulsar-1210-9:ServiceUnitStateChannelImpl] - XYZ doHealthCheckBrokerAsyncWithRetries localhost:58117 2 2024-10-11T11:28:59,040 - WARN - [AsyncHttpClient-1334-2:ServiceUnitStateChannelImpl] - XYZ localhost:58117 Failed to do health check 2024-10-11T11:28:59,150 - INFO - [pulsar-load-manager-1251-1:ServiceUnitStateChannelImpl] - XYZ doHealthCheckBrokerAsyncWithRetries localhost:58158 0 2024-10-11T11:28:59,157 - WARN - [AsyncHttpClient-1292-1:ServiceUnitStateChannelImpl] - XYZ localhost:58158 Failed to do health check 2024-10-11T11:28:59,157 - INFO - [pulsar-1252-9:ServiceUnitStateChannelImpl] - XYZ doHealthCheckBrokerAsyncWithRetries localhost:58158 1 2024-10-11T11:28:59,159 - WARN - [AsyncHttpClient-1292-1:ServiceUnitStateChannelImpl] - XYZ localhost:58158 Failed to do health check 2024-10-11T11:28:59,213 - INFO - [pulsar-1252-4:ServiceUnitStateChannelImpl] - XYZ doHealthCheckBrokerAsyncWithRetries localhost:58158 2 2024-10-11T11:28:59,215 - WARN - [AsyncHttpClient-1292-1:ServiceUnitStateChannelImpl] - XYZ localhost:58158 Failed to do health check 2024-10-11T11:28:59,420 - INFO - [pulsar-1252-1:ServiceUnitStateChannelImpl] - XYZ doHealthCheckBrokerAsyncWithRetries localhost:58158 3

So the doCleanup should be skipped.

Actually, the key question that since doCleanup(..., true) is called during the close. Should we skip doCleanup(..., false) after that?

I'd like to do the following change to avoid testOverrideOrphanStateData fail. i.e. doCleanup(..., false) won't be skipped. Only the healthCheckBrokerAsync call will be skipped

diff --git a/pulsar-broker/src/main/java/org/apache/pulsar/broker/loadbalance/extensions/channel/ServiceUnitStateChannelImpl.java b/pulsar-broker/src/main/java/org/apache/pulsar/broker/loadbalance/extensions/channel/ServiceUnitStateChannelImpl.java index ea1bf01be5..30d19415e7 100644 --- a/pulsar-broker/src/main/java/org/apache/pulsar/broker/loadbalance/extensions/channel/ServiceUnitStateChannelImpl.java +++ b/pulsar-broker/src/main/java/org/apache/pulsar/broker/loadbalance/extensions/channel/ServiceUnitStateChannelImpl.java @@ -1334,17 +1334,14 @@ public class ServiceUnitStateChannelImpl implements ServiceUnitStateChannel { private boolean channelDisabled() { final var channelState = this.channelState; - if (channelState == Disabled || channelState == Closed) { - log.warn("[{}] Skip scheduleCleanup because the state is {} now", brokerId, channelState); - return true; - } - return false; + return channelState == Disabled || channelState == Closed; } private void scheduleCleanup(String broker, long delayInSecs) { var scheduled = new MutableObject<CompletableFuture<Void>>(); try { if (channelDisabled()) { + log.warn("[{}] Skip scheduleCleanup because the state is {} now", brokerId, channelState); return; } cleanupJobs.computeIfAbsent(broker, k -> { @@ -1478,10 +1475,6 @@ public class ServiceUnitStateChannelImpl implements ServiceUnitStateChannel { } private void doHealthCheckBrokerAsyncWithRetries(String brokerId, int retry, CompletableFuture<Void> future) { - if (channelDisabled()) { - future.complete(null); - return; - } try { var admin = getPulsarAdmin(); admin.brokers().healthcheckAsync(TopicVersion.V2, Optional.of(brokerId)) @@ -1519,7 +1512,9 @@ public class ServiceUnitStateChannelImpl implements ServiceUnitStateChannel { } // if not gracefully, verify the broker is inactive by health-check. - if (!gracefully) { + // When the channel is disabled, the broker is during the close phase where the web service might not be + // available so that calling healthCheckBrokerAsync() could fail with timeout. + if (!gracefully && !channelDisabled()) { try { healthCheckBrokerAsync(broker).get( pulsar.getConfiguration().getMetadataStoreOperationTimeoutSeconds(), SECONDS);

…ensible load manager (apache#23433) (cherry picked from commit 5506f50)

…ensible load manager (#23433) (#23517) Co-authored-by: Yunze Xu <xyzinfernity@163.com>

[fix][broker] Fix namespace unload might be blocked too long with ext…

2e5dab7

…ensible load manager

BewareMyPower self-assigned this Oct 10, 2024

BewareMyPower requested review from lhotari, codelipenghui, Demogorgon314 and heesung-sn October 10, 2024 10:59

BewareMyPower added type/bug The PR fixed a bug or issue reported a bug release/3.3.3 labels Oct 10, 2024

github-actions bot added the doc-not-needed Your PR changes do not impact docs label Oct 10, 2024

BewareMyPower added this to the 4.0.0 milestone Oct 10, 2024

lhotari added the ready-to-test label Oct 10, 2024

lhotari reviewed Oct 10, 2024

View reviewed changes

...est/java/org/apache/pulsar/broker/loadbalance/extensions/ExtensibleLoadManagerCloseTest.java Show resolved Hide resolved

lhotari approved these changes Oct 10, 2024

View reviewed changes

BewareMyPower marked this pull request as draft October 10, 2024 12:15

Don't skip the orphan cleanup when channel is disabled or the broker …

35fd689

…is not found

BewareMyPower marked this pull request as ready for review October 10, 2024 13:20

lhotari merged commit 5506f50 into apache:master Oct 10, 2024
54 checks passed

BewareMyPower deleted the bewaremypower/fix-extensible-lm-tests branch October 11, 2024 01:25

heesung-sn reviewed Oct 11, 2024

View reviewed changes

heesung-sn pushed a commit to heesung-sn/pulsar that referenced this pull request Oct 25, 2024

[fix][broker] Fix namespace unload might be blocked too long with ext…

8e77a77

…ensible load manager (apache#23433) (cherry picked from commit 5506f50)

heesung-sn added cherry-picked/branch-3.3 and removed cherry-picked/branch-3.3 labels Oct 25, 2024

heesung-sn added a commit that referenced this pull request Oct 28, 2024

[fix][broker] Fix namespace unload might be blocked too long with ext…

0411d8d

…ensible load manager (#23433) (#23517) Co-authored-by: Yunze Xu <xyzinfernity@163.com>

heesung-sn added the cherry-picked/branch-3.3 label Oct 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[fix][broker] Fix namespace unload might be blocked too long with extensible load manager #23433

[fix][broker] Fix namespace unload might be blocked too long with extensible load manager #23433

BewareMyPower commented Oct 10, 2024 •

edited

Loading

BewareMyPower commented Oct 10, 2024 •

edited

Loading

lhotari left a comment

BewareMyPower commented Oct 10, 2024

BewareMyPower commented Oct 10, 2024

codecov-commenter commented Oct 10, 2024 •

edited

Loading

heesung-sn Oct 11, 2024

BewareMyPower Oct 11, 2024

heesung-sn commented Oct 11, 2024

heesung-sn Oct 11, 2024

BewareMyPower Oct 11, 2024

BewareMyPower Oct 11, 2024

BewareMyPower Oct 11, 2024

heesung-sn Oct 14, 2024

[fix][broker] Fix namespace unload might be blocked too long with extensible load manager #23433

[fix][broker] Fix namespace unload might be blocked too long with extensible load manager #23433

Conversation

BewareMyPower commented Oct 10, 2024 • edited Loading

Motivation

Modifications

Documentation

Matching PR in forked repository

BewareMyPower commented Oct 10, 2024 • edited Loading

lhotari left a comment

Choose a reason for hiding this comment

BewareMyPower commented Oct 10, 2024

BewareMyPower commented Oct 10, 2024

codecov-commenter commented Oct 10, 2024 • edited Loading

Codecov Report

heesung-sn Oct 11, 2024

Choose a reason for hiding this comment

BewareMyPower Oct 11, 2024

Choose a reason for hiding this comment

heesung-sn commented Oct 11, 2024

heesung-sn Oct 11, 2024

Choose a reason for hiding this comment

BewareMyPower Oct 11, 2024

Choose a reason for hiding this comment

BewareMyPower Oct 11, 2024

Choose a reason for hiding this comment

BewareMyPower Oct 11, 2024

Choose a reason for hiding this comment

heesung-sn Oct 14, 2024

Choose a reason for hiding this comment

BewareMyPower commented Oct 10, 2024 •

edited

Loading

BewareMyPower commented Oct 10, 2024 •

edited

Loading

codecov-commenter commented Oct 10, 2024 •

edited

Loading