Fix[MQB]: Select active node quickly when no cluster node in same DC #606

chrisbeard · 2025-02-11T21:02:10Z

Issue number of the reported bug or feature request: #

Describe your changes
When a broker connects to a cluster and attempts to select an active node, the broker attempts to select a node in the same data center. After 10 seconds, if no active node has been selected, the broker relaxes the data center requirement and picks any available node. If a cluster does not have any nodes in the broker's data center, this leads to a 10 second delay on the first open queue request to a cluster.

This commit changes the active node selection logic for the case where there is no cluster node within the same data center as a broker. In this case, the broker will ignore the data center may use any available node as the active node.

Testing performed
Describe the testing you have performed to ensure that the bug has been addressed, or that the new feature works as planned.

Additional context
Add any other context about your contribution here.

678098 · 2025-02-12T17:55:47Z

src/groups/mqb/mqbnet/mqbnet_clusteractivenodemanager.cpp

    for (mqbnet::Cluster::NodesList::const_iterator it = nodes.begin();
         it != nodes.end();
         ++it) {
        mqbnet::ClusterNode* node = *it;
        d_nodes[node].d_status    = bmqp_ctrlmsg::NodeStatus::E_UNAVAILABLE;
+        clusterNodeInLocalDC      = clusterNodeInLocalDC &&
+                               d_dataCenter == node->dataCenter();


I am not sure if the logic here is correct.
The clusterNodeInLocalDC flag is initially true, but it can only turn false in this loop. So if we have at least one node in another datacenter, the flag will turn false and we will do extended selection.

So it seems the proposed patch will only speed up the case when all the cluster nodes are in the same datacenter, right?

I would revert the flag here and do smth like

bool clusterNodeInLocalDC = false; for (...) { ... clusterNodeInLocalDC = clusterNodeInLocalDC || d_dataCenter == node->dataCenter(); // will turn `true` if at least one node is in the same datacenter }

It's not sufficient to re-use this boolean, if my understanding of the logic in findNewActiveNode is correct.

It looks like this bool is flipped to true via enableExtendedSelection() by ClusterProxy when it wants to say "give up waiting, try to pick something right now because I've been waiting 10 seconds and need an active node ASAP". When the bool is true, ClusterActiveNodeManager::findNewActiveNode() no longer returns early, and if there is no candidate, immediately causes a cluster panic alarm. So this boolean doesn't just control candidate selection behavior, it actually causes panic logs. I might need to split this into a separate boolean in an effort to not refactor that logic too much.

Signed-off-by: Christopher Beard <cbeard9@bloomberg.net>

When a broker connects to a cluster and attempts to select an active node, the broker attempts to select a node in the same data center. After 10 seconds, if no active node has been selected, the broker relaxes the data center requirement and picks any available node. If a cluster does not have any nodes in the broker's data center, this leads to a 10 second delay on the first open queue request to a cluster. This commit changes the active node selection logic for the case where there is no cluster node within the same data center as a broker. In this case, the broker will ignore the data center may use any available node as the active node. Signed-off-by: Christopher Beard <cbeard9@bloomberg.net>

chrisbeard · 2025-02-12T23:08:38Z

src/groups/mqb/mqbnet/mqbnet_clusteractivenodemanager.h

+    bool d_ignoreDataCenter;
+    // If true, remove the data center
+    // requirement when selecting active
+    // node. Set to true when the cluster
+    // does not have any nodes in the
+    // current machine's data center.
+


While this does work, it's not elegant. The bool below claims to do what this bool is doing, but the bool below actually affects much more (as noted earlier in the discussion). I might try to do some more refactoring here to clean this up a bit more.

chrisbeard force-pushed the active-node-no-dc-match branch from e13b6bb to 456b4e2 Compare February 11, 2025 21:03

dorjesinpo previously approved these changes Feb 12, 2025

View reviewed changes

678098 reviewed Feb 12, 2025

View reviewed changes

Feat[MQB]: Add unit test for ClusterActiveNodeManager

ac07c15

Signed-off-by: Christopher Beard <cbeard9@bloomberg.net>

chrisbeard dismissed dorjesinpo’s stale review via 2316baf February 12, 2025 22:22

chrisbeard force-pushed the active-node-no-dc-match branch 2 times, most recently from 2316baf to 540274b Compare February 12, 2025 22:23

chrisbeard force-pushed the active-node-no-dc-match branch from 540274b to f27fda8 Compare February 12, 2025 22:25

chrisbeard commented Feb 12, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix[MQB]: Select active node quickly when no cluster node in same DC #606

Fix[MQB]: Select active node quickly when no cluster node in same DC #606

chrisbeard commented Feb 11, 2025

678098 Feb 12, 2025 •

edited

Loading

chrisbeard Feb 12, 2025

chrisbeard Feb 12, 2025

Fix[MQB]: Select active node quickly when no cluster node in same DC #606

Are you sure you want to change the base?

Fix[MQB]: Select active node quickly when no cluster node in same DC #606

Conversation

chrisbeard commented Feb 11, 2025

678098 Feb 12, 2025 • edited Loading

Choose a reason for hiding this comment

chrisbeard Feb 12, 2025

Choose a reason for hiding this comment

chrisbeard Feb 12, 2025

Choose a reason for hiding this comment

678098 Feb 12, 2025 •

edited

Loading