Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix[MQB]: Select active node quickly when no cluster node in same DC #606

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

chrisbeard
Copy link
Contributor

Issue number of the reported bug or feature request: #

Describe your changes
When a broker connects to a cluster and attempts to select an active node, the broker attempts to select a node in the same data center. After 10 seconds, if no active node has been selected, the broker relaxes the data center requirement and picks any available node. If a cluster does not have any nodes in the broker's data center, this leads to a 10 second delay on the first open queue request to a cluster.

This commit changes the active node selection logic for the case where there is no cluster node within the same data center as a broker. In this case, the broker will ignore the data center may use any available node as the active node.

Testing performed
Describe the testing you have performed to ensure that the bug has been addressed, or that the new feature works as planned.

Additional context
Add any other context about your contribution here.

@chrisbeard chrisbeard force-pushed the active-node-no-dc-match branch from e13b6bb to 456b4e2 Compare February 11, 2025 21:03
dorjesinpo
dorjesinpo previously approved these changes Feb 12, 2025
for (mqbnet::Cluster::NodesList::const_iterator it = nodes.begin();
it != nodes.end();
++it) {
mqbnet::ClusterNode* node = *it;
d_nodes[node].d_status = bmqp_ctrlmsg::NodeStatus::E_UNAVAILABLE;
clusterNodeInLocalDC = clusterNodeInLocalDC &&
d_dataCenter == node->dataCenter();
Copy link
Collaborator

@678098 678098 Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if the logic here is correct.
The clusterNodeInLocalDC flag is initially true, but it can only turn false in this loop. So if we have at least one node in another datacenter, the flag will turn false and we will do extended selection.

So it seems the proposed patch will only speed up the case when all the cluster nodes are in the same datacenter, right?

I would revert the flag here and do smth like

bool clusterNodeInLocalDC = false;
for (...) {
   ...
   clusterNodeInLocalDC = clusterNodeInLocalDC ||
                               d_dataCenter == node->dataCenter();
   // will turn `true` if at least one node is in the same datacenter
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not sufficient to re-use this boolean, if my understanding of the logic in findNewActiveNode is correct.

It looks like this bool is flipped to true via enableExtendedSelection() by ClusterProxy when it wants to say "give up waiting, try to pick something right now because I've been waiting 10 seconds and need an active node ASAP". When the bool is true, ClusterActiveNodeManager::findNewActiveNode() no longer returns early, and if there is no candidate, immediately causes a cluster panic alarm. So this boolean doesn't just control candidate selection behavior, it actually causes panic logs. I might need to split this into a separate boolean in an effort to not refactor that logic too much.

Signed-off-by: Christopher Beard <cbeard9@bloomberg.net>
@chrisbeard chrisbeard force-pushed the active-node-no-dc-match branch 2 times, most recently from 2316baf to 540274b Compare February 12, 2025 22:23
When a broker connects to a cluster and attempts to select an active
node, the broker attempts to select a node in the same data center.
After 10 seconds, if no active node has been selected, the broker
relaxes the data center requirement and picks any available node. If a
cluster does not have any nodes in the broker's data center, this leads
to a 10 second delay on the first open queue request to a cluster.

This commit changes the active node selection logic for the case where
there is no cluster node within the same data center as a broker. In
this case, the broker will ignore the data center may use any available
node as the active node.

Signed-off-by: Christopher Beard <cbeard9@bloomberg.net>
@chrisbeard chrisbeard force-pushed the active-node-no-dc-match branch from 540274b to f27fda8 Compare February 12, 2025 22:25
Comment on lines +261 to +267
bool d_ignoreDataCenter;
// If true, remove the data center
// requirement when selecting active
// node. Set to true when the cluster
// does not have any nodes in the
// current machine's data center.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this does work, it's not elegant. The bool below claims to do what this bool is doing, but the bool below actually affects much more (as noted earlier in the discussion). I might try to do some more refactoring here to clean this up a bit more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants