-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix[MQB]: Select active node quickly when no cluster node in same DC #606
base: main
Are you sure you want to change the base?
Conversation
e13b6bb
to
456b4e2
Compare
for (mqbnet::Cluster::NodesList::const_iterator it = nodes.begin(); | ||
it != nodes.end(); | ||
++it) { | ||
mqbnet::ClusterNode* node = *it; | ||
d_nodes[node].d_status = bmqp_ctrlmsg::NodeStatus::E_UNAVAILABLE; | ||
clusterNodeInLocalDC = clusterNodeInLocalDC && | ||
d_dataCenter == node->dataCenter(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if the logic here is correct.
The clusterNodeInLocalDC
flag is initially true
, but it can only turn false
in this loop. So if we have at least one node in another datacenter, the flag will turn false
and we will do extended selection.
So it seems the proposed patch will only speed up the case when all the cluster nodes are in the same datacenter, right?
I would revert the flag here and do smth like
bool clusterNodeInLocalDC = false;
for (...) {
...
clusterNodeInLocalDC = clusterNodeInLocalDC ||
d_dataCenter == node->dataCenter();
// will turn `true` if at least one node is in the same datacenter
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not sufficient to re-use this boolean, if my understanding of the logic in findNewActiveNode
is correct.
It looks like this bool is flipped to true via enableExtendedSelection()
by ClusterProxy when it wants to say "give up waiting, try to pick something right now because I've been waiting 10 seconds and need an active node ASAP". When the bool is true, ClusterActiveNodeManager::findNewActiveNode()
no longer returns early, and if there is no candidate, immediately causes a cluster panic alarm. So this boolean doesn't just control candidate selection behavior, it actually causes panic logs. I might need to split this into a separate boolean in an effort to not refactor that logic too much.
Signed-off-by: Christopher Beard <cbeard9@bloomberg.net>
2316baf
to
540274b
Compare
When a broker connects to a cluster and attempts to select an active node, the broker attempts to select a node in the same data center. After 10 seconds, if no active node has been selected, the broker relaxes the data center requirement and picks any available node. If a cluster does not have any nodes in the broker's data center, this leads to a 10 second delay on the first open queue request to a cluster. This commit changes the active node selection logic for the case where there is no cluster node within the same data center as a broker. In this case, the broker will ignore the data center may use any available node as the active node. Signed-off-by: Christopher Beard <cbeard9@bloomberg.net>
540274b
to
f27fda8
Compare
bool d_ignoreDataCenter; | ||
// If true, remove the data center | ||
// requirement when selecting active | ||
// node. Set to true when the cluster | ||
// does not have any nodes in the | ||
// current machine's data center. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this does work, it's not elegant. The bool below claims to do what this bool is doing, but the bool below actually affects much more (as noted earlier in the discussion). I might try to do some more refactoring here to clean this up a bit more.
Issue number of the reported bug or feature request: #
Describe your changes
When a broker connects to a cluster and attempts to select an active node, the broker attempts to select a node in the same data center. After 10 seconds, if no active node has been selected, the broker relaxes the data center requirement and picks any available node. If a cluster does not have any nodes in the broker's data center, this leads to a 10 second delay on the first open queue request to a cluster.
This commit changes the active node selection logic for the case where there is no cluster node within the same data center as a broker. In this case, the broker will ignore the data center may use any available node as the active node.
Testing performed
Describe the testing you have performed to ensure that the bug has been addressed, or that the new feature works as planned.
Additional context
Add any other context about your contribution here.