-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clustered routers fail to add new routee after leader becomes unreachable, new routee node joins, leader becomes reachable, node is up #1700
Comments
@annymsMthd might want to take a look at #1189 as well |
I reverted these files back to around 1.0.3 and the issue is fixed. Something in these changes broke the rejoin of a node to a cluster. |
Looks like the changes to EndpointManager and EndpointRegistry here: Still narrowing it down |
Narrowed it down farther to the refuseUid stuff in the registry. When i revert the registry to the last commit before this one it works properly with changes in the manager. Going to look at how refuseUid is being used and why it would cause this issue. |
I've got similar issue, but not exactly. I wonder if it is related.
I tried different routers - broadcast, round robin. The whole system behaves the same. |
@Aaronontheweb these files changes fix the problem. Not sure exactly which part and Im wondering what features this change added. |
Nevermind my previous comment - I didn't specify the number of instances - by default seems to be 1. |
I've been diagnosing this a bit more. This seems to be a timing problem. If I attach a debugger to the "dead" node or pause the process it recovers immediately. Also every now and then if the node recovers quickly I get this exception: (this is before my own changes to the DistributedPubSubMediator. |
@Aaronontheweb Probably fixed #2103 |
I think between #2103 and the failure detection changes we just merged on #2099, along with the UID improvements we've added in a couple of Akka core and remoting patches that this should be resolved. There were a myriad of issues at work here, the biggest of which were implementation faults with the routers (resolved in #2103.) In a pool router setting this would have been further compounded by the UID issues we had with remote deployments - we've made several fixes to those and added tests to cover both scenarios, which now pass. Going to mark this issue as resolved per the 1.1 milestone. |
Been able to replicate this bug inside WebCrawler by killing off lighthouse, launching a new crawler instance, bringing lighthouse back, and watching the new crawler node join the cluster. Cluster appears to not gossip about the new node to other existing nodes.
The text was updated successfully, but these errors were encountered: