Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clustered routers fail to add new routee after leader becomes unreachable, new routee node joins, leader becomes reachable, node is up #1700

Closed
Aaronontheweb opened this issue Feb 10, 2016 · 10 comments

Comments

@Aaronontheweb
Copy link
Member

Been able to replicate this bug inside WebCrawler by killing off lighthouse, launching a new crawler instance, bringing lighthouse back, and watching the new crawler node join the cluster. Cluster appears to not gossip about the new node to other existing nodes.

@Aaronontheweb
Copy link
Member Author

@annymsMthd might want to take a look at #1189 as well

@annymsMthd
Copy link
Contributor

https://github.com/akkadotnet/akka.net/compare/dev...syncromatics:feature/reverted-endpoint-managment-files?diff=split&expand=1&name=feature%2Freverted-endpoint-managment-files

I reverted these files back to around 1.0.3 and the issue is fixed. Something in these changes broke the rejoin of a node to a cluster.

@annymsMthd
Copy link
Contributor

Looks like the changes to EndpointManager and EndpointRegistry here:
b642984#diff-1ac936535300cfff2e1ff58c1ad7b75d

Still narrowing it down

@annymsMthd
Copy link
Contributor

Narrowed it down farther to the refuseUid stuff in the registry. When i revert the registry to the last commit before this one it works properly with changes in the manager. Going to look at how refuseUid is being used and why it would cause this issue.

@maxcherednik
Copy link
Contributor

I've got similar issue, but not exactly. I wonder if it is related.
Here is my scenario:

  1. I have a cluster of 1 'Master' node and 3 'Slave' nodes
  2. Each Slave node hosts one Actor called 'widget'
  3. Master node creates a group router with following config:

deployment { /widgetmanager/widgetsRouter { router = broadcast-group routees.paths = ["/user/widget"] cluster { enabled = on allow-local-routees = off use-role = riskenginewidget } } }

  1. I am trying to send the message through the Router.
  2. It's received only by one Actor
  3. I kill it and messages are received by the next Actor and so on

I tried different routers - broadcast, round robin. The whole system behaves the same.

@annymsMthd
Copy link
Contributor

https://github.com/akkadotnet/akka.net/compare/dev...syncromatics:feature/reverted-endpoint-managment-files?expand=1

@Aaronontheweb these files changes fix the problem. Not sure exactly which part and Im wondering what features this change added.

@maxcherednik
Copy link
Contributor

Nevermind my previous comment - I didn't specify the number of instances - by default seems to be 1.

@corneliutusnea
Copy link

I've been diagnosing this a bit more. This seems to be a timing problem. If I attach a debugger to the "dead" node or pause the process it recovers immediately.

Also every now and then if the node recovers quickly I get this exception:
[INFO][2/05/2016 11:43:00 PM][Thread 0019][[akka://OneSaas/system/cluster/core/daemon]] Marking node [akka.tcp://OneSaas@127.0.0.1:9911] as Down
[ERROR][2/05/2016 11:43:00 PM][Thread 0006][akka://OneSaas/system/distributedPubSubMediator] An element with the same key but a different value already exists. Key: /user/clustermanager
Cause: System.ArgumentException: An element with the same key but a different value already exists. Key: /user/clustermanager
at System.Collections.Immutable.ImmutableDictionary2.HashBucket.Add(TKey key, TValue value, IEqualityComparer1 keyOnlyComparer, IEqualityComparer1 valueComparer, KeyCollisionBehavior behavior, OperationResult& result) at System.Collections.Immutable.ImmutableDictionary2.AddRange(IEnumerable1 items, MutationInput origin, KeyCollisionBehavior collisionBehavior) at System.Collections.Immutable.ImmutableDictionary2.AddRange(IEnumerable1 pairs, Boolean avoidToHashMap) at System.Collections.Immutable.ImmutableDictionary2.AddRange(IEnumerable1 pairs) at System.Collections.Immutable.ImmutableDictionary2.System.Collections.Immutable.IImmutableDictionary<TKey,TValue>.AddRange(IEnumerable1 pairs) at Akka.Cluster.Tools.PublishSubscribe.DistributedPubSubMediator.<.ctor>b__13_12(Delta delta) in E:\Projects\Akka\akka.net\src\contrib\cluster\Akka.Cluster.Tools\PublishSubscribe\DistributedPubSubMediator.cs:line 280 at lambda_method(Closure , Object , Action1 , Action1 , Action1 , Action1 , Action1 , Action1 , Action1 , Action1 , Action1 , Action1 , Action1 , Action1 , Action1 , Action1 , Object[] ) at Akka.Tools.MatchHandler.PartialHandlerArgumentsCapture16.Handle(T message) in E:\Projects\Akka\akka.net\src\core\Akka\Util\MatchHandler\PartialHandlerArgumentsCapture.cs:line 423
at Akka.Actor.ReceiveActor.ExecutePartialMessageHandler(Object message, PartialAction`1 partialAction) in E:\Projects\Akka\akka.net\src\core\Akka\Actor\ReceiveActor.cs:line 68
at Akka.Actor.ReceiveActor.OnReceive(Object message) in E:\Projects\Akka\akka.net\src\core\Akka\Actor\ReceiveActor.cs:line 63
at Akka.Actor.UntypedActor.Receive(Object message) in E:\Projects\Akka\akka.net\src\core\Akka\Actor\UntypedActor.cs:line 21
at Akka.Actor.ActorBase.AroundReceive(Receive receive, Object message) in E:\Projects\Akka\akka.net\src\core\Akka\Actor\ActorBase.cs:line 155
at Akka.Actor.ActorCell.ReceiveMessage(Object message) in E:\Projects\Akka\akka.net\src\core\Akka\Actor\ActorCell.DefaultMessages.cs:line 145
at Akka.Actor.ActorCell.Invoke(Envelope envelope) in E:\Projects\Akka\akka.net\src\core\Akka\Actor\ActorCell.DefaultMessages.cs:line 62
[WARNING][2/05/2016 11:43:04 PM][Thread 0016][[akka://OneSaas/system/cluster/core/daemon]] Cluster Node [akka.tcp://OneSaas@127.0.0.1:9912] - Marking node(s) as UNREACHABLE [Member(address = akka.tcp://OneSaas@127.0.0.1:9911, status = Down]

(this is before my own changes to the DistributedPubSubMediator.

@alexvaluyskiy alexvaluyskiy added this to the Akka.NET v1.1 milestone Jun 15, 2016
@alexvaluyskiy
Copy link
Contributor

@Aaronontheweb Probably fixed #2103

@Aaronontheweb
Copy link
Member Author

I think between #2103 and the failure detection changes we just merged on #2099, along with the UID improvements we've added in a couple of Akka core and remoting patches that this should be resolved. There were a myriad of issues at work here, the biggest of which were implementation faults with the routers (resolved in #2103.) In a pool router setting this would have been further compounded by the UID issues we had with remote deployments - we've made several fixes to those and added tests to cover both scenarios, which now pass.

Going to mark this issue as resolved per the 1.1 milestone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants