Clustered routers fail to add new routee after leader becomes unreachable, new routee node joins, leader becomes reachable, node is up #1700

Aaronontheweb · 2016-02-10T21:18:11Z

Been able to replicate this bug inside WebCrawler by killing off lighthouse, launching a new crawler instance, bringing lighthouse back, and watching the new crawler node join the cluster. Cluster appears to not gossip about the new node to other existing nodes.

Aaronontheweb · 2016-02-19T22:04:55Z

@annymsMthd might want to take a look at #1189 as well

annymsMthd · 2016-03-04T01:04:28Z

https://github.com/akkadotnet/akka.net/compare/dev...syncromatics:feature/reverted-endpoint-managment-files?diff=split&expand=1&name=feature%2Freverted-endpoint-managment-files

I reverted these files back to around 1.0.3 and the issue is fixed. Something in these changes broke the rejoin of a node to a cluster.

annymsMthd · 2016-03-04T03:08:38Z

Looks like the changes to EndpointManager and EndpointRegistry here:
b642984#diff-1ac936535300cfff2e1ff58c1ad7b75d

Still narrowing it down

annymsMthd · 2016-03-04T03:55:16Z

Narrowed it down farther to the refuseUid stuff in the registry. When i revert the registry to the last commit before this one it works properly with changes in the manager. Going to look at how refuseUid is being used and why it would cause this issue.

maxcherednik · 2016-03-04T10:43:03Z

I've got similar issue, but not exactly. I wonder if it is related.
Here is my scenario:

I have a cluster of 1 'Master' node and 3 'Slave' nodes
Each Slave node hosts one Actor called 'widget'
Master node creates a group router with following config:

deployment { /widgetmanager/widgetsRouter { router = broadcast-group routees.paths = ["/user/widget"] cluster { enabled = on allow-local-routees = off use-role = riskenginewidget } } }

I am trying to send the message through the Router.
It's received only by one Actor
I kill it and messages are received by the next Actor and so on

I tried different routers - broadcast, round robin. The whole system behaves the same.

annymsMthd · 2016-03-04T17:15:48Z

https://github.com/akkadotnet/akka.net/compare/dev...syncromatics:feature/reverted-endpoint-managment-files?expand=1

@Aaronontheweb these files changes fix the problem. Not sure exactly which part and Im wondering what features this change added.

maxcherednik · 2016-03-07T07:54:39Z

Nevermind my previous comment - I didn't specify the number of instances - by default seems to be 1.

corneliutusnea · 2016-05-02T23:43:53Z

I've been diagnosing this a bit more. This seems to be a timing problem. If I attach a debugger to the "dead" node or pause the process it recovers immediately.

Also every now and then if the node recovers quickly I get this exception:
[INFO][2/05/2016 11:43:00 PM][Thread 0019][[akka://OneSaas/system/cluster/core/daemon]] Marking node [akka.tcp://OneSaas@127.0.0.1:9911] as Down
[ERROR][2/05/2016 11:43:00 PM][Thread 0006][akka://OneSaas/system/distributedPubSubMediator] An element with the same key but a different value already exists. Key: /user/clustermanager
Cause: System.ArgumentException: An element with the same key but a different value already exists. Key: /user/clustermanager
at System.Collections.Immutable.ImmutableDictionary2.HashBucket.Add(TKey key, TValue value, IEqualityComparer1 keyOnlyComparer, IEqualityComparer1 valueComparer, KeyCollisionBehavior behavior, OperationResult& result) at System.Collections.Immutable.ImmutableDictionary2.AddRange(IEnumerable1 items, MutationInput origin, KeyCollisionBehavior collisionBehavior) at System.Collections.Immutable.ImmutableDictionary2.AddRange(IEnumerable1 pairs, Boolean avoidToHashMap) at System.Collections.Immutable.ImmutableDictionary2.AddRange(IEnumerable1 pairs) at System.Collections.Immutable.ImmutableDictionary2.System.Collections.Immutable.IImmutableDictionary<TKey,TValue>.AddRange(IEnumerable1 pairs) at Akka.Cluster.Tools.PublishSubscribe.DistributedPubSubMediator.<.ctor>b__13_12(Delta delta) in E:\Projects\Akka\akka.net\src\contrib\cluster\Akka.Cluster.Tools\PublishSubscribe\DistributedPubSubMediator.cs:line 280 at lambda_method(Closure , Object , Action1 , Action1 , Action1 , Action1 , Action1 , Action1 , Action1 , Action1 , Action1 , Action1 , Action1 , Action1 , Action1 , Action1 , Object[] ) at Akka.Tools.MatchHandler.PartialHandlerArgumentsCapture16.Handle(T message) in E:\Projects\Akka\akka.net\src\core\Akka\Util\MatchHandler\PartialHandlerArgumentsCapture.cs:line 423
at Akka.Actor.ReceiveActor.ExecutePartialMessageHandler(Object message, PartialAction`1 partialAction) in E:\Projects\Akka\akka.net\src\core\Akka\Actor\ReceiveActor.cs:line 68
at Akka.Actor.ReceiveActor.OnReceive(Object message) in E:\Projects\Akka\akka.net\src\core\Akka\Actor\ReceiveActor.cs:line 63
at Akka.Actor.UntypedActor.Receive(Object message) in E:\Projects\Akka\akka.net\src\core\Akka\Actor\UntypedActor.cs:line 21
at Akka.Actor.ActorBase.AroundReceive(Receive receive, Object message) in E:\Projects\Akka\akka.net\src\core\Akka\Actor\ActorBase.cs:line 155
at Akka.Actor.ActorCell.ReceiveMessage(Object message) in E:\Projects\Akka\akka.net\src\core\Akka\Actor\ActorCell.DefaultMessages.cs:line 145
at Akka.Actor.ActorCell.Invoke(Envelope envelope) in E:\Projects\Akka\akka.net\src\core\Akka\Actor\ActorCell.DefaultMessages.cs:line 62
[WARNING][2/05/2016 11:43:04 PM][Thread 0016][[akka://OneSaas/system/cluster/core/daemon]] Cluster Node [akka.tcp://OneSaas@127.0.0.1:9912] - Marking node(s) as UNREACHABLE [Member(address = akka.tcp://OneSaas@127.0.0.1:9911, status = Down]

(this is before my own changes to the DistributedPubSubMediator.

alexvaluyskiy · 2016-06-28T11:24:15Z

@Aaronontheweb Probably fixed #2103

Aaronontheweb · 2016-06-30T19:53:05Z

I think between #2103 and the failure detection changes we just merged on #2099, along with the UID improvements we've added in a couple of Akka core and remoting patches that this should be resolved. There were a myriad of issues at work here, the biggest of which were implementation faults with the routers (resolved in #2103.) In a pool router setting this would have been further compounded by the UID issues we had with remote deployments - we've made several fixes to those and added tests to cover both scenarios, which now pass.

Going to mark this issue as resolved per the 1.1 milestone.

Aaronontheweb added confirmed bug akka-cluster labels Feb 10, 2016

annymsMthd self-assigned this Feb 17, 2016

Aaronontheweb mentioned this issue Feb 19, 2016

Number of routees reported on Clustered Group routers doesn't reflect actual number of routees cluster-wide #1189

Closed

This was referenced Mar 4, 2016

Web application does not rejoin akka.cluster after restart #1670

Closed

#1700 Feature/reverted endpoint managment files to fix rejoin issues #1755

Closed

Aaronontheweb assigned Aaronontheweb and unassigned annymsMthd Mar 24, 2016

This was referenced Mar 24, 2016

Remote + Cluster = what? #748

Closed

Port Akka.Cluster.Tests.MultiNode.NodeRestartSpec #1821

Closed

Port Akka.Cluster.Tests.MultiNode.UnreachableNodeJoinsAgainSpec #1830

Closed

Aaronontheweb mentioned this issue Apr 7, 2016

EndpointRegistry fixes #1862

Merged

alexvaluyskiy added this to the Akka.NET v1.1 milestone Jun 15, 2016

Aaronontheweb closed this as completed Jun 30, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clustered routers fail to add new routee after leader becomes unreachable, new routee node joins, leader becomes reachable, node is up #1700

Clustered routers fail to add new routee after leader becomes unreachable, new routee node joins, leader becomes reachable, node is up #1700

Aaronontheweb commented Feb 10, 2016

Aaronontheweb commented Feb 19, 2016

annymsMthd commented Mar 4, 2016

annymsMthd commented Mar 4, 2016

annymsMthd commented Mar 4, 2016

maxcherednik commented Mar 4, 2016

annymsMthd commented Mar 4, 2016

maxcherednik commented Mar 7, 2016

corneliutusnea commented May 2, 2016

alexvaluyskiy commented Jun 28, 2016

Aaronontheweb commented Jun 30, 2016

Clustered routers fail to add new routee after leader becomes unreachable, new routee node joins, leader becomes reachable, node is up #1700

Clustered routers fail to add new routee after leader becomes unreachable, new routee node joins, leader becomes reachable, node is up #1700

Comments

Aaronontheweb commented Feb 10, 2016

Aaronontheweb commented Feb 19, 2016

annymsMthd commented Mar 4, 2016

annymsMthd commented Mar 4, 2016

annymsMthd commented Mar 4, 2016

maxcherednik commented Mar 4, 2016

annymsMthd commented Mar 4, 2016

maxcherednik commented Mar 7, 2016

corneliutusnea commented May 2, 2016

alexvaluyskiy commented Jun 28, 2016

Aaronontheweb commented Jun 30, 2016