-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Web application does not rejoin akka.cluster after restart #1670
Comments
can you post the configuration information for your web service also? |
It's basically the same, apart from port=0 and role=[webapp] Analogously for [wcfservice] |
Recapping what we discussed in Gitter: your issue is using port 0 for the services |
I changed ports to hardcoded ([webapp] to 6001 and wcfservice to [7001]) and I still experience the same erroneous outcome when [webapp] is restarted. |
New log from [mainservice]:
My hocon from [webapp]
|
I tried waiting 1min before restarting webapp and now in such case it works fine. I suppose I must find the right config setting and set it to a value that's smaller than usual IIS app pool recycle time.... However when I look at the logs I still see lots of Akka.Remote.EndpointDisassociatedException: Disassociated ERROR messages in the logs even after the restart and successful message passing. I would be good too know to get rid of those errors after rejoin, so that the logs present the real cluster state and ignore some old messages after a while. |
@Aaronontheweb Correct me if I am wrong or help me understand :)
I can validate this in my Cluster Monitor example by remove that RemoveMember call. |
As per gitter discussion, this issue still persists and I have created a minimal reproduction sample. The sample is located here: https://github.com/voltcode/AkkaSample The scenario to replicate the bug is as follows (assuming the solution is open in VS): 1.mark the service project as startup, launch the service (best with CTRL+F5 so it stays outside debugger).
Please consider reopening this issue @Aaronontheweb . @cgstevens do you mind having a look as well, and see if this minimal scenario correlates with your case? |
Another experiment - I changed the sample app to use pure remoting without cluster, connection gets re-established properly, so I believe the problem is in the Cluster code not in Remote and the network stack below. Scenario with Remote was scheduling repeatedly a message from a service actor to web actor, and killing the web application after mesages went through (verified with a breakpoint in web code). I can push the Remote sample to repo if it helps? |
I have created a branch of my sample https://github.com/voltcode/AkkaSample/tree/AkkaSampleRemote |
Possibly related akka/akka#16224 |
I'll reopen this and see what could be possibly going on here |
I began debugging the issue. I found that at one point during reconnection attempt, the EndpointManager of the webapp seems to try ungating the connection but its ungate message is never handled. message Ungate is sent by logs confirm that it's never handled - I wonder if ignoring ungate is the desired option.
|
More clarification, the first time Ungate is sent (and not handled) during reconnect attempt is actually in a different place of the same method as above:
it doesnt get handled, consecutive "ungates" are sent from the place in the previous comment. |
exact log message (proof of Ungate unhanlded): 2016-02-20 00:00:43,714 [11] DEBUG Akka.Actor.LocalActorRef [(null)] - Unhandled message from akka://voltcode/system/endpointManager : Akka.Remote.ReliableDeliverySupervisor+Ungate |
a longer excerpt from the logs:
|
corresponding log from the other application:
|
adding full logs from both sides |
I found that the unexpected "Ungate" is sent when ReliableDeliverySupervisor (recipient) is in Receiving state. |
another observation: on second (failed) join after process restart, the sequence is: at the same time, the message on the seed node is on failed join is AssociateUnderlyingRefuseUid with refuseUid = null. Thats because the handle inside EndpointWriter is null on the failed join (it's not null on the initial join) |
I have created a new version of sample project for bug replication - the non-seed node that's trying to join is now a console project - much quicker to reproduce than restarting web app https://github.com/voltcode/AkkaSample/tree/AkkaSampleConsole to reproduce - run Service first, Console second. kill Console. start Console again |
#1700 Addresses this |
I just wanted to say that I retested the scenario on nightly build 1.0.9.238
and the issue no longer occurs - thank you very much! Keeping my fingers crossed for official 1.1 build - I will retest when it's fully cooked. |
Hooray!!! Sent from my iPhone, so if there are any misspellings I blame it. On Jul 6, 2016, at 8:55 AM, voltcode <notifications@github.commailto:notifications@github.com> wrote: I just wanted to say that I retested the scenario on nightly build 1.0.9.238 and the issue no longer occurs - thank you very much! Keeping my fingers crossed for official 1.1 build - I will retest when it's fully cooked. You are receiving this because you were mentioned. |
My setup:
a [wcfservice] role talks to [mainservice] role (a windows service) that talks to [webapp] role (an mvc app)
1.I start up [mainservice], [wcfservice], [webapp](in this order)
all looks good in the logs, cluster is formed.
2.message is send from wcfservice to mainservice. mainservice processes it and sends a message to webapp. All works as expected
3.I kill [webapp]. gossip failures show up in [mainservice] logs (as expected).
4.I restart webapp, there is no message confirming that it rejoined the cluster. Sending new message as in p.2. does not work.
I would expect that the webapp would rejoin the cluster automatically on startup.
If I kill the [mainservice] together with [webapp], and start them both up (mainservice first) then cluster is formed correctly (as in situation from p.1).
akka std out from [mainservice] logs edited to show when p.2,3,4 occur (look for !!!>>>)
[mainservice] hocon
The text was updated successfully, but these errors were encountered: