-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faulty behavior of forwarder-vpp blocks the heal process #1161
Comments
Could you try to use v1.14.0-rc.1? |
It comes on a customer deployment. So far we did not manage to reproduce this situation in lab environments, but working on it. If it succeeds the we will be able to try v1.14.0-rc.1. |
@ljkiraly I found that healing could be triggered by deleting the network service. So you may run. kubectl delete networkservicemesh $networkserviceName If the endpoint has the option of self-service registration, then it's enough. kubectl apply -f $networkserviceName |
Unfortunately I can not reproduce this issue in my environment. (with NSMv 1.13.2 neither with rc.2) "dsc-fdr-7bdfc7b8b5-f7js4" pod can not connect to NS: "proxy.vpn1-b.sc-diam.deascmqv02". Once again: the failure popped up randomly during a long running traffic case, where no elements were restarted. |
I've managed to reproduce the problem on NSM v1.13.2, but it looks like the issue is gone on NSM v1.14.0-rc.2. Changes in |
@NikitaSkrynnik These are good news! |
@ljkiraly, sure
|
@NikitaSkrynnik , Thank you. Based on this description I was able to get different connection errors with NSM v1.13.2. Sometimes only 6 NSC connected successfully. After checking the forwarder logs I having concerns if we reproducing the same fault as originally reported. As I noticed in the previous comment the last TRACE level log output related to a faulty connection was the "beginServer.Request()" (#1161 (comment)). Please confirm if you have seen any connection stuck in 'beginServer' based on forwarder logs. |
@ljkiraly, you can also try the scripts i used healbug copy.zip Run |
@NikitaSkrynnik Could you please attach results from our the last internal test? |
@denis-tingaikin, @ljkiraly after 30 runs on |
@NikitaSkrynnik And how often it's reproducing without the fix? It's about 1-3 runs right? |
@denis-tingaikin, 3-5 runs usually |
@NikitaSkrynnik Could you please also check it with rc.7? |
ran tests 30 times on |
Also verified on v1.14.0-rc.7 (with 1.11.2 and 1.13.2 and 1.14.0-rc7 endpoints) and the connections seems much more stable. I can not reproduce the problem. Thank you for the fix. |
Hi @NikitaSkrynnik, |
Hello @zolug, We used the patch from the PR networkservicemesh/sdk-kernel#679. We are currently not planning to merge it into main since we'd like to get a fix via a NetLink release. |
Hi @denis-tingaikin, Hmm, if I got it right, then it's not resolved in NSM 1.14.0 yet, is that correct? |
To be clear, We actually have two solutions:
As far as we know at this moment, both fixes are working and can be used together, but the begin chain element was not enough tested. So we decided not to include in the final release. Situation at this moment: main brach --> problem is resolved via the begin chain element changes. releases: Since we did not get enough time for testing on customer like env, we considered getting the netlink workaround for release 1.14.0 since it looks more stable and safe. In the future release (v1.15.0) is planning to have the begin fixes + new released version of the NetLink library. Please let me know if some things are still not looking clear. |
Seems like this one is resolved |
We observed a problem with
forwarder-vpp
which randomly starts ignoring requests for certain node internal connections.For example, a refresh seems to fail randomly which triggers the healing mechanism. As a part of it, in most of the cases even the
Close
is unsuccessful. Then no matter how hard the heal tries to repair the connection every request times out innsmgr
, because even if theforwarder-vpp
gets the requests but it does not handle them, for instance does not send it to theNSE
.So, there are request timeouts in
nsmgr
in every 15 seconds (which comes from the request timeout onNSC
side).However in the
forwarder-vpp
we only saw that the request received 15 seconds earlier butforwarder-vpp
did nothing with it.Even though it results a leaked interface but in our opinion here it is just a consequence and not the root cause of the problem.
It is important to mention that so far this behavior observed only for node internal connections.
Also note that the request does not reach the
discover/discoverCandidateServer.Request()
in theforwarder-vpp
where the first debug message would come but it did not appear.The situation recovers after restart of
forwarder-vpp
.The used
NSM
version is v1.13.2 whileNSC
s use v1.11.2 SDK version.Unfortunately the
DEBUG
logs (as they are in the usedNSM
release) are not sufficient to analyze the case deeper.However it seems the problem encountered quite frequently (in couple of days), the
TRACE
log level is set on the system and we are waiting for reproduction, so hopefully it would appear soon and I can add detailed logs to this issue.The text was updated successfully, but these errors were encountered: