Faulty behavior of forwarder-vpp blocks the heal process #1161

szvincze · 2024-08-21T16:26:38Z

We observed a problem with forwarder-vpp which randomly starts ignoring requests for certain node internal connections.
For example, a refresh seems to fail randomly which triggers the healing mechanism. As a part of it, in most of the cases even the Close is unsuccessful. Then no matter how hard the heal tries to repair the connection every request times out in nsmgr, because even if the forwarder-vpp gets the requests but it does not handle them, for instance does not send it to the NSE.

So, there are request timeouts in nsmgr in every 15 seconds (which comes from the request timeout on NSC side).
However in the forwarder-vpp we only saw that the request received 15 seconds earlier but forwarder-vpp did nothing with it.
Even though it results a leaked interface but in our opinion here it is just a consequence and not the root cause of the problem.

It is important to mention that so far this behavior observed only for node internal connections.
Also note that the request does not reach the discover/discoverCandidateServer.Request() in the forwarder-vpp where the first debug message would come but it did not appear.

The situation recovers after restart of forwarder-vpp.

The used NSM version is v1.13.2 while NSCs use v1.11.2 SDK version.

Unfortunately the DEBUG logs (as they are in the used NSM release) are not sufficient to analyze the case deeper.
However it seems the problem encountered quite frequently (in couple of days), the TRACE log level is set on the system and we are waiting for reproduction, so hopefully it would appear soon and I can add detailed logs to this issue.

The text was updated successfully, but these errors were encountered:

denis-tingaikin · 2024-08-27T14:40:39Z

Could you try to use v1.14.0-rc.1?

szvincze · 2024-08-27T14:44:30Z

It comes on a customer deployment. So far we did not manage to reproduce this situation in lab environments, but working on it. If it succeeds the we will be able to try v1.14.0-rc.1.

denis-tingaikin · 2024-09-02T10:01:04Z

@ljkiraly I found that healing could be triggered by deleting the network service.

So you may run.

kubectl delete networkservicemesh $networkserviceName

If the endpoint has the option of self-service registration, then it's enough.
Otherwise, just apply the service after some period of time.

kubectl apply -f $networkserviceName

ljkiraly · 2024-09-04T15:25:37Z

Unfortunately I can not reproduce this issue in my environment. (with NSMv 1.13.2 neither with rc.2)
The problem was detected yesterday in a test environment (NSM v1.13.2) and we got some logs which contains TRACE level printouts.

"dsc-fdr-7bdfc7b8b5-f7js4" pod can not connect to NS: "proxy.vpn1-b.sc-diam.deascmqv02".
The connection request (id:65102507-b8d3-4266-8633-421e569ee7c7) is blocked in beginServer.Request() as you can see in the attached log:
forwarder-vpp-sljd6_forwarder-vpp.txt

Once again: the failure popped up randomly during a long running traffic case, where no elements were restarted.

NikitaSkrynnik · 2024-09-10T05:34:59Z

I've managed to reproduce the problem on NSM v1.13.2, but it looks like the issue is gone on NSM v1.14.0-rc.2. Changes in begin fix it. But we've encountered a couple of new issues: continuous closes and sometimes recvfd freezes.

ljkiraly · 2024-09-10T08:25:05Z

@NikitaSkrynnik These are good news!
Can you describe the reproduction steps?

NikitaSkrynnik · 2024-09-11T05:33:49Z

@ljkiraly, sure

Create 1 node cluster
set up NSM (basic setup, v1.13.2)
Deploy 1 NSE
Deploy 40 NSCs with NSM_MAX_TOKEN_LIFETIME=3m
Wait until some NSCs start to get errors on refreshes. Sometimes I don't get any errors, so I repeat steps 2-4 again. Usually after repeating them for 3-4 times I get the error

ljkiraly · 2024-09-11T14:38:50Z

@NikitaSkrynnik , Thank you. Based on this description I was able to get different connection errors with NSM v1.13.2. Sometimes only 6 NSC connected successfully. After checking the forwarder logs I having concerns if we reproducing the same fault as originally reported. As I noticed in the previous comment the last TRACE level log output related to a faulty connection was the "beginServer.Request()" (#1161 (comment)).
I repeated the steps 2-4 for 5 times and getting many type of connection errors, but seemingly the forwarder always run through the beginServer. Maybe I have to be more patient and try further.

Please confirm if you have seen any connection stuck in 'beginServer' based on forwarder logs.

NikitaSkrynnik · 2024-09-12T04:53:19Z

@ljkiraly, you can also try the scripts i used healbug copy.zip

Run ./run.sh command and after it's finishe search for string policy failed in nsc's logs. Usually I get this error on 3-4 iteration. This error usually means that beginServer is stuck.

denis-tingaikin · 2024-09-17T11:19:05Z

@NikitaSkrynnik Could you please attach results from our the last internal test?

NikitaSkrynnik · 2024-09-17T12:20:25Z

@denis-tingaikin, @ljkiraly after 30 runs on rc.6 I can't reproduce the error

denis-tingaikin · 2024-09-17T12:24:02Z

@NikitaSkrynnik And how often it's reproducing without the fix? It's about 1-3 runs right?

NikitaSkrynnik · 2024-09-17T13:40:16Z

@denis-tingaikin, 3-5 runs usually

denis-tingaikin · 2024-09-17T15:24:34Z

@NikitaSkrynnik Could you please also check it with rc.7?

NikitaSkrynnik · 2024-09-18T09:28:42Z

ran tests 30 times on rc.7. Couldn't reproduce the problem

ljkiraly · 2024-09-20T11:06:40Z

Also verified on v1.14.0-rc.7 (with 1.11.2 and 1.13.2 and 1.14.0-rc7 endpoints) and the connections seems much more stable. I can not reproduce the problem. Thank you for the fix.

zolug · 2024-09-27T07:50:57Z

Hi @NikitaSkrynnik,
Could you point me to the commit(s) in 1.14.0 covering the fix for this problem?
Also, could you share some more information what was causing this behaviour and how it got resolved?
Thanks in advance.

denis-tingaikin · 2024-09-27T10:41:18Z

Hello @zolug,

We used the patch from the PR networkservicemesh/sdk-kernel#679.
See more details at networkservicemesh/sdk-kernel#679 (comment).

We are currently not planning to merge it into main since we'd like to get a fix via a NetLink release.

zolug · 2024-09-27T12:18:37Z

Hi @denis-tingaikin,

Hmm, if I got it right, then it's not resolved in NSM 1.14.0 yet, is that correct?
I'm just somewhat confused because of the label saying 'Done in 1.14.0'. Also, according to some comments there were no successful reproduction attempts with 1.14.0 or rc7 that is.

denis-tingaikin · 2024-09-27T12:45:17Z

To be clear,

We actually have two solutions:

begin chain element changes => networkservicemesh/sdk@3016313
workaroud for netlink library => Remove waiting in ipaddress chain element sdk-kernel#679

As far as we know at this moment, both fixes are working and can be used together, but the begin chain element was not enough tested. So we decided not to include in the final release.

Situation at this moment:

main brach --> problem is resolved via the begin chain element changes.

releases:
release/v1.14.0-rc.1 --> problem is resolved via the begin chain element changes.
release/v1.14.0-rc.2 --> problem is resolved via the begin chain element changes.
release/v1.14.0-rc.3 --> problem is resolved via the begin chain element changes + netlink workaournd.
release/v1.14.0-rc.4 --> The problem is resolved via a netlink workaround.
release/v1.14.0-rc.5--> problem is resolved via a netlink workaround.
release/v1.14.0-rc.6--> problem is resolved via a netlink workaround.
release/v1.14.0-rc.7--> problem is resolved via a netlink workaround.
release/v1.14.0 --> problem is resolved via netlink workaround

Since we did not get enough time for testing on customer like env, we considered getting the netlink workaround for release 1.14.0 since it looks more stable and safe.

In the future release (v1.15.0) is planning to have the begin fixes + new released version of the NetLink library.

Please let me know if some things are still not looking clear.

denis-tingaikin · 2024-10-15T15:37:17Z

Seems like this one is resolved

denis-tingaikin added this to Release v1.14.0 Aug 27, 2024

denis-tingaikin added the bug Something isn't working label Aug 27, 2024

denis-tingaikin self-assigned this Aug 27, 2024

denis-tingaikin moved this to In Progress in Release v1.14.0 Aug 30, 2024

denis-tingaikin moved this from In Progress to Blocked in Release v1.14.0 Sep 13, 2024

ljkiraly closed this as completed Sep 20, 2024

github-project-automation bot moved this from Blocked to Done in Release v1.14.0 Sep 20, 2024

ljkiraly reopened this Sep 20, 2024

NikitaSkrynnik mentioned this issue Oct 11, 2024

Use commit 084abd93d350e97ee5410b5b6311bcc211f7ea05 for netlink networkservicemesh/sdk-kernel#686

Merged

NikitaSkrynnik added this to Release v1.14.1 Oct 15, 2024

NikitaSkrynnik self-assigned this Oct 15, 2024

denis-tingaikin closed this as completed Oct 15, 2024

github-project-automation bot moved this to Done in Release v1.14.1 Oct 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faulty behavior of forwarder-vpp blocks the heal process #1161

Faulty behavior of forwarder-vpp blocks the heal process #1161

szvincze commented Aug 21, 2024

denis-tingaikin commented Aug 27, 2024

szvincze commented Aug 27, 2024

denis-tingaikin commented Sep 2, 2024

ljkiraly commented Sep 4, 2024

NikitaSkrynnik commented Sep 10, 2024

ljkiraly commented Sep 10, 2024

NikitaSkrynnik commented Sep 11, 2024 •

edited

Loading

ljkiraly commented Sep 11, 2024 •

edited

Loading

NikitaSkrynnik commented Sep 12, 2024

denis-tingaikin commented Sep 17, 2024

NikitaSkrynnik commented Sep 17, 2024

denis-tingaikin commented Sep 17, 2024

NikitaSkrynnik commented Sep 17, 2024 •

edited

Loading

denis-tingaikin commented Sep 17, 2024 •

edited

Loading

NikitaSkrynnik commented Sep 18, 2024 •

edited

Loading

ljkiraly commented Sep 20, 2024

zolug commented Sep 27, 2024

denis-tingaikin commented Sep 27, 2024

zolug commented Sep 27, 2024 •

edited

Loading

denis-tingaikin commented Sep 27, 2024 •

edited

Loading

denis-tingaikin commented Oct 15, 2024

Faulty behavior of forwarder-vpp blocks the heal process #1161

Faulty behavior of forwarder-vpp blocks the heal process #1161

Comments

szvincze commented Aug 21, 2024

denis-tingaikin commented Aug 27, 2024

szvincze commented Aug 27, 2024

denis-tingaikin commented Sep 2, 2024

ljkiraly commented Sep 4, 2024

NikitaSkrynnik commented Sep 10, 2024

ljkiraly commented Sep 10, 2024

NikitaSkrynnik commented Sep 11, 2024 • edited Loading

ljkiraly commented Sep 11, 2024 • edited Loading

NikitaSkrynnik commented Sep 12, 2024

denis-tingaikin commented Sep 17, 2024

NikitaSkrynnik commented Sep 17, 2024

denis-tingaikin commented Sep 17, 2024

NikitaSkrynnik commented Sep 17, 2024 • edited Loading

denis-tingaikin commented Sep 17, 2024 • edited Loading

NikitaSkrynnik commented Sep 18, 2024 • edited Loading

ljkiraly commented Sep 20, 2024

zolug commented Sep 27, 2024

denis-tingaikin commented Sep 27, 2024

zolug commented Sep 27, 2024 • edited Loading

denis-tingaikin commented Sep 27, 2024 • edited Loading

denis-tingaikin commented Oct 15, 2024

NikitaSkrynnik commented Sep 11, 2024 •

edited

Loading

ljkiraly commented Sep 11, 2024 •

edited

Loading

NikitaSkrynnik commented Sep 17, 2024 •

edited

Loading

denis-tingaikin commented Sep 17, 2024 •

edited

Loading

NikitaSkrynnik commented Sep 18, 2024 •

edited

Loading

zolug commented Sep 27, 2024 •

edited

Loading

denis-tingaikin commented Sep 27, 2024 •

edited

Loading