Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault in forwarder on endpoint uninstall #504

Open
richardstone opened this issue Feb 11, 2022 · 5 comments
Open

Segfault in forwarder on endpoint uninstall #504

richardstone opened this issue Feb 11, 2022 · 5 comments
Labels
bug Something isn't working

Comments

@richardstone
Copy link

richardstone commented Feb 11, 2022

Forwarder unexpectedly restarted with a segfault when then endpoint got removed.

The issue is rarely reproducible..

Feb 10 20:09:19.978�[36m [INFO] [cmd:vpp] �[0mvpp[14]: received signal SIGSEGV, PC 0x7f274a25b520, faulting address 0x0
Feb 10 20:09:19.978�[36m [INFO] [cmd:vpp] �[0mvpp[14]: #0  0x00007f2749cef640 unix_signal_handler + 0x100
Feb 10 20:09:19.978�[36m [INFO] [cmd:vpp] �[0mvpp[14]: #1  0x00007f2749602050 0x7f2749602050
Feb 10 20:09:19.978�[36m [INFO] [cmd:vpp] �[0mvpp[14]: #2  0x00007f274a25b520 vnet_hw_if_update_runtime_data + 0x1a30
Feb 10 20:09:19.978�[36m [INFO] [cmd:vpp] �[0mvpp[14]: #3  0x00007f274a226347 vnet_delete_hw_interface + 0x87
Feb 10 20:09:19.978�[36m [INFO] [cmd:vpp] �[0mvpp[14]: #4  0x00007f274a26d238 ethernet_delete_interface + 0x548
Feb 10 20:09:19.978�[36m [INFO] [cmd:vpp] �[0mvpp[14]: #5  0x00007f274a7be8ed tap_delete_if + 0x13d
Feb 10 20:09:19.978�[36m [INFO] [cmd:vpp] �[0mvpp[14]: #6  0x00007f274a7c3a30 vl_api_tap_delete_v2_t_handler + 0x70
Feb 10 20:09:19.978�[36m [INFO] [cmd:vpp] �[0mvpp[14]: #7  0x00007f274af611a1 vl_msg_api_handler_no_free + 0xd1
Feb 10 20:09:19.978�[36m [INFO] [cmd:vpp] �[0mvpp[14]: #8  0x00007f274af399a5 vl_socket_process_api_msg + 0x15
Feb 10 20:09:19.978�[36m [INFO] [cmd:vpp] �[0mvpp[14]: #9  0x00007f274af437cb vl_api_clnt_process + 0xdeb
Feb 10 20:09:19.978�[36m [INFO] [cmd:vpp] �[0mvpp[14]: #10 0x00007f2749c96966 vlib_process_bootstrap + 0x16
Feb 10 20:09:19.978�[36m [INFO] [cmd:vpp] �[0mvpp[14]: #11 0x00007f27491d51c8 0x7f27491d51c8
Feb 10 20:17:13.298�[31m [ERRO] [cmd:/bin/forwarder] �[0msignal: aborted
Feb 10 20:17:13.299�[31m [ERRO] [cmd:/bin/forwarder] �[0m%!s(<nil>)

NSM component versions are v1.2.0-rc.1

forwarder-vpp-5r59k-forwarder-vpp-prev.txt

@denis-tingaikin
Copy link
Member

@richardstone Could you enable trace logs in the forwarder https://github.com/networkservicemesh/deployments-k8s/blob/main/apps/forwarder-vpp/forwarder.yaml#L29-L30?
This can super help us to figure out what is going on.

Could also you point out what applications are used? (Is it just https://github.com/networkservicemesh/cmd-nse-icmp-responder and https://github.com/networkservicemesh/cmd-nsc or something else?)

@denis-tingaikin denis-tingaikin added the bug Something isn't working label Feb 11, 2022
@richardstone
Copy link
Author

@denis-tingaikin
I've used https://github.com/networkservicemesh/cmd-nse-remote-vlan and https://github.com/networkservicemesh/cmd-nsc

I'll try to reproduce with trace logs, but as this is an intermittent issue, I'm not sure when will that be..

@denis-tingaikin
Copy link
Member

@edwarnicke , @ljkiraly , @richardstone I think the best solution here is to try to add testing for VLAN remote case on CI.
CI is running every day for a few times, so we'll catch any rare issue in some time :)

@ljkiraly
Copy link
Contributor

ljkiraly commented Feb 15, 2022

Hello @denis-tingaikin, @edwarnicke,

I added more remote vlan mechanism related TCs in pull request https://github.com/networkservicemesh/deployment_k8s#4465
Also contains some fix for the failing test issue networkservicemesh/integration-k8s-kind#584 (comment)
So the automation can be added after the deployment_k8s#4465 was merged.

Laszlo

@richardstone
Copy link
Author

This issues is still popping up from time to time, it happened again last night.

I suspect VPP does not update the timestamp as logs from VPP looks to be having the very same timestamp for all lines it prints. In case of the forwarder where the segfault appeared, the pod was alive from Nov 2 20:45:00.588 until Nov 2 21:13:02.651, but this is all of the logs from VPP:

Nov  2 20:45:00.590 [INFO] [cmd:vpp] vpp[14]: buffer: numa[0] falling back to non-hugepage backed buffer pool (vlib_physmem_shared_map_create: pmalloc_map_pages: Unable to lock pages: Cannot allocate memory)
Nov  2 20:45:00.590 [INFO] [cmd:vpp] vpp[14]: perfmon: skipping source 'intel-uncore' - intel_uncore_init: no uncore units found
Nov  2 20:45:00.590 [INFO] [cmd:vpp] vpp[14]: perfmon: skipping source 'intel-core' - intel_core_init: not a IA-32 CPU
Nov  2 20:45:00.590 [INFO] [cmd:vpp] vpp[14]: tls_init_ca_chain:609: Could not initialize TLS CA certificates
Nov  2 20:45:00.590 [INFO] [cmd:vpp] vpp[14]: tls_mbedtls_init:644: failed to initialize TLS CA chain
Nov  2 20:45:00.590 [INFO] [cmd:vpp] vpp[14]: tls_init_ca_chain:976: Could not initialize TLS CA certificates
Nov  2 20:45:00.590 [INFO] [cmd:vpp] vpp[14]: tls_openssl_init:1050: failed to initialize TLS CA chain
Nov  2 20:45:00.590 [INFO] [cmd:vpp] vpp[14]: vat-plug/load: vat_plugin_register: oddbuf plugin not loaded...
Nov  2 20:45:00.590 [INFO] [cmd:vpp] vpp[14]: vl_api_want_interface_events_t_handler:1635: pid 1: already enabled...
Nov  2 20:45:00.590 [INFO] [cmd:vpp] vpp[14]: capo_sw_interface_add_del:103: unconfiguring policies for if 5 deleted
Nov  2 20:45:00.590 [INFO] [cmd:vpp] vpp[14]: capo_sw_interface_add_del:110: error deleting caiop (output): -2
Nov  2 20:45:00.590 [INFO] [cmd:vpp] vpp[14]: capo_sw_interface_add_del:116: error deleting caiop (input): -2
Nov  2 20:45:00.590 [INFO] [cmd:vpp] vpp[14]: received signal SIGSEGV, PC 0x7f7bc9687129, faulting address 0x4
Nov  2 20:45:00.590 [INFO] [cmd:vpp] vpp[14]: #0  0x00007f7bc905bf20 unix_signal_handler + 0x150
Nov  2 20:45:00.590 [INFO] [cmd:vpp] vpp[14]: #1  0x00007f7bc898a8c0 0x7f7bc898a8c0
Nov  2 20:45:00.590 [INFO] [cmd:vpp] vpp[14]: #2  0x00007f7bc9687129 vnet_hw_if_generate_rxq_int_poll_vector + 0x69
Nov  2 20:45:00.590 [INFO] [cmd:vpp] vpp[14]: #3  0x00007f7bca357ae2 virtio_input_node_fn_hsw + 0x38352
Nov  2 20:45:00.590 [INFO] [cmd:vpp] vpp[14]: #4  0x00007f7bc8ffa35f vlib_main_loop + 0x108f
Nov  2 20:45:00.590 [INFO] [cmd:vpp] vpp[14]: #5  0x00007f7bc8ff9272 vlib_main + 0x582
Nov  2 20:45:00.590 [INFO] [cmd:vpp] vpp[14]: #6  0x00007f7bc905b8e6 thread0 + 0x26
Nov  2 20:45:00.590 [INFO] [cmd:vpp] vpp[14]: #7  0x00007f7bc851c294 0x7f7bc851c294

@edwarnicke Should I open a ticket for the VPP timestamp issue?

nsmbot pushed a commit that referenced this issue May 15, 2024
…k-k8s@main

PR link: networkservicemesh/sdk-k8s#504

Commit: 51fb1b9
Author: Network Service Mesh Bot
Date: 2024-05-15 10:40:06 -0500
Message:
  - Update go.mod and go.sum to latest version from networkservicemesh/sdk@main (#504)
PR link: networkservicemesh/sdk#1624
Commit: c2a3414
Author: Nikita Skrynnik
Date: 2024-05-16 02:24:48 +1100
Message:
    - Fix some leaks in opa policies (#1624)
* Fix some leaks in opa policies
* fix go linter issues
---------
Signed-off-by: NSMBot <nsmbot@networkservicmesh.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Blocked
Development

No branches or pull requests

3 participants