Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SWSS Orchagent crashed for "soft-reboot" command. #8301

Open
rajkumar38 opened this issue Jul 27, 2021 · 5 comments
Open

SWSS Orchagent crashed for "soft-reboot" command. #8301

rajkumar38 opened this issue Jul 27, 2021 · 5 comments
Assignees
Labels

Comments

@rajkumar38
Copy link
Contributor

Issue happens in test-bed testing in t0-52 topology, while executing reboot test suite,

py.test platform_tests/test_reboot.py   --testbed=ptf5-m --inventory=../ansible/lab --testbed_file=../ansible/testbed.csv --host-pattern=str-marvell-acs-1 --module-path=../ansible/library -v --topology=t0-52,any -

As part of this, syncd is stopped and Orchagent still pushing some request to syncd and in this flow for a bulk request there is no response (expected, as syncd is down) and an exception is thrown, which is not handled by swss-orchagent, eventually supervisord catch this exception and terminates the Orchagent process.

Jul 13 01:44:02.117199 ixs-7215-pizza4 INFO ansible-command: Invoked with creates=None executable=None _uses_shell=False strip_empty_ends=True _raw_params=soft-reboot removes=None argv=None warn=True chdir=None stdin_add_newline=True stdin=None
Jul 13 01:44:04.570998 ixs-7215-pizza4 NOTICE admin: Collecting logs to check ssd health before soft-reboot...

**syncd down log**

Jul 13 01:44:10.133423 ixs-7215-pizza4 NOTICE syncd#syncd_request_shutdown: :- loadFromFile: no context config specified, will load default context config
Jul 13 01:44:10.133423 ixs-7215-pizza4 NOTICE syncd#syncd_request_shutdown: :- insert: added switch: 0:
Jul 13 01:44:10.144254 ixs-7215-pizza4 NOTICE syncd#syncd_request_shutdown: :- send: requested COLD shutdown

**Orchagent pushing request and TIMESOUT**

Jul 13 01:50:13.555322 ixs-7215-pizza4 ERR swss#orchagent: :- wait: SELECT operation result: TIMEOUT on getresponse
Jul 13 01:50:13.555322 ixs-7215-pizza4 ERR swss#orchagent: :- wait: failed to get response for getresponse
Jul 13 01:50:13.555322 ixs-7215-pizza4 ERR swss#orchagent: :- set: set status: SAI_STATUS_FAILURE
Jul 13 01:50:13.555322 ixs-7215-pizza4 ERR swss#orchagent: :- setRouterIntfsMtu: Failed to set router interface PortChannel0004 MTU to 9100, rv:-1
Jul 13 01:50:16.451985 ixs-7215-pizza4 ERR monit[451]: 'container_checker' status failed (3) -- Expected containers not running: pmon
Jul 13 01:50:26.487275 ixs-7215-pizza4 ERR monit[451]: 'container_checker' status failed (3) -- Expected containers not running: pmon
Jul 13 01:50:36.516905 ixs-7215-pizza4 ERR monit[451]: 'container_checker' status failed (3) -- Expected containers not running: pmon
Jul 13 01:50:46.548083 ixs-7215-pizza4 ERR monit[451]: 'container_checker' status failed (3) -- Expected containers not running: pmon
Jul 13 01:50:46.769762 ixs-7215-pizza4 DEBUG /disk_check.py: /etc is Read-Write
Jul 13 01:50:46.770963 ixs-7215-pizza4 DEBUG /disk_check.py: /home is Read-Write
Jul 13 01:50:56.600778 ixs-7215-pizza4 ERR monit[451]: 'container_checker' status failed (3) -- Expected containers not running: pmon
Jul 13 01:51:06.625046 ixs-7215-pizza4 ERR monit[451]: 'container_checker' status failed (3) -- Expected containers not running: pmon
Jul 13 01:51:13.600351 ixs-7215-pizza4 ERR swss#orchagent: :- wait: SELECT operation result: TIMEOUT on getresponse
Jul 13 01:51:13.600351 ixs-7215-pizza4 ERR swss#orchagent: :- wait: failed to get response for getresponse
Jul 13 01:51:13.600351 ixs-7215-pizza4 ERR swss#orchagent: :- remove: remove status: SAI_STATUS_FAILURE
Jul 13 01:51:13.600351 ixs-7215-pizza4 ERR swss#orchagent: :- removeLagMember: Failed to remove member Ethernet51 from LAG PortChannel0004 lid:2000000000594 lmid:1b000000000598

Jul 13 01:52:14.035422 ixs-7215-pizza4 ERR swss#orchagent: :- wait: SELECT operation result: TIMEOUT on getresponse
Jul 13 01:52:14.035814 ixs-7215-pizza4 ERR swss#orchagent: :- wait: failed to get response for getresponse
Jul 13 01:52:14.036177 ixs-7215-pizza4 ERR swss#orchagent: :- waitForBulkResponse: wrong number of counters, got 0, expected 1000
Jul 13 01:52:14.036422 ixs-7215-pizza4 INFO swss#/supervisord: orchagent terminate called after throwing an instance of 'std::runtime_error'
Jul 13 01:52:14.036626 ixs-7215-pizza4 INFO swss#/supervisord: orchagent   what():  :- waitForBulkResponse: wrong number of counters, got 0, expected 1000
Jul 13 01:52:15.447453 ixs-7215-pizza4 INFO ansible-command: Invoked with creates=None executable=None _uses_shell=False strip_empty_ends=True _raw_params=date +"%Y-%m-%d %H:%M:%S" removes=None argv=None warn=True chdir=None stdin_add_newline=True stdin=None
Jul 13 01:52:15.987562 ixs-7215-pizza4 INFO swss#supervisord 2021-07-13 01:52:15,986 INFO exited: orchagent (terminated by SIGABRT (core dumped); not expected)


**root@str-marvell-acs-1:~# show version**

SONiC Software Version: SONiC.202012.23973-1d3939b7f
Distribution: Debian 10.10
Kernel: 4.19.0-12-2-armmp
Build commit: 1d3939b7f
Build date: Sat Jul 17 06:38:27 UTC 2021
Built by: AzDevOps@sonic-build-workers-0000DC

Attaching complete syslog.
syslog.txt

@lguohan lguohan transferred this issue from sonic-net/sonic-swss Jul 31, 2021
@lguohan
Copy link
Collaborator

lguohan commented Jul 31, 2021

can you help us to understand why syncd is killed in the soft-reboot first?

@rajkumar38
Copy link
Contributor Author

soft-reboo

I don't find any documentation related to soft-reboot.
@sujinmkang , Can you pls help answer @lguohan comment.

@sujinmkang
Copy link
Collaborator

@lguohan and @rajkumar38 the sycd shutdown came from cold reboot script which soft-reboot was based off. As I understand, it was added to prevent some rare reboot failure - sonic-net/sonic-utilities#223 is one of the changes that introduce the syncd shutdown.

@rajkumar38
Copy link
Contributor Author

Issue fixed. Verified with latest commit.

@rajkumar38
Copy link
Contributor Author

Observing the issue again with SONIC 202205 branch

@rajkumar38 rajkumar38 reopened this Dec 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants