Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

log_ssd_health timeouts when executing from shell and during warm-reboot #9114

Closed
dgsudharsan opened this issue Oct 29, 2021 · 3 comments · Fixed by sonic-net/sonic-utilities#1904

Comments

@dgsudharsan
Copy link
Collaborator

dgsudharsan commented Oct 29, 2021

Description

Issue introduced after sonic-net/sonic-utilities#1850
timeout command was added to smartctl. However, the smartctl command in host is a wrapper which calls docker exec -it pmon smartctl. With the interactive mode, the command doesn't return and it times out. This is documented in the bug moby/moby#28207 (comment) and recommendation is to use --foreground in timeout command which solves the issue

In the below logs between successive commands the log_sdd_health takes 30 seconds and timeouts
Oct 20 09:19:23.134719 arc-switch1025 NOTICE admin: Collecting logs to check ssd health before fast-reboot...
Oct 20 09:19:53.154005 arc-switch1025 NOTICE admin: Stopping nat ...

Steps to reproduce the issue:

  1. Run log_ssd_health in bash. It will not return until timeout (30 sec)

Describe the results you received:

The command hangs

Describe the results you expected:

The command shouldn't hang

Output of show version:

show version

SONiC Software Version: SONiC.master.209-b0c73d9a7_Internal
Distribution: Debian 10.11
Kernel: 4.19.0-12-2-amd64
Build commit: b0c73d9a7
Build date: Tue Oct 26 16:48:20 UTC 2021
Built by: sw-r2d2-bot@r-build-sonic-ci03-243

Platform: x86_64-mlnx_msn2700-r0
HwSKU: Mellanox-SN2700
ASIC: mellanox
ASIC Count: 1
Serial Number: MT2020T04244
Model Number: MSN2700-CS2FO
Hardware Revision: A2
Uptime: 05:53:42 up 1 day, 21:36,  1 user,  load average: 0.70, 0.94, 1.03

Docker images:
REPOSITORY                                         TAG                             IMAGE ID            SIZE
docker-dhcp-relay                                  latest                          e2f35a076316        429MB
docker-syncd-mlnx                                  latest                          2652a4c657a9        996MB
docker-syncd-mlnx                                  master.209-b0c73d9a7_Internal   2652a4c657a9        996MB
docker-database                                    latest                          fe2aefdfb8c3        415MB
docker-database                                    master.209-b0c73d9a7_Internal   fe2aefdfb8c3        415MB
docker-snmp                                        latest                          822131202195        457MB
docker-snmp                                        master.209-b0c73d9a7_Internal   822131202195        457MB
docker-teamd                                       latest                          ea6bda719ec0        428MB
docker-teamd                                       master.209-b0c73d9a7_Internal   ea6bda719ec0        428MB
docker-nat                                         latest                          1cbc8173f2e3        430MB
docker-nat                                         master.209-b0c73d9a7_Internal   1cbc8173f2e3        430MB
docker-router-advertiser                           latest                          ea7bb1f6d3f0        415MB
docker-router-advertiser                           master.209-b0c73d9a7_Internal   ea7bb1f6d3f0        415MB
docker-platform-monitor                            latest                          03027e633d3f        746MB
docker-platform-monitor                            master.209-b0c73d9a7_Internal   03027e633d3f        746MB
docker-macsec                                      latest                          de1ea43df7cb        431MB
docker-macsec                                      master.209-b0c73d9a7_Internal   de1ea43df7cb        431MB
docker-lldp                                        latest                          f13c1b763180        455MB
docker-lldp                                        master.209-b0c73d9a7_Internal   f13c1b763180        455MB
docker-orchagent                                   latest                          6849170aa3cd        446MB
docker-orchagent                                   master.209-b0c73d9a7_Internal   6849170aa3cd        446MB
docker-sonic-telemetry                             latest                          22a2702c6b34        504MB
docker-sonic-telemetry                             master.209-b0c73d9a7_Internal   22a2702c6b34        504MB
docker-sonic-mgmt-framework                        latest                          84e0d03cae25        570MB
docker-sonic-mgmt-framework                        master.209-b0c73d9a7_Internal   84e0d03cae25        570MB
docker-mux                                         latest                          6e6637e4ea92        468MB
docker-mux                                         master.209-b0c73d9a7_Internal   6e6637e4ea92        468MB
docker-fpm-frr                                     latest                          a48280de2de7        446MB
docker-fpm-frr                                     master.209-b0c73d9a7_Internal   a48280de2de7        446MB
docker-sflow                                       latest                          0b2be7d286b9        428MB
docker-sflow                                       master.209-b0c73d9a7_Internal   0b2be7d286b9        428MB
urm.nvidia.com/sw-nbu-sws-sonic-docker/sonic-wjh   1.1.0-202106-internal-13        a1808db49408        462MB
harbor.mellanox.com/sonic/cpu-report               10.0.0                          5314b41a2a5e        413MB

Output of show techsupport:

(paste your output here or download and attach the file here )

Additional information you deem important (e.g. issue happens only occasionally):

@dgsudharsan
Copy link
Collaborator Author

@yxieca FYI

@yxieca
Copy link
Contributor

yxieca commented Oct 29, 2021

@dgsudharsan Thanks so much for pointing this out. I'll add a fix soon.

@dgsudharsan
Copy link
Collaborator Author

@yxieca I have raised a fix for it. sonic-net/sonic-utilities#1904 If you can review it would be great.

yxieca pushed a commit to sonic-net/sonic-utilities that referenced this issue Oct 29, 2021
What I did
Fix sonic-net/sonic-buildimage#9114
The log_ssd_health command hangs due to timeout being used with docker exec -i which also affect warmboot flow.

How I did it
Added foreground option for timeout. This is recommended when not using the command on shell
https://man7.org/linux/man-pages/man1/timeout.1.html

How to verify it
Run log_ssd_health and verify it does not hang

Signed-off-by: Sudharsan Dhamal Gopalarathnam sudharsand@nvidia.com
qiluo-msft pushed a commit to sonic-net/sonic-utilities that referenced this issue Nov 5, 2021
What I did
Fix sonic-net/sonic-buildimage#9114
The log_ssd_health command hangs due to timeout being used with docker exec -i which also affect warmboot flow.

How I did it
Added foreground option for timeout. This is recommended when not using the command on shell
https://man7.org/linux/man-pages/man1/timeout.1.html

How to verify it
Run log_ssd_health and verify it does not hang

Signed-off-by: Sudharsan Dhamal Gopalarathnam sudharsand@nvidia.com
judyjoseph pushed a commit to sonic-net/sonic-utilities that referenced this issue Nov 6, 2021
What I did
Fix sonic-net/sonic-buildimage#9114
The log_ssd_health command hangs due to timeout being used with docker exec -i which also affect warmboot flow.

How I did it
Added foreground option for timeout. This is recommended when not using the command on shell
https://man7.org/linux/man-pages/man1/timeout.1.html

How to verify it
Run log_ssd_health and verify it does not hang

Signed-off-by: Sudharsan Dhamal Gopalarathnam sudharsand@nvidia.com
malletvapid23 added a commit to malletvapid23/Sonic-Utility that referenced this issue Aug 3, 2023
What I did
Fix sonic-net/sonic-buildimage#9114
The log_ssd_health command hangs due to timeout being used with docker exec -i which also affect warmboot flow.

How I did it
Added foreground option for timeout. This is recommended when not using the command on shell
https://man7.org/linux/man-pages/man1/timeout.1.html

How to verify it
Run log_ssd_health and verify it does not hang

Signed-off-by: Sudharsan Dhamal Gopalarathnam sudharsand@nvidia.com
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants