Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[log_ssd_health]Fix log_ssd_health hang issue #1904

Merged
merged 1 commit into from
Oct 29, 2021

Conversation

dgsudharsan
Copy link
Collaborator

@dgsudharsan dgsudharsan commented Oct 29, 2021

What I did

Fix sonic-net/sonic-buildimage#9114
The log_ssd_health command hangs due to timeout being used with docker exec -i which also affect warmboot flow.

How I did it

Added foreground option for timeout. This is recommended when not using the command on shell
https://man7.org/linux/man-pages/man1/timeout.1.html

How to verify it

Run log_ssd_health and verify it does not hang

Signed-off-by: Sudharsan Dhamal Gopalarathnam sudharsand@nvidia.com

Previous command output (if the output of a command-line utility has changed)

New command output (if the output of a command-line utility has changed)

Signed-off-by: Sudharsan Dhamal Gopalarathnam <sudharsand@nvidia.com>
@dgsudharsan
Copy link
Collaborator Author

Required for 202012 and 202106

@dgsudharsan
Copy link
Collaborator Author

@yxieca FYI

@dgsudharsan
Copy link
Collaborator Author

Verified the flow through warmboot too. Below are the logs
Oct 29 16:05:20.836944 r-anaconda-15 NOTICE admin: Pausing orchagent ...
Oct 29 16:05:20.994356 r-anaconda-15 NOTICE admin: Collecting logs to check ssd health before fastfast-reboot...
Oct 29 16:05:21.151082 r-anaconda-15 NOTICE admin: smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.19.0-12-2-amd64] (local build)
Oct 29 16:05:21.151440 r-anaconda-15 NOTICE admin: Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org
Oct 29 16:05:21.151707 r-anaconda-15 NOTICE admin:
Oct 29 16:05:21.151963 r-anaconda-15 NOTICE admin: === START OF INFORMATION SECTION ===
Oct 29 16:05:21.152203 r-anaconda-15 NOTICE admin: Device Model: StorFly VSFBM4XC030G-MLX1
Oct 29 16:05:21.152442 r-anaconda-15 NOTICE admin: Serial Number: P1T13004870812030294
Oct 29 16:05:21.152689 r-anaconda-15 NOTICE admin: Firmware Version: 0202-000
Oct 29 16:05:21.152950 r-anaconda-15 NOTICE admin: User Capacity: 30,016,659,456 bytes [30.0 GB]
Oct 29 16:05:21.153191 r-anaconda-15 NOTICE admin: Sector Size: 512 bytes logical/physical
Oct 29 16:05:21.153489 r-anaconda-15 NOTICE admin: Rotation Rate: Solid State Device
Oct 29 16:05:21.153786 r-anaconda-15 NOTICE admin: Device is: Not in smartctl database [for details use: -P showall]
Oct 29 16:05:21.154117 r-anaconda-15 NOTICE admin: ATA Version is: ACS-2 (minor revision not indicated)
Oct 29 16:05:21.154394 r-anaconda-15 NOTICE admin: SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Oct 29 16:05:21.154484 r-anaconda-15 NOTICE admin: Local Time is: Fri Oct 29 16:05:21 2021 UTC
Oct 29 16:05:21.154568 r-anaconda-15 NOTICE admin: SMART support is: Available - device has SMART capability.
Oct 29 16:05:21.154650 r-anaconda-15 NOTICE admin: SMART support is: Enabled
Oct 29 16:05:21.154732 r-anaconda-15 NOTICE admin:
Oct 29 16:05:21.154815 r-anaconda-15 NOTICE admin: === START OF READ SMART DATA SECTION ===
Oct 29 16:05:21.154902 r-anaconda-15 NOTICE admin: SMART overall-health self-assessment test result: PASSED
Oct 29 16:05:21.154985 r-anaconda-15 NOTICE admin:
Oct 29 16:05:21.155066 r-anaconda-15 NOTICE admin: General SMART Values:
Oct 29 16:05:21.155147 r-anaconda-15 NOTICE admin: Offline data collection status: (0x00)#011Offline data collection activity
Oct 29 16:05:21.155229 r-anaconda-15 NOTICE admin: #11#011#011#011#011was never started.
Oct 29 16:05:21.155310 r-anaconda-15 NOTICE admin: #11#011#011#011#011Auto Offline Data Collection: Disabled.
Oct 29 16:05:21.155391 r-anaconda-15 NOTICE admin: Self-test execution status: ( 0)#011The previous self-test routine completed
Oct 29 16:05:21.155472 r-anaconda-15 NOTICE admin: #11#011#011#011#011without error or no self-test has ever
Oct 29 16:05:21.155555 r-anaconda-15 NOTICE admin: #11#011#011#011#011been run.
Oct 29 16:05:21.155635 r-anaconda-15 NOTICE admin: Total time to complete Offline
Oct 29 16:05:21.155723 r-anaconda-15 NOTICE admin: data collection: #11#011( 0) seconds.
Oct 29 16:05:21.155805 r-anaconda-15 NOTICE admin: Offline data collection
Oct 29 16:05:21.155886 r-anaconda-15 NOTICE admin: capabilities: #11#011#011 (0x71) SMART execute Offline immediate.
Oct 29 16:05:21.155971 r-anaconda-15 NOTICE admin: #11#011#011#011#011No Auto Offline data collection support.
Oct 29 16:05:21.156062 r-anaconda-15 NOTICE admin: #11#011#011#011#011Suspend Offline collection upon new
Oct 29 16:05:21.156144 r-anaconda-15 NOTICE admin: #11#011#011#011#011command.
Oct 29 16:05:21.156227 r-anaconda-15 NOTICE admin: #11#011#011#011#011No Offline surface scan supported.
Oct 29 16:05:21.156307 r-anaconda-15 NOTICE admin: #11#011#011#011#011Self-test supported.
Oct 29 16:05:21.156388 r-anaconda-15 NOTICE admin: #11#011#011#011#011Conveyance Self-test supported.
Oct 29 16:05:21.156470 r-anaconda-15 NOTICE admin: #11#011#011#011#011Selective Self-test supported.
Oct 29 16:05:21.156553 r-anaconda-15 NOTICE admin: SMART capabilities: (0x0002)#011Does not save SMART data before
Oct 29 16:05:21.156635 r-anaconda-15 NOTICE admin: #11#011#011#011#011entering power-saving mode.
Oct 29 16:05:21.156717 r-anaconda-15 NOTICE admin: #11#011#011#011#011Supports SMART auto save timer.
Oct 29 16:05:21.156798 r-anaconda-15 NOTICE admin: Error logging capability: (0x01)#011Error logging supported.
Oct 29 16:05:21.156879 r-anaconda-15 NOTICE admin: #11#011#011#011#011General Purpose Logging supported.
Oct 29 16:05:21.156960 r-anaconda-15 NOTICE admin: Short self-test routine
Oct 29 16:05:21.157043 r-anaconda-15 NOTICE admin: recommended polling time: #11 ( 1) minutes.
Oct 29 16:05:21.157127 r-anaconda-15 NOTICE admin: Extended self-test routine
Oct 29 16:05:21.157209 r-anaconda-15 NOTICE admin: recommended polling time: #11 ( 1) minutes.
Oct 29 16:05:21.157289 r-anaconda-15 NOTICE admin: Conveyance self-test routine
Oct 29 16:05:21.157371 r-anaconda-15 NOTICE admin: recommended polling time: #11 ( 1) minutes.
Oct 29 16:05:21.157452 r-anaconda-15 NOTICE admin:
Oct 29 16:05:21.157534 r-anaconda-15 NOTICE admin: SMART Attributes Data Structure revision number: 1
Oct 29 16:05:21.157614 r-anaconda-15 NOTICE admin: Vendor Specific SMART Attributes with Thresholds:
Oct 29 16:05:21.157695 r-anaconda-15 NOTICE admin: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
Oct 29 16:05:21.157777 r-anaconda-15 NOTICE admin: 1 Raw_Read_Error_Rate 0x0000 100 100 070 Old_age Offline - 0
Oct 29 16:05:21.157880 r-anaconda-15 NOTICE admin: 5 Reallocated_Sector_Ct 0x0000 100 100 000 Old_age Offline - 0
Oct 29 16:05:21.157965 r-anaconda-15 NOTICE admin: 9 Power_On_Hours 0x0000 100 100 000 Old_age Offline - 1142
Oct 29 16:05:21.158052 r-anaconda-15 NOTICE admin: 12 Power_Cycle_Count 0x0000 100 100 000 Old_age Offline - 3947
Oct 29 16:05:21.158151 r-anaconda-15 NOTICE admin: 160 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 0
Oct 29 16:05:21.158232 r-anaconda-15 NOTICE admin: 161 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 262
Oct 29 16:05:21.158313 r-anaconda-15 NOTICE admin: 163 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 21
Oct 29 16:05:21.158393 r-anaconda-15 NOTICE admin: 164 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 1843067
Oct 29 16:05:21.158474 r-anaconda-15 NOTICE admin: 165 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 1054
Oct 29 16:05:21.158554 r-anaconda-15 NOTICE admin: 166 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 861
Oct 29 16:05:21.158635 r-anaconda-15 NOTICE admin: 167 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 887
Oct 29 16:05:21.158716 r-anaconda-15 NOTICE admin: 168 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 20000
Oct 29 16:05:21.158796 r-anaconda-15 NOTICE admin: 177 Wear_Leveling_Count 0x0000 100 100 050 Old_age Offline - 11065
Oct 29 16:05:21.158877 r-anaconda-15 NOTICE admin: 178 Used_Rsvd_Blk_Cnt_Chip 0x0000 100 100 000 Old_age Offline - 0
Oct 29 16:05:21.158960 r-anaconda-15 NOTICE admin: 181 Program_Fail_Cnt_Total 0x0000 100 100 000 Old_age Offline - 0
Oct 29 16:05:21.159041 r-anaconda-15 NOTICE admin: 182 Erase_Fail_Count_Total 0x0000 100 100 000 Old_age Offline - 0
Oct 29 16:05:21.159123 r-anaconda-15 NOTICE admin: 187 Reported_Uncorrect 0x0000 100 100 000 Old_age Offline - 0
Oct 29 16:05:21.159205 r-anaconda-15 NOTICE admin: 192 Power-Off_Retract_Count 0x0000 100 100 000 Old_age Offline - 1182
Oct 29 16:05:21.159327 r-anaconda-15 NOTICE admin: 194 Temperature_Celsius 0x0000 100 100 000 Old_age Offline - 36
Oct 29 16:05:21.159416 r-anaconda-15 NOTICE admin: 195 Hardware_ECC_Recovered 0x0000 100 100 000 Old_age Offline - 0
Oct 29 16:05:21.159498 r-anaconda-15 NOTICE admin: 196 Reallocated_Event_Count 0x0000 100 100 016 Old_age Offline - 0
Oct 29 16:05:21.159581 r-anaconda-15 NOTICE admin: 198 Offline_Uncorrectable 0x0000 100 100 000 Old_age Offline - 0
Oct 29 16:05:21.159661 r-anaconda-15 NOTICE admin: 199 UDMA_CRC_Error_Count 0x0000 100 100 050 Old_age Offline - 1
Oct 29 16:05:21.159741 r-anaconda-15 NOTICE admin: 232 Available_Reservd_Space 0x0000 100 100 000 Old_age Offline - 100
Oct 29 16:05:21.159826 r-anaconda-15 NOTICE admin: 241 Total_LBAs_Written 0x0000 100 100 000 Old_age Offline - 455796
Oct 29 16:05:21.159908 r-anaconda-15 NOTICE admin: 242 Total_LBAs_Read 0x0000 100 100 000 Old_age Offline - 3921978
Oct 29 16:05:21.159989 r-anaconda-15 NOTICE admin: 248 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 96
Oct 29 16:05:21.160070 r-anaconda-15 NOTICE admin: 249 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 100
Oct 29 16:05:21.160153 r-anaconda-15 NOTICE admin:
Oct 29 16:05:21.160235 r-anaconda-15 NOTICE admin: SMART Error Log Version: 1
Oct 29 16:05:21.160315 r-anaconda-15 NOTICE admin: No Errors Logged
Oct 29 16:05:21.160396 r-anaconda-15 NOTICE admin:
Oct 29 16:05:21.160477 r-anaconda-15 NOTICE admin: SMART Self-test log structure revision number 1
Oct 29 16:05:21.160557 r-anaconda-15 NOTICE admin: No self-tests have been logged. [To run self-tests, use: smartctl -t]
Oct 29 16:05:21.160641 r-anaconda-15 NOTICE admin:
Oct 29 16:05:21.160722 r-anaconda-15 NOTICE admin: SMART Selective self-test log data structure revision number 1
Oct 29 16:05:21.160803 r-anaconda-15 NOTICE admin: SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
Oct 29 16:05:21.160883 r-anaconda-15 NOTICE admin: 1 0 0 Not_testing
Oct 29 16:05:21.160963 r-anaconda-15 NOTICE admin: 2 0 0 Not_testing
Oct 29 16:05:21.161044 r-anaconda-15 NOTICE admin: 3 0 0 Not_testing
Oct 29 16:05:21.161125 r-anaconda-15 NOTICE admin: 4 0 0 Not_testing
Oct 29 16:05:21.161206 r-anaconda-15 NOTICE admin: 5 0 0 Not_testing
Oct 29 16:05:21.161287 r-anaconda-15 NOTICE admin: 6 0 65535 Read_scanning was never started
Oct 29 16:05:21.161371 r-anaconda-15 NOTICE admin: Selective self-test flags (0x0):
Oct 29 16:05:21.161457 r-anaconda-15 NOTICE admin: After scanning selected spans, do NOT read-scan remainder of disk.
Oct 29 16:05:21.161539 r-anaconda-15 NOTICE admin: If Selective self-test is pending on power-up, resume after 0 minute delay.
Oct 29 16:05:21.161623 r-anaconda-15 NOTICE admin:
Oct 29 16:05:21.161826 r-anaconda-15 NOTICE admin: Stopping nat ...

@yxieca yxieca self-requested a review October 29, 2021 16:52
@yxieca yxieca merged commit 80a10dc into sonic-net:master Oct 29, 2021
qiluo-msft pushed a commit that referenced this pull request Nov 5, 2021
What I did
Fix sonic-net/sonic-buildimage#9114
The log_ssd_health command hangs due to timeout being used with docker exec -i which also affect warmboot flow.

How I did it
Added foreground option for timeout. This is recommended when not using the command on shell
https://man7.org/linux/man-pages/man1/timeout.1.html

How to verify it
Run log_ssd_health and verify it does not hang

Signed-off-by: Sudharsan Dhamal Gopalarathnam sudharsand@nvidia.com
judyjoseph pushed a commit that referenced this pull request Nov 6, 2021
What I did
Fix sonic-net/sonic-buildimage#9114
The log_ssd_health command hangs due to timeout being used with docker exec -i which also affect warmboot flow.

How I did it
Added foreground option for timeout. This is recommended when not using the command on shell
https://man7.org/linux/man-pages/man1/timeout.1.html

How to verify it
Run log_ssd_health and verify it does not hang

Signed-off-by: Sudharsan Dhamal Gopalarathnam sudharsand@nvidia.com
stepanblyschak pushed a commit to stepanblyschak/sonic-utilities that referenced this pull request Apr 18, 2022
Submodule update for sonic-utilties

```
48035d7 [202012] [techsupport] Techsupport Error Reporting pending fixes (sonic-net#1854)
8b2ec09 Fix log_ssd_health hang issue (sonic-net#1904)
ac9c425 Fix the option missing in kernel config issue (sonic-net#1888)
5cc9417 disk_check: Script updated to run good in 201811 & 201911 (sonic-net#1747)
```
@dgsudharsan dgsudharsan deleted the ssd_health_fix branch March 9, 2023 02:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

log_ssd_health timeouts when executing from shell and during warm-reboot
4 participants