Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[reboot-cause] Fix a broken symlink of previous-reboot-cause file removal issue #46

Merged
merged 2 commits into from
Mar 28, 2023

Conversation

mlok-nokia
Copy link
Contributor

Why I did it

"show reboot-cause history" shows empty history. When the previous-reboot-cause has a broken symlink, And rebooting the system will not be able to generate a new symlink of the new previous-reboot-cause.

admin@sonic:~$ show reboot-cause history 
Name    Cause    Time    User    Comment
------  -------  ------  ------  ---------

How I did it

Somehow, when the symlink file /host/reboot-cause/previous-reboot-cause is broken (which its destination files doesn't exist in this case), the current condition check "if os.path,exists(PREVIOUS_REBOOT_CAUSE_FILE)" will return False in determine-reboot-cause script. Hence, the current previous-reboot-cause is not been removed and the recreation of the new previous-reboot-cause failed. In case of previous-reboot-cause is a broken synlink file, add condition os.path.islink(PREVIOUS_REBOOT_CAUSE) to check and allow the remove operation happens.

How to verify it

  1. Manually make the /host/reboot-cause/previous-reboot-cause to be a broken symlink file by removing its destination file
  2. reboot the system. "show reboot-cause history" should show the correct info
admin@ixre-egl-board25:~$ show reboot-cause history 
Name                 Cause                                    Time                             User     Comment
-------------------  ---------------------------------------  -------------------------------  -------  ----------
2023_03_01_17_41_31  reboot                                   Wed 01 Mar 2023 05:39:27 PM UTC  admin    N/A
2023_03_01_17_14_47  reboot                                   Wed 01 Mar 2023 05:12:05 PM UTC  admin    N/A
2023_03_01_17_14_18  reboot                                   Wed 01 Mar 2023 05:12:05 PM UTC  admin    N/A
2023_03_01_16_20_52  reboot                                   Wed 01 Mar 2023 04:18:22 PM UTC  admin    N/A
2023_03_01_16_20_27  reboot                                   Wed 01 Mar 2023 04:18:22 PM UTC  admin    N/A

Which release branch to backport (provide reason below if selected)

  • 201811
  • 201911
  • 202006
  • 202012
  • 202106
  • 202111
  • 202205
  • 202211

Description for the changelog

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

@mlok-nokia
Copy link
Contributor Author

@judyjoseph @abdosi Hi Judy and Abhishek, Please review this PR to fix the "show reboot-cause history" shows no info. Thanks.

@mlok-nokia
Copy link
Contributor Author

This is fix which has been reverted before. -- sonic-net/sonic-buildimage#10751. Per discussion, we need to have this PR to fix the issue in case of it happens to be a broken symlink

@prgeor prgeor self-requested a review March 1, 2023 19:11
@prgeor
Copy link
Contributor

prgeor commented Mar 1, 2023

@mlok-nokia did we investigate why the symlink is getting broken at the first place?

@mlok-nokia
Copy link
Contributor Author

@mlok-nokia did we investigate why the symlink is getting broken at the first place?

For the platform which doesn't have the RTC and super-cab, when system reboot or shutdown for a long period of time, the clock will go back to the old time. If the system doesn't have NTP enabled, the system will start with the old time. The filename of the reboot-cause history is constructed with the current timestamp. The "previous-reboot-cause" is created and softlink to that filename. When system has more than 10 histories, if the newer history with the older timesatmp, it will be removed because it filename has oldest timestamp. Hence, the file "previous-reboot-cause" point to a file which is just been removed. And it becomes a broken link. Once this broken symlink exists, it will be stay forever until user manually removes it. This PR just recovers it once the problem happens.

@abdosi
Copy link
Contributor

abdosi commented Mar 4, 2023

please create separate PR for 202205 as master has submodule while 202205 does not have

@mlok-nokia
Copy link
Contributor Author

please create separate PR for 202205 as master has submodule while 202205 does not have

A new PR has been created on 202205 branch -- sonic-net/sonic-buildimage#14106

@prgeor
Copy link
Contributor

prgeor commented Mar 6, 2023

previous-reboot-cause

@mlok-nokia where is the code which removes When system has more than 10 histories, if the newer history with the older timesatmp, it will be removed because it filename has oldest timestamp.
Can you point me to the code?

@mlok-nokia
Copy link
Contributor Author

previous-reboot-cause

@mlok-nokia where is the code which removes When system has more than 10 histories, if the newer history with the older timesatmp, it will be removed because it filename has oldest timestamp. Can you point me to the code?

This has been done in the process-reboot-cause script in fucntion read_reboot_cause_files_and_save_state_db()

@prgeor
Copy link
Contributor

prgeor commented Mar 7, 2023

read_reboot_cause_files_and_save_state_db

@mlok-nokia Thanks for pointing me the code. The reboot-cause_files_ with timestamp gets generated before the reboot i.e the current stamp after reboot has no impact on the deletion of file because the code removes the OLDEST files and keeps only the ten latest one. So, how is the NTP causing wrong deletion of files? Is it that during the long period of system operation and reboots, the timestamp of various reboot cause file gets out of order?

@mlok-nokia
Copy link
Contributor Author

I don't mean NTP cause this issue. Somehow, the real time clock set back to initial time (or old time) during booting up (such as No RTC clock and NO supercab to keep the hwclock, or others). The reboot cause history will be generated with old time filename. But, if NTP is enabled and the time can be synced before the determine-reboot-cause is running, the history file name will be generated with correct timestamp. This issue will not occurred.

yxieca pushed a commit to sonic-net/sonic-buildimage that referenced this pull request Mar 8, 2023
…cause file removal issue (sonic-host-services #46) (#14106)

Why I did it
Porting/cherry-pick PR sonic-net/sonic-host-services#46
"show reboot-cause history" shows empty history. When the previous-reboot-cause has a broken symlink, And rebooting the system will not be able to generate a new symlink of the new previous-reboot-cause.

admin@sonic:~$ show reboot-cause history 
Name    Cause    Time    User    Comment
------  -------  ------  ------  ---------
How I did it
Somehow, when the symlink file /host/reboot-cause/previous-reboot-cause is broken (which its destination files doesn't exist in this case), the current condition check "if os.path,exists(PREVIOUS_REBOOT_CAUSE_FILE)" will return False in determine-reboot-cause script. Hence, the current previous-reboot-cause is not been removed and the recreation of the new previous-reboot-cause failed. In case of previous-reboot-cause is a broken synlink file, add condition os.path.islink(PREVIOUS_REBOOT_CAUSE) to check and allow the remove operation happens.

How to verify it
Manually make the /host/reboot-cause/previous-reboot-cause to be a broken symlink file by removing its destination file
reboot the system. "show reboot-cause history" should show the correct info

Signed-off-by: mlok <marty.lok@nokia.com>
@prgeor
Copy link
Contributor

prgeor commented Mar 8, 2023

@mlok-nokia I deleted the symlink itself, and I don't see the issue that you are seeing. I dont' understand your explanation that show reboot cause history depends upon the symlink

root@str2-7215-acs-1:~# ls -l /host/reboot-cause/
total 16
drwxr-xr-x 2 root root 4096 Mar  6 23:25 history
drwxr-xr-x 2 root root 4096 Dec  6 08:34 platform
lrwxrwxrwx 1 root root   64 Mar  6 23:24 previous-reboot-cause.json -> /host/reboot-cause/history/reboot-cause-2023_03_06_23_24_13.json
-rw-r--r-- 1 root root    7 Mar  6 23:24 reboot-cause.txt
root@str2-7215-acs-1:~# show reboot-cause history
Name                 Cause    Time                          User    Comment
-------------------  -------  ----------------------------  ------  ---------
2023_03_06_23_24_13  reboot   Mon Mar  6 23:22:28 UTC 2023  admin   N/A
2023_03_06_14_17_12  reboot   Mon Mar  6 14:15:31 UTC 2023  admin   N/A
2023_03_06_13_59_19  reboot   Mon Mar  6 13:57:36 UTC 2023  admin   N/A
2023_03_06_13_21_42  reboot   Mon Mar  6 13:20:02 UTC 2023  admin   N/A
2023_03_06_13_08_56  reboot   Mon Mar  6 13:07:16 UTC 2023  admin   N/A
2023_03_06_10_21_01  Unknown  N/A                           N/A     N/A
2023_03_06_06_55_46  reboot   Mon Mar  6 06:54:00 UTC 2023  admin   N/A
2023_03_06_06_48_59  reboot   Mon Mar  6 06:47:13 UTC 2023  admin   N/A
2023_03_06_06_42_09  reboot   Mon Mar  6 06:40:04 UTC 2023  admin   N/A
2023_03_06_06_35_01  reboot   Mon Mar  6 06:33:20 UTC 2023  admin   N/A
root@str2-7215-acs-1:~# rm -f /host/reboot-cause/previous-reboot-cause.json
root@str2-7215-acs-1:~# ls -l /host/reboot-cause/
total 12
drwxr-xr-x 2 root root 4096 Mar  6 23:25 history
drwxr-xr-x 2 root root 4096 Dec  6 08:34 platform
-rw-r--r-- 1 root root    7 Mar  6 23:24 reboot-cause.txt
root@str2-7215-acs-1:~# show reboot-cause history
Name                 Cause    Time                          User    Comment
-------------------  -------  ----------------------------  ------  ---------
2023_03_06_23_24_13  reboot   Mon Mar  6 23:22:28 UTC 2023  admin   N/A
2023_03_06_14_17_12  reboot   Mon Mar  6 14:15:31 UTC 2023  admin   N/A
2023_03_06_13_59_19  reboot   Mon Mar  6 13:57:36 UTC 2023  admin   N/A
2023_03_06_13_21_42  reboot   Mon Mar  6 13:20:02 UTC 2023  admin   N/A
2023_03_06_13_08_56  reboot   Mon Mar  6 13:07:16 UTC 2023  admin   N/A
2023_03_06_10_21_01  Unknown  N/A                           N/A     N/A
2023_03_06_06_55_46  reboot   Mon Mar  6 06:54:00 UTC 2023  admin   N/A
2023_03_06_06_48_59  reboot   Mon Mar  6 06:47:13 UTC 2023  admin   N/A
2023_03_06_06_42_09  reboot   Mon Mar  6 06:40:04 UTC 2023  admin   N/A
2023_03_06_06_35_01  reboot   Mon Mar  6 06:33:20 UTC 2023  admin   N/A
root@str2-7215-acs-1:~# 

Copy link
Contributor

@prgeor prgeor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mlok-nokia please help me to understand better

@mlok-nokia
Copy link
Contributor Author

mlok-nokia commented Mar 9, 2023

@mlok-nokia please help me to understand better

You need to remove the src file -- /host/reboot-cause/history/reboot-cause-2023_03_06_23_24_13.json. Then , it becomes a broken. After that, reboot the system, the problem will occur.
This PR is to allow the determine-reboot-cause script to remove previous-reboot-cause file when it became broken which happened before the last reboot.

@prgeor
Copy link
Contributor

prgeor commented Mar 9, 2023

@mlok-nokia I rebooted after deleting the history file. Because the symlink is broken, the reboot cause is 'unknown' . But I don't see the issue. If there are history files under /host/reboot-cause/history , there is no issue. It does not depend upon the symlink presence

root@str2-7215-acs-1:~# ls -l /host/reboot-cause/history/
total 40
-rw-r--r-- 1 root root 129 Mar  6 06:35 reboot-cause-2023_03_06_06_35_01.json
-rw-r--r-- 1 root root 129 Mar  6 06:42 reboot-cause-2023_03_06_06_42_09.json
-rw-r--r-- 1 root root 129 Mar  6 06:48 reboot-cause-2023_03_06_06_48_59.json
-rw-r--r-- 1 root root 129 Mar  6 06:55 reboot-cause-2023_03_06_06_55_46.json
-rw-r--r-- 1 root root 103 Mar  6 10:21 reboot-cause-2023_03_06_10_21_01.json
-rw-r--r-- 1 root root 129 Mar  6 13:08 reboot-cause-2023_03_06_13_08_56.json
-rw-r--r-- 1 root root 129 Mar  6 13:21 reboot-cause-2023_03_06_13_21_42.json
-rw-r--r-- 1 root root 129 Mar  6 13:59 reboot-cause-2023_03_06_13_59_19.json
-rw-r--r-- 1 root root 129 Mar  6 14:17 reboot-cause-2023_03_06_14_17_12.json
-rw-r--r-- 1 root root 132 Mar  9 18:46 reboot-cause-2023_03_09_18_46_46.json
root@str2-7215-acs-1:~# ls -l /host/reboot-cause/
total 12
drwxr-xr-x 2 root root 4096 Mar  9 18:46 history
drwxr-xr-x 2 root root 4096 Dec  6 08:34 platform
lrwxrwxrwx 1 root root   47 Mar  9 18:43 previous-reboot-cause.json -> ./history/reboot-cause-2023_03_06_23_24_13.json
-rw-r--r-- 1 root root   82 Mar  9 18:45 reboot-cause.txt
root@str2-7215-acs-1:~# show reboot-cause history                   <<< After reboot
Name                 Cause    Time                             User    Comment
-------------------  -------  -------------------------------  ------  ---------
2023_03_09_18_46_46  reboot   Thu 09 Mar 2023 06:45:05 PM UTC  admin   N/A
2023_03_06_14_17_12  reboot   Mon Mar  6 14:15:31 UTC 2023     admin   N/A
2023_03_06_13_59_19  reboot   Mon Mar  6 13:57:36 UTC 2023     admin   N/A
2023_03_06_13_21_42  reboot   Mon Mar  6 13:20:02 UTC 2023     admin   N/A
2023_03_06_13_08_56  reboot   Mon Mar  6 13:07:16 UTC 2023     admin   N/A
2023_03_06_10_21_01  Unknown  N/A                              N/A     N/A
2023_03_06_06_55_46  reboot   Mon Mar  6 06:54:00 UTC 2023     admin   N/A
2023_03_06_06_48_59  reboot   Mon Mar  6 06:47:13 UTC 2023     admin   N/A
2023_03_06_06_42_09  reboot   Mon Mar  6 06:40:04 UTC 2023     admin   N/A
2023_03_06_06_35_01  reboot   Mon Mar  6 06:33:20 UTC 2023     admin   N/A
root@str2-7215-acs-1:~# show reboot-cause
Unknown
root@str2-7215-acs-1:~# 

@mlok-nokia
Copy link
Contributor Author

@prgeor The following is what I got by using the 202205 which is built on 02/24/2023.

  1. When the file previous-reboot-cause link is broken. The determine-reboot-cause script is not able to remove this file by the following code
    # Remove stale PREVIOUS_REBOOT_CAUSE_FILE if it exists
    if os.path.exists(PREVIOUS_REBOOT_CAUSE_FILE):       <----- it returns False since the history file doesn't exist
        os.remove(PREVIOUS_REBOOT_CAUSE_FILE)

  1. Since the previous-reboot-cause file is not removed in step 1). The following code in determine-reboot-cause cannot create a new previous-reboot-cause symlink and fails with Traceback. And determine-reboot-cause.service exited with status FAILURE
# Create a symbolic link to previous-reboot-cause.json file
    os.symlink(REBOOT_CAUSE_HISTORY_FILE, PREVIOUS_REBOOT_CAUSE_FILE)
Stack Trace
Mar  9 19:57:28.477216 ixre-cpm-chassis8 INFO determine-reboot-cause[2289]: Traceback (most recent call last):
Mar  9 19:57:28.477280 ixre-cpm-chassis8 INFO determine-reboot-cause[2289]:   File "/usr/local/bin/determine-reboot-cause", line 262, in <module>
Mar  9 19:57:28.477449 ixre-cpm-chassis8 INFO determine-reboot-cause[2289]:     main()
Mar  9 19:57:28.477502 ixre-cpm-chassis8 INFO determine-reboot-cause[2289]:   File "/usr/local/bin/determine-reboot-cause", line 249, in main
Mar  9 19:57:28.477640 ixre-cpm-chassis8 INFO determine-reboot-cause[2289]:     os.symlink(REBOOT_CAUSE_HISTORY_FILE, PREVIOUS_REBOOT_CAUSE_FILE)
Mar  9 19:57:28.477716 ixre-cpm-chassis8 INFO determine-reboot-cause[2289]: FileExistsError: [Errno 17] File exists: '/host/reboot-cause/history/reboot-cause-2023_03_09_19_57_28.json' -> '/host/reboot-cause/previous-reboot-cause.json'
Mar  9 19:57:28.516293 ixre-cpm-chassis8 NOTICE systemd[1]: determine-reboot-cause.service: Main process exited, code=exited, status=1/FAILURE
Mar  9 19:57:28.516448 ixre-cpm-chassis8 WARNING systemd[1]: determine-reboot-cause.service: Failed with result 'exit-code'.
Mar  9 19:57:28.517475 ixre-cpm-chassis8 ERR systemd[1]: Failed to start Reboot cause determination service.
  1. Because the determine-reboot-cause.service exited with FAILURE. The process-reboot-cause.service will be stuck in the loaded state due to the Requires dependency. And no history files have been processed and stored into the database. And the command "show reboot-cause history" shows empty.
â process-reboot-cause.service - Retrieve the reboot cause from the history fil>
     Loaded: loaded (/lib/systemd/system/process-reboot-cause.service; static)
     Active: inactive (dead)
TriggeredBy: â process-reboot-cause.timer


Name                 Cause    Time                             User    Comment
-------------------  -------  -------------------------------  ------  ---------
2023_03_07_21_55_48  reboot   Tue 07 Mar 2023 09:54:02 PM UTC  admin   N/A
2023_02_24_15_41_58  reboot   Fri 24 Feb 2023 03:40:11 PM UTC  admin   N/A
2023_02_23_20_53_20  reboot   Thu 23 Feb 2023 08:50:21 PM UTC  admin   N/A
2023_02_23_17_47_58  reboot   Thu 23 Feb 2023 05:45:01 PM UTC  admin   N/A
2023_02_22_16_04_39  reboot   Wed 22 Feb 2023 04:02:50 PM UTC  admin   N/A
2023_02_21_16_14_02  reboot   Tue 21 Feb 2023 04:12:18 PM UTC  admin   N/A
2023_02_21_00_51_20  reboot   Tue 21 Feb 2023 12:49:47 AM UTC  admin   N/A
2023_02_21_00_47_28  reboot   Tue 21 Feb 2023 12:45:41 AM UTC  admin   N/A
2023_02_21_00_44_35  reboot   Tue 21 Feb 2023 12:42:57 AM UTC  admin   N/A
2023_02_21_00_27_14  reboot   Tue 21 Feb 2023 12:25:26 AM UTC  admin   N/A
admin@ixs-7215-pizza9:~$ ls -l /host/reboot-cause/
total 16
drwxr-xr-x 2 root root 4096 Mar  7 21:56 history
drwxr-xr-x 2 root root 4096 Feb 20 17:59 platform
lrwxrwxrwx 1 root root   64 Mar  7 21:55 previous-reboot-cause.json -> /host/reboot-cause/history/reboot-cause-2023_03_07_21_55_48.json
-rw-r--r-- 1 root root    7 Mar  7 21:55 reboot-cause.txt
admin@ixs-7215-pizza9:~$ sudo rm /host/reboot-cause/history/reboot-cause-2023_03_07_21_55_48.json
admin@ixs-7215-pizza9:~$
admin@ixs-7215-pizza9:~$ sudo reboot
requested COLD shutdown

(After reboot)
admin@ixs-7215-pizza9:~$ show reboot-cause history 
Name    Cause    Time    User    Comment
------  -------  ------  ------  ---------
admin@ixs-7215-pizza9:~$

@xumia
Copy link
Contributor

xumia commented Mar 11, 2023

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@judyjoseph
Copy link
Contributor

@mlok-nokia could you take care of test coverage ?

…oval issue

Signed-off-by: mlok <marty.lok@nokia.com>
@mlok-nokia
Copy link
Contributor Author

@mlok-nokia could you take care of test coverage ?

@judyjoseph I have added the case test_determinre_reboot_cause_main() for it. Thanks

@judyjoseph
Copy link
Contributor

@mlok-nokia could you update the test to cover this change -- so as to get the coverage, thanks

@mlok-nokia
Copy link
Contributor Author

@prgeor Hi Prince, I have moved the code to the place which you suggested. But somehow, our conversion have been cleared. I just let you know that. Please review the change. Thanks

… to the pace before its new symlink creation
@mlok-nokia
Copy link
Contributor Author

@mlok-nokia could you update the test to cover this change -- so as to get the coverage, thanks

All set. Thanks.

@prgeor prgeor added the bug Something isn't working label Mar 28, 2023
@prgeor prgeor merged commit 2fdb64b into sonic-net:master Mar 28, 2023
StormLiangMS pushed a commit that referenced this pull request Apr 20, 2023
[reboot-cause] Fix a broken symlink of previous-reboot-cause file removal issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

6 participants