-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[thermalctld] Enlarge startretries value to avoid thermalctld not able to restart during regression test #5633
[thermalctld] Enlarge startretries value to avoid thermalctld not able to restart during regression test #5633
Conversation
@Junchao-Mellanox: Usually we set |
Hi Joe, there is a test case which loads invalid thermal control configuration and verify that thermal control daemon won't crash. In that case, it restart the daemon after the checking, and it could start & kill the daemon in a very short time. |
I see. Thanks for the info. I'm not sure we should change this simply to satisfy a specific test case. It sounds like it might be more proper to modify the test case? |
We could add a delay in the test case, like "time.sleep(10)", not sure if it is a good solution. IMO, thermalctld is designed to help protect the system from being overheat, maybe we could set it to restart always. Any suggestion? |
I think this is a valid concern. thermalctld is a critical process. I think it makes sense to be persistent at trying to get it running. I'm not opposed to setting it to restart always. @sujinmkang: Do you see any issues with configuring thermalctld to restart always? |
retest broadcom please |
retest mellanox please |
1 similar comment
retest mellanox please |
@Junchao-Mellanox and @jleveque As we discussed in the email thread, if the thermal control configuration is critical for thermalctld to run, then thermalctld crashes and restart several times and then stops restarting, is right behavior instead of continuous restart, I think. If thermalctld can run some minimum checks without the configuration, it should give warning/error messages periodically but it should run without crash. |
Hi @sujinmkang, the test case is like this:
So thermalctld itself doesn't crash, it is the test case which kill it manually. |
Retest baseimage please |
Retest mellanox please |
@Junchao-Mellanox: Can you please update the PR title and description to match the new change? |
retest mellanox please |
@sujinmkang would you please help to review. |
@abdosi would you please help to cherry-pick? |
…e to restart during regression test (#5633) Increase startretires value from default of 10 to 50 to prevent supervisor from placing thermalctld in FATAL state during regression testing. Also ensures supervisord tries hard to get thermalctld running in production, as thermalctld is critical to prevent device from overheating.
…e to restart during regression test (sonic-net#5633) Increase startretires value from default of 10 to 50 to prevent supervisor from placing thermalctld in FATAL state during regression testing. Also ensures supervisord tries hard to get thermalctld running in production, as thermalctld is critical to prevent device from overheating.
- Why I did it
Found error logs in syslog:
The issue is related to the "startsecs" configuration of thermalctld in /etc/supervisor/conf.d/supervisord.conf. The current configuration setting the "startsecs" to 10, which means that it require thermalctld process running at least 10 seconds or supervisord will not restart it after it exiting even if the exit code is expected.
See the official document for "startsecs" at http://supervisord.org/configuration.html:
- How I did it
The fix is to change the "startsecs" configuration from 10 to 0
- How to verify it
Manual test
- Which release branch to backport (provide reason below if selected)
- Description for the changelog
- A picture of a cute animal (not mandatory but encouraged)