Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added worker error on timeout if restart fails #56

Merged
merged 5 commits into from
Aug 19, 2024

Conversation

PietroPasotti
Copy link
Contributor

Fixes #51

Issue

The worker attempts to start the services in the layer as soon as it has all the necessary pieces of config.
If that fails, for whatever reason, it currently reports 'active' and that's it.
Any follow-up event, unless it signals config changes, will not even attempt to restart the services.

Solution

The worker will retry for 15 minutes (default, configurable) to start the services and, if it fails, raise an exception and let the charm go into error state and write some logs that tell the user what's wrong and that it's probably an issue with some external services that the workload depends upon to start, such as s3.
The idea is that juju will retry the hook and eventually as those external services come up, the issue will resolve itself.

Rejected alternatives:

  • attempting to 'gracefully' handle the situation by putting the charm in Blocked/Waiting are tricky because:
    • we become dependent on update-status or other events to wake us up so we can try again
    • we need to make the 'attempt to restart if not started' logic run unconditionally, which is an overcooked spaghetti code party
  • using resurrect or something similar to wake yourself up with custom events periodically until the check finally passes
    • hard to predict how those will interact with regular juju events

Context

https://discourse.charmhub.io/t/its-probably-ok-for-a-unit-to-go-into-error-state/13022/30

Testing Instructions

  • Deploy a coordinator/worker charm built with this lib version
  • Deploy an s3 facade so that the config will look good but s3 will in practice be broken
  • Watch the charm cling to life for 15 minutes
  • Watch the charm go to error state
  • Now replace the s3 facade for a real s3 and the charm should eventually go to active

@PietroPasotti PietroPasotti requested a review from a team as a code owner August 16, 2024 12:16
@PietroPasotti PietroPasotti merged commit 4f26423 into main Aug 19, 2024
5 checks passed
@PietroPasotti PietroPasotti deleted the common-exit-hook-start branch August 19, 2024 06:52
PietroPasotti added a commit that referenced this pull request Aug 26, 2024
* Catch SecretNotFoundError when privkey shouldn't be here (#48)

* Catch SecretNotFoundError when privkey shouldn't be here

* Bring back if self.tls_available check from pre-refactor times

* lint

* Bump the version that was forgotten in #48 (#50)

* Add issues integration action (#52)

* Add issues integration action

* Order imports differently as lint started to complain

* Update databag model dump return value (#53)

* dump returns databag

* tests

* root ca cert patch

* fixed static checks

* fix

* removed conftest

* added health check logic to worker (#55)

* added health check logic to worker

* adapted status check to be less tempo-specific

* use regular paths for codespell too

* rerun black

* vbump

* added worker error on timeout if restart fails (#56)

* added worker error on timeout if restart fails

* maintenance status throughout retries

* fixed static

* only restart own services

* added tls support for worker checks (#59)

* added tls support for worker checks

* lint

* static fix

* Test coordinator (#60)

* Added the following:
1) Coordinator pytest fixtures
2) Coordinator unit tests
3) Rrefactoring of roles_config in coordinator.py
  * ClusterRolesConfig was switched to a dataclass with __post_init__ and is_coherent_with methods
4) Created test_roles_config.py tests

* Chore: Fix leftover comments and minor code changes

* * Updates
1) Fmt
2) Merged 'main:coordinator.py' into test-coordinator-cleanup to fix Secrets error

* Added docstrings to ClusterRoleConfig

---------

Co-authored-by: Mateusz Kulewicz <mateusz.kulewicz@canonical.com>
Co-authored-by: PietroPasotti <starfire.daemon@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

worker won't start if s3 is not configured on pebble-ready, and will never be started
2 participants