added worker error on timeout if restart fails #56

PietroPasotti · 2024-08-16T12:16:54Z

Fixes #51

Issue

The worker attempts to start the services in the layer as soon as it has all the necessary pieces of config.
If that fails, for whatever reason, it currently reports 'active' and that's it.
Any follow-up event, unless it signals config changes, will not even attempt to restart the services.

Solution

The worker will retry for 15 minutes (default, configurable) to start the services and, if it fails, raise an exception and let the charm go into error state and write some logs that tell the user what's wrong and that it's probably an issue with some external services that the workload depends upon to start, such as s3.
The idea is that juju will retry the hook and eventually as those external services come up, the issue will resolve itself.

Rejected alternatives:

attempting to 'gracefully' handle the situation by putting the charm in Blocked/Waiting are tricky because:
- we become dependent on update-status or other events to wake us up so we can try again
- we need to make the 'attempt to restart if not started' logic run unconditionally, which is an overcooked spaghetti code party
using resurrect or something similar to wake yourself up with custom events periodically until the check finally passes
- hard to predict how those will interact with regular juju events

Context

https://discourse.charmhub.io/t/its-probably-ok-for-a-unit-to-go-into-error-state/13022/30

Testing Instructions

Deploy a coordinator/worker charm built with this lib version
Deploy an s3 facade so that the config will look good but s3 will in practice be broken
Watch the charm cling to life for 15 minutes
Watch the charm go to error state
Now replace the s3 facade for a real s3 and the charm should eventually go to active

src/cosl/coordinated_workers/worker.py

* Catch SecretNotFoundError when privkey shouldn't be here (#48) * Catch SecretNotFoundError when privkey shouldn't be here * Bring back if self.tls_available check from pre-refactor times * lint * Bump the version that was forgotten in #48 (#50) * Add issues integration action (#52) * Add issues integration action * Order imports differently as lint started to complain * Update databag model dump return value (#53) * dump returns databag * tests * root ca cert patch * fixed static checks * fix * removed conftest * added health check logic to worker (#55) * added health check logic to worker * adapted status check to be less tempo-specific * use regular paths for codespell too * rerun black * vbump * added worker error on timeout if restart fails (#56) * added worker error on timeout if restart fails * maintenance status throughout retries * fixed static * only restart own services * added tls support for worker checks (#59) * added tls support for worker checks * lint * static fix * Test coordinator (#60) * Added the following: 1) Coordinator pytest fixtures 2) Coordinator unit tests 3) Rrefactoring of roles_config in coordinator.py * ClusterRolesConfig was switched to a dataclass with __post_init__ and is_coherent_with methods 4) Created test_roles_config.py tests * Chore: Fix leftover comments and minor code changes * * Updates 1) Fmt 2) Merged 'main:coordinator.py' into test-coordinator-cleanup to fix Secrets error * Added docstrings to ClusterRoleConfig --------- Co-authored-by: Mateusz Kulewicz <mateusz.kulewicz@canonical.com> Co-authored-by: PietroPasotti <starfire.daemon@gmail.com>

PietroPasotti added 2 commits August 16, 2024 13:58

added worker error on timeout if restart fails

e3e063e

maintenance status throughout retries

7b33a9f

PietroPasotti requested a review from a team as a code owner August 16, 2024 12:16

PietroPasotti added 2 commits August 16, 2024 14:22

fixed static

9c27c78

only restart own services

b225eba

Abuelodelanada approved these changes Aug 16, 2024

View reviewed changes

src/cosl/coordinated_workers/worker.py Show resolved Hide resolved

Merge branch 'main' into common-exit-hook-start

2413dd3

PietroPasotti merged commit 4f26423 into main Aug 19, 2024
5 checks passed

PietroPasotti deleted the common-exit-hook-start branch August 19, 2024 06:52

sed-i mentioned this pull request Aug 27, 2024

Workers not working if S3 configuration is invalid but in active status canonical/mimir-worker-k8s-operator#34

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

added worker error on timeout if restart fails #56

added worker error on timeout if restart fails #56

PietroPasotti commented Aug 16, 2024

added worker error on timeout if restart fails #56

added worker error on timeout if restart fails #56

Conversation

PietroPasotti commented Aug 16, 2024

Issue

Solution

Context

Testing Instructions