Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

when changing storage size for Postgres error in playbook causes loop #1642

Open
3 tasks done
tylergmuir opened this issue Nov 17, 2023 · 3 comments
Open
3 tasks done
Labels

Comments

@tylergmuir
Copy link
Contributor

Please confirm the following

  • I agree to follow this project's code of conduct.
  • I have checked the current issues for duplicates.
  • I understand that the AWX Operator is open source software provided for free and that I might not receive a timely response.

Bug Summary

If you change the value in postgres_storage_requirements it caused an error to occur. This is because the Statefulset isn't able to change that value. The task Create Database if no database is specified in database_configuration.yml fails. This drops it down to the rescue which scales down everything to 0.

Then on the task Remove PostgreSQL statefulset for upgrade (which in this case, should be run) it fails to evaluate to the when statement because create_statefulset_result.error does not exist. But in this case, removing the Statefulset is what is required.

AWX Operator version

2.7.2

AWX version

23.4.0

Kubernetes platform

kubernetes

Kubernetes/Platform version

1.26.7

Modifications

no

Steps to reproduce

Have a functioning AWX environment using a managed Postgres pod.

Change the kustomization for the AWX environment to change the value of postgres_storage_requirements. This can be done by either adding it where it wasn't previously used and setting the values to something other than the default, or by increasing the current allocation.

Expected results

The statefulset should be deleted and recreated with the new PVC size as defined.

Actual results

Playbook fails causing the AWX environment to be scaled to 0 for all pods and then getting stuck in a loop attempted to update the statefulset.

Additional information

Once you are stuck in this state, you can manually delete the statefulset and then allow the operator to see the statefulset is missing and have it re-create it. After that is done, the deployment continues and the environment is brought back up.

Operator Logs

The conditional check 'create_statefulset_result.error == 422' failed. The error was: error while evaluating conditional (create_statefulset_result.error == 422): 'dict object' has no attribute 'error'. 'dict object' has no attribute 'error'.

The error appears to be in '/opt/ansible/roles/installer/tasks/database_configuration.yml': line 175, column 7, but may be elsewhere in the file depending on the exact syntax problem.

@rooftopcellist
Copy link
Member

@tylergmuir I am confused about your description above. If we changed the operator so that it deleted the PostgreSQL StatefulSet if the storage size was change, the PVC that the data is stored in would not be deleted. So when the new StatefulSet was created, it would not enter the running state because the existing PVC would have the same name as the one the new StatefulSet would be dynamically trying to create. So I think the StatefulSet would try to use the existing PVC, and would try to change resources.requests.storage on the PVC, which is only allowed in the StorageClass specified supports and has specified allowVolumeExpansion: true if I recall correctly.

The problem is that not all users will have StorageClasses that support dynamic expansion.

So, if I am following correctly, we could potentially add logic to support PVC expansion for the db pvc by doing the following:

  • Add a task that compares the existing statefulset and the postgres_storage_requirements value on the spec, and if they are different, delete the StatefulSet and re-create it
  • Add error handling here so that if a user specifies a new storage size and their StorageClass does not support it, we set an error status on the AWX CR, or make it a noop and intentionally exclude the storage request value change.
    • We could potentially key off of the presence and value of the storageclass.allowVolumeExpansion field; but we would also need to know the default storageclass provided by the cluster, or at least the storageclass used to create the existing PVC, because storage_class on the AWX spec is now a required field.

What do you think @tylergmuir ? Can you think of any other considerations? Does what I said above make sense/align with what you've seen experimentally?

Also, if you or anyone else has a good idea of how this could work and has time, a PR would be welcome.

@tylergmuir
Copy link
Contributor Author

@rooftopcellist I believe you have it all right. In my case, I had a PVC that used a storage class that did support being expanded. So all I had to do to get back to a working state was delete the StatefulSet and the rest of the existing code in the Operator handled resizing the PVC, creating the StatefulSet using that resized PVC, and building the pods on top of that.

The main issue that I ran into was that by changing the postgres_storage_requirements it consumed the change, brought down the Postgres pod but then get stuck in a loop of trying to update the StatefulSet but fail to update the StatefulSet (due to the storage in StatefulSets being immutable).

But like you mentioned, in the case the user had a storage class that wasn't resizable, we would need some way of nicely stopping the process from starting to protect the service from being taken down to wait for an resize of the PVC that won't ever happen.

@ranvit
Copy link
Contributor

ranvit commented Jun 14, 2024

Great description @tylergmuir, I had the same issue of my postgres statefulset not spinning up and the same symptoms with conditional (create_statefulset_result.error == 422): 'dict object' has no attribute 'error'. 'dict object' has no attribute 'error'.

I was worried about deleting the statefulset and losing the PVC, but since theres no explicit retention policy defined, the PVC remained up after deleting my statefulset, and then a new statefulset spun up. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants