Skip to content
This repository has been archived by the owner on Jun 29, 2022. It is now read-only.

AKS: prometheus-operator gets installed before default storage class object is created #855

Closed
invidian opened this issue Aug 25, 2020 · 5 comments · Fixed by #886
Closed
Assignees
Labels
area/ci Items related to CI bug Something isn't working
Milestone

Comments

@invidian
Copy link
Member

invidian commented Aug 25, 2020

This causes the CI to fail, as prometheus-operator component never converges.

This is related to #559, however, to solve this one, we should perhaps make sure, that before AKS installation converges, default storage class becomes available. We could for example install the storage class on AKS explicitly via chart, to avoid waiting for the AKS to converge.

@invidian invidian added bug Something isn't working area/ci Items related to CI labels Aug 25, 2020
@invidian
Copy link
Member Author

Another idea would be to have a post-install hook in Go, available for platforms, to wait for the storage class to show up. Similarly, we could by default wait for nodes to become ready CC @johananl

@johananl
Copy link
Member

johananl commented Aug 26, 2020

Conceptually it could make more sense to me to have an optional pre-install hook for components rather than a post-install hook for platforms in the context of this problem.

This is how I think about the problem you're describing:

"Before installing component X I want to ensure that condition Y is met".

Translating that into software, I would tie the logic to the component's installation rather than to the cluster's deployment even if chronologically the result is the same (at the moment!).

Another thought: maybe this hints at a more generic problem called component dependencies. True, you can do whatever using a hook which allows executing arbitrary logic, but I can imagine more cases where we want to ensure the existence of components and/or their order of deployment, too.

To summarize, I would first consider whether it makes sense to introduce a component dependency mechanism (can Helm help?), and only then I'd look for a less structured solution such as hooks. Lastly, if the hook is related to components, IMO it should be tied to components.

@johananl
Copy link
Member

Reading a 2nd time, I suspect the storage class is an AKS thing rather than another component here. If so, sounds like a component-specific hook could work, assuming that we want to enforce this logic for only some of the components. If we want to halt all component deployments until some condition is met, in this case it indeed makes sense to me to use a platform post-deployment hook.

@invidian
Copy link
Member Author

invidian commented Aug 26, 2020

Thanks for your input @johananl.

Conceptually it could make more sense to me to have an optional pre-install hook for components rather than a post-install hook for platforms in the context of this problem.

I was thinking to shift the task more towards the platform side, as to me, this looks like after Terraform reports that AKS cluster has been created, the cluster didn't really converge yet.

To summarize, I would first consider whether it makes sense to introduce a component dependency mechanism (can Helm help?), and only then I'd look for a less structured solution such as hooks. Lastly, if the hook is related to components, IMO it should be tied to components.

I agree about pre-install hook for components, and yes, Helm should be able to help us in this case (though it will most likely require modifying the upstream chart if we decide to use that), however, I would expect more the component to fail if the default storage class is not defined, rather than wait for it indefinitely, which would be the case in this scenario.

Another thought: maybe this hints at a more generic problem called component dependencies. True, you can do whatever using a hook which allows executing arbitrary logic, but I can imagine more cases where we want to ensure the existence of components and/or their order of deployment, too.

We were thinking about dependency management for the components, but between the components and not between the component and the cluster state, which is more difficult to express (though such thing would perhaps also make sense).

EDIT:

Reading a 2nd time, I suspect the storage class is an AKS thing rather than another component here. If so, sounds like a component-specific hook could work, assuming that we want to enforce this logic for only some of the components. If we want to halt all component deployments until some condition is met, in this case it indeed makes sense to me to use a platform post-deployment hook.

Exactly.

@iaguis iaguis added this to the v0.4.0 milestone Aug 31, 2020
invidian added a commit that referenced this issue Sep 1, 2020
To address issue with AKS having delayed creation of default
StorageClass, which affects installing components which depend on the
storage, this commit introduces PlatformPostApplyHook interface
interface, which platforms will be able to implement to obtain
kubeconfig file content after cluster is installed to run their own
sanity cheks. In case of AKS, hook will be looping an waiting until the
default storage class appears on the cluster.

Refs #855

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
invidian added a commit that referenced this issue Sep 1, 2020
To address issue with AKS having delayed creation of default
StorageClass, which affects installing components which depend on the
storage, this commit adds support for running optional PlatformPostApplyHook
after cluster has been installed.

Refs #855

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
invidian added a commit that referenced this issue Sep 1, 2020
This commit implements newly introduced platform.PostApplyHook for AKS
clusters, to address issue, where components depending on the storage
gets installed when default storage class has not been yet created by
the AKS controller, as this causes components to get stuck, which makes
cluster provisioning to fail.

The implemented hook lists available storage classes on the cluster and
returns when class with default storage class annotation is found.
Usually default storage class appears on the cluster within 5 minutes
after cluster creation, so 10 minutes timeout seems like a sane default
for this operation.

Closes #855

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
invidian added a commit that referenced this issue Sep 1, 2020
To address issue with AKS having delayed creation of default
StorageClass, which affects installing components which depend on the
storage, this commit introduces PlatformPostApplyHook interface
interface, which platforms will be able to implement to obtain
kubeconfig file content after cluster is installed to run their own
sanity checks. In case of AKS, hook will be looping an waiting until the
default storage class appears on the cluster.

Refs #855

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
invidian added a commit that referenced this issue Sep 1, 2020
To address issue with AKS having delayed creation of default
StorageClass, which affects installing components which depend on the
storage, this commit adds support for running optional platform.PostApplyHook
after cluster has been installed.

Refs #855

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
invidian added a commit that referenced this issue Sep 1, 2020
This commit implements newly introduced platform.PostApplyHook for AKS
clusters, to address issue, where components depending on the storage
gets installed when default storage class has not been yet created by
the AKS controller, as this causes components to get stuck, which makes
cluster provisioning to fail.

The implemented hook lists available storage classes on the cluster and
returns when class with default storage class annotation is found.
Usually default storage class appears on the cluster within 5 minutes
after cluster creation, so 10 minutes timeout seems like a sane default
for this operation.

Closes #855

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
invidian added a commit that referenced this issue Sep 1, 2020
This commit implements newly introduced platform.PostApplyHook for AKS
clusters, to address issue, where components depending on the storage
gets installed when default storage class has not been yet created by
the AKS controller, as this causes components to get stuck, which makes
cluster provisioning to fail.

The implemented hook lists available storage classes on the cluster and
returns when class with default storage class annotation is found.
Usually default storage class appears on the cluster within 5 minutes
after cluster creation, so 10 minutes timeout seems like a sane default
for this operation.

Closes #855

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
invidian added a commit that referenced this issue Sep 1, 2020
To address issue with AKS having delayed creation of default
StorageClass, which affects installing components which depend on the
storage, this commit introduces PostApplyHook interface, which platforms
will be able to implement to obtain kubeconfig file content after cluster
is installed to run their own sanity checks. In case of AKS, hook will
be looping and waiting until the default storage class appears on the
cluster.

Refs #855

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
invidian added a commit that referenced this issue Sep 1, 2020
To address issue with AKS having delayed creation of default
StorageClass, which affects installing components which depend on the
storage, this commit adds support for running optional platform.PostApplyHook
after cluster has been installed.

Refs #855

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
invidian added a commit that referenced this issue Sep 1, 2020
This commit implements newly introduced platform.PostApplyHook for AKS
clusters, to address issue, where components depending on the storage
gets installed when default storage class has not been yet created by
the AKS controller, as this causes components to get stuck, which makes
cluster provisioning to fail.

The implemented hook lists available storage classes on the cluster and
returns when class with default storage class annotation is found.
Usually default storage class appears on the cluster within 5 minutes
after cluster creation, so 10 minutes timeout seems like a sane default
for this operation.

Closes #855

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
invidian added a commit that referenced this issue Sep 1, 2020
This commit implements newly introduced platform.PostApplyHook for AKS
clusters, to address issue, where components depending on the storage
gets installed when default storage class has not been yet created by
the AKS controller, as this causes components to get stuck, which makes
cluster provisioning to fail.

The implemented hook lists available storage classes on the cluster and
returns when class with default storage class annotation is found.
Usually default storage class appears on the cluster within 5 minutes
after cluster creation, so 10 minutes timeout seems like a sane default
for this operation.

Closes #855

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
@invidian
Copy link
Member Author

invidian commented Sep 1, 2020

Created PR with a fix #886.

invidian added a commit that referenced this issue Sep 1, 2020
To address issue with AKS having delayed creation of default
StorageClass, which affects installing components which depend on the
storage, this commit introduces PostApplyHook interface, which platforms
will be able to implement to obtain kubeconfig file content after cluster
is installed to run their own sanity checks. In case of AKS, hook will
be looping and waiting until the default storage class appears on the
cluster.

Refs #855

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
invidian added a commit that referenced this issue Sep 1, 2020
To address issue with AKS having delayed creation of default
StorageClass, which affects installing components which depend on the
storage, this commit adds support for running optional platform.PostApplyHook
after cluster has been installed.

Refs #855

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
invidian added a commit that referenced this issue Sep 1, 2020
This commit implements newly introduced platform.PostApplyHook for AKS
clusters, to address issue, where components depending on the storage
gets installed when default storage class has not been yet created by
the AKS controller, as this causes components to get stuck, which makes
cluster provisioning to fail.

The implemented hook lists available storage classes on the cluster and
returns when class with default storage class annotation is found.
Usually default storage class appears on the cluster within 5 minutes
after cluster creation, so 10 minutes timeout seems like a sane default
for this operation.

Closes #855

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
invidian added a commit that referenced this issue Sep 1, 2020
To address issue with AKS having delayed creation of default
StorageClass, which affects installing components which depend on the
storage, this commit introduces PlatformWithPostApplyHook interface, which platforms
will be able to implement to obtain kubeconfig file content after cluster
is installed to run their own sanity checks. In case of AKS, hook will
be looping and waiting until the default storage class appears on the
cluster.

Refs #855

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
invidian added a commit that referenced this issue Sep 1, 2020
To address issue with AKS having delayed creation of default
StorageClass, which affects installing components which depend on the
storage, this commit adds support for running optional platform.PostApplyHook
after cluster has been installed.

Refs #855

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
invidian added a commit that referenced this issue Sep 1, 2020
This commit implements newly introduced platform.PlatformWithPostApplyHook
for AKS clusters, to address issue, where components depending on the storage
gets installed when default storage class has not been yet created by
the AKS controller, as this causes components to get stuck, which makes
cluster provisioning to fail.

The implemented hook lists available storage classes on the cluster and
returns when class with default storage class annotation is found.
Usually default storage class appears on the cluster within 5 minutes
after cluster creation, so 10 minutes timeout seems like a sane default
for this operation.

Closes #855

Signed-off-by: Mateusz Gozdek <mateusz@kinvolk.io>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area/ci Items related to CI bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants