Skip to content
This repository has been archived by the owner on Aug 17, 2023. It is now read-only.

Name the .cache and kustomize directories based on the KFDef name #242

Closed
vpavlin opened this issue Feb 17, 2020 · 22 comments
Closed

Name the .cache and kustomize directories based on the KFDef name #242

vpavlin opened this issue Feb 17, 2020 · 22 comments

Comments

@vpavlin
Copy link
Member

vpavlin commented Feb 17, 2020

Currently Kubeflow operator will use /tmp/.cache and /tmp/kustomize to store the downloaded and generated manifests.

These directories should be ideally based on the KFDef resource name, since once the /tmp/kustomize exists, operator will skip the kustomize.Generate step:

time="2020-02-17T12:48:48Z" level=info msg="folder /tmp/kustomize exists, skip kustomize.Generate"

Imagine a scenario where:

  1. A user submits KFDef

  2. Realizes he needs to add another manifests repo to the list

  3. Adds the manifests repository and runs kubectl apply -f

  4. Operator fails with

    time="2020-02-17T12:48:48Z" level=error msg="error evaluating kustomization manifest for component Error absolute path error in '/tmp/kustomize/component' : evalsymlink failure on '/tmp/kustomize/component'
    

    because it did not update the /tmp/kustomize

There are probably other scenarios - like someone adding a resource to KF manifests and the operator ignoring it, but I guess for development it is ok to just fall back to kfctl directly

@issue-label-bot
Copy link

Issue-Label Bot is automatically applying the labels:

Label Probability
kind/bug 0.66

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

@vpavlin vpavlin changed the title Name the .cache and kustomize directories base don the KFDef name Name the .cache and kustomize directories base don the KFDef name Feb 17, 2020
@vpavlin vpavlin changed the title Name the .cache and kustomize directories base don the KFDef name Name the .cache and kustomize directories based on the KFDef name Feb 17, 2020
@adrian555
Copy link
Member

@vpavlin say if we change the path from /tmp/kustomize to /tmp/kubeflow-operator/kustomize, where kubeflow-operator is the kfdef name. But I am not sure how this will solve the problem you describe above. After step 3, the kfdef name does not get changed, so the /tmp/kubeflow-operator/kustomize still exists. Am I missing something here? Could you please clarify a bit?

@vpavlin
Copy link
Member Author

vpavlin commented Feb 29, 2020

Hmm..now that I read my description, it really does not make much sense. Sorry for this.

IIRC I came across this issue when I submitted a KFDef names X, deleted it and submitted another - different - KFDef named Y where Y = X + additional component.

And I combined this problem with potentially another issue where we want to submit 2 KFdefs - in case, in the future, we have the ability to enable components on namespace scope.

So I would now say that there is this issue in the name - path to .cache and kustomize should be based on the resource name and then there is issue where the cache needs to be updated when the KFDef resource is updated by a user.

What do you think?

@adrian555
Copy link
Member

@vpavlin yes, I agree in a case when the kubeflow supports multiple instances within a namespace or a cluster (note that current operator is a cluster scoped). And based on just resource name may be solving the update partially. We can have a PR to append the KfDef name as I think it is a right thing.

I also agree the bigger problem is when the configuration of a certain component gets changed, but the KfDef name remains the same. Because the kustomization regeneration is bypassed, even though the KfDef resource itself is updated, the real kubeflow deployment remains the same.

Let the operator code always generate the kustomization directories and files may work for adding additional components. But what if a component is removed from the current KfDef resource? Still thinking how to handle these cases and whether the operator should have the similar behavior as command line deployment.

@vpavlin
Copy link
Member Author

vpavlin commented Mar 16, 2020

My approach would be:

  • Apply - start from clean cache (this sounds obvious, but if the Delete event does not clean the cache, apply does not start from scratch, so in case:
    • apply
    • operator crash
    • delete
    • operator restart
    • apply - here we would not start from fresh cache even if the delete cleans it, so it would be best to clean cache on apply always
  • Update - start from clean cache
  • Delete - we don't do much here since we agreed on using the ownerReferences, delete the cache
  • Full reconciliation - use existing cache

This way any new KFDef or update would force manifest regeneration and timed reconciliation would run from well known cache.

WDYT @adrian555

@jtfogarty
Copy link

/area kfctl
/priority p1

@vpavlin
Copy link
Member Author

vpavlin commented Apr 29, 2020

Hey! Any progress here?

@adrian555
Copy link
Member

Thanks @vpavlin for checking back. I am still working on this. Basically only when an update is detected to a KfDef will the existing cache be used, otherwise we will generate a new directory and cache for a new KfDef.

@nakfour
Copy link
Member

nakfour commented Apr 29, 2020

@adrian555 we are bumping into this with our new ODH 0.6 operator see issue (opendatahub-io/opendatahub-operator#9)
Basically it is not just the cache, but also the kustomize folder. In general terms the user should be allowed to create multiple instances of the API (in this case kfdef) on the cluster without any issues even if the operator is cluster wide. Currently, the following use case is broken with this operator:

  1. Cluster admin installs this operator cluster wide
  2. User A creates a namespace A and creates an instance Kfdef (name: A, namespace: A). This works.
  3. User B creates a namespace B and creates and instance Kfdef that is different from kfdef in Step 2 (name B, namespace B) This fails with these errors, in this case seldon was the new component:
msg="folder /tmp/kustomize exists, skip kustomize.Generate"
time="2020-04-27T17:20:19Z" level=info msg="/tmp/.cache/manifests exists; not resyncing "
time="2020-04-28T22:13:55Z" level=error msg="error evaluating kustomization manifest for seldon-core-operator Error absolute path error in '/tmp/kustomize/seldon-core-operator' : evalsymlink failure on '/tmp/kustomize/seldon-core-operator' : lstat /tmp/kustomize/seldon-core-operator: no such file or directory"

@adrian555
Copy link
Member

@nakfour yes, this is expected problem with the current implementation as already mentioned in some previous comments. Therefore, we need to have different directory for each kfdef instance.

@vpavlin
Copy link
Member Author

vpavlin commented Apr 30, 2020

Thanks @adrian555, it's good to know this is still on your radar:)

Basically only when an update is detected

So what happens if the update is a new component added into the KFDef? I believe that would fail unless the kustomize dir is regenerated.

@nakfour
Copy link
Member

nakfour commented Apr 30, 2020

@adrian555 I am not clear why we need to keep different kustomize folders. What are they used for? From an operator perspective, users create instances of APIs and they delete them. I don't see a need to keep the kustomize folder since I don't see a Users pattern of re-installing the same API instance over and over again. According to this issue: kubeflow/kubeflow#4368 the folder kustomize is kept to allow users to insert data into it, which is not the case with operators.

@nakfour
Copy link
Member

nakfour commented Apr 30, 2020

@adrian555 there is also one use case to add, users might edit the instance yaml and change it, but this still requires a new generated Kustomize folder.

@adrian555
Copy link
Member

@nakfour what I said is each instance will have a directory of different name. This is to support the use case you mentioned (name A and name B coexist). The directory for each instance will be deleted when the instance is deleted, just like @vpavlin described.

@adrian555
Copy link
Member

yes @vpavlin for new application added to kfdef the cache and kustomize will be regenerated/synced. I am hoping I do not need to wipe out the whole directory, but it may come down to that approach to remove the whole directory and recreate.

@nakfour
Copy link
Member

nakfour commented Apr 30, 2020

@adrian555 it still not clear to me. Your sentence "each instance will have a directory of different name. This is to support the use case you mentioned (name A and name B coexist)." What does support mean?? You can have name A and name B coexisting if you just wipe out the kustomize folder on each install. It is not clear to me why the kustomize folder is needed???

@adrian555
Copy link
Member

@nakfour the reconcile routine calls the same kfApply function as during the install. So the kfapp directory can not be wiped out or replaced after the install.

@vpavlin
Copy link
Member Author

vpavlin commented May 1, 2020

@adrian555 If you can provide even WIP PR, we'd be happy to test and document the potential cases. There are probably 2 seperate problems

.cache - if something changes in a manifest repository, cache needs to be updated
kustomize - if something changes in KFDef, kustomize dir needs to be regenerated

And that is the case for both apply or update, obviously delete should just clean everything

The cache case could be solved by some checksums being verified if we want to avoid deleting it every time

But as soon as we update the cache, we need to clean up kustomize as well, to prevent stale generated manifests

The kustomize case basically means that if KFDef changes in any way, the kustomize dir needs to be regenerated, otherwise we wouldn't work with correct manifests.

So the simplest (lame) solution really is to run from clean cache and kustomize every time any event is triggered.

More complicated (smarter) solution is to track the manifest source and make sure the cache not outdated and make sure each change in cache or KFdef regenerates kustomize - but I am not sure if this is worth the potential implementation pain.

@nakfour
Copy link
Member

nakfour commented May 1, 2020

@adrian555 when is the reconcile routine called? We have the quick fix in ODH of deleting the kustomize folder on every apply request here: opendatahub-io/opendatahub-operator#14
It seems to work, but maybe we missed the specific use case where reconcile is called. Does this team have a weekly community meeting we can discuss issues? If not you are welcome to join our ODH community meetings to discuss.

@vpavlin
Copy link
Member Author

vpavlin commented May 19, 2020

@adrian555 Thanks for the fix, closing the issue!

@issue-label-bot
Copy link

Issue-Label Bot is automatically applying the labels:

Label Probability
kind/feature 0.68

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

1 similar comment
@issue-label-bot
Copy link

Issue-Label Bot is automatically applying the labels:

Label Probability
kind/feature 0.68

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants