-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Configmap removals not being recognized in WATCH mode #25
Comments
Hi @gburton1 , that warning is fine. You have provided RBAC to However, I'm not able to reproduce the behaviour you're seeing... This is what I see from a fresh pod after creating some configmaps
and then deleting them:
As you can see, the logs show everything being added/deleted as expected. In your case, I see that you have:
Are the configmaps you're creating/deleting in that namespace? |
Ah, the situation is a little different than I thought. The configmap is not actually being deleted and recreated. The Before:
After:
|
Ah, I see... that really is a different situation The operator is fully stateless, it will create/delete/update files from where you tell it to. At that point, it has no memory that the configmap had a different |
Yeah, thanks for the clarity here! I appreciate the response, and for your work on this side project to overcome the issue on the original! |
Hi @OmegaVVeapon, What behavior do you expect in this case? Let's say a config map defines two files. When the configmap is created, the sidecar mounts the two files into the desired location. If you delete the configmap, then the sidecar removes the two files from that location.
But if you update the configmap content to remove one of the two files, we see that the sidecar does not remove that file from where it was mounted. Is that expected behavior?
|
Unfortunately, I think this would fall under the statelessness problem... However, it has no way to know that there were previously two files in the configmap. I'll have to think if there's a way we can re-add some kind of "memory" for cases like these without falling into the issues that persistence generated before. Just to give you some history, in previous iterations, we relied in the kopf persistence via the With that annotation, you would be able to "diff" the previous run against the current one, and know that things changed. It also created some issues with some k8s validation tools that didn't expect this magical annotation to appear in the I'll have to see if there's something we can do without adding instability... Is this a blocker for you? |
We are actually using the sidecar for Grafana dashboards too, and we also use it for Cortex alert rules (basically the same thing as Prometheus alert rules). Our users like the pattern of grouping similar alert rules and dashboards into the same config map, so the pattern is everywhere. We have a PR check on size, so that people know when they have hit the ConfigMap size limit. We have also faced the annotation size issue when our CD tool (ArgoCD) was decorating these configmaps with a "last-configuration" annotation, so we have disabled those. Making the sidecar add an annotation comes with the big downsides you described. Our ArgoCD tool, a strict infra-as-code enforcer, will fight to revert any change like that to objects it manages. So I see why you removed state from the current iteration of sidecar. So yes, this is a bit of a blocker. I need to find a good plan soon because users are noticing that rules/dashboards are not getting deleted. What about adding a boolean configuration to this sidecar that defaults to FALSE, but when TRUE would tell the sidecar that during handling of UPDATE events, it should delete everything in |
We have a pretty decent workaround that we were already getting for another reason in our dashboards setup; in order to avoid the bulky last-applied annotation that you get with We hadn't needed to worry about the bulky annotations with rules because they are smaller, but now we are applying the same configuration as dashboards anyway just to get this other benefit. Now the only problem we are seeing is that there seems to be an occasional flakiness where 1 instance out of many identical ones that are all watching the same k8s resources will just stop getting events from a resource. So while the most instances have a flurry of delete and create activity, one instance can just sit there with no activity. We'll try upgrading to 1.3.5, but I'm not sure that's the issue. |
Sounds good, glad you found that workaround! Regarding the occasional flakiness, I'm curious about your setup. You mention you have several instances monitoring the same resources, do you have multiple grafana instances (one per team, I'd assume) and segregate them with the Do let me know if 1.3.5 solves your issues, There's a new release of kopf 1.35.0 that includes one more robustness check so I was planning of bumping to that anyways. Since we haven't seen issues on our end for a fairly long time, my urgency on this was low but I'll gladly tag a release if you continue to experience issues. |
Have you ever heard of Cortex? It's basically a horizontally scalable version of Prometheus that implements all Prometheus APIs. One component of it is the Ruler, which evaluates alert rules. And it can scale across multiple instances and shard rules across them. So the sidecar on each instance watches all rules (in configmaps) and mounts all of them into a staging directory of each Ruler instance, and then the Ruler instances negotiate who will take which ones, and the runtime config has no duplication. It works great. And when the sidecar deletes one from the staging directory, the instances respond by removing it from the runtime config. In this bug case, I observed 1 instance out of three in which the sidecar never removed a file, even though the sidecar on the other two instances that were watching the same config removed it fine. And the logs reflect that. On the instance that did not remove it, there is no activity at all in the logs at the time that the other instances were responding to the configmap change events. Now I'm fighting that new healthcheck that was added in this release. It has a port conflict with something else in my pod 😧 |
The new health check port is hard-coded in kopf-k8s-sidecar to 8080. Unfortunately, that port is deeply entrenched in all my stuff it looks like. It will take some time to move ports around without breaking stuff. |
Huh! No, I hadn't heard of Cortex and now that I'm going over the docs, it sounds incredibly useful! |
Opened #27 |
Github actions completed. |
@gburton1 Hey, just checking up. |
Everything has been better since we changed to an immutable model for config maps, so they are always guaranteed to have a delete/create cycle on every change. So I think the sidecar is working great! We have one more little issue that I'm thinking about...the tool we use (Kustomize) to generate our configmaps that contain dashboards and alert rules has an annoying trait--when a config map changes content, the old config map is deleted and a new one is created, each with a unique suffix in the config map name. But unfortunately, the order is not guaranteed, so you get a delete/create operation within milliseconds of each other, and sometimes the create happens first, which makes the sidecar react in that order, mounting the new config map and then deleting the file that it just mounted a few milliseconds later! I need to ask in that project if I can get a guaranteed order somehow. My main workaround for now will be to make the sidecar reset more often via WATCH_SERVER_TIMEOUT, which would reduce the downtime from the default 10 minutes to something better. |
I've just switched over to kopf-k8s-sidecar due to this issue. It's working fine as a drop-in replacement, but it does not recognize when configmaps are removed. I'm using it for the standard Grafana dashboard sidecar use case, and I've shown the service account and role below. Here is the warning in the logs that looks relevant:
The text was updated successfully, but these errors were encountered: