-
Notifications
You must be signed in to change notification settings - Fork 66
jenkins sync plugin uses okhttp which sends 200 api watch calls per sec #1648
Comments
from @rupalibehera what is deployed to OSIO is at https://github.com/fabric8io/jenkins-sync-plugin/tree/job-to-bc which is fork of https://github.com/openshift/jenkins-sync-plugin |
This library uses this version
cc @chmouel |
@jfchevrette , could we know what it is the version of openshift on which we are facing this issue ? |
@rupalibehera 3.6 (v3.6.173.0.63 (online version 3.6.0.73.0)), my understanding is that it is currently blocking the upgrade to 3.7 on us-starter-east-2 |
@pbergene , If I get it right we get this issue after the upgrade ? |
@rupalibehera no, it's current. There is also an ongoing email thread. EDIT: My mistake, sorry, you are indeed right. This occured once the API servers were upgraded to 3.7. |
@rajdavies ^^ see update from @pbergene on 3.7. |
After the upgrade of openshift 3.7 we suspect that there are major api changes, which is affecting openhift-client/kubernetes-client and which is used by the jenkins-sync-plugin, we tried downgrading the plugin and tried to test it locally, but could not check with prometheus the watch requests. |
@rupalibehera this issue is present right now in production on 3.6 - not specific to 3.7. it is unfortunately very hard to reproduce locally as we don't have a way to run the OSIO stack locally yet. hopefully we can bring the plugin on par with upstream which may already have a fix for this. on the other hand, the plugin doesn't seem to have much by way of logging so being able to check logs and see when the plugin is sending those watch api calls would help immensely. |
could we get the logs of Jenkins master, need to check the plugin logs (on openshift 3.7) ? we tried monitoring the api watch requests but we don't see that high number which you guys are seeing. The upstream plugin also uses the watcher code for BuildConfigs and Builds, what we have different in our fabric8io/openshift-sync-plugin fork is configmaps and secrets also sync with BuildConfigs and Build. |
@rupalibehera i've sent the logs privately as they contain sensitive information. What we're seeing is about 50 watch request/sec for each of BuildConfigs, Builds, Secrets and ConfigMaps (total of 200 req/sec). This is across 500 jenkins pods at the moment. So it's possible that each are doing a request every few seconds (all contributing to the 200 req/sec total) or that only a few of them are in a bad state (could be due toinvalid authentication token or privileges for instance) causing the plugin to keep retrying very rapidly. This is occurring regardless of the OpenShift version and is causing high stress on the api servers. What prevented the upgrade to 3.7 is that such a high amount of requests/sec has led us to discover a bug in OpenShift 3.7 related to gRPC. This particular bug is being fixed and we also have a temporary workaround to allow us to do the upgrade. However we still have to fix the issue with jenkins/plugin because this will not scale well once we have many more users on the clusters. The problem is probably going to be hard to reproduce locally. What we would need is better logging on the jenkins instances or the plugin itself so that we know whats going on. In the plugins code I see that it is capable of logging all those events but I believe jenkins logging level isn't high enough for the logs to show. |
This shouldn't be a SEV1 issue now - can we downgrade it to SEV2. |
@jfchevrette do the dev teams have access to the prometheus charts from your email for prod and prod preview? I expect there's going to need to be a certain amount of experimentation required during the investigation so being able to recreate and track using the monitoring tooling is going to help significantly. |
@rawlingsj the prometheus endpoint is only available in production and not something that is widely available. it is in fact an internal tools the ops team is using. |
@jfchevrette ok any chance we can have prometheus + these charts added to prod preview? Or capture / present the data using whatever monitoring solution we do have available to devs in prod / prod preview? The reason this has only come to light now is because of the gRPC issue during the upgrade. Now that we know it's there we really need the info to be accessible to help investigate and test potential solutions. |
I also suspect anything in a tenant namespace that performs a watch could be susceptible to the same issue. I do wonder if we're scaling tenant namespaces to > 500 on a cluster if a watch per tenant can be supported by the API server? I guess different technologies and code is likely going to make a difference but we really need some scalability testing around this + monitoring and alerting to gain an understanding. |
we need to figure out why we are storming the openshift/kube api as SERV1 |
@rawlingsj that will not be possible for now as we would need to grant cluster admin provileges to everyone who want to look at prometheus. this is not meant for enduser consumption at the moment. Another thing that is very strange and may indicate an issue with the fabric8 k8s client itself: once the masters were ugpraded to 3.7 with the workaround to prevent the api DDoS from affecting the api servers (triggering the gRPC bug), all requests became LIST requests, not WATCH requests anymore. For some reason the client switched from doing WATCH requests to LIST requests. I've looked at the code (jenkins-sync-plugin, fabric8 k8s client) for some time last night and couldn't spot anything that would make the api calls fallback to LIST calls for some reason. |
Given that list and watch operations are pretty similar, I wouldn't be surprised if there was a bug, in the kubernetes-client related to api groups, that break things up and watch requests are improperly constructed ending up being list requests. Now, to better understand the issue I'd like to know a few things:
|
From the pom of the branch we use it's openshift-client 2.5.7 https://github.com/fabric8io/jenkins-sync-plugin/blob/job-to-bc/pom.xml#L44
it's been working but I dont think we know at what point the high number of requests started happening or is it's always been like this, @jfchevrette correct?
@jfchevrette is this something you might be able to help with? |
is it related to this - kubernetes/kubernetes#45811 ? |
@rawlingsj the requests were there before we attempted the upgrade. They were discovered when we realized they were triggering a bug in gRPC down the line. Then after the upgrade, the api endpoinds are seeing LIST events instead of WATCH events. I was told by @smarterclayton that they had a similar issue with their jenkins and it was solved by upgrading the fabric8 client dependency. Should we try that first? |
Wathchers will poll until the resource they are "watching" becomes available - then they will watch. So while upgrading the kubernetes client may work around the breaking changes in openshift between 3.6->3.7 - there will always be some polling. |
We can upgrade the client and that would also bring in this fix which may or may not be related fabric8io/kubernetes-client#855 One issue still is how we test this? We have no monitoring in prod preview or prod that we can use to check if this addresses the problem or worse know if it causes an adverse effect. |
@jfchevrette The PR linked above includes the upgrade of kubernetes (openshift) client to the latest release. The CI job has build a snapshot image and pushed it to dockerhub.
Based on my concerns above I still don't know how we're going to tell if this works or not. I guess if someone with access to prometheus can put a query together to capture their tenants current jenkins metrics, next manually edit their jenkins tenant DC, change the |
@jfchevrette @rawlingsj
Can we consider the fabric8io/jenkins-sync-plugin#8 ? |
@jfchevrette , There was some testing with the above image provided by @rawlingsj let us know how did the testing went ? Does the issue still exists ? |
@rupalibehera Is there a test suite upstream that we can rely on to validate the dependency change? Updating a single jenkins instance in OSIO (among all 500 instances) isn't enough to validate if the api servers are still getting hammered with api calls. 1/500 is only a 0.2% change and is not enough to cause a significant & observable impact on the graphs. Once the changes are released upstream we can update the OSIO tenants to the new jenkins image. |
@jfchevrette , we have merged the changes fabric8io/openshift-jenkins-s2i-config#124 some sanity checks in this PR jenkins job looks good it can import job and build a quickstart etc), |
After these changes have been merged , fabric8io/jenkins-sync-plugin#8, fabric8io/openshift-jenkins-s2i-config#127, fabric8-services/fabric8-tenant-jenkins#59 and https://github.com/fabric8-services/fabric8-tenant/blob/426a97ccaf41e320818ae3fdd9b39a2333692a6c/JENKINS_VERSION. |
That is absolutely awesome. Fixed and verified. Thanks everyone, closing it off :) |
As reported by @jfchevrette :
The openshift aos team discovered that the starter-us-east-2 cluster is receiving 200 api watch calls per second from a 'okhttp/3.8.1' user-agent
This triggers a race condition in the gRPC/etcd client causing the api server to lockup. this issue is being reproduced and fixed at the moment but this will take some time. ref.: openshift/origin#17735
What is using the okhttp library on version 3.8.1?
https://github.com/openshift/jenkins-sync-plugin/blob/master/src/main/java/io/fabric8/jenkins/openshiftsync/GlobalPluginConfiguration.java#L60-L66
From @jfchevrette : Reason for suspecting it is coming from Jenkins is that when we force idle jenkins tenants the calls stops/reduces significantly.
The text was updated successfully, but these errors were encountered: