Skip to content
This repository has been archived by the owner on Jul 23, 2020. It is now read-only.

jenkins sync plugin uses okhttp which sends 200 api watch calls per sec #1648

Closed
sbose78 opened this issue Dec 13, 2017 · 32 comments
Closed

Comments

@sbose78
Copy link
Collaborator

sbose78 commented Dec 13, 2017

As reported by @jfchevrette :

The openshift aos team discovered that the starter-us-east-2 cluster is receiving 200 api watch calls per second from a 'okhttp/3.8.1' user-agent

{client="okhttp/3.8.1",code="200",job="kubernetes-apiservers",resource="builds"} 51.80555555555555
{client="okhttp/3.8.1",code="200",job="kubernetes-apiservers",resource="configmaps"} 51.80555555555555
{client="okhttp/3.8.1",code="200",job="kubernetes-apiservers",resource="buildconfigs"} 51.800000000000004
{client="okhttp/3.8.1",code="200",job="kubernetes-apiservers",resource="secrets"} 51.800000000000004

This triggers a race condition in the gRPC/etcd client causing the api server to lockup. this issue is being reproduced and fixed at the moment but this will take some time. ref.: openshift/origin#17735

What is using the okhttp library on version 3.8.1?
https://github.com/openshift/jenkins-sync-plugin/blob/master/src/main/java/io/fabric8/jenkins/openshiftsync/GlobalPluginConfiguration.java#L60-L66

From @jfchevrette : Reason for suspecting it is coming from Jenkins is that when we force idle jenkins tenants the calls stops/reduces significantly.

@maxandersen
Copy link
Collaborator

@rupalibehera
Copy link
Collaborator

rupalibehera commented Dec 13, 2017

This library uses this version com.squareup.okhttp3:okhttp:jar:3.8.1:compile found this with help of @hrihsin
I am trying to debug this issue with the help of @hrihsin but I am not a java expert to understand much of it.

17:20 $ mvn dependency:tree | grep okDownloading: http://repo.jenkins-ci.org/public/org/jenkins-ci/plugins/token-macro/1.12.1/token-macro-1.12.1.pomDownloaded: http://repo.jenkins-ci.org/public/org/jenkins-ci/plugins/token-macro/1.12.1/token-macro-1.12.1.pom (3 KB at 4.2 KB/sec)Downloading: http://repo.jenkins-ci.org/public/org/jenkins-ci/plugins/token-macro/1.12.1/token-macro-1.12.1.jarDownloaded: http://repo.jenkins-ci.org/public/org/jenkins-ci/plugins/token-macro/1.12.1/token-macro-1.12.1.jar (76 KB at 20.5 KB/sec)[INFO] |     +- com.squareup.okhttp3:okhttp:jar:3.8.1:compile[INFO] |     |  \- com.squareup.okio:okio:jar:1.13.0:compile[INFO] |     +- com.squareup.okhttp3:logging-interceptor:jar:3.8.1:compile[INFO] +- org.jenkins-ci.plugins:token-macro:jar:1.12.1:compile[INFO] |  +- com.squareup.okhttp:okhttp-urlconnection:jar:2.7.5:test[INFO] |  |  \- com.squareup.okhttp:okhttp:jar:2.7.5:test

cc @chmouel

@rupalibehera
Copy link
Collaborator

@jfchevrette , could we know what it is the version of openshift on which we are facing this issue ?

@pbergene
Copy link
Collaborator

pbergene commented Dec 13, 2017

@rupalibehera 3.6 (v3.6.173.0.63 (online version 3.6.0.73.0)), my understanding is that it is currently blocking the upgrade to 3.7 on us-starter-east-2

@rupalibehera
Copy link
Collaborator

@pbergene , If I get it right we get this issue after the upgrade ?

@pbergene
Copy link
Collaborator

pbergene commented Dec 13, 2017

@rupalibehera no, it's current. There is also an ongoing email thread.

EDIT: My mistake, sorry, you are indeed right. This occured once the API servers were upgraded to 3.7.

@maxandersen
Copy link
Collaborator

@rajdavies ^^ see update from @pbergene on 3.7.

@rupalibehera
Copy link
Collaborator

After the upgrade of openshift 3.7 we suspect that there are major api changes, which is affecting openhift-client/kubernetes-client and which is used by the jenkins-sync-plugin, we tried downgrading the plugin and tried to test it locally, but could not check with prometheus the watch requests.

@jfchevrette
Copy link
Contributor

@rupalibehera this issue is present right now in production on 3.6 - not specific to 3.7. it is unfortunately very hard to reproduce locally as we don't have a way to run the OSIO stack locally yet.

hopefully we can bring the plugin on par with upstream which may already have a fix for this. on the other hand, the plugin doesn't seem to have much by way of logging so being able to check logs and see when the plugin is sending those watch api calls would help immensely.

@rupalibehera
Copy link
Collaborator

Hi @jfchevrette @pbergene,

could we get the logs of Jenkins master, need to check the plugin logs (on openshift 3.7) ?
and also can we have those Prometheus graphs for 3.6 which you had shared in the email for 3.7.

we tried monitoring the api watch requests but we don't see that high number which you guys are seeing.

The upstream plugin also uses the watcher code for BuildConfigs and Builds, what we have different in our fabric8io/openshift-sync-plugin fork is configmaps and secrets also sync with BuildConfigs and Build.

@jfchevrette
Copy link
Contributor

@rupalibehera i've sent the logs privately as they contain sensitive information.

What we're seeing is about 50 watch request/sec for each of BuildConfigs, Builds, Secrets and ConfigMaps (total of 200 req/sec). This is across 500 jenkins pods at the moment. So it's possible that each are doing a request every few seconds (all contributing to the 200 req/sec total) or that only a few of them are in a bad state (could be due toinvalid authentication token or privileges for instance) causing the plugin to keep retrying very rapidly.

This is occurring regardless of the OpenShift version and is causing high stress on the api servers. What prevented the upgrade to 3.7 is that such a high amount of requests/sec has led us to discover a bug in OpenShift 3.7 related to gRPC. This particular bug is being fixed and we also have a temporary workaround to allow us to do the upgrade.

However we still have to fix the issue with jenkins/plugin because this will not scale well once we have many more users on the clusters.

The problem is probably going to be hard to reproduce locally. What we would need is better logging on the jenkins instances or the plugin itself so that we know whats going on. In the plugins code I see that it is capable of logging all those events but I believe jenkins logging level isn't high enough for the logs to show.

@rajdavies
Copy link

This shouldn't be a SEV1 issue now - can we downgrade it to SEV2.

@rawlingsj
Copy link

@jfchevrette do the dev teams have access to the prometheus charts from your email for prod and prod preview? I expect there's going to need to be a certain amount of experimentation required during the investigation so being able to recreate and track using the monitoring tooling is going to help significantly.

@jfchevrette
Copy link
Contributor

@rawlingsj the prometheus endpoint is only available in production and not something that is widely available. it is in fact an internal tools the ops team is using.

@rawlingsj
Copy link

@jfchevrette ok any chance we can have prometheus + these charts added to prod preview? Or capture / present the data using whatever monitoring solution we do have available to devs in prod / prod preview?

The reason this has only come to light now is because of the gRPC issue during the upgrade. Now that we know it's there we really need the info to be accessible to help investigate and test potential solutions.

@rawlingsj
Copy link

I also suspect anything in a tenant namespace that performs a watch could be susceptible to the same issue. I do wonder if we're scaling tenant namespaces to > 500 on a cluster if a watch per tenant can be supported by the API server? I guess different technologies and code is likely going to make a difference but we really need some scalability testing around this + monitoring and alerting to gain an understanding.

@cctrieloff
Copy link
Collaborator

we need to figure out why we are storming the openshift/kube api as SERV1

@jfchevrette
Copy link
Contributor

@rawlingsj that will not be possible for now as we would need to grant cluster admin provileges to everyone who want to look at prometheus. this is not meant for enduser consumption at the moment.

Another thing that is very strange and may indicate an issue with the fabric8 k8s client itself: once the masters were ugpraded to 3.7 with the workaround to prevent the api DDoS from affecting the api servers (triggering the gRPC bug), all requests became LIST requests, not WATCH requests anymore. For some reason the client switched from doing WATCH requests to LIST requests.

I've looked at the code (jenkins-sync-plugin, fabric8 k8s client) for some time last night and couldn't spot anything that would make the api calls fallback to LIST calls for some reason.

prom-upgrade-jenkins-watch-to-list

@iocanel
Copy link

iocanel commented Dec 14, 2017

Given that list and watch operations are pretty similar, I wouldn't be surprised if there was a bug, in the kubernetes-client related to api groups, that break things up and watch requests are improperly constructed ending up being list requests.

Now, to better understand the issue I'd like to know a few things:

  1. What is the kubernetes-client version that is used?
  2. Did it ever worked with an older version of the client?
  3. Is it possible to get the list with the actual http requests from the affected system (instead of just the entries in prometheus)?.

@rawlingsj
Copy link

What is the kubernetes-client version that is used?

From the pom of the branch we use it's openshift-client 2.5.7 https://github.com/fabric8io/jenkins-sync-plugin/blob/job-to-bc/pom.xml#L44

Did it ever worked with an older version of the client?

it's been working but I dont think we know at what point the high number of requests started happening or is it's always been like this, @jfchevrette correct?

Is it possible to get the list with the actual http requests from the affected system (instead of just the entries in prometheus)?.

@jfchevrette is this something you might be able to help with?

@rajdavies
Copy link

is it related to this - kubernetes/kubernetes#45811 ?

@jfchevrette
Copy link
Contributor

@rawlingsj the requests were there before we attempted the upgrade. They were discovered when we realized they were triggering a bug in gRPC down the line. Then after the upgrade, the api endpoinds are seeing LIST events instead of WATCH events.

I was told by @smarterclayton that they had a similar issue with their jenkins and it was solved by upgrading the fabric8 client dependency. Should we try that first?

@rajdavies
Copy link

Wathchers will poll until the resource they are "watching" becomes available - then they will watch. So while upgrading the kubernetes client may work around the breaking changes in openshift between 3.6->3.7 - there will always be some polling.

@rawlingsj
Copy link

We can upgrade the client and that would also bring in this fix which may or may not be related fabric8io/kubernetes-client#855

One issue still is how we test this? We have no monitoring in prod preview or prod that we can use to check if this addresses the problem or worse know if it causes an adverse effect.

@rawlingsj
Copy link

@jfchevrette The PR linked above includes the upgrade of kubernetes (openshift) client to the latest release. The CI job has build a snapshot image and pushed it to dockerhub.

fabric8/jenkins-openshift:SNAPSHOT-PR-124-2

Based on my concerns above I still don't know how we're going to tell if this works or not. I guess if someone with access to prometheus can put a query together to capture their tenants current jenkins metrics, next manually edit their jenkins tenant DC, change the image: to use fabric8/jenkins-openshift:SNAPSHOT-PR-124-2 then check the query again to see if the upgrade helped?

@hrishin
Copy link

hrishin commented Dec 15, 2017

@jfchevrette @rawlingsj
@rupalibehera and me tried upgrading the client while running Prometheus on minishift with OpenShift v3.7, it's hard to observe the behavior difference.
One thing along with upgrade, we should

  • add more logging statements, specifically around create watch request
  • open ALL level of logging

Can we consider the fabric8io/jenkins-sync-plugin#8 ?

@rupalibehera
Copy link
Collaborator

@jfchevrette , There was some testing with the above image provided by @rawlingsj let us know how did the testing went ? Does the issue still exists ?

@jfchevrette
Copy link
Contributor

@rupalibehera Is there a test suite upstream that we can rely on to validate the dependency change?

Updating a single jenkins instance in OSIO (among all 500 instances) isn't enough to validate if the api servers are still getting hammered with api calls. 1/500 is only a 0.2% change and is not enough to cause a significant & observable impact on the graphs.

Once the changes are released upstream we can update the OSIO tenants to the new jenkins image.

@rupalibehera
Copy link
Collaborator

@jfchevrette , we have merged the changes fabric8io/openshift-jenkins-s2i-config#124 some sanity checks in this PR jenkins job looks good it can import job and build a quickstart etc),
fabric8-services/fabric8-tenant-jenkins#58 , fabric8-services/fabric8-tenant#504, I am not sure how it goes to prod from here all the related jobs are successful.

@rupalibehera
Copy link
Collaborator

After these changes have been merged , fabric8io/jenkins-sync-plugin#8, fabric8io/openshift-jenkins-s2i-config#127, fabric8-services/fabric8-tenant-jenkins#59 and https://github.com/fabric8-services/fabric8-tenant/blob/426a97ccaf41e320818ae3fdd9b39a2333692a6c/JENKINS_VERSION.
As reported by @jfchevrette these api watch call have reduced significantly.
We can close this issue now.

@jfchevrette
Copy link
Contributor

A graph is worth a thousand words (or something like that)

prom-jenkins-api-calls-low

@pbergene
Copy link
Collaborator

That is absolutely awesome. Fixed and verified. Thanks everyone, closing it off :)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests