jenkins sync plugin uses okhttp which sends 200 api watch calls per sec #1648

sbose78 · 2017-12-13T11:44:19Z

The openshift aos team discovered that the starter-us-east-2 cluster is receiving 200 api watch calls per second from a 'okhttp/3.8.1' user-agent

{client="okhttp/3.8.1",code="200",job="kubernetes-apiservers",resource="builds"} 51.80555555555555
{client="okhttp/3.8.1",code="200",job="kubernetes-apiservers",resource="configmaps"} 51.80555555555555
{client="okhttp/3.8.1",code="200",job="kubernetes-apiservers",resource="buildconfigs"} 51.800000000000004
{client="okhttp/3.8.1",code="200",job="kubernetes-apiservers",resource="secrets"} 51.800000000000004

This triggers a race condition in the gRPC/etcd client causing the api server to lockup. this issue is being reproduced and fixed at the moment but this will take some time. ref.: openshift/origin#17735

What is using the okhttp library on version 3.8.1?
https://github.com/openshift/jenkins-sync-plugin/blob/master/src/main/java/io/fabric8/jenkins/openshiftsync/GlobalPluginConfiguration.java#L60-L66

From @jfchevrette : Reason for suspecting it is coming from Jenkins is that when we force idle jenkins tenants the calls stops/reduces significantly.

The text was updated successfully, but these errors were encountered:

maxandersen · 2017-12-13T11:54:42Z

from @rupalibehera what is deployed to OSIO is at https://github.com/fabric8io/jenkins-sync-plugin/tree/job-to-bc which is fork of https://github.com/openshift/jenkins-sync-plugin

rupalibehera · 2017-12-13T12:19:10Z

This library uses this version com.squareup.okhttp3:okhttp:jar:3.8.1:compile found this with help of @hrihsin
I am trying to debug this issue with the help of @hrihsin but I am not a java expert to understand much of it.

17:20 $ mvn dependency:tree | grep okDownloading: http://repo.jenkins-ci.org/public/org/jenkins-ci/plugins/token-macro/1.12.1/token-macro-1.12.1.pomDownloaded: http://repo.jenkins-ci.org/public/org/jenkins-ci/plugins/token-macro/1.12.1/token-macro-1.12.1.pom (3 KB at 4.2 KB/sec)Downloading: http://repo.jenkins-ci.org/public/org/jenkins-ci/plugins/token-macro/1.12.1/token-macro-1.12.1.jarDownloaded: http://repo.jenkins-ci.org/public/org/jenkins-ci/plugins/token-macro/1.12.1/token-macro-1.12.1.jar (76 KB at 20.5 KB/sec)[INFO] |     +- com.squareup.okhttp3:okhttp:jar:3.8.1:compile[INFO] |     |  \- com.squareup.okio:okio:jar:1.13.0:compile[INFO] |     +- com.squareup.okhttp3:logging-interceptor:jar:3.8.1:compile[INFO] +- org.jenkins-ci.plugins:token-macro:jar:1.12.1:compile[INFO] |  +- com.squareup.okhttp:okhttp-urlconnection:jar:2.7.5:test[INFO] |  |  \- com.squareup.okhttp:okhttp:jar:2.7.5:test

cc @chmouel

rupalibehera · 2017-12-13T12:42:35Z

@jfchevrette , could we know what it is the version of openshift on which we are facing this issue ?

pbergene · 2017-12-13T12:44:34Z

@rupalibehera 3.6 (v3.6.173.0.63 (online version 3.6.0.73.0)), my understanding is that it is currently blocking the upgrade to 3.7 on us-starter-east-2

rupalibehera · 2017-12-13T12:47:25Z

@pbergene , If I get it right we get this issue after the upgrade ?

pbergene · 2017-12-13T12:48:55Z

@rupalibehera no, it's current. There is also an ongoing email thread.

EDIT: My mistake, sorry, you are indeed right. This occured once the API servers were upgraded to 3.7.

maxandersen · 2017-12-13T14:14:57Z

@rajdavies ^^ see update from @pbergene on 3.7.

rupalibehera · 2017-12-13T18:14:18Z

After the upgrade of openshift 3.7 we suspect that there are major api changes, which is affecting openhift-client/kubernetes-client and which is used by the jenkins-sync-plugin, we tried downgrading the plugin and tried to test it locally, but could not check with prometheus the watch requests.

jfchevrette · 2017-12-13T18:32:13Z

@rupalibehera this issue is present right now in production on 3.6 - not specific to 3.7. it is unfortunately very hard to reproduce locally as we don't have a way to run the OSIO stack locally yet.

hopefully we can bring the plugin on par with upstream which may already have a fix for this. on the other hand, the plugin doesn't seem to have much by way of logging so being able to check logs and see when the plugin is sending those watch api calls would help immensely.

rupalibehera · 2017-12-14T13:26:41Z

Hi @jfchevrette @pbergene,

could we get the logs of Jenkins master, need to check the plugin logs (on openshift 3.7) ?
and also can we have those Prometheus graphs for 3.6 which you had shared in the email for 3.7.

we tried monitoring the api watch requests but we don't see that high number which you guys are seeing.

The upstream plugin also uses the watcher code for BuildConfigs and Builds, what we have different in our fabric8io/openshift-sync-plugin fork is configmaps and secrets also sync with BuildConfigs and Build.

jfchevrette · 2017-12-14T14:14:08Z

@rupalibehera i've sent the logs privately as they contain sensitive information.

What we're seeing is about 50 watch request/sec for each of BuildConfigs, Builds, Secrets and ConfigMaps (total of 200 req/sec). This is across 500 jenkins pods at the moment. So it's possible that each are doing a request every few seconds (all contributing to the 200 req/sec total) or that only a few of them are in a bad state (could be due toinvalid authentication token or privileges for instance) causing the plugin to keep retrying very rapidly.

This is occurring regardless of the OpenShift version and is causing high stress on the api servers. What prevented the upgrade to 3.7 is that such a high amount of requests/sec has led us to discover a bug in OpenShift 3.7 related to gRPC. This particular bug is being fixed and we also have a temporary workaround to allow us to do the upgrade.

However we still have to fix the issue with jenkins/plugin because this will not scale well once we have many more users on the clusters.

The problem is probably going to be hard to reproduce locally. What we would need is better logging on the jenkins instances or the plugin itself so that we know whats going on. In the plugins code I see that it is capable of logging all those events but I believe jenkins logging level isn't high enough for the logs to show.

rajdavies · 2017-12-14T15:32:24Z

This shouldn't be a SEV1 issue now - can we downgrade it to SEV2.

rawlingsj · 2017-12-14T15:43:13Z

@jfchevrette do the dev teams have access to the prometheus charts from your email for prod and prod preview? I expect there's going to need to be a certain amount of experimentation required during the investigation so being able to recreate and track using the monitoring tooling is going to help significantly.

jfchevrette · 2017-12-14T15:47:33Z

@rawlingsj the prometheus endpoint is only available in production and not something that is widely available. it is in fact an internal tools the ops team is using.

rawlingsj · 2017-12-14T15:51:27Z

@jfchevrette ok any chance we can have prometheus + these charts added to prod preview? Or capture / present the data using whatever monitoring solution we do have available to devs in prod / prod preview?

The reason this has only come to light now is because of the gRPC issue during the upgrade. Now that we know it's there we really need the info to be accessible to help investigate and test potential solutions.

rawlingsj · 2017-12-14T15:55:17Z

I also suspect anything in a tenant namespace that performs a watch could be susceptible to the same issue. I do wonder if we're scaling tenant namespaces to > 500 on a cluster if a watch per tenant can be supported by the API server? I guess different technologies and code is likely going to make a difference but we really need some scalability testing around this + monitoring and alerting to gain an understanding.

cctrieloff · 2017-12-14T15:57:07Z

we need to figure out why we are storming the openshift/kube api as SERV1

jfchevrette · 2017-12-14T16:01:29Z

@rawlingsj that will not be possible for now as we would need to grant cluster admin provileges to everyone who want to look at prometheus. this is not meant for enduser consumption at the moment.

Another thing that is very strange and may indicate an issue with the fabric8 k8s client itself: once the masters were ugpraded to 3.7 with the workaround to prevent the api DDoS from affecting the api servers (triggering the gRPC bug), all requests became LIST requests, not WATCH requests anymore. For some reason the client switched from doing WATCH requests to LIST requests.

I've looked at the code (jenkins-sync-plugin, fabric8 k8s client) for some time last night and couldn't spot anything that would make the api calls fallback to LIST calls for some reason.

iocanel · 2017-12-14T16:59:45Z

Given that list and watch operations are pretty similar, I wouldn't be surprised if there was a bug, in the kubernetes-client related to api groups, that break things up and watch requests are improperly constructed ending up being list requests.

Now, to better understand the issue I'd like to know a few things:

What is the kubernetes-client version that is used?
Did it ever worked with an older version of the client?
Is it possible to get the list with the actual http requests from the affected system (instead of just the entries in prometheus)?.

rawlingsj · 2017-12-14T17:05:38Z

What is the kubernetes-client version that is used?

From the pom of the branch we use it's openshift-client 2.5.7 https://github.com/fabric8io/jenkins-sync-plugin/blob/job-to-bc/pom.xml#L44

Did it ever worked with an older version of the client?

it's been working but I dont think we know at what point the high number of requests started happening or is it's always been like this, @jfchevrette correct?

Is it possible to get the list with the actual http requests from the affected system (instead of just the entries in prometheus)?.

@jfchevrette is this something you might be able to help with?

rajdavies · 2017-12-14T18:28:42Z

is it related to this - kubernetes/kubernetes#45811 ?

jfchevrette · 2017-12-14T20:21:48Z

@rawlingsj the requests were there before we attempted the upgrade. They were discovered when we realized they were triggering a bug in gRPC down the line. Then after the upgrade, the api endpoinds are seeing LIST events instead of WATCH events.

I was told by @smarterclayton that they had a similar issue with their jenkins and it was solved by upgrading the fabric8 client dependency. Should we try that first?

rajdavies · 2017-12-15T08:08:21Z

Wathchers will poll until the resource they are "watching" becomes available - then they will watch. So while upgrading the kubernetes client may work around the breaking changes in openshift between 3.6->3.7 - there will always be some polling.

rawlingsj · 2017-12-15T10:18:33Z

We can upgrade the client and that would also bring in this fix which may or may not be related fabric8io/kubernetes-client#855

One issue still is how we test this? We have no monitoring in prod preview or prod that we can use to check if this addresses the problem or worse know if it causes an adverse effect.

rawlingsj · 2017-12-15T11:01:46Z

@jfchevrette The PR linked above includes the upgrade of kubernetes (openshift) client to the latest release. The CI job has build a snapshot image and pushed it to dockerhub.

fabric8/jenkins-openshift:SNAPSHOT-PR-124-2

Based on my concerns above I still don't know how we're going to tell if this works or not. I guess if someone with access to prometheus can put a query together to capture their tenants current jenkins metrics, next manually edit their jenkins tenant DC, change the image: to use fabric8/jenkins-openshift:SNAPSHOT-PR-124-2 then check the query again to see if the upgrade helped?

hrishin · 2017-12-15T14:09:35Z

@jfchevrette @rawlingsj
@rupalibehera and me tried upgrading the client while running Prometheus on minishift with OpenShift v3.7, it's hard to observe the behavior difference.
One thing along with upgrade, we should

add more logging statements, specifically around create watch request
open ALL level of logging

Can we consider the fabric8io/jenkins-sync-plugin#8 ?

rupalibehera · 2017-12-18T10:48:33Z

@jfchevrette , There was some testing with the above image provided by @rawlingsj let us know how did the testing went ? Does the issue still exists ?

jfchevrette · 2017-12-18T14:00:22Z

@rupalibehera Is there a test suite upstream that we can rely on to validate the dependency change?

Updating a single jenkins instance in OSIO (among all 500 instances) isn't enough to validate if the api servers are still getting hammered with api calls. 1/500 is only a 0.2% change and is not enough to cause a significant & observable impact on the graphs.

Once the changes are released upstream we can update the OSIO tenants to the new jenkins image.

rupalibehera · 2017-12-18T22:03:46Z

@jfchevrette , we have merged the changes fabric8io/openshift-jenkins-s2i-config#124 some sanity checks in this PR jenkins job looks good it can import job and build a quickstart etc),
fabric8-services/fabric8-tenant-jenkins#58 , fabric8-services/fabric8-tenant#504, I am not sure how it goes to prod from here all the related jobs are successful.

rupalibehera · 2017-12-20T21:48:06Z

After these changes have been merged , fabric8io/jenkins-sync-plugin#8, fabric8io/openshift-jenkins-s2i-config#127, fabric8-services/fabric8-tenant-jenkins#59 and https://github.com/fabric8-services/fabric8-tenant/blob/426a97ccaf41e320818ae3fdd9b39a2333692a6c/JENKINS_VERSION.
As reported by @jfchevrette these api watch call have reduced significantly.
We can close this issue now.

jfchevrette · 2017-12-20T21:56:58Z

A graph is worth a thousand words (or something like that)

pbergene · 2017-12-21T09:52:18Z

That is absolutely awesome. Fixed and verified. Thanks everyone, closing it off :)

sbose78 added area/pipelines SEV1-urgent labels Dec 13, 2017

joshuawilson added team/build-cd team/service-delivery type/bug labels Dec 13, 2017

rawlingsj mentioned this issue Dec 15, 2017

upgrade openshift client to bring in latest watch fixes for running on openshift 3.6 fabric8io/openshift-jenkins-s2i-config#124

Merged

hrishin mentioned this issue Dec 15, 2017

Temp fix for watchers issues fabric8io/jenkins-sync-plugin#8

Merged

pbergene added SEV2-high and removed SEV1-urgent labels Dec 20, 2017

pbergene closed this as completed Dec 21, 2017

This was referenced Feb 9, 2018

fix for console log URL fabric8io/jenkins-sync-plugin#13

Merged

build polling frequency fabric8io/jenkins-sync-plugin#14

Closed

gabemontero mentioned this issue Feb 16, 2018

update run poll based on OSIO issue; doc polling openshift/jenkins-sync-plugin#204

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

jenkins sync plugin uses okhttp which sends 200 api watch calls per sec #1648

jenkins sync plugin uses okhttp which sends 200 api watch calls per sec #1648

sbose78 commented Dec 13, 2017 •

edited by maxandersen

Loading

maxandersen commented Dec 13, 2017

rupalibehera commented Dec 13, 2017 •

edited

Loading

rupalibehera commented Dec 13, 2017

pbergene commented Dec 13, 2017 •

edited

Loading

rupalibehera commented Dec 13, 2017

pbergene commented Dec 13, 2017 •

edited

Loading

maxandersen commented Dec 13, 2017

rupalibehera commented Dec 13, 2017

jfchevrette commented Dec 13, 2017

rupalibehera commented Dec 14, 2017

jfchevrette commented Dec 14, 2017

rajdavies commented Dec 14, 2017

rawlingsj commented Dec 14, 2017

jfchevrette commented Dec 14, 2017

rawlingsj commented Dec 14, 2017

rawlingsj commented Dec 14, 2017

cctrieloff commented Dec 14, 2017

jfchevrette commented Dec 14, 2017

iocanel commented Dec 14, 2017

rawlingsj commented Dec 14, 2017

rajdavies commented Dec 14, 2017

jfchevrette commented Dec 14, 2017

rajdavies commented Dec 15, 2017

rawlingsj commented Dec 15, 2017

rawlingsj commented Dec 15, 2017

hrishin commented Dec 15, 2017 •

edited

Loading

rupalibehera commented Dec 18, 2017

jfchevrette commented Dec 18, 2017

rupalibehera commented Dec 18, 2017

rupalibehera commented Dec 20, 2017

jfchevrette commented Dec 20, 2017

pbergene commented Dec 21, 2017

jenkins sync plugin uses okhttp which sends 200 api watch calls per sec #1648

jenkins sync plugin uses okhttp which sends 200 api watch calls per sec #1648

Comments

sbose78 commented Dec 13, 2017 • edited by maxandersen Loading

maxandersen commented Dec 13, 2017

rupalibehera commented Dec 13, 2017 • edited Loading

rupalibehera commented Dec 13, 2017

pbergene commented Dec 13, 2017 • edited Loading

rupalibehera commented Dec 13, 2017

pbergene commented Dec 13, 2017 • edited Loading

maxandersen commented Dec 13, 2017

rupalibehera commented Dec 13, 2017

jfchevrette commented Dec 13, 2017

rupalibehera commented Dec 14, 2017

jfchevrette commented Dec 14, 2017

rajdavies commented Dec 14, 2017

rawlingsj commented Dec 14, 2017

jfchevrette commented Dec 14, 2017

rawlingsj commented Dec 14, 2017

rawlingsj commented Dec 14, 2017

cctrieloff commented Dec 14, 2017

jfchevrette commented Dec 14, 2017

iocanel commented Dec 14, 2017

rawlingsj commented Dec 14, 2017

rajdavies commented Dec 14, 2017

jfchevrette commented Dec 14, 2017

rajdavies commented Dec 15, 2017

rawlingsj commented Dec 15, 2017

rawlingsj commented Dec 15, 2017

hrishin commented Dec 15, 2017 • edited Loading

rupalibehera commented Dec 18, 2017

jfchevrette commented Dec 18, 2017

rupalibehera commented Dec 18, 2017

rupalibehera commented Dec 20, 2017

jfchevrette commented Dec 20, 2017

pbergene commented Dec 21, 2017

sbose78 commented Dec 13, 2017 •

edited by maxandersen

Loading

rupalibehera commented Dec 13, 2017 •

edited

Loading

pbergene commented Dec 13, 2017 •

edited

Loading

pbergene commented Dec 13, 2017 •

edited

Loading

hrishin commented Dec 15, 2017 •

edited

Loading