Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deploytime api fix #187

Merged
merged 27 commits into from
Sep 15, 2020
Merged

Conversation

malacourse
Copy link
Collaborator

Describe the behavior changes introduced in this PR

For later version of Openshift the "extensions/v1beta1" version of repicasets is not available. This causes the deploytime exporter to fail.

Linked Issues?

resolves #186

Testing Instructions

Test against higher version of OCP 4.4.16 or greater.

Please include any additional commands or pointers in addition to our standard PR testing process.

@redhat-cop/mdt

exporters/deploytime/app.py Show resolved Hide resolved
@malacourse
Copy link
Collaborator Author

Debug message added.

@etsauer
Copy link
Collaborator

etsauer commented Sep 3, 2020

@malacourse I think we need to rethink the logic here a bit. Looking at the looping it's taking 5+ minutes for the exporter to return data. For some reason, it seems to be returning things in odd groups as well, starting with 1 record, and then repetitively spitting out 2, 3, 4, all the way up to 81 lines.

$ time curl localhost:8080

deploy_timestamp{app="jenkins-1",image_sha="sha256:c454944825be5359c85728e1ba2d4bc30c73976698ef33f7f72cfb4d4ae2d9fc",namespace="basic-nginx-build"} 1.593702267e+09
# HELP deploy_timestamp Deployment timestamp
# TYPE deploy_timestamp gauge
deploy_timestamp{app="jenkins-1",image_sha="sha256:c454944825be5359c85728e1ba2d4bc30c73976698ef33f7f72cfb4d4ae2d9fc",namespace="basic-nginx-build"} 1.593702267e+09
deploy_timestamp{app="jenkins-2",image_sha="sha256:c454944825be5359c85728e1ba2d4bc30c73976698ef33f7f72cfb4d4ae2d9fc",namespace="basic-nginx-build"} 1.594164843e+09
# HELP deploy_timestamp Deployment timestamp
# TYPE deploy_timestamp gauge
deploy_timestamp{app="jenkins-1",image_sha="sha256:c454944825be5359c85728e1ba2d4bc30c73976698ef33f7f72cfb4d4ae2d9fc",namespace="basic-nginx-build"} 1.593702267e+09
deploy_timestamp{app="jenkins-2",image_sha="sha256:c454944825be5359c85728e1ba2d4bc30c73976698ef33f7f72cfb4d4ae2d9fc",namespace="basic-nginx-build"} 1.594164843e+09
deploy_timestamp{app="jenkins-3",image_sha="sha256:c454944825be5359c85728e1ba2d4bc30c73976698ef33f7f72cfb4d4ae2d9fc",namespace="basic-nginx-build"} 1.594166264e+09
# HELP deploy_timestamp Deployment timestamp
# TYPE deploy_timestamp gauge
deploy_timestamp{app="jenkins-1",image_sha="sha256:c454944825be5359c85728e1ba2d4bc30c73976698ef33f7f72cfb4d4ae2d9fc",namespace="basic-nginx-build"} 1.593702267e+09
deploy_timestamp{app="jenkins-2",image_sha="sha256:c454944825be5359c85728e1ba2d4bc30c73976698ef33f7f72cfb4d4ae2d9fc",namespace="basic-nginx-build"} 1.594164843e+09
deploy_timestamp{app="jenkins-3",image_sha="sha256:c454944825be5359c85728e1ba2d4bc30c73976698ef33f7f72cfb4d4ae2d9fc",namespace="basic-nginx-build"} 1.594166264e+09
deploy_timestamp{app="jenkins-4",image_sha="sha256:c454944825be5359c85728e1ba2d4bc30c73976698ef33f7f72cfb4d4ae2d9fc",namespace="basic-nginx-build"} 1.594166983e+09
# HELP deploy_timestamp Deployment timestamp
# TYPE deploy_timestamp gauge
deploy_timestamp{app="jenkins-1",image_sha="sha256:c454944825be5359c85728e1ba2d4bc30c73976698ef33f7f72cfb4d4ae2d9fc",namespace="basic-nginx-build"} 1.593702267e+09
deploy_timestamp{app="jenkins-2",image_sha="sha256:c454944825be5359c85728e1ba2d4bc30c73976698ef33f7f72cfb4d4ae2d9fc",namespace="basic-nginx-build"} 1.594164843e+09
deploy_timestamp{app="jenkins-3",image_sha="sha256:c454944825be5359c85728e1ba2d4bc30c73976698ef33f7f72cfb4d4ae2d9fc",namespace="basic-nginx-build"} 1.594166264e+09
deploy_timestamp{app="jenkins-4",image_sha="sha256:c454944825be5359c85728e1ba2d4bc30c73976698ef33f7f72cfb4d4ae2d9fc",namespace="basic-nginx-build"} 1.594166983e+09
deploy_timestamp{app="jenkins-5",image_sha="sha256:c454944825be5359c85728e1ba2d4bc30c73976698ef33f7f72cfb4d4ae2d9fc",namespace="basic-nginx-build"} 1.594167412e+09
# HELP deploy_timestamp Deployment timestamp
# TYPE deploy_timestamp gauge
deploy_timestamp{app="jenkins-1",image_sha="sha256:c454944825be5359c85728e1ba2d4bc30c73976698ef33f7f72cfb4d4ae2d9fc",namespace="basic-nginx-build"} 1.593702267e+09
deploy_timestamp{app="jenkins-2",image_sha="sha256:c454944825be5359c85728e1ba2d4bc30c73976698ef33f7f72cfb4d4ae2d9fc",namespace="basic-nginx-build"} 1.594164843e+09
deploy_timestamp{app="jenkins-3",image_sha="sha256:c454944825be5359c85728e1ba2d4bc30c73976698ef33f7f72cfb4d4ae2d9fc",namespace="basic-nginx-build"} 1.594166264e+09
deploy_timestamp{app="jenkins-4",image_sha="sha256:c454944825be5359c85728e1ba2d4bc30c73976698ef33f7f72cfb4d4ae2d9fc",namespace="basic-nginx-build"} 1.594166983e+09
deploy_timestamp{app="jenkins-5",image_sha="sha256:c454944825be5359c85728e1ba2d4bc30c73976698ef33f7f72cfb4d4ae2d9fc",namespace="basic-nginx-build"} 1.594167412e+09
deploy_timestamp{app="jenkins-6",image_sha="sha256:c454944825be5359c85728e1ba2d4bc30c73976698ef33f7f72cfb4d4ae2d9fc",namespace="basic-nginx-build"} 1.594168043e+09
# HELP deploy_timestamp Deployment timestamp
# TYPE deploy_timestamp gauge

...[keeps going 81 times, with the last one having 81 lines in it]...

real	5m15.975s
user	0m0.015s
sys	0m0.020s

@themoosman
Copy link
Collaborator

You might need to build a cache so you aren't looking up the same values over and over again. We had to do this in the committime exporter.

@etsauer
Copy link
Collaborator

etsauer commented Sep 3, 2020

Note that it only takes 5+ minutes when one of the apiVersions is missing (I changed extensions/v1beta1 to extensions/v1beta1). When all apiVersions exist, it only takes about 30 seconds, but still shows the same result behavior above.

@etsauer
Copy link
Collaborator

etsauer commented Sep 3, 2020

Another suggestion that I have (possibly for both deploytime and committime) is that maybe we should refactor the code so that we aren't gathering data per namespace. This results in multiplying the number of api calls we have to make by the number of namespaces in the cluster, and if we hit a lot of 404 type errors, that's an expensive api call to have to make multiple times.

@willowmck
Copy link
Contributor

Hey team, I tried this fix and hit the following permissions error -

Traceback (most recent call last):
  File "deploytime/app.py", line 106, in <module>
    REGISTRY.register(DeployTimeCollector(namespaces))
  File "/opt/app-root/lib/python3.6/site-packages/prometheus_client/registry.py", line 26, in register
    names = self._get_names(collector)
  File "/opt/app-root/lib/python3.6/site-packages/prometheus_client/registry.py", line 66, in _get_names
    for metric in desc_func():
  File "deploytime/app.py", line 23, in collect
    metrics = generate_metrics(self._namespaces)
  File "deploytime/app.py", line 74, in generate_metrics
    label_selector=pelorus.get_app_label())
  File "/opt/app-root/lib/python3.6/site-packages/openshift/dynamic/client.py", line 94, in get
    return self.request('get', path, **kwargs)
  File "/opt/app-root/lib/python3.6/site-packages/openshift/dynamic/client.py", line 44, in inner
    raise api_exception(e)
openshift.dynamic.exceptions.ForbiddenError: 403
Reason: Forbidden
HTTP response headers: HTTPHeaderDict({'Audit-Id': '4bc78a78-376a-4571-ad8c-e99489b0696d', 'Content-Type': 'application/json', 'X-Content-Type-Options': 'nosniff', 'Date': 'Thu, 03 Sep 2020 23:52:38 GMT', 'Content-Length': '349'})
HTTP response body: b'{"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"replicasets.apps is forbidden: User \\"system:serviceaccount:pelorus:pelorus-exporter\\" cannot list resource \\"replicasets\\" in API group \\"apps\\" in the namespace \\"analyticinputs-build\\"","reason":"Forbidden","details":{"group":"apps","kind":"replicasets"},"code":403}\n'

@etsauer
Copy link
Collaborator

etsauer commented Sep 4, 2020

@willowmck did you redeploy with the latest exporter chart? It looks to grant those permissions. https://github.com/redhat-mal/pelorus/blob/deploytime-api-fix/charts/pelorus/charts/exporters/templates/rbac.yaml

@willowmck
Copy link
Contributor

willowmck commented Sep 4, 2020

My goof, I was on the wrong branch. However, now I'm back to the original error.

Traceback (most recent call last):
  File "deploytime/app.py", line 106, in <module>
    REGISTRY.register(DeployTimeCollector(namespaces))
  File "/opt/app-root/lib64/python3.8/site-packages/prometheus_client/registry.py", line 26, in register
    names = self._get_names(collector)
  File "/opt/app-root/lib64/python3.8/site-packages/prometheus_client/registry.py", line 66, in _get_names
    for metric in desc_func():
  File "deploytime/app.py", line 23, in collect
    metrics = generate_metrics(self._namespaces)
  File "deploytime/app.py", line 78, in generate_metrics
    v1_replicationsets = dyn_client.resources.get(api_version='extensions/v1beta1', kind='ReplicaSet')
  File "/opt/app-root/lib64/python3.8/site-packages/openshift/dynamic/discovery.py", line 246, in get
    raise ResourceNotFoundError('No matches found for {}'.format(kwargs))
openshift.dynamic.exceptions.ResourceNotFoundError: No matches found for {'api_version': 'extensions/v1beta1', 'kind': 'ReplicaSet'}

Something weird is happening though, because line 78 in that file is a blank line.

@etsauer
Copy link
Collaborator

etsauer commented Sep 4, 2020

@willowmck in order to run the code from this branch on OpenShift, you need to point your values file to this fork/branch. That would look like:

  - app_name: deploytime-exporter
    source_context_dir: exporters/
    extraEnv:
    - name: APP_FILE
      value: deploytime/app.py
    source_ref: deploytime-api-fix
    source_url: https://github.com/redhat-mal/pelorus.git

@willowmck
Copy link
Contributor

Ok, that's probably the issue. I'll make that change.

@willowmck
Copy link
Contributor

ok, success! Thanks for bearing with me on what is clearly a senior moment.

@malacourse malacourse self-assigned this Sep 11, 2020
Copy link
Collaborator

@etsauer etsauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@malacourse Left a few comments related to debugging. Also, when I run this, I hit a KeyError looking up the owner reference.

Exception has occurred: KeyError
'deploytime-exporter-2'
  File "/home/esauer/src/pelorus/exporters/deploytime/app.py", line 90, in generate_metrics
    rc = replicas_dict[ownerRef.name]
  File "/home/esauer/src/pelorus/exporters/deploytime/app.py", line 25, in collect
    metrics = generate_metrics(self._namespaces)
  File "/home/esauer/src/pelorus/exporters/deploytime/app.py", line 128, in <module>
    REGISTRY.register(DeployTimeCollector(namespaces))

exporters/deploytime/app.py Outdated Show resolved Hide resolved
exporters/deploytime/app.py Outdated Show resolved Hide resolved
exporters/deploytime/app.py Outdated Show resolved Hide resolved
exporters/deploytime/app.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@etsauer etsauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another small change.

logging.debug("Getting replica: %s, kind: %s, namespace: %s",
ownerRef.name, ownerRef.kind, namespace)

if replicas_dict.get(ownerRef.name+namespace):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the standard kubernetes notation for this would be directory or url style: namespace + "/" + ownerRef.name

apiResource = dyn_client.resources.get(api_version=apiVersion, kind=objectName)
replicationobjects = apiResource.get(label_selector=pelorus.get_app_label())
for replica in replicationobjects.items:
replicas[replica.metadata.name+replica.metadata.namespace] = replica
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the standard kubernetes notation for this would be directory or url style: replica.metadata.namespace + "/" + replica.metadata.name

Copy link
Collaborator

@etsauer etsauer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@malacourse ok, code looks good, and not only did we fix the original errors but we cut the run time down from 30s - 3 or 4 minutes (when one apiVersion DNE), to 1 to 2 seconds!

# When all apis exist
$ time curl localhost:8080
...
real	0m1.179s
user	0m0.012s
sys	0m0.007s

# When one apiVersion is missing
$ time curl localhost:8080
real	0m1.179s
user	0m0.012s
sys	0m0.007s

LGTM!

@etsauer etsauer merged commit 919191d into dora-metrics:master Sep 15, 2020
@malacourse malacourse mentioned this pull request Sep 15, 2020
@etsauer etsauer linked an issue Sep 21, 2020 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants