Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datadog scaler is not able to find matching metrics #2657

Closed
rr-paras-patel opened this issue Feb 17, 2022 · 19 comments · Fixed by #2694
Closed

Datadog scaler is not able to find matching metrics #2657

rr-paras-patel opened this issue Feb 17, 2022 · 19 comments · Fixed by #2694
Assignees
Labels
bug Something isn't working

Comments

@rr-paras-patel
Copy link

rr-paras-patel commented Feb 17, 2022

Report

I have datadog scaler configured on AWS EKS Cluster with keda-2.6.1.
I am using Nginx request per second metric for scaling it is working fine.
Setup works fine as expected for few minutes. After that it starts throwing error about not able to find metrics. and It auto-recovers in few minutes. it stays unstable continuously.

Error events on HPA

AbleToScale     True    SucceededGetScale        the HPA controller was able to get the target's current scale
  ScalingActive   False   FailedGetExternalMetric  the HPA was unable to compute the replica count: unable to get external metric proxy-demo/s1-datadog-max-nginx-net-request_per_s/&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: datadog-scaledobject,},MatchExpressions:[]LabelSelectorRequirement{},}: unable to fetch metrics from external metrics API: no matching metrics found for s1-datadog-max-nginx-net-request_per_s
  ScalingLimited  False   DesiredWithinRange       the desired count is within the acceptable range
Events:
  Type     Reason                   Age                     From                       Message
  ----     ------                   ----                    ----                       -------
  Warning  FailedGetExternalMetric  59s (x1494 over 6h15m)  horizontal-pod-autoscaler  unable to get external metric proxy-demo/s1-datadog-max-nginx-net-request_per_s/&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: datadog-scaledobject,},MatchExpressions:[]LabelSelectorRequirement{},}: unable to fetch metrics from external metrics API: no matching metrics found for s1-datadog-max-nginx-net-request_per_s

Expected Behavior

Once it is able to fetch metrics from Datadog it should work in steady state.

Actual Behavior

It is throwing error about not able to fetch metrics and it auto-recover.

Steps to Reproduce the Problem

  1. deploy nginx proxy app
  2. deploy keda scaledobject with nginx RPS metrics
  3. Generate Traffic
  4. Wait 10 to 15 minutes
  5. Describe HPA object and it will show error events

Logs from KEDA operator

Error logs on keda-operator-metrics-api

E0217 16:35:40.580125       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"no matching metrics found for s1-datadog-max-nginx-net-request_per_s"}: no matching metrics found for s1-datadog-max-nginx-net-request_per_s
E0217 16:35:55.656813       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"no matching metrics found for s1-datadog-max-nginx-net-request_per_s"}: no matching metrics found for s1-datadog-max-nginx-net-request_per_s
E0217 16:36:10.733747       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"no matching metrics found for s1-datadog-max-nginx-net-request_per_s"}: no matching metrics found for s1-datadog-max-nginx-net-request_per_s

KEDA Version

2.6.1

Kubernetes Version

1.21

Platform

Amazon Web Services

Scaler Details

Datadog

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: datadog-scaledobject
spec:
  scaleTargetRef:
    name: nginx
  minReplicaCount: 1
  maxReplicaCount: 3
  pollingInterval: 15
  cooldownPeriod: 10
  advanced:
    horizontalPodAutoscalerConfig:
      behavior:
        scaleDown:
          stabilizationWindowSeconds: 10
  triggers:
  - type: datadog
    metadata:
      query: "avg:nginx.net.request_per_s{cluster:cluster1}.rollup(15)"
      queryValue: "6"
      # Optional: (Global or Average). Whether the target value is global or average per pod. Default: Average
      type: "average"
      # Optional: The time window (in seconds) to retrieve metrics from Datadog. Default: 90
      age: "15"
    authenticationRef:
      name: keda-trigger-auth-datadog-secret

Anything else?

cc : @arapulido

@rr-paras-patel rr-paras-patel added the bug Something isn't working label Feb 17, 2022
@rr-paras-patel rr-paras-patel changed the title Datadog scaler is having issue Datadog scaler is not stable Feb 17, 2022
@tomkerkhove
Copy link
Member

This sounds similar to #2632, can you please double check?

@tomkerkhove tomkerkhove changed the title Datadog scaler is not stable Datadog scaler is not able to find matching metrics Feb 18, 2022
@rr-paras-patel
Copy link
Author

This sounds similar to #2632, can you please double check?

It seems issue #2632 was related to missing IAM permission. in my case we don't interect with any IAM service. Authentication with Datadog works fine after 10 to 15 minutes it suddenly starts throwing error and it auto-resolve. May be it is tied to Datadog rate limiting. one thing for sure we need to get better on is Logging HTTP response in this scenario. i tried log level DEBUG but didn't see any useful info.

@rr-paras-patel
Copy link
Author

rr-paras-patel commented Feb 18, 2022

@arapulido(Who is active contributor for datadog scaler) also confirmed she is able to re-produce this issue on local env after running few minutes.

@benjaminwood
Copy link

We are seeing the same behavior. Here's an excerpt from the HPA:

Warning  FailedGetExternalMetric       4m31s (x74 over 29m)  horizontal-pod-autoscaler  unable to get external metric analytics/s1-datadog-sum-trace-rack-request-hits-by_http_status/&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: rails-scaledobject,},MatchExpressions:[]LabelSelectorRequirement{},}: unable to fetch metrics from external metrics API: no matching metrics found for s1-datadog-sum-trace-rack-request-hits-by_http_status

@tomkerkhove
Copy link
Member

Definately something to improve - Anyone open to contributing this?

@arapulido
Copy link
Contributor

Yes, I am already looking into this and I am working on some other improvements.
This is not related to rate-limiting, though. If it was, it wouldn't recover that fast. This is due to sometimes not getting a metric, and KEDA cancelling the context (and thus the HPA logs the warning).

I will work on a patch that makes this more resilient, and also to make it clearer in the error when the user hits rate-limiting.

@tomkerkhove
Copy link
Member

Awesome, thank you!

@arapulido
Copy link
Contributor

I just created a PR to fix this (and other improvements) in 2.6.

The problem described in this issue is partially caused by the selection of a time window that is too small (15 seconds). Datadog doesn't have that metric yet in many cases, and returns an empty value, that logs as a warning in the HPA. Always try in general a bigger window, and we will use the last point (most recent) returned.

Ideally, we should discourage this behaviour (selecting a specific window), so in 2.7 we should probably introduce a breaking change and remove the "age" parameter.

@tomkerkhove
Copy link
Member

Makes sense, thanks!

Ideally, we should discourage this behaviour (selecting a specific window), so in 2.7 we should probably introduce a breaking change and remove the "age" parameter.

We can't do that though, so we will have to wait for v3.0 and document this. Thoughts @kedacore/keda-maintainers?

@arapulido
Copy link
Contributor

We can't do that though, so we will have to wait for v3.0 and document this. Thoughts @kedacore/keda-maintainers?

That's OK, we can wait. Or even make other changes that don't break the API; i.e. allow a minimum of 90 seconds for an the age parameter: would that be allowed for a 2.7 change? Thanks!

@JorTurFer
Copy link
Member

If we can wait is always better, but in the worst case, Datadog Scaler has been introduced 2 months ago, I guess that the user base is not huge yet (but I prefer to wait if it's possible)

@benjaminwood
Copy link

I believe there is a legitimate use case where datadog would return an empty value. For example, imagine a query counting requests with 500 status. In our case, we have such a trigger that scales an object if a certain threshold (of 500's) is reached. Ideally there would be times when there are no 500's at all (even in a large window of time).

Am I understanding the problem correctly? If so, would it be possible to interpret an empty value as 0? Or, is that difficult because the HPA is where the root problem lies?

@tomkerkhove
Copy link
Member

Agreed, valid concern and will definately happening. I would argue 0 is a bit misleading because you cannot separate that from a real 0 value but -1 is also not ideal, so 0 is fine I guess.

@benjaminwood
Copy link

Yeah, I agree 0 could be misleading. Perhaps a default value could be provided as an argument? In my scenario, I could safely set it to 0. Others may want a different default value in the event that the metric is null.

One thing is for sure, the existing behavior is undesirable under most (all?) circumstances. Logging a warning that the metric is null seems appropriate, but breaking the trigger is not ideal. In my case, the HPA scaled pods up because of another trigger, and then never scaled them down because the null metric broke the comparison.

@tomkerkhove
Copy link
Member

tomkerkhove commented Mar 3, 2022

Fully agree. @arapulido Would you mind incorporating the following:

  • No metric? Then we log a warning.
  • End-users can configure a metricUnavailableValue property to use in case the metric is not there. If nothing was configured then I believe 0 is the safest value? Or -1 but not sure how HPA will react to that.

Thoughts @kedacore/keda-maintainers?

@JorTurFer
Copy link
Member

I agree with the first point, for the second point, I think that we should raise an error if the metric is not available. KEDA already has a fallback system for doing it, only a raised error is needed, I think that if we add this local fallback system, we are duplicating the responsibilities of the current fallback system

@arapulido
Copy link
Contributor

I think there are some cases in which not having a metric doesn't mean that it should be 0. I will be doing some more testing on my side, and I will post here the findings.

@arapulido
Copy link
Contributor

I think that filling with a number in all cases is misleading, and we should only do it if the user explicitly asks for it.

So, what about the following:

Add a new metricUnavailableFiller optional property that the user can add with a value they want the metric to be filled with if not available. If that property is not added in the ScaledObject definition, then return an error as we do now (by the way, the HPA reacts to the error with a warning, that if extended in time, will mark the HPA as inactive, but will be back on when a metric is returned)

@tomkerkhove
Copy link
Member

Sounds good to me but would call it metricUnavailableValue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Archived in project
5 participants