Datadog scaler is not able to find matching metrics #2657

rr-paras-patel · 2022-02-17T16:43:07Z

Report

I have datadog scaler configured on AWS EKS Cluster with keda-2.6.1.
I am using Nginx request per second metric for scaling it is working fine.
Setup works fine as expected for few minutes. After that it starts throwing error about not able to find metrics. and It auto-recovers in few minutes. it stays unstable continuously.

Error events on HPA

AbleToScale     True    SucceededGetScale        the HPA controller was able to get the target's current scale
  ScalingActive   False   FailedGetExternalMetric  the HPA was unable to compute the replica count: unable to get external metric proxy-demo/s1-datadog-max-nginx-net-request_per_s/&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: datadog-scaledobject,},MatchExpressions:[]LabelSelectorRequirement{},}: unable to fetch metrics from external metrics API: no matching metrics found for s1-datadog-max-nginx-net-request_per_s
  ScalingLimited  False   DesiredWithinRange       the desired count is within the acceptable range
Events:
  Type     Reason                   Age                     From                       Message
  ----     ------                   ----                    ----                       -------
  Warning  FailedGetExternalMetric  59s (x1494 over 6h15m)  horizontal-pod-autoscaler  unable to get external metric proxy-demo/s1-datadog-max-nginx-net-request_per_s/&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: datadog-scaledobject,},MatchExpressions:[]LabelSelectorRequirement{},}: unable to fetch metrics from external metrics API: no matching metrics found for s1-datadog-max-nginx-net-request_per_s

Expected Behavior

Once it is able to fetch metrics from Datadog it should work in steady state.

Actual Behavior

It is throwing error about not able to fetch metrics and it auto-recover.

Steps to Reproduce the Problem

deploy nginx proxy app
deploy keda scaledobject with nginx RPS metrics
Generate Traffic
Wait 10 to 15 minutes
Describe HPA object and it will show error events

Logs from KEDA operator

Error logs on keda-operator-metrics-api

E0217 16:35:40.580125       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"no matching metrics found for s1-datadog-max-nginx-net-request_per_s"}: no matching metrics found for s1-datadog-max-nginx-net-request_per_s
E0217 16:35:55.656813       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"no matching metrics found for s1-datadog-max-nginx-net-request_per_s"}: no matching metrics found for s1-datadog-max-nginx-net-request_per_s
E0217 16:36:10.733747       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"no matching metrics found for s1-datadog-max-nginx-net-request_per_s"}: no matching metrics found for s1-datadog-max-nginx-net-request_per_s

KEDA Version

2.6.1

Kubernetes Version

1.21

Platform

Amazon Web Services

Scaler Details

Datadog

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: datadog-scaledobject
spec:
  scaleTargetRef:
    name: nginx
  minReplicaCount: 1
  maxReplicaCount: 3
  pollingInterval: 15
  cooldownPeriod: 10
  advanced:
    horizontalPodAutoscalerConfig:
      behavior:
        scaleDown:
          stabilizationWindowSeconds: 10
  triggers:
  - type: datadog
    metadata:
      query: "avg:nginx.net.request_per_s{cluster:cluster1}.rollup(15)"
      queryValue: "6"
      # Optional: (Global or Average). Whether the target value is global or average per pod. Default: Average
      type: "average"
      # Optional: The time window (in seconds) to retrieve metrics from Datadog. Default: 90
      age: "15"
    authenticationRef:
      name: keda-trigger-auth-datadog-secret

Anything else?

cc : @arapulido

The text was updated successfully, but these errors were encountered:

tomkerkhove · 2022-02-18T05:58:05Z

This sounds similar to #2632, can you please double check?

rr-paras-patel · 2022-02-18T19:02:33Z

This sounds similar to #2632, can you please double check?

It seems issue #2632 was related to missing IAM permission. in my case we don't interect with any IAM service. Authentication with Datadog works fine after 10 to 15 minutes it suddenly starts throwing error and it auto-resolve. May be it is tied to Datadog rate limiting. one thing for sure we need to get better on is Logging HTTP response in this scenario. i tried log level DEBUG but didn't see any useful info.

rr-paras-patel · 2022-02-18T19:05:49Z

@arapulido(Who is active contributor for datadog scaler) also confirmed she is able to re-produce this issue on local env after running few minutes.

benjaminwood · 2022-02-22T23:03:07Z

We are seeing the same behavior. Here's an excerpt from the HPA:

Warning  FailedGetExternalMetric       4m31s (x74 over 29m)  horizontal-pod-autoscaler  unable to get external metric analytics/s1-datadog-sum-trace-rack-request-hits-by_http_status/&LabelSelector{MatchLabels:map[string]string{scaledobject.keda.sh/name: rails-scaledobject,},MatchExpressions:[]LabelSelectorRequirement{},}: unable to fetch metrics from external metrics API: no matching metrics found for s1-datadog-sum-trace-rack-request-hits-by_http_status

tomkerkhove · 2022-02-23T08:15:55Z

Definately something to improve - Anyone open to contributing this?

arapulido · 2022-02-23T08:18:55Z

Yes, I am already looking into this and I am working on some other improvements.
This is not related to rate-limiting, though. If it was, it wouldn't recover that fast. This is due to sometimes not getting a metric, and KEDA cancelling the context (and thus the HPA logs the warning).

I will work on a patch that makes this more resilient, and also to make it clearer in the error when the user hits rate-limiting.

tomkerkhove · 2022-02-23T08:20:18Z

Awesome, thank you!

arapulido · 2022-03-01T13:46:16Z

I just created a PR to fix this (and other improvements) in 2.6.

The problem described in this issue is partially caused by the selection of a time window that is too small (15 seconds). Datadog doesn't have that metric yet in many cases, and returns an empty value, that logs as a warning in the HPA. Always try in general a bigger window, and we will use the last point (most recent) returned.

Ideally, we should discourage this behaviour (selecting a specific window), so in 2.7 we should probably introduce a breaking change and remove the "age" parameter.

tomkerkhove · 2022-03-01T14:44:46Z

Makes sense, thanks!

Ideally, we should discourage this behaviour (selecting a specific window), so in 2.7 we should probably introduce a breaking change and remove the "age" parameter.

We can't do that though, so we will have to wait for v3.0 and document this. Thoughts @kedacore/keda-maintainers?

arapulido · 2022-03-01T14:57:05Z

We can't do that though, so we will have to wait for v3.0 and document this. Thoughts @kedacore/keda-maintainers?

That's OK, we can wait. Or even make other changes that don't break the API; i.e. allow a minimum of 90 seconds for an the age parameter: would that be allowed for a 2.7 change? Thanks!

JorTurFer · 2022-03-01T15:00:35Z

If we can wait is always better, but in the worst case, Datadog Scaler has been introduced 2 months ago, I guess that the user base is not huge yet (but I prefer to wait if it's possible)

benjaminwood · 2022-03-01T22:15:01Z

I believe there is a legitimate use case where datadog would return an empty value. For example, imagine a query counting requests with 500 status. In our case, we have such a trigger that scales an object if a certain threshold (of 500's) is reached. Ideally there would be times when there are no 500's at all (even in a large window of time).

Am I understanding the problem correctly? If so, would it be possible to interpret an empty value as 0? Or, is that difficult because the HPA is where the root problem lies?

tomkerkhove · 2022-03-02T06:30:30Z

Agreed, valid concern and will definately happening. I would argue 0 is a bit misleading because you cannot separate that from a real 0 value but -1 is also not ideal, so 0 is fine I guess.

benjaminwood · 2022-03-02T19:40:18Z

Yeah, I agree 0 could be misleading. Perhaps a default value could be provided as an argument? In my scenario, I could safely set it to 0. Others may want a different default value in the event that the metric is null.

One thing is for sure, the existing behavior is undesirable under most (all?) circumstances. Logging a warning that the metric is null seems appropriate, but breaking the trigger is not ideal. In my case, the HPA scaled pods up because of another trigger, and then never scaled them down because the null metric broke the comparison.

tomkerkhove · 2022-03-03T06:36:25Z

Fully agree. @arapulido Would you mind incorporating the following:

No metric? Then we log a warning.
End-users can configure a metricUnavailableValue property to use in case the metric is not there. If nothing was configured then I believe 0 is the safest value? Or -1 but not sure how HPA will react to that.

Thoughts @kedacore/keda-maintainers?

JorTurFer · 2022-03-03T07:32:41Z

I agree with the first point, for the second point, I think that we should raise an error if the metric is not available. KEDA already has a fallback system for doing it, only a raised error is needed, I think that if we add this local fallback system, we are duplicating the responsibilities of the current fallback system

arapulido · 2022-03-03T07:44:22Z

I think there are some cases in which not having a metric doesn't mean that it should be 0. I will be doing some more testing on my side, and I will post here the findings.

arapulido · 2022-03-03T16:01:26Z

I think that filling with a number in all cases is misleading, and we should only do it if the user explicitly asks for it.

So, what about the following:

Add a new metricUnavailableFiller optional property that the user can add with a value they want the metric to be filled with if not available. If that property is not added in the ScaledObject definition, then return an error as we do now (by the way, the HPA reacts to the error with a warning, that if extended in time, will mark the HPA as inactive, but will be back on when a metric is returned)

tomkerkhove · 2022-03-04T09:07:26Z

Sounds good to me but would call it metricUnavailableValue

rr-paras-patel added the bug Something isn't working label Feb 17, 2022

tomkerkhove added this to Roadmap - KEDA Core Feb 17, 2022

tomkerkhove moved this to Proposed in Roadmap - KEDA Core Feb 17, 2022

rr-paras-patel changed the title ~~Datadog scaler is having issue~~ Datadog scaler is not stable Feb 17, 2022

tomkerkhove changed the title ~~Datadog scaler is not stable~~ Datadog scaler is not able to find matching metrics Feb 18, 2022

tomkerkhove assigned arapulido Feb 23, 2022

tomkerkhove moved this from Proposed to In Progress in Roadmap - KEDA Core Feb 23, 2022

arapulido mentioned this issue Mar 1, 2022

Improve the Datadog scaler, including a new optional parameter metricUnavailableValue to fill data when no Datadog metric was returned #2694

Merged

4 tasks

zroubalik closed this as completed in #2694 Mar 14, 2022

Repository owner moved this from In Progress to Ready To Ship in Roadmap - KEDA Core Mar 14, 2022

dalgibbard mentioned this issue Mar 15, 2022

Datadog query with functions never returns healthy #2761

Closed

zroubalik mentioned this issue Mar 22, 2022

Datadog: isActive() should react on num > 0 #2798

Merged

1 task

tomkerkhove moved this from Ready To Ship to Done in Roadmap - KEDA Core Aug 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Datadog scaler is not able to find matching metrics #2657

Datadog scaler is not able to find matching metrics #2657

rr-paras-patel commented Feb 17, 2022 •

edited

Loading

tomkerkhove commented Feb 18, 2022

rr-paras-patel commented Feb 18, 2022

rr-paras-patel commented Feb 18, 2022 •

edited

Loading

benjaminwood commented Feb 22, 2022

tomkerkhove commented Feb 23, 2022

arapulido commented Feb 23, 2022

tomkerkhove commented Feb 23, 2022

arapulido commented Mar 1, 2022

tomkerkhove commented Mar 1, 2022

arapulido commented Mar 1, 2022

JorTurFer commented Mar 1, 2022

benjaminwood commented Mar 1, 2022

tomkerkhove commented Mar 2, 2022

benjaminwood commented Mar 2, 2022

tomkerkhove commented Mar 3, 2022 •

edited

Loading

JorTurFer commented Mar 3, 2022

arapulido commented Mar 3, 2022

arapulido commented Mar 3, 2022

tomkerkhove commented Mar 4, 2022

Datadog scaler is not able to find matching metrics #2657

Datadog scaler is not able to find matching metrics #2657

Comments

rr-paras-patel commented Feb 17, 2022 • edited Loading

Report

Error events on HPA

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Logs from KEDA operator

Error logs on keda-operator-metrics-api

KEDA Version

Kubernetes Version

Platform

Scaler Details

Anything else?

tomkerkhove commented Feb 18, 2022

rr-paras-patel commented Feb 18, 2022

rr-paras-patel commented Feb 18, 2022 • edited Loading

benjaminwood commented Feb 22, 2022

tomkerkhove commented Feb 23, 2022

arapulido commented Feb 23, 2022

tomkerkhove commented Feb 23, 2022

arapulido commented Mar 1, 2022

tomkerkhove commented Mar 1, 2022

arapulido commented Mar 1, 2022

JorTurFer commented Mar 1, 2022

benjaminwood commented Mar 1, 2022

tomkerkhove commented Mar 2, 2022

benjaminwood commented Mar 2, 2022

tomkerkhove commented Mar 3, 2022 • edited Loading

JorTurFer commented Mar 3, 2022

arapulido commented Mar 3, 2022

arapulido commented Mar 3, 2022

tomkerkhove commented Mar 4, 2022

rr-paras-patel commented Feb 17, 2022 •

edited

Loading

rr-paras-patel commented Feb 18, 2022 •

edited

Loading

tomkerkhove commented Mar 3, 2022 •

edited

Loading