Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Null values for metrics if interval is set to 1 minute #1290

Open
bluepixbe opened this issue Sep 22, 2020 · 18 comments
Open

Null values for metrics if interval is set to 1 minute #1290

bluepixbe opened this issue Sep 22, 2020 · 18 comments
Labels
bug Something isn't working end-user:msft-defender prio:P0 All issues that are top priority

Comments

@bluepixbe
Copy link
Contributor

bluepixbe commented Sep 22, 2020

If we are setting the interval to 1 minute we see often (>90% of the cases) null metric values. I then started to play a bit with the azure metrics api and I think I have found out why.

// 5 minutes interval
GET: https://management.azure.com/subscriptions/SID/resourceGroups/RG/providers/Microsoft.Compute/virtualMachines/VMname/providers/microsoft.insights/metrics?api-version=2018-01-01&interval=PT5M
-> is fine

// 1 minute interval
GET: https://management.azure.com/subscriptions/SID/resourceGroups/RG/providers/Microsoft.Compute/virtualMachines/VMname/providers/microsoft.insights/metrics?api-version=2018-01-01&interval=PT1M
Body:

{
    "cost": 0,
    "timespan": "2020-09-22T11:40:00Z/2020-09-22T12:40:00Z",
    "interval": "PT1M",
    "value": [
        {
            "id": "...",
            "type": "Microsoft.Insights/metrics",
            "name": {
                "value": "Percentage CPU",
                "localizedValue": "Percentage CPU"
            },
            "displayDescription": "The percentage of allocated compute units that are currently in use by the Virtual Machine(s)",
            "unit": "Percent",
            "timeseries": [
                {
                    "metadatavalues": [],
                    "data": [
                        ...
                        {
                            "timeStamp": "2020-09-22T12:36:00Z",
                            "average": 4.285
                        },
                        {
                            "timeStamp": "2020-09-22T12:37:00Z",
                            "average": 3.4
                        },
                        {
                            "timeStamp": "2020-09-22T12:38:00Z"
                        },
                        {
                            "timeStamp": "2020-09-22T12:39:00Z"
                        }
                    ]
                }
            ],
            "errorCode": "Success"
        }
    ],
    "namespace": "Microsoft.Compute/virtualMachines",
    "resourceregion": "westeurope"
}

As you see the most recent entries don't contain the average attribute. Sometimes the last 2 are missing sometimes only the most recent one. Pretty rare it's fine.

I'm not at all familiar with azure metrics api. Do you know if this this the usual behavior of azure which should be covered in Promitor or is it more related to an azure issue?

Expected Behavior

Found value 3.4 for metric azure_virtual_machine_percentage_cpu with aggregation interval 00:01:00

Actual Behavior

Found value null for metric azure_virtual_machine_percentage_cpu with aggregation interval 00:01:00

Steps to Reproduce the Problem

  1. Start scraper with the scrap metric definition from below or just fire the requests from above an a virtual machine resource

Configuration

Provide insights in the configuration that you are using:

  • Configured scraping schedule:
Used scraping configuration
- name: azure_virtual_machine_percentage_cpu
    description: "Average percentage cpu usage on an Azure virtual machine"
    resourceType: VirtualMachine
    scraping:
      schedule: "0 */2 * ? * *"
    azureMetricConfiguration:
      metricName: Percentage CPU
      aggregation:
        type: Average
        interval: 00:01:00
    resourceDiscoveryGroups:
    - name: vm-landscape

Specifications

  • Version: 2.0.0-preview-3, but don't think it's related (having same behavior in prev-2)
  • Platform: Docker
@bluepixbe bluepixbe added the bug Something isn't working label Sep 22, 2020
@tomkerkhove
Copy link
Owner

Unfortunately there is a lag up to 5 minutes for Azure Monitor to surface metrics which is what you are seeing here. We fully rely on the Azure API and as you point out in the issue, we don't get the values yet so nothing we can do about it afaik.

What you can do, but would have to verify outcomes, is set the scraping interval to 1 minute with an aggregation of 5 minutes.

@bluepixbe
Copy link
Contributor Author

@tomkerkhove ok. thanks for the feedback. will try that out

@tomkerkhove
Copy link
Owner

Sorry for the bad news :/

@bluepixbe
Copy link
Contributor Author

No worries. At least with the 5min aggregation you get the values ;)

Nevertheless I have one point which could make sense in my opinion. As the 1 minute interval will more or less always lead to null values which will never bring any added value why not getting up the value ladder until Promitor finds the newest valid value? With that the 1 minute interval would work and you would have a more accurate result. What do you think?

Example:

{
    "cost": 0,
    "timespan": "2020-09-22T11:40:00Z/2020-09-22T12:40:00Z",
    "interval": "PT1M",
    "value": [
        {
            "id": "...",
            "type": "Microsoft.Insights/metrics",
            "name": {
                "value": "Percentage CPU",
                "localizedValue": "Percentage CPU"
            },
            "displayDescription": "The percentage of allocated compute units that are currently in use by the Virtual Machine(s)",
            "unit": "Percent",
            "timeseries": [
                {
                    "metadatavalues": [],
                    "data": [
                        ...
                        {
                            "timeStamp": "2020-09-22T12:36:00Z",
                            "average": 4.285
                        },
                        {
                            "timeStamp": "2020-09-22T12:37:00Z",
                            "average": 3.4
                        },
                        {
                            "timeStamp": "2020-09-22T12:38:00Z"
                        },
                        {
                            "timeStamp": "2020-09-22T12:39:00Z"
                        }
                    ]
                }
            ],
            "errorCode": "Success"
        }
    ],
    "namespace": "Microsoft.Compute/virtualMachines",
    "resourceregion": "westeurope"
}

In this case using 3.4 from timestamp 2020-09-22T12:37:00Z instead of null from 2020-09-22T12:39:00Z

@tomkerkhove
Copy link
Owner

We could do that, but if you want 1 min metric updates, should we report the one from 3 min ago? This will lead to stale metric information which is dangerous/confusing. What's your use-case?

We could, and I'm not committing to it yet, give you a flag that says give me the latest metric with a value but that can be tricky as well because some metrics have value null because nothing is reported so it would go all the way back to last measure metric value from last week and report that today. I don't think that's the intent here?

@bluepixbe
Copy link
Contributor Author

I think it's not much related to a specific use case. It's more that you would like to have real time metrics from Azure as much as possible which are less aggregated.
Imagines you have a vpn connection where you check the availability (in percentage). If you aggregate that you would depending on the downtime never see that it was down.

But I fully see your concern to show older as "new" metrics. This would lead to a big issue if Azure would not provide metrics for more than let's say 4-5 minutes. Sure you could also catch such cases but yes I fully agree it's not a nice way then..

What would be nice is to have this "issue" somehow documented. Setting up an interval of 1 minute will always lead to null/NaN values. From my point of view we can close it by now. Maybe I have once a better idea ;)

@tomkerkhove
Copy link
Owner

There still are data gaps, for example when querying Azure Cosmos DB it tends to be slow. (FYI @SudhakarNandigam-TomTom)

I'm querying the Azure Monitor API at 3:13 PM with an aggregation of 5 min and find the following 2 time series:

  • 5/6/2021 3:08:00 PM with no data (~5 min ago)
  • 5/6/2021 3:03:00 PM with value 4 (~10 min ago)

Today, Promitor will report null while you might want to see 4. However, this is a data gap if you ask me so here is what I propose:

  • By default, Promitor keeps on working as it is
  • However, end-users can configure to skipDataGaps: true so that it would take the last reported metric value and log a warning

Would this be something you would enable @bluepixbe @SudhakarNandigam-TomTom @adamconnelly @adam-resdiary ?

Relates to #711
Relates to #1621

@tomkerkhove
Copy link
Owner

This happens often when using a very low frequency:
image

@adamconnelly
Copy link
Contributor

@tomkerkhove I've actually left ResDiary now and I'm not in a position to use this, but @ResDiaryLewis or @elliot-resdiary might be interested.

@tomkerkhove
Copy link
Owner

Sorry to hear and best of luck @adamconnelly!

@bluepixbe
Copy link
Contributor Author

@tomkerkhove thanks asking!
To me this looks like to be a good compromise and I think we would enable this option. Promitor would just have to make sure that for example for deleted resources you don't get the last reported metric value

@tomkerkhove
Copy link
Owner

This morning I was thinking about two things:

  • Would you want a global flag or a per-metric one?
  • Would you want to define the # of gaps that you consider to be allowed to skip? (ie only one)

@ResDiaryLewis
Copy link

Would this be something you would enable @bluepixbe @SudhakarNandigam-TomTom @adamconnelly @adam-resdiary ?

Hey Tom! I don't think that we'd enable this at ResDiary. We rarely see any gaps with our current configuration (1 minute scrape interval, 5 minute aggregation). So it's partly because we're not really affected by this issue and don't plan on changing our configuration anytime soon. However, I also think that this behaviour would be pretty confusing on a dashboard - surely seeing two occurences of 4 would be misleading.

Anyway, if it was an opt-in then I can see no reason why not to implement it if you'd like to!

@tomkerkhove
Copy link
Owner

May I ask what Azure services you are using to scrape? From what I've seen this highly depends on the Azure service that is being used not providing consistent/fast metrics.

@ResDiaryLewis
Copy link

May I ask what Azure services you are using to scrape?

Sure, we're scraping:

  • SQL Database
  • Azure Loadbalancer
  • Redis Cache
  • Service Bus

@bluepixbe
Copy link
Contributor Author

This morning I was thinking about two things:

  • Would you want a global flag or a per-metric one?
  • Would you want to define the # of gaps that you consider to be allowed to skip? (ie only one)

@tomkerkhove

  • We would definitely define it globally, overriding it on metric level would be nice to have in my opinion (but still could be nice)
  • That would be a nice option indeed

@tomkerkhove tomkerkhove added the prio:P0 All issues that are top priority label Oct 12, 2021
@imarkvisser
Copy link

We experience the same with metrics returning null sometime resulting in gaps in our timeseries. Not sure if skipDataGaps is still something you are considering adding.

@tomkerkhove
Copy link
Owner

Yes but did not have time yet so I'm open to contributions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working end-user:msft-defender prio:P0 All issues that are top priority
Projects
Status: Todo
Development

No branches or pull requests

5 participants