Null values for metrics if interval is set to 1 minute #1290

bluepixbe · 2020-09-22T12:56:15Z

If we are setting the interval to 1 minute we see often (>90% of the cases) null metric values. I then started to play a bit with the azure metrics api and I think I have found out why.

// 5 minutes interval
GET: https://management.azure.com/subscriptions/SID/resourceGroups/RG/providers/Microsoft.Compute/virtualMachines/VMname/providers/microsoft.insights/metrics?api-version=2018-01-01&interval=PT5M
-> is fine

// 1 minute interval
GET: https://management.azure.com/subscriptions/SID/resourceGroups/RG/providers/Microsoft.Compute/virtualMachines/VMname/providers/microsoft.insights/metrics?api-version=2018-01-01&interval=PT1M
Body:

{
    "cost": 0,
    "timespan": "2020-09-22T11:40:00Z/2020-09-22T12:40:00Z",
    "interval": "PT1M",
    "value": [
        {
            "id": "...",
            "type": "Microsoft.Insights/metrics",
            "name": {
                "value": "Percentage CPU",
                "localizedValue": "Percentage CPU"
            },
            "displayDescription": "The percentage of allocated compute units that are currently in use by the Virtual Machine(s)",
            "unit": "Percent",
            "timeseries": [
                {
                    "metadatavalues": [],
                    "data": [
                        ...
                        {
                            "timeStamp": "2020-09-22T12:36:00Z",
                            "average": 4.285
                        },
                        {
                            "timeStamp": "2020-09-22T12:37:00Z",
                            "average": 3.4
                        },
                        {
                            "timeStamp": "2020-09-22T12:38:00Z"
                        },
                        {
                            "timeStamp": "2020-09-22T12:39:00Z"
                        }
                    ]
                }
            ],
            "errorCode": "Success"
        }
    ],
    "namespace": "Microsoft.Compute/virtualMachines",
    "resourceregion": "westeurope"
}

As you see the most recent entries don't contain the average attribute. Sometimes the last 2 are missing sometimes only the most recent one. Pretty rare it's fine.

I'm not at all familiar with azure metrics api. Do you know if this this the usual behavior of azure which should be covered in Promitor or is it more related to an azure issue?

Expected Behavior

Found value 3.4 for metric azure_virtual_machine_percentage_cpu with aggregation interval 00:01:00

Actual Behavior

Found value null for metric azure_virtual_machine_percentage_cpu with aggregation interval 00:01:00

Steps to Reproduce the Problem

Start scraper with the scrap metric definition from below or just fire the requests from above an a virtual machine resource

Configuration

Provide insights in the configuration that you are using:

Configured scraping schedule:

Used scraping configuration

- name: azure_virtual_machine_percentage_cpu
    description: "Average percentage cpu usage on an Azure virtual machine"
    resourceType: VirtualMachine
    scraping:
      schedule: "0 */2 * ? * *"
    azureMetricConfiguration:
      metricName: Percentage CPU
      aggregation:
        type: Average
        interval: 00:01:00
    resourceDiscoveryGroups:
    - name: vm-landscape

Specifications

Version: 2.0.0-preview-3, but don't think it's related (having same behavior in prev-2)
Platform: Docker

The text was updated successfully, but these errors were encountered:

tomkerkhove · 2020-09-23T04:52:09Z

Unfortunately there is a lag up to 5 minutes for Azure Monitor to surface metrics which is what you are seeing here. We fully rely on the Azure API and as you point out in the issue, we don't get the values yet so nothing we can do about it afaik.

What you can do, but would have to verify outcomes, is set the scraping interval to 1 minute with an aggregation of 5 minutes.

bluepixbe · 2020-09-23T14:42:22Z

@tomkerkhove ok. thanks for the feedback. will try that out

tomkerkhove · 2020-09-23T15:13:56Z

Sorry for the bad news :/

bluepixbe · 2020-09-24T06:30:25Z

No worries. At least with the 5min aggregation you get the values ;)

Nevertheless I have one point which could make sense in my opinion. As the 1 minute interval will more or less always lead to null values which will never bring any added value why not getting up the value ladder until Promitor finds the newest valid value? With that the 1 minute interval would work and you would have a more accurate result. What do you think?

Example:

{
    "cost": 0,
    "timespan": "2020-09-22T11:40:00Z/2020-09-22T12:40:00Z",
    "interval": "PT1M",
    "value": [
        {
            "id": "...",
            "type": "Microsoft.Insights/metrics",
            "name": {
                "value": "Percentage CPU",
                "localizedValue": "Percentage CPU"
            },
            "displayDescription": "The percentage of allocated compute units that are currently in use by the Virtual Machine(s)",
            "unit": "Percent",
            "timeseries": [
                {
                    "metadatavalues": [],
                    "data": [
                        ...
                        {
                            "timeStamp": "2020-09-22T12:36:00Z",
                            "average": 4.285
                        },
                        {
                            "timeStamp": "2020-09-22T12:37:00Z",
                            "average": 3.4
                        },
                        {
                            "timeStamp": "2020-09-22T12:38:00Z"
                        },
                        {
                            "timeStamp": "2020-09-22T12:39:00Z"
                        }
                    ]
                }
            ],
            "errorCode": "Success"
        }
    ],
    "namespace": "Microsoft.Compute/virtualMachines",
    "resourceregion": "westeurope"
}

In this case using 3.4 from timestamp 2020-09-22T12:37:00Z instead of null from 2020-09-22T12:39:00Z

tomkerkhove · 2020-09-24T07:25:38Z

We could do that, but if you want 1 min metric updates, should we report the one from 3 min ago? This will lead to stale metric information which is dangerous/confusing. What's your use-case?

We could, and I'm not committing to it yet, give you a flag that says give me the latest metric with a value but that can be tricky as well because some metrics have value null because nothing is reported so it would go all the way back to last measure metric value from last week and report that today. I don't think that's the intent here?

bluepixbe · 2020-09-25T08:51:15Z

I think it's not much related to a specific use case. It's more that you would like to have real time metrics from Azure as much as possible which are less aggregated.
Imagines you have a vpn connection where you check the availability (in percentage). If you aggregate that you would depending on the downtime never see that it was down.

But I fully see your concern to show older as "new" metrics. This would lead to a big issue if Azure would not provide metrics for more than let's say 4-5 minutes. Sure you could also catch such cases but yes I fully agree it's not a nice way then..

What would be nice is to have this "issue" somehow documented. Setting up an interval of 1 minute will always lead to null/NaN values. From my point of view we can close it by now. Maybe I have once a better idea ;)

tomkerkhove · 2021-05-06T15:19:25Z

There still are data gaps, for example when querying Azure Cosmos DB it tends to be slow. (FYI @SudhakarNandigam-TomTom)

I'm querying the Azure Monitor API at 3:13 PM with an aggregation of 5 min and find the following 2 time series:

5/6/2021 3:08:00 PM with no data (~5 min ago)
5/6/2021 3:03:00 PM with value 4 (~10 min ago)

Today, Promitor will report null while you might want to see 4. However, this is a data gap if you ask me so here is what I propose:

By default, Promitor keeps on working as it is
However, end-users can configure to skipDataGaps: true so that it would take the last reported metric value and log a warning

Would this be something you would enable @bluepixbe @SudhakarNandigam-TomTom @adamconnelly @adam-resdiary ?

Relates to #711
Relates to #1621

tomkerkhove · 2021-05-06T15:21:27Z

This happens often when using a very low frequency:

adamconnelly · 2021-05-06T16:36:47Z

@tomkerkhove I've actually left ResDiary now and I'm not in a position to use this, but @ResDiaryLewis or @elliot-resdiary might be interested.

tomkerkhove · 2021-05-07T06:02:47Z

Sorry to hear and best of luck @adamconnelly!

bluepixbe · 2021-05-07T07:29:10Z

@tomkerkhove thanks asking!
To me this looks like to be a good compromise and I think we would enable this option. Promitor would just have to make sure that for example for deleted resources you don't get the last reported metric value

tomkerkhove · 2021-05-07T07:37:31Z

This morning I was thinking about two things:

Would you want a global flag or a per-metric one?
Would you want to define the # of gaps that you consider to be allowed to skip? (ie only one)

ResDiaryLewis · 2021-05-07T09:02:19Z

Would this be something you would enable @bluepixbe @SudhakarNandigam-TomTom @adamconnelly @adam-resdiary ?

Hey Tom! I don't think that we'd enable this at ResDiary. We rarely see any gaps with our current configuration (1 minute scrape interval, 5 minute aggregation). So it's partly because we're not really affected by this issue and don't plan on changing our configuration anytime soon. However, I also think that this behaviour would be pretty confusing on a dashboard - surely seeing two occurences of 4 would be misleading.

Anyway, if it was an opt-in then I can see no reason why not to implement it if you'd like to!

tomkerkhove · 2021-05-07T09:14:27Z

May I ask what Azure services you are using to scrape? From what I've seen this highly depends on the Azure service that is being used not providing consistent/fast metrics.

ResDiaryLewis · 2021-05-07T09:37:41Z

May I ask what Azure services you are using to scrape?

Sure, we're scraping:

SQL Database
Azure Loadbalancer
Redis Cache
Service Bus

bluepixbe · 2021-05-07T09:40:04Z

This morning I was thinking about two things:

Would you want a global flag or a per-metric one?

Would you want to define the # of gaps that you consider to be allowed to skip? (ie only one)

@tomkerkhove

We would definitely define it globally, overriding it on metric level would be nice to have in my opinion (but still could be nice)
That would be a nice option indeed

imarkvisser · 2023-06-20T13:15:44Z

We experience the same with metrics returning null sometime resulting in gaps in our timeseries. Not sure if skipDataGaps is still something you are considering adding.

tomkerkhove · 2023-06-20T13:55:57Z

Yes but did not have time yet so I'm open to contributions!

bluepixbe added the bug Something isn't working label Sep 22, 2020

tomkerkhove mentioned this issue Apr 24, 2021

Service Bus Aggregation missing one interval #1563

Open

tomkerkhove added this to the Scraper - v2.3.0 milestone Apr 24, 2021

tomkerkhove mentioned this issue Apr 24, 2021

Promitor exports metric with name Size to Prometheus that does not match the one in Azure Monitor #1603

Closed

tomkerkhove mentioned this issue May 6, 2021

Azure Cosmos DB collection metrics are null #1621

Open

tomkerkhove modified the milestones: Scraper - v2.3.0, Scraper - v2.4.0 May 7, 2021

tomkerkhove modified the milestones: Scraper - v2.4.0, Scraper - v2.5.0 Jul 15, 2021

tomkerkhove modified the milestones: Scraper - v2.5.0, Scraper - v2.6.0 Aug 24, 2021

tomkerkhove modified the milestones: Scraper - v2.6.0, Scraper - v2.7.0 Oct 12, 2021

tomkerkhove added the prio:P0 All issues that are top priority label Oct 12, 2021

tomkerkhove modified the milestones: Scraper - v2.7.0, Scraper - v2.8.0 Dec 17, 2021

tomkerkhove modified the milestones: Scraper - v2.8.0, Scraper - v2.7.0 Feb 2, 2022

tomkerkhove added the end-user:msft-defender label Feb 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Null values for metrics if interval is set to 1 minute #1290

Null values for metrics if interval is set to 1 minute #1290

bluepixbe commented Sep 22, 2020 •

edited

Loading

tomkerkhove commented Sep 23, 2020

bluepixbe commented Sep 23, 2020

tomkerkhove commented Sep 23, 2020

bluepixbe commented Sep 24, 2020

tomkerkhove commented Sep 24, 2020

bluepixbe commented Sep 25, 2020

tomkerkhove commented May 6, 2021

tomkerkhove commented May 6, 2021

adamconnelly commented May 6, 2021

tomkerkhove commented May 7, 2021

bluepixbe commented May 7, 2021

tomkerkhove commented May 7, 2021

ResDiaryLewis commented May 7, 2021

tomkerkhove commented May 7, 2021

ResDiaryLewis commented May 7, 2021

bluepixbe commented May 7, 2021

imarkvisser commented Jun 20, 2023

tomkerkhove commented Jun 20, 2023

Null values for metrics if interval is set to 1 minute #1290

Null values for metrics if interval is set to 1 minute #1290

Comments

bluepixbe commented Sep 22, 2020 • edited Loading

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Configuration

Specifications

tomkerkhove commented Sep 23, 2020

bluepixbe commented Sep 23, 2020

tomkerkhove commented Sep 23, 2020

bluepixbe commented Sep 24, 2020

tomkerkhove commented Sep 24, 2020

bluepixbe commented Sep 25, 2020

tomkerkhove commented May 6, 2021

tomkerkhove commented May 6, 2021

adamconnelly commented May 6, 2021

tomkerkhove commented May 7, 2021

bluepixbe commented May 7, 2021

tomkerkhove commented May 7, 2021

ResDiaryLewis commented May 7, 2021

tomkerkhove commented May 7, 2021

ResDiaryLewis commented May 7, 2021

bluepixbe commented May 7, 2021

imarkvisser commented Jun 20, 2023

tomkerkhove commented Jun 20, 2023

bluepixbe commented Sep 22, 2020 •

edited

Loading