DiskIo Metricset: Failed to encode event: unsupported float value: NaN #15549

LucaWintergerst · 2020-01-14T19:00:52Z

Version: metricbeat-7.5.1
Operating System: Kubernetes, GKE, Elastic Docker container

if the diskio metricset is enabled, the following ERROR shows in the logs a few seconds after startup. It persists during the runtime and often increases its frequency
2020-01-14T18:32:53.571Z ERROR elasticsearch/client.go:404 Failed to encode event: unsupported float value: NaN
Full logfile: metricbeat.log

There are other INFO errors too, but I'm not sure if they are related:

020-01-14T16:01:02.010Z	INFO	module/wrapper.go:252	Error fetching data for metricset kubernetes.system: error doing HTTP request to fetch 'system' Metricset data: error making http request: Get http://localhost:10255/stats/summary: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
2020-01-14T16:01:02.067Z	INFO	module/wrapper.go:252	Error fetching data for metricset kubernetes.volume: error doing HTTP request to fetch 'volume' Metricset data: error making http request: Get http://localhost:10255/stats/summary: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
2020-01-14T16:01:02.068Z	INFO	module/wrapper.go:252	Error fetching data for metricset kubernetes.pod: error doing HTTP request to fetch 'pod' Metricset data: error making http request: Get http://localhost:10255/stats/summary: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
2020-01-14T16:01:02.136Z	INFO	module/wrapper.go:252	Error fetching data for metricset kubernetes.container: error doing HTTP request to fetch 'container' Metricset data: error making http request: Get http://localhost:10255/stats/summary: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
2020-01-14T16:01:02.136Z	INFO	module/wrapper.go:252	Error fetching data for metricset kubernetes.node: error doing HTTP request to fetch 'node' Metricset data: error making http request: Get http://localhost:10255/stats/summary: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
202

metricbeat.yml

metricbeat.config.modules:
  path: ${path.config}/modules.d/*.yml
  reload.enabled: false

processors:
- add_cloud_metadata: ~
- add_docker_metadata: ~

output.elasticsearch:
  hosts: '${ELASTICSEARCH_HOSTS:elasticsearch:9200}'
  username: '${ELASTICSEARCH_USERNAME:}'
  password: '${ELASTICSEARCH_PASSWORD:}'

modules.d

- module: docker
  metricsets:
    - container
    - cpu
    - diskio
    - event
    - healthcheck
    - info
    - memory
    - network
  hosts: ["unix:///var/run/docker.sock"]
  period: 10s- module: kubernetes
  metricsets:
    - node
    - system
    - pod
    - container
    - volume
  period: 10s
  host: ${NODE_NAME}
  hosts: ["localhost:10255"]
  add_metadata: true
- module: kubernetes
  metricsets:
    - proxy
  period: 10s
  host: ${NODE_NAME}
  hosts: ["localhost:10249"]- module: system
  period: 10s
  metricsets:
    - cpu
    - load
    - memory
    - network
    - process
    - process_summary
    - core
    - diskio
    - socket
  processes: ['.*']
  process.include_top_n:
    by_cpu: 5      # include top 5 processes by CPU
    by_memory: 5   # include top 5 processes by memory
- module: system
  period: 1m
  metricsets:
    - filesystem
    - fsstat
  processors:
  - drop_event.when.regexp:
      system.filesystem.mount_point: '^/(sys|cgroup|proc|dev|etc|host|lib)($|/)'

The text was updated successfully, but these errors were encountered:

fearful-symmetry · 2020-01-14T19:21:26Z

So, @LucaWintergerst when you can, we'll need logs with just the diskio metricset enabled.

Also, those k8s errors are suspicious, and I wonder if it's something there? Could be unrelated.

LucaWintergerst · 2020-01-15T08:04:59Z

With diskio disabled, I got 2 NaN Errors in 30 minutes across 9 metricbeat pods. However, these errors now have additional output:

2020-01-15T07:42:34.674Z	ERROR	cpu/helper.go:199	Error calculating CPU time change for docker module: new stats value (0) is lower than the old one(34071203)
2020-01-15T07:42:34.674Z	ERROR	cpu/helper.go:199	Error calculating CPU time change for docker module: new stats value (0) is lower than the old one(34071203)
2020-01-15T07:42:34.674Z	ERROR	cpu/helper.go:199	Error calculating CPU time change for docker module: new stats value (0) is lower than the old one(10000000)
2020-01-15T07:42:34.674Z	ERROR	cpu/helper.go:199	Error calculating CPU time change for docker module: new stats value (0) is lower than the old one(10000000)
2020-01-15T07:42:34.674Z	ERROR	cpu/helper.go:199	Error calculating CPU time change for docker module: new stats value (0) is lower than the old one(10000000)
2020-01-15T07:42:34.674Z	ERROR	cpu/helper.go:199	Error calculating CPU time change for docker module: new stats value (0) is lower than the old one(10000000)
2020-01-15T07:42:34.674Z	ERROR	cpu/helper.go:199	Error calculating CPU time change for docker module: new stats value (0) is lower than the old one(4176137860000000)
2020-01-15T07:42:34.674Z	ERROR	cpu/helper.go:199	Error calculating CPU time change for docker module: new stats value (0) is lower than the old one(4176137860000000)
2020-01-15T07:42:35.174Z	ERROR	elasticsearch/client.go:404	Failed to encode event: unsupported float value: NaN

All other errors are gone as well now, which is surprising. (take this with a grain of salt, may just be luck)

With diskio enabled, I got hundreds within a few minutes on even a single pod, but without the explanation.

Running a single pod with only diskio enabled and no other metricset does not reproduce the error behaviour (lots of errors quickly). Take this with a grain of salt too for now, as I can't say for certain that the NaN Erros show up for all pods all the time with the original config.

fearful-symmetry · 2020-01-17T16:25:43Z

@LucaWintergerst Sorry for the delay, been distracted the last few days.

okay, I have to ask, is any valid data coming through to elasticsearch? It almost looks like nothing is working, and metricsets that aren't handling invalid metrics properly are getting smoked out.

If there's some other configuration issue at play here preventing you from reading in values, those Error calculating CPU time change for docker module errors are the correct way to handle it. It's the NaN errors I'm worried about.

With diskio disabled, I got 2 NaN Errors in 30 minutes across 9 metricbeat pods. However, these errors now have additional output:

What other metricsets were enabled? Just the ones listed in your original post minus diskio?

With diskio enabled, I got hundreds within a few minutes on even a single pod, but without the explanation.

Okay, so there's definitely an issue with the diskio metricset it looks like?

LucaWintergerst · 2020-01-22T08:13:55Z

yes, we are getting all other data without any issues.
This is the list of other metricsets we have enabled:

- module: kubernetes
      metricsets:
        - node
        - system
        - pod
        - container
        - volume
- module: system
      metricsets:
        - cpu
        - load
        - memory
        - network
        - process
        - process_summary
        - core
        #- diskio
        - socket
- module: docker
      metricsets:
        - container
        - cpu
        #- diskio
        - event
        - healthcheck
        - info
        - memory
        - network

botelastic · 2020-12-27T13:15:54Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

gservat · 2021-01-21T01:17:38Z

We're on beats 6.8.6 and just had a solid 16 hours (since restart) of no diskio metrics with the same sort of error message:

2021-01-21T00:08:35.933Z        INFO    pipeline/output.go:95   Connecting to backoff(tcp://logstash-address-redacted:port)
2021-01-21T00:08:35.950Z        INFO    pipeline/output.go:105  Connection to backoff(tcp://logstash-address-redacted:port) established
2021-01-21T00:08:35.971Z        ERROR   logstash/sync.go:155    Failed to publish events caused by: unsupported float value: NaN
2021-01-21T00:08:37.650Z        ERROR   pipeline/output.go:121  Failed to publish events: unsupported float value: NaN
2021-01-21T00:08:37.650Z        INFO    pipeline/output.go:95   Connecting to backoff(tcp://logstash-address-redacted:port)
2021-01-21T00:08:37.666Z        INFO    pipeline/output.go:105  Connection to backoff(tcp://logstash-address-redacted:port) established
2021-01-21T00:08:37.666Z        ERROR   logstash/sync.go:155    Failed to publish events caused by: unsupported float value: NaN
2021-01-21T00:08:39.625Z        ERROR   pipeline/output.go:121  Failed to publish events: unsupported float value: NaN
2021-01-21T00:08:39.625Z        INFO    pipeline/output.go:95   Connecting to backoff(tcp://logstash-address-redacted:port)
2021-01-21T00:08:39.641Z        INFO    pipeline/output.go:105  Connection to backoff(tcp://logstash-address-redacted:port) established
2021-01-21T00:08:39.641Z        INFO    [publish]       pipeline/retry.go:155   Drop batch
2021-01-21T00:08:39.641Z        ERROR   logstash/sync.go:155    Failed to publish events caused by: unsupported float value: NaN
2021-01-21T00:08:41.227Z        ERROR   pipeline/output.go:121  Failed to publish events: unsupported float value: NaN
2021-01-21T00:08:45.630Z        INFO    pipeline/output.go:95   Connecting to backoff(tcp://logstash-address-redacted:port)
2021-01-21T00:08:45.647Z        INFO    pipeline/output.go:105  Connection to backoff(tcp://logstash-address-redacted:port) established
2021-01-21T00:09:42.958Z        ERROR   logstash/sync.go:155    Failed to publish events caused by: unsupported float value: NaN
2021-01-21T00:09:44.700Z        ERROR   pipeline/output.go:121  Failed to publish events: unsupported float value: NaN
2021-01-21T00:09:44.700Z        INFO    pipeline/output.go:95   Connecting to backoff(tcp://logstash-address-redacted:port)
2021-01-21T00:09:44.730Z        INFO    pipeline/output.go:105  Connection to backoff(tcp://logstash-address-redacted:port) established
2021-01-21T00:09:44.735Z        ERROR   logstash/sync.go:155    Failed to publish events caused by: unsupported float value: NaN
2021-01-21T00:09:46.595Z        ERROR   pipeline/output.go:121  Failed to publish events: unsupported float value: NaN
2021-01-21T00:09:46.595Z        INFO    pipeline/output.go:95   Connecting to backoff(tcp://logstash-address-redacted:port)
2021-01-21T00:09:46.617Z        INFO    pipeline/output.go:105  Connection to backoff(tcp://logstash-address-redacted:port) established
2021-01-21T00:09:46.621Z        ERROR   logstash/sync.go:155    Failed to publish events caused by: unsupported float value: NaN
2021-01-21T00:09:48.350Z        ERROR   pipeline/output.go:121  Failed to publish events: unsupported float value: NaN
2021-01-21T00:09:48.350Z        INFO    pipeline/output.go:95   Connecting to backoff(tcp://logstash-address-redacted:port)
2021-01-21T00:09:48.382Z        INFO    pipeline/output.go:105  Connection to backoff(tcp://logstash-address-redacted:port) established
2021-01-21T00:09:48.397Z        INFO    [publish]       pipeline/retry.go:155   Drop batch

I can see that other metrics were getting published OK (like docker.network), but docker.diskio stopped and magically fixed itself 16 hours later.

fearful-symmetry · 2021-01-29T17:09:43Z

@gservat Is there any way you can update metricbeat to a newer version?

botelastic · 2021-12-30T17:47:12Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

fearful-symmetry added bug Team:Integrations Label for the Integrations team labels Jan 14, 2020

This was referenced Jan 17, 2020

[metricbeat] Add divide by zero check to docker/diskio #15649

Merged

Cherry-pick #15649 to 7.6: [metricbeat] Add divide by zero check to docker/diskio #15705

Merged

Cherry-pick #15649 to 7.x: [metricbeat] Add divide by zero check to docker/diskio #15706

Merged

andresrc added the [zube]: Old-Inbox label Jan 27, 2020

botelastic bot added the Stalled label Dec 27, 2020

botelastic bot removed the Stalled label Jan 21, 2021

botelastic bot added the Stalled label Dec 30, 2021

botelastic bot closed this as completed Jan 29, 2022

zube bot added [zube]: Done and removed [zube]: Old-Inbox labels Jan 29, 2022

zube bot removed the [zube]: Done label Apr 30, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DiskIo Metricset: Failed to encode event: unsupported float value: NaN #15549

DiskIo Metricset: Failed to encode event: unsupported float value: NaN #15549

LucaWintergerst commented Jan 14, 2020

fearful-symmetry commented Jan 14, 2020

LucaWintergerst commented Jan 15, 2020

fearful-symmetry commented Jan 17, 2020

LucaWintergerst commented Jan 22, 2020

botelastic bot commented Dec 27, 2020

gservat commented Jan 21, 2021

fearful-symmetry commented Jan 29, 2021

botelastic bot commented Dec 30, 2021

DiskIo Metricset: Failed to encode event: unsupported float value: NaN #15549

DiskIo Metricset: Failed to encode event: unsupported float value: NaN #15549

Comments

LucaWintergerst commented Jan 14, 2020

fearful-symmetry commented Jan 14, 2020

LucaWintergerst commented Jan 15, 2020

fearful-symmetry commented Jan 17, 2020

LucaWintergerst commented Jan 22, 2020

botelastic bot commented Dec 27, 2020

gservat commented Jan 21, 2021

fearful-symmetry commented Jan 29, 2021

botelastic bot commented Dec 30, 2021