Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DiskIo Metricset: Failed to encode event: unsupported float value: NaN #15549

Closed
LucaWintergerst opened this issue Jan 14, 2020 · 8 comments
Closed
Labels
bug Stalled Team:Integrations Label for the Integrations team

Comments

@LucaWintergerst
Copy link

  • Version: metricbeat-7.5.1
  • Operating System: Kubernetes, GKE, Elastic Docker container

if the diskio metricset is enabled, the following ERROR shows in the logs a few seconds after startup. It persists during the runtime and often increases its frequency
2020-01-14T18:32:53.571Z ERROR elasticsearch/client.go:404 Failed to encode event: unsupported float value: NaN
Full logfile: metricbeat.log

There are other INFO errors too, but I'm not sure if they are related:

020-01-14T16:01:02.010Z	INFO	module/wrapper.go:252	Error fetching data for metricset kubernetes.system: error doing HTTP request to fetch 'system' Metricset data: error making http request: Get http://localhost:10255/stats/summary: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
2020-01-14T16:01:02.067Z	INFO	module/wrapper.go:252	Error fetching data for metricset kubernetes.volume: error doing HTTP request to fetch 'volume' Metricset data: error making http request: Get http://localhost:10255/stats/summary: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
2020-01-14T16:01:02.068Z	INFO	module/wrapper.go:252	Error fetching data for metricset kubernetes.pod: error doing HTTP request to fetch 'pod' Metricset data: error making http request: Get http://localhost:10255/stats/summary: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
2020-01-14T16:01:02.136Z	INFO	module/wrapper.go:252	Error fetching data for metricset kubernetes.container: error doing HTTP request to fetch 'container' Metricset data: error making http request: Get http://localhost:10255/stats/summary: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
2020-01-14T16:01:02.136Z	INFO	module/wrapper.go:252	Error fetching data for metricset kubernetes.node: error doing HTTP request to fetch 'node' Metricset data: error making http request: Get http://localhost:10255/stats/summary: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
202

metricbeat.yml

metricbeat.config.modules:
  path: ${path.config}/modules.d/*.yml
  reload.enabled: false

processors:
- add_cloud_metadata: ~
- add_docker_metadata: ~

output.elasticsearch:
  hosts: '${ELASTICSEARCH_HOSTS:elasticsearch:9200}'
  username: '${ELASTICSEARCH_USERNAME:}'
  password: '${ELASTICSEARCH_PASSWORD:}'

modules.d

- module: docker
  metricsets:
    - container
    - cpu
    - diskio
    - event
    - healthcheck
    - info
    - memory
    - network
  hosts: ["unix:///var/run/docker.sock"]
  period: 10s- module: kubernetes
  metricsets:
    - node
    - system
    - pod
    - container
    - volume
  period: 10s
  host: ${NODE_NAME}
  hosts: ["localhost:10255"]
  add_metadata: true
- module: kubernetes
  metricsets:
    - proxy
  period: 10s
  host: ${NODE_NAME}
  hosts: ["localhost:10249"]- module: system
  period: 10s
  metricsets:
    - cpu
    - load
    - memory
    - network
    - process
    - process_summary
    - core
    - diskio
    - socket
  processes: ['.*']
  process.include_top_n:
    by_cpu: 5      # include top 5 processes by CPU
    by_memory: 5   # include top 5 processes by memory
- module: system
  period: 1m
  metricsets:
    - filesystem
    - fsstat
  processors:
  - drop_event.when.regexp:
      system.filesystem.mount_point: '^/(sys|cgroup|proc|dev|etc|host|lib)($|/)'

@fearful-symmetry fearful-symmetry added bug Team:Integrations Label for the Integrations team labels Jan 14, 2020
@fearful-symmetry
Copy link
Contributor

So, @LucaWintergerst when you can, we'll need logs with just the diskio metricset enabled.

Also, those k8s errors are suspicious, and I wonder if it's something there? Could be unrelated.

@LucaWintergerst
Copy link
Author

With diskio disabled, I got 2 NaN Errors in 30 minutes across 9 metricbeat pods. However, these errors now have additional output:

2020-01-15T07:42:34.674Z	ERROR	cpu/helper.go:199	Error calculating CPU time change for docker module: new stats value (0) is lower than the old one(34071203)
2020-01-15T07:42:34.674Z	ERROR	cpu/helper.go:199	Error calculating CPU time change for docker module: new stats value (0) is lower than the old one(34071203)
2020-01-15T07:42:34.674Z	ERROR	cpu/helper.go:199	Error calculating CPU time change for docker module: new stats value (0) is lower than the old one(10000000)
2020-01-15T07:42:34.674Z	ERROR	cpu/helper.go:199	Error calculating CPU time change for docker module: new stats value (0) is lower than the old one(10000000)
2020-01-15T07:42:34.674Z	ERROR	cpu/helper.go:199	Error calculating CPU time change for docker module: new stats value (0) is lower than the old one(10000000)
2020-01-15T07:42:34.674Z	ERROR	cpu/helper.go:199	Error calculating CPU time change for docker module: new stats value (0) is lower than the old one(10000000)
2020-01-15T07:42:34.674Z	ERROR	cpu/helper.go:199	Error calculating CPU time change for docker module: new stats value (0) is lower than the old one(4176137860000000)
2020-01-15T07:42:34.674Z	ERROR	cpu/helper.go:199	Error calculating CPU time change for docker module: new stats value (0) is lower than the old one(4176137860000000)
2020-01-15T07:42:35.174Z	ERROR	elasticsearch/client.go:404	Failed to encode event: unsupported float value: NaN 

All other errors are gone as well now, which is surprising. (take this with a grain of salt, may just be luck)

With diskio enabled, I got hundreds within a few minutes on even a single pod, but without the explanation.

Running a single pod with only diskio enabled and no other metricset does not reproduce the error behaviour (lots of errors quickly). Take this with a grain of salt too for now, as I can't say for certain that the NaN Erros show up for all pods all the time with the original config.

@fearful-symmetry
Copy link
Contributor

@LucaWintergerst Sorry for the delay, been distracted the last few days.

okay, I have to ask, is any valid data coming through to elasticsearch? It almost looks like nothing is working, and metricsets that aren't handling invalid metrics properly are getting smoked out.

If there's some other configuration issue at play here preventing you from reading in values, those Error calculating CPU time change for docker module errors are the correct way to handle it. It's the NaN errors I'm worried about.

With diskio disabled, I got 2 NaN Errors in 30 minutes across 9 metricbeat pods. However, these errors now have additional output:

What other metricsets were enabled? Just the ones listed in your original post minus diskio?

With diskio enabled, I got hundreds within a few minutes on even a single pod, but without the explanation.

Okay, so there's definitely an issue with the diskio metricset it looks like?

@LucaWintergerst
Copy link
Author

yes, we are getting all other data without any issues.
This is the list of other metricsets we have enabled:

- module: kubernetes
      metricsets:
        - node
        - system
        - pod
        - container
        - volume
- module: system
      metricsets:
        - cpu
        - load
        - memory
        - network
        - process
        - process_summary
        - core
        #- diskio
        - socket
- module: docker
      metricsets:
        - container
        - cpu
        #- diskio
        - event
        - healthcheck
        - info
        - memory
        - network

@botelastic
Copy link

botelastic bot commented Dec 27, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@botelastic botelastic bot added the Stalled label Dec 27, 2020
@gservat
Copy link

gservat commented Jan 21, 2021

We're on beats 6.8.6 and just had a solid 16 hours (since restart) of no diskio metrics with the same sort of error message:

2021-01-21T00:08:35.933Z        INFO    pipeline/output.go:95   Connecting to backoff(tcp://logstash-address-redacted:port)
2021-01-21T00:08:35.950Z        INFO    pipeline/output.go:105  Connection to backoff(tcp://logstash-address-redacted:port) established
2021-01-21T00:08:35.971Z        ERROR   logstash/sync.go:155    Failed to publish events caused by: unsupported float value: NaN
2021-01-21T00:08:37.650Z        ERROR   pipeline/output.go:121  Failed to publish events: unsupported float value: NaN
2021-01-21T00:08:37.650Z        INFO    pipeline/output.go:95   Connecting to backoff(tcp://logstash-address-redacted:port)
2021-01-21T00:08:37.666Z        INFO    pipeline/output.go:105  Connection to backoff(tcp://logstash-address-redacted:port) established
2021-01-21T00:08:37.666Z        ERROR   logstash/sync.go:155    Failed to publish events caused by: unsupported float value: NaN
2021-01-21T00:08:39.625Z        ERROR   pipeline/output.go:121  Failed to publish events: unsupported float value: NaN
2021-01-21T00:08:39.625Z        INFO    pipeline/output.go:95   Connecting to backoff(tcp://logstash-address-redacted:port)
2021-01-21T00:08:39.641Z        INFO    pipeline/output.go:105  Connection to backoff(tcp://logstash-address-redacted:port) established
2021-01-21T00:08:39.641Z        INFO    [publish]       pipeline/retry.go:155   Drop batch
2021-01-21T00:08:39.641Z        ERROR   logstash/sync.go:155    Failed to publish events caused by: unsupported float value: NaN
2021-01-21T00:08:41.227Z        ERROR   pipeline/output.go:121  Failed to publish events: unsupported float value: NaN
2021-01-21T00:08:45.630Z        INFO    pipeline/output.go:95   Connecting to backoff(tcp://logstash-address-redacted:port)
2021-01-21T00:08:45.647Z        INFO    pipeline/output.go:105  Connection to backoff(tcp://logstash-address-redacted:port) established
2021-01-21T00:09:42.958Z        ERROR   logstash/sync.go:155    Failed to publish events caused by: unsupported float value: NaN
2021-01-21T00:09:44.700Z        ERROR   pipeline/output.go:121  Failed to publish events: unsupported float value: NaN
2021-01-21T00:09:44.700Z        INFO    pipeline/output.go:95   Connecting to backoff(tcp://logstash-address-redacted:port)
2021-01-21T00:09:44.730Z        INFO    pipeline/output.go:105  Connection to backoff(tcp://logstash-address-redacted:port) established
2021-01-21T00:09:44.735Z        ERROR   logstash/sync.go:155    Failed to publish events caused by: unsupported float value: NaN
2021-01-21T00:09:46.595Z        ERROR   pipeline/output.go:121  Failed to publish events: unsupported float value: NaN
2021-01-21T00:09:46.595Z        INFO    pipeline/output.go:95   Connecting to backoff(tcp://logstash-address-redacted:port)
2021-01-21T00:09:46.617Z        INFO    pipeline/output.go:105  Connection to backoff(tcp://logstash-address-redacted:port) established
2021-01-21T00:09:46.621Z        ERROR   logstash/sync.go:155    Failed to publish events caused by: unsupported float value: NaN
2021-01-21T00:09:48.350Z        ERROR   pipeline/output.go:121  Failed to publish events: unsupported float value: NaN
2021-01-21T00:09:48.350Z        INFO    pipeline/output.go:95   Connecting to backoff(tcp://logstash-address-redacted:port)
2021-01-21T00:09:48.382Z        INFO    pipeline/output.go:105  Connection to backoff(tcp://logstash-address-redacted:port) established
2021-01-21T00:09:48.397Z        INFO    [publish]       pipeline/retry.go:155   Drop batch

I can see that other metrics were getting published OK (like docker.network), but docker.diskio stopped and magically fixed itself 16 hours later.

@botelastic botelastic bot removed the Stalled label Jan 21, 2021
@fearful-symmetry
Copy link
Contributor

@gservat Is there any way you can update metricbeat to a newer version?

@botelastic
Copy link

botelastic bot commented Dec 30, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@botelastic botelastic bot added the Stalled label Dec 30, 2021
@botelastic botelastic bot closed this as completed Jan 29, 2022
@zube zube bot removed the [zube]: Done label Apr 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Stalled Team:Integrations Label for the Integrations team
Projects
None yet
Development

No branches or pull requests

4 participants