Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

node_exporter goes into crashLoopBackoff after node reboot and never recovers #2935

Closed
vinay50muddu opened this issue Feb 22, 2024 · 2 comments

Comments

@vinay50muddu
Copy link

Host operating system: output of uname -a

Linux my-cluster 5.14.21-150500.55.44-default #1 SMP PREEMPT_DYNAMIC Mon Jan 15 10:03:40 UTC 2024 (cc7d8b6) x86_64 x86_64 x86_64 GNU/Linux

node_exporter version: output of node_exporter --version

node_exporter, version 1.7.0 (branch: HEAD, revision: 7333465)

node_exporter command line flags

--path.procfs=/host/proc --path.sysfs=/host/sys --web.listen-address=[$(POD_IP)]:9100 --path.rootfs=/host 
--web.config.file=/data/etc/eric-pm-node-exporter-web-config.yaml --collector.textfile.directory=/host/var/lib/eccd/
--collector.disable-defaults --collector.cpu --collector.filesystem --collector.loadavg --collector.meminfo --collector.mountstats
--collector.netclass --collector.netdev --collector.timex --collector.textfile --collector.uname --collector.xfs --collector.diskstats

node_exporter log output

ts=2024-02-22T06:26:53.719Z caller=node_exporter.go:192 level=info msg="Starting node_exporter" version="(version=1.7.0, branch=HEAD, revision=7333465abf9efba81876303bb57e6fadb946041b)"
ts=2024-02-22T06:26:53.719Z caller=node_exporter.go:193 level=info msg="Build context" build_context="(go=go1.21.6, platform=linux/amd64, user=root@f86e9674f8f3, date=20240219-04:42:07, tags=netgo osusergo static_build)"
ts=2024-02-22T06:26:53.719Z caller=filesystem_common.go:111 level=info collector=filesystem msg="Parsed flag --collector.filesystem.mount-points-exclude" flag=^/(dev|proc|run/credentials/.+|sys|var/lib/docker/.+|var/lib/containers/storage/.+)($|/)
ts=2024-02-22T06:26:53.719Z caller=filesystem_common.go:113 level=info collector=filesystem msg="Parsed flag --collector.filesystem.fs-types-exclude" flag=^(autofs|binfmt_misc|bpf|cgroup2?|configfs|debugfs|devpts|devtmpfs|fusectl|hugetlbfs|iso9660|mqueue|nsfs|overlay|proc|procfs|pstore|rpc_pipefs|securityfs|selinuxfs|squashfs|sysfs|tracefs)$
ts=2024-02-22T06:26:53.720Z caller=diskstats_common.go:111 level=info collector=diskstats msg="Parsed flag --collector.diskstats.device-exclude" flag=^(ram|loop|fd|(h|s|v|xv)d[a-z]|nvme\d+n\d+p)\d+$
ts=2024-02-22T06:26:53.720Z caller=diskstats_linux.go:265 level=error collector=diskstats msg="Failed to open directory, disabling udev device properties" path=/run/udev/data
ts=2024-02-22T06:26:53.720Z caller=node_exporter.go:110 level=info msg="Enabled collectors"
ts=2024-02-22T06:26:53.720Z caller=node_exporter.go:117 level=info collector=cpu
ts=2024-02-22T06:26:53.720Z caller=node_exporter.go:117 level=info collector=diskstats
ts=2024-02-22T06:26:53.720Z caller=node_exporter.go:117 level=info collector=filesystem
ts=2024-02-22T06:26:53.720Z caller=node_exporter.go:117 level=info collector=loadavg
ts=2024-02-22T06:26:53.720Z caller=node_exporter.go:117 level=info collector=meminfo
ts=2024-02-22T06:26:53.720Z caller=node_exporter.go:117 level=info collector=mountstats
ts=2024-02-22T06:26:53.720Z caller=node_exporter.go:117 level=info collector=netclass
ts=2024-02-22T06:26:53.720Z caller=node_exporter.go:117 level=info collector=netdev
ts=2024-02-22T06:26:53.720Z caller=node_exporter.go:117 level=info collector=textfile
ts=2024-02-22T06:26:53.720Z caller=node_exporter.go:117 level=info collector=timex
ts=2024-02-22T06:26:53.720Z caller=node_exporter.go:117 level=info collector=uname
ts=2024-02-22T06:26:53.720Z caller=node_exporter.go:117 level=info collector=xfs
ts=2024-02-22T06:26:53.721Z caller=tls_config.go:274 level=info msg="Listening on" address=:9100
ts=2024-02-22T06:26:53.721Z caller=tls_config.go:310 level=info msg="TLS is enabled." http2=true address=:9100

Are you running node_exporter in Docker?

Running as a pod in the kubernetes cluster

What did you do that produced an error?

Reboot the node and when the node-exporter pod comes up, it silently exits after printing the last line included in the log.

What did you expect to see?

Suspecting that sometimes the host volumes mounted into the pod sandbox environment are not yet ready while the node exporter is started. However it is expected that node exporter should detect any missing prerequisites and complain about them or throw an error (at least a warning ) which is not the current case.

What did you see instead?

NE exits without leaving any trace of why it exited.

@SuperQ
Copy link
Member

SuperQ commented Feb 22, 2024

This does not sound like a node_exporter issue, but a Kubernetes issue.

There are no "detected prerequisites" in the node_exporter. Every scrape is dynamic.

The node_exporter logs any errors returned, the only way to exit with no errors is for the webserver function to exit with no errors, usually by a SIGTERM or SIGKILL.

@vinay50muddu
Copy link
Author

IMO, I don't think it's k8s issue as the other pods are running without any issue. This issue is seen only when we reboot the node and NE tries to come up. NE is moved to CrashLoopBackoff state and is rendered in that state until we edit the daemonset/pod spec. Now the moment I edit the spec (even with minimal changes like reducing the probe time only to cause the pod to restart ), NE comes up. So I suspect that after the sandbox creation when the container is coming up it just exits possibly because it is getting into some condition which is not handled. I am mounting the volumes as below:

volumeMounts:
- mountPath: /host
  mountPropagation: HostToContainer
  name: rootfs
  readOnly: true
- mountPath: /host/proc
  name: proc
  readOnly: true
- mountPath: /host/sys
  name: sys
  readOnly: true

volumes:

  • hostPath:
    path: /
    type: ""
    name: rootfs
  • hostPath:
    path: /proc
    type: ""
    name: proc
  • hostPath:
    path: /sys
    type: ""
    name: sys

Also I have a query in this context. When it crashes, do we get any trace in the log? Let me know if any information is needed which I am missing to provide.

@SuperQ SuperQ closed this as completed Mar 8, 2024
@prometheus prometheus locked as resolved and limited conversation to collaborators Mar 8, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants