-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Procstat Plugin] If process does not exist telegraf starts consuming more and more cpu #2472
Comments
could be related to shirou/gopsutil#320 and shirou/gopsutil#319 |
@discoduck2x Are you running 64-bit linux? could you try running this binary and see if it fixes your issue?: https://6188-33258973-gh.circle-artifacts.com/0/tmp/circle-artifacts.hjZlqFh/telegraf.gz |
@sparrc yes 64bit centos, ill try the file and report back |
@sparrc , im not able to get any procstat data with that bin , getting the following error: Mar 1 10:21:16 centos7 telegraf: 2017-03-01T09:21:16Z E! Error: procstat: Failed to open process with pid '4543'. Error: 'open /proc/4543: no such file or directory' and my conf for procstat: |
does /proc/4543 exist? did you restart the process? reloading probably won't work here |
@sparrc no the pid does not exist - ive restarted a few times same thing,, have to go off for a while but will try more later today |
can you try using a pidfile? |
@sparrc , with pidfile im getting nothing.. no errors no data... |
I can confirm that this is an issue on master, doesn't appear to be collecting cpu usage properly, currently investigating.... |
@sparrc - confirmed by: (then reverting to original 1.2.1 telegraf binary and then data getting picked up by pidfile usage. |
@discoduck2x it seems that a fix for a separate bug resulted in breaking the cpu_usage metric. (see #2479) Once I merge that PR I will provide another build that has both fixes. |
@sparrc ok thanks for your effort! |
@discoduck2x could you try out the following binary? https://6254-33258973-gh.circle-artifacts.com/0/tmp/circle-artifacts.KlLekoO/telegraf.gz |
@sparrc , im still seein the same behaviour: |
OK, didn't mean to close this anyways so reopening |
@discoduck2x Does the process sometimes exist? From #1636 we know that procstat caches the pids indefinitely. If the process sometimes was found on occasion you would eventually build up a large list of pids that would need to be checked. |
@danielnelson no, ive got one/the same telegraf.conf for all my test hosts and one host has influxdb,, one has grafana etc so no,, there are no pid´s coming and going so to say on the hosts. with regards to #1636 it seems to me that if i have pattern configured in for procstat,, and lets say its for "telegraf",,, then if i use nano to edit the telegraf.conf then the "nano" process will be caught by the telegraf procstat process thus showing nano as a process.... Im gonna test some more coz cant replicate it consistently...but it looks very strange |
If you use the pattern option I think it should pick up the text editor or anything with the pattern in one of the args, but if you use the |
@danielnelson thats true, the pattern option catches them. running another test now for a while - will get back with hopefully some more details |
@discoduck2x Will you retest with the latest master? This might be fixed. |
i would if i could , cant build from master , u got bin anywhere i can pull it from? @danielnelson |
@discoduck2x Here is the latest build: https://6349-33258973-gh.circle-artifacts.com/0/tmp/circle-artifacts.eKuio1V/telegraf.gz |
@danielnelson , is that built for centos? getting this error with that binary: |
@discoduck2x Sorry, was just a bug in my change, try this one: https://6363-33258973-gh.circle-artifacts.com/0/tmp/circle-artifacts.H35AC4G/telegraf.gz |
@danielnelson thanks,, deployed and will let it simmer over night. |
@danielnelson - the cpu growth seems to be gone ! nice work does this build also include some fix for multiple instances of the same process name? |
Just the 3 bugs listed in #2540. I believe this bug was caused by procstat caching the pid and tags forever, which would require more and more memory and cpu to check. |
hopefully this will fix influxdata#2472
hopefully this will fix influxdata#2472
hopefully this will fix influxdata#2472
When using the procstat plugin and if a process isnt running resulting in that the following error can be observed in /var/log/messages "Feb 27 11:37:06 centos7 telegraf: 2017-02-27T10:37:06Z E! Error: procstat getting process, exe: [snapteld] pidfile: [] pattern: [] user: [] Failed to execute /usr/bin/pgrep. Error: 'exit status 1'
"
cpu usage of the telegraf process will steadily increase as shown in picture below.
Does not happen if all processes procstat plugin is configured to monitor is actually present and running
System info:
Centos7 , telegraf 1.2.1
telegraf.conf:
[[inputs.procstat]]
exe = "telegraf"
fieldpass = ["cpu_usage"]
[[inputs.procstat]]
exe = "snapteld"
fieldpass = ["cpu_usage"]
[[inputs.cpu]]
percpu = true
totalcpu = true
collect_cpu_time = false
topleft graph - host cpu usage (system cpu iput)
topright graph - procstat cpu usage
bottom barcharts - count of collected metrics by interval (shows the issue #2315 - but also that its fine for other inputs , system cpu in this case)
The text was updated successfully, but these errors were encountered: