[Procstat Plugin] If process does not exist telegraf starts consuming more and more cpu #2472

discoduck2x · 2017-02-27T10:53:28Z

When using the procstat plugin and if a process isnt running resulting in that the following error can be observed in /var/log/messages "Feb 27 11:37:06 centos7 telegraf: 2017-02-27T10:37:06Z E! Error: procstat getting process, exe: [snapteld] pidfile: [] pattern: [] user: [] Failed to execute /usr/bin/pgrep. Error: 'exit status 1'
"
cpu usage of the telegraf process will steadily increase as shown in picture below.
Does not happen if all processes procstat plugin is configured to monitor is actually present and running

System info:

Centos7 , telegraf 1.2.1

telegraf.conf:

[[inputs.procstat]]
exe = "telegraf"
fieldpass = ["cpu_usage"]

[[inputs.procstat]]
exe = "snapteld"
fieldpass = ["cpu_usage"]

[[inputs.cpu]]
percpu = true
totalcpu = true
collect_cpu_time = false

topleft graph - host cpu usage (system cpu iput)
topright graph - procstat cpu usage
bottom barcharts - count of collected metrics by interval (shows the issue #2315 - but also that its fine for other inputs , system cpu in this case)

sparrc · 2017-02-28T12:16:40Z

could be related to shirou/gopsutil#320 and shirou/gopsutil#319

hopefully this will fix #2472

sparrc · 2017-02-28T12:45:29Z

@discoduck2x Are you running 64-bit linux? could you try running this binary and see if it fixes your issue?: https://6188-33258973-gh.circle-artifacts.com/0/tmp/circle-artifacts.hjZlqFh/telegraf.gz

hopefully this will fix #2472

discoduck2x · 2017-03-01T09:12:18Z

@sparrc yes 64bit centos, ill try the file and report back

discoduck2x · 2017-03-01T09:23:48Z

@sparrc , im not able to get any procstat data with that bin , getting the following error:

Mar 1 10:21:16 centos7 telegraf: 2017-03-01T09:21:16Z E! Error: procstat: Failed to open process with pid '4543'. Error: 'open /proc/4543: no such file or directory'

and my conf for procstat:
[[inputs.procstat]]
exe = "telegraf"
fieldpass = ["cpu_usage"]

sparrc · 2017-03-01T10:53:25Z

does /proc/4543 exist? did you restart the process? reloading probably won't work here

discoduck2x · 2017-03-01T11:03:04Z

@sparrc no the pid does not exist - ive restarted a few times same thing,, have to go off for a while but will try more later today

sparrc · 2017-03-01T11:04:19Z

can you try using a pidfile?

discoduck2x · 2017-03-01T12:07:26Z

@sparrc , with pidfile im getting nothing.. no errors no data...

sparrc · 2017-03-01T14:04:14Z

I can confirm that this is an issue on master, doesn't appear to be collecting cpu usage properly, currently investigating....

discoduck2x · 2017-03-01T14:06:28Z

@sparrc - confirmed by: (then reverting to original 1.2.1 telegraf binary and then data getting picked up by pidfile usage.

sparrc · 2017-03-01T15:08:08Z

@discoduck2x it seems that a fix for a separate bug resulted in breaking the cpu_usage metric. (see #2479)

Once I merge that PR I will provide another build that has both fixes.

discoduck2x · 2017-03-01T15:57:52Z

@sparrc ok thanks for your effort!

hopefully this will fix #2472

sparrc · 2017-03-08T13:17:54Z

@discoduck2x could you try out the following binary? https://6254-33258973-gh.circle-artifacts.com/0/tmp/circle-artifacts.KlLekoO/telegraf.gz

discoduck2x · 2017-03-10T16:14:50Z

@sparrc , im still seein the same behaviour:
checksum: 43e159ebe073a14467ac0ce325c68296

sparrc · 2017-03-10T16:41:09Z

OK, didn't mean to close this anyways so reopening

danielnelson · 2017-03-13T21:38:50Z

@discoduck2x Does the process sometimes exist? From #1636 we know that procstat caches the pids indefinitely. If the process sometimes was found on occasion you would eventually build up a large list of pids that would need to be checked.

discoduck2x · 2017-03-14T08:18:03Z

@danielnelson no, ive got one/the same telegraf.conf for all my test hosts and one host has influxdb,, one has grafana etc so no,, there are no pid´s coming and going so to say on the hosts.

with regards to #1636 it seems to me that if i have pattern configured in for procstat,, and lets say its for "telegraf",,, then if i use nano to edit the telegraf.conf then the "nano" process will be caught by the telegraf procstat process thus showing nano as a process.... Im gonna test some more coz cant replicate it consistently...but it looks very strange

danielnelson · 2017-03-14T16:09:48Z

If you use the pattern option I think it should pick up the text editor or anything with the pattern in one of the args, but if you use the exe option it should only pick it up if the name of process aka arg0 matches. I'll fix #1636 today and then we should retest this to see if its related.

discoduck2x · 2017-03-15T07:48:28Z

@danielnelson thats true, the pattern option catches them. running another test now for a while - will get back with hopefully some more details

danielnelson · 2017-03-17T23:50:58Z

@discoduck2x Will you retest with the latest master? This might be fixed.

discoduck2x · 2017-03-18T19:26:06Z

i would if i could , cant build from master , u got bin anywhere i can pull it from? @danielnelson

danielnelson · 2017-03-21T01:07:11Z

@discoduck2x Here is the latest build: https://6349-33258973-gh.circle-artifacts.com/0/tmp/circle-artifacts.eKuio1V/telegraf.gz

discoduck2x · 2017-03-21T05:11:08Z

@danielnelson , is that built for centos? getting this error with that binary:

danielnelson · 2017-03-21T20:28:37Z

@discoduck2x Sorry, was just a bug in my change, try this one: https://6363-33258973-gh.circle-artifacts.com/0/tmp/circle-artifacts.H35AC4G/telegraf.gz

discoduck2x · 2017-03-21T21:14:29Z

@danielnelson thanks,, deployed and will let it simmer over night.
On one other note the "missing collections" seems better by looking at the count of metrics beein collected (previously telegraf missed procstat process cpu data if cpu usage was low)
On this pic the new bin shows the correct expected number of metrics collected per interval. so fingers crossed !

discoduck2x · 2017-03-22T07:22:17Z

@danielnelson - the cpu growth seems to be gone ! nice work

does this build also include some fix for multiple instances of the same process name?

danielnelson · 2017-03-22T17:15:55Z

Just the 3 bugs listed in #2540. I believe this bug was caused by procstat caching the pid and tags forever, which would require more and more memory and cpu to check.

hopefully this will fix influxdata#2472

hopefully this will fix #2472

sparrc added the bug unexpected problem or unintended behavior label Feb 28, 2017

sparrc added this to the 1.3.0 milestone Feb 28, 2017

sparrc added a commit that referenced this issue Feb 28, 2017

update gopsutil for file close fixes

f6f1416

hopefully this will fix #2472

sparrc mentioned this issue Feb 28, 2017

update gopsutil for file close fixes #2477

Merged

3 tasks

121watts assigned sparrc Feb 28, 2017

121watts added the in progress label Feb 28, 2017

sparrc added a commit that referenced this issue Feb 28, 2017

update gopsutil for file close fixes

fafb1b7

hopefully this will fix #2472

sparrc added a commit that referenced this issue Feb 28, 2017

update gopsutil for file close fixes

e05d078

hopefully this will fix #2472

sparrc added a commit that referenced this issue Mar 8, 2017

update gopsutil for file close fixes

9df2974

hopefully this will fix #2472

sparrc closed this as completed in #2477 Mar 8, 2017

121watts removed the in progress label Mar 8, 2017

sparrc reopened this Mar 10, 2017

danielnelson assigned danielnelson and unassigned sparrc Mar 13, 2017

danielnelson mentioned this issue Mar 17, 2017

Refactor procstat input #2540

Merged

2 tasks

danielnelson closed this as completed Mar 22, 2017

ssorathia pushed a commit to ssorathia/telegraf that referenced this issue Mar 25, 2017

update gopsutil for file close fixes

56ea1a2

hopefully this will fix influxdata#2472

calerogers pushed a commit to calerogers/telegraf that referenced this issue Apr 5, 2017

update gopsutil for file close fixes

997f1e7

hopefully this will fix influxdata#2472

vlamug pushed a commit to vlamug/telegraf that referenced this issue May 30, 2017

update gopsutil for file close fixes

3ecaa97

hopefully this will fix influxdata#2472

maxunt pushed a commit that referenced this issue Jun 26, 2018

update gopsutil for file close fixes

5c3cd82

hopefully this will fix #2472

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Procstat Plugin] If process does not exist telegraf starts consuming more and more cpu #2472

[Procstat Plugin] If process does not exist telegraf starts consuming more and more cpu #2472

discoduck2x commented Feb 27, 2017

sparrc commented Feb 28, 2017

sparrc commented Feb 28, 2017

discoduck2x commented Mar 1, 2017

discoduck2x commented Mar 1, 2017

sparrc commented Mar 1, 2017

discoduck2x commented Mar 1, 2017

sparrc commented Mar 1, 2017

discoduck2x commented Mar 1, 2017

sparrc commented Mar 1, 2017

discoduck2x commented Mar 1, 2017

sparrc commented Mar 1, 2017

discoduck2x commented Mar 1, 2017

sparrc commented Mar 8, 2017

discoduck2x commented Mar 10, 2017

sparrc commented Mar 10, 2017

danielnelson commented Mar 13, 2017

discoduck2x commented Mar 14, 2017

danielnelson commented Mar 14, 2017

discoduck2x commented Mar 15, 2017

danielnelson commented Mar 17, 2017

discoduck2x commented Mar 18, 2017

danielnelson commented Mar 21, 2017

discoduck2x commented Mar 21, 2017

danielnelson commented Mar 21, 2017

discoduck2x commented Mar 21, 2017

discoduck2x commented Mar 22, 2017

danielnelson commented Mar 22, 2017

[Procstat Plugin] If process does not exist telegraf starts consuming more and more cpu #2472

[Procstat Plugin] If process does not exist telegraf starts consuming more and more cpu #2472

Comments

discoduck2x commented Feb 27, 2017

System info:

telegraf.conf:

sparrc commented Feb 28, 2017

sparrc commented Feb 28, 2017

discoduck2x commented Mar 1, 2017

discoduck2x commented Mar 1, 2017

sparrc commented Mar 1, 2017

discoduck2x commented Mar 1, 2017

sparrc commented Mar 1, 2017

discoduck2x commented Mar 1, 2017

sparrc commented Mar 1, 2017

discoduck2x commented Mar 1, 2017

sparrc commented Mar 1, 2017

discoduck2x commented Mar 1, 2017

sparrc commented Mar 8, 2017

discoduck2x commented Mar 10, 2017

sparrc commented Mar 10, 2017

danielnelson commented Mar 13, 2017

discoduck2x commented Mar 14, 2017

danielnelson commented Mar 14, 2017

discoduck2x commented Mar 15, 2017

danielnelson commented Mar 17, 2017

discoduck2x commented Mar 18, 2017

danielnelson commented Mar 21, 2017

discoduck2x commented Mar 21, 2017

danielnelson commented Mar 21, 2017

discoduck2x commented Mar 21, 2017

discoduck2x commented Mar 22, 2017

danielnelson commented Mar 22, 2017