Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Procstat Plugin] If process does not exist telegraf starts consuming more and more cpu #2472

Closed
discoduck2x opened this issue Feb 27, 2017 · 27 comments · Fixed by #2477
Closed
Assignees
Labels
bug unexpected problem or unintended behavior
Milestone

Comments

@discoduck2x
Copy link

When using the procstat plugin and if a process isnt running resulting in that the following error can be observed in /var/log/messages "Feb 27 11:37:06 centos7 telegraf: 2017-02-27T10:37:06Z E! Error: procstat getting process, exe: [snapteld] pidfile: [] pattern: [] user: [] Failed to execute /usr/bin/pgrep. Error: 'exit status 1'
"
cpu usage of the telegraf process will steadily increase as shown in picture below.
Does not happen if all processes procstat plugin is configured to monitor is actually present and running

System info:

Centos7 , telegraf 1.2.1

telegraf.conf:

[[inputs.procstat]]
exe = "telegraf"
fieldpass = ["cpu_usage"]

[[inputs.procstat]]
exe = "snapteld"
fieldpass = ["cpu_usage"]

[[inputs.cpu]]
percpu = true
totalcpu = true
collect_cpu_time = false

image

topleft graph - host cpu usage (system cpu iput)
topright graph - procstat cpu usage
bottom barcharts - count of collected metrics by interval (shows the issue #2315 - but also that its fine for other inputs , system cpu in this case)

@sparrc
Copy link
Contributor

sparrc commented Feb 28, 2017

could be related to shirou/gopsutil#320 and shirou/gopsutil#319

@sparrc sparrc added the bug unexpected problem or unintended behavior label Feb 28, 2017
@sparrc sparrc added this to the 1.3.0 milestone Feb 28, 2017
sparrc added a commit that referenced this issue Feb 28, 2017
@sparrc
Copy link
Contributor

sparrc commented Feb 28, 2017

@discoduck2x Are you running 64-bit linux? could you try running this binary and see if it fixes your issue?: https://6188-33258973-gh.circle-artifacts.com/0/tmp/circle-artifacts.hjZlqFh/telegraf.gz

sparrc added a commit that referenced this issue Feb 28, 2017
sparrc added a commit that referenced this issue Feb 28, 2017
@discoduck2x
Copy link
Author

@sparrc yes 64bit centos, ill try the file and report back

@discoduck2x
Copy link
Author

@sparrc , im not able to get any procstat data with that bin , getting the following error:

Mar 1 10:21:16 centos7 telegraf: 2017-03-01T09:21:16Z E! Error: procstat: Failed to open process with pid '4543'. Error: 'open /proc/4543: no such file or directory'

and my conf for procstat:
[[inputs.procstat]]
exe = "telegraf"
fieldpass = ["cpu_usage"]

@sparrc
Copy link
Contributor

sparrc commented Mar 1, 2017

does /proc/4543 exist? did you restart the process? reloading probably won't work here

@discoduck2x
Copy link
Author

@sparrc no the pid does not exist - ive restarted a few times same thing,, have to go off for a while but will try more later today

@sparrc
Copy link
Contributor

sparrc commented Mar 1, 2017

can you try using a pidfile?

@discoduck2x
Copy link
Author

@sparrc , with pidfile im getting nothing.. no errors no data...

@sparrc
Copy link
Contributor

sparrc commented Mar 1, 2017

I can confirm that this is an issue on master, doesn't appear to be collecting cpu usage properly, currently investigating....

@discoduck2x
Copy link
Author

@sparrc - confirmed by: (then reverting to original 1.2.1 telegraf binary and then data getting picked up by pidfile usage.

image

@sparrc
Copy link
Contributor

sparrc commented Mar 1, 2017

@discoduck2x it seems that a fix for a separate bug resulted in breaking the cpu_usage metric. (see #2479)

Once I merge that PR I will provide another build that has both fixes.

@discoduck2x
Copy link
Author

@sparrc ok thanks for your effort!

sparrc added a commit that referenced this issue Mar 8, 2017
@sparrc
Copy link
Contributor

sparrc commented Mar 8, 2017

@discoduck2x
Copy link
Author

@sparrc , im still seein the same behaviour:
checksum: 43e159ebe073a14467ac0ce325c68296

image

@sparrc
Copy link
Contributor

sparrc commented Mar 10, 2017

OK, didn't mean to close this anyways so reopening

@sparrc sparrc reopened this Mar 10, 2017
@danielnelson
Copy link
Contributor

@discoduck2x Does the process sometimes exist? From #1636 we know that procstat caches the pids indefinitely. If the process sometimes was found on occasion you would eventually build up a large list of pids that would need to be checked.

@danielnelson danielnelson assigned danielnelson and unassigned sparrc Mar 13, 2017
@discoduck2x
Copy link
Author

@danielnelson no, ive got one/the same telegraf.conf for all my test hosts and one host has influxdb,, one has grafana etc so no,, there are no pid´s coming and going so to say on the hosts.

with regards to #1636 it seems to me that if i have pattern configured in for procstat,, and lets say its for "telegraf",,, then if i use nano to edit the telegraf.conf then the "nano" process will be caught by the telegraf procstat process thus showing nano as a process.... Im gonna test some more coz cant replicate it consistently...but it looks very strange

@danielnelson
Copy link
Contributor

If you use the pattern option I think it should pick up the text editor or anything with the pattern in one of the args, but if you use the exe option it should only pick it up if the name of process aka arg0 matches. I'll fix #1636 today and then we should retest this to see if its related.

@discoduck2x
Copy link
Author

@danielnelson thats true, the pattern option catches them. running another test now for a while - will get back with hopefully some more details

@danielnelson
Copy link
Contributor

@discoduck2x Will you retest with the latest master? This might be fixed.

@discoduck2x
Copy link
Author

i would if i could , cant build from master , u got bin anywhere i can pull it from? @danielnelson

@danielnelson
Copy link
Contributor

@discoduck2x
Copy link
Author

@danielnelson , is that built for centos? getting this error with that binary:
image

@danielnelson
Copy link
Contributor

@discoduck2x Sorry, was just a bug in my change, try this one: https://6363-33258973-gh.circle-artifacts.com/0/tmp/circle-artifacts.H35AC4G/telegraf.gz

@discoduck2x
Copy link
Author

@danielnelson thanks,, deployed and will let it simmer over night.
On one other note the "missing collections" seems better by looking at the count of metrics beein collected (previously telegraf missed procstat process cpu data if cpu usage was low)
On this pic the new bin shows the correct expected number of metrics collected per interval. so fingers crossed !

image

@discoduck2x
Copy link
Author

@danielnelson - the cpu growth seems to be gone ! nice work
image

does this build also include some fix for multiple instances of the same process name?

@danielnelson
Copy link
Contributor

Just the 3 bugs listed in #2540. I believe this bug was caused by procstat caching the pid and tags forever, which would require more and more memory and cpu to check.

ssorathia pushed a commit to ssorathia/telegraf that referenced this issue Mar 25, 2017
calerogers pushed a commit to calerogers/telegraf that referenced this issue Apr 5, 2017
vlamug pushed a commit to vlamug/telegraf that referenced this issue May 30, 2017
maxunt pushed a commit that referenced this issue Jun 26, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants