Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Process metrics for Linux #7870

Closed
kubotat opened this issue Aug 25, 2023 · 14 comments · Fixed by #7943
Closed

Process metrics for Linux #7870

kubotat opened this issue Aug 25, 2023 · 14 comments · Fixed by #7943

Comments

@kubotat
Copy link

kubotat commented Aug 25, 2023

Is your feature request related to a problem? Please describe.

The in_process plugin is available today which has capability to check how health a process is. Having process level CPU and Memory metrics metrics in addition to health information is beneficial for the system operation.

Describe the solution you'd like

As far as I research, node_exporter does not support process metrics as of today. So I suggest to develop new plugin which captures process level metrics from /proc//stat. Here is the expected configuration for the plugin: process_name_regex and process_status_regex options give user great flexibilities to control which process name to be captured and reduce the amount of data by cutting off unnecessary metrics.

[INPUT]
    Name  process_metrics_exporter
    scrape_interval  60
    path.procfs  /proc/
    process_name_regex  /fluent-bit/
    process_status_regex  /R/

Describe alternatives you've considered

I considered in_process plugin as an alternative. It helps me to check the health status and Memory metrics but it doesn't capture CPU metrics and doesn't work when users don't know the name of process.

Additional context

None

@cosmo0920
Copy link
Contributor

cosmo0920 commented Aug 28, 2023

The official node_exporter can handle process metrics which provides this feature as a one of the collectors. So, we need to provide it as one of the metrics which is implemented in node_exporter_metrics.
Thus, it needs to be implemented as process metrics.

The configuration for process metrics should be as follows:

[INPUT]
    Name  node_metrics_exporter
    collector.process.scrape_interval  60
    metrics process
    path.procfs  /proc/
    ne.process_name_regex  /fluent-bit/
    ne.process_status_regex  /R/

@cosmo0920
Copy link
Contributor

cosmo0920 commented Aug 30, 2023

I already registered a PR for implementing processes metrics which means system level of the statuses of processes and threads on in_node_exporter_metrics here: #7880
So, this feature request should be handled as implementing process or some of the equivalents but it should use the different name of metrics for it.

@kubotat
Copy link
Author

kubotat commented Aug 30, 2023

@cosmo0920 Thanks for your help. I looked into #7880 and confirmed that the system level metrics are captured.

2023-08-30T09:15:10.449031472Z node_process_threads = 1668
2023-08-30T09:15:10.449031472Z node_process_max_threads = 513122
2023-08-30T09:15:10.449031472Z node_process_threads_state{thread_state="R"} = 1
2023-08-30T09:15:10.449031472Z node_process_threads_state{thread_state="S"} = 1558
2023-08-30T09:15:10.449031472Z node_process_threads_state{thread_state="D"} = 0
2023-08-30T09:15:10.449031472Z node_process_threads_state{thread_state="Z"} = 3
2023-08-30T09:15:10.449031472Z node_process_threads_state{thread_state="T"} = 0
2023-08-30T09:15:10.449031472Z node_process_threads_state{thread_state="I"} = 106
2023-08-30T09:15:10.449031472Z node_process_state{state="R"} = 0
2023-08-30T09:15:10.449031472Z node_process_state{state="S"} = 367
2023-08-30T09:15:10.449031472Z node_process_state{state="D"} = 0
2023-08-30T09:15:10.449031472Z node_process_state{state="Z"} = 3
2023-08-30T09:15:10.449031472Z node_process_state{state="T"} = 0
2023-08-30T09:15:10.449031472Z node_process_state{state="I"} = 106
2023-08-30T09:15:10.449031472Z node_process_pids = 476
2023-08-30T09:15:10.449031472Z node_process_max_processes = 4194304

So, this feature request should be handled as implementing process or some of the equivalents but it should use the different name of metrics for it.

Do you mean process level cpu/memory metrics should be discussed in the other PR?

@cosmo0920
Copy link
Contributor

Yes. I wanted to discuss this issue and another PR for process level of metrics.

@cosmo0920
Copy link
Contributor

For the reference, we need to implement process metrics like as: https://github.com/ncabatoff/process-exporter/blob/master/collector/process_collector.go

@kubotat
Copy link
Author

kubotat commented Aug 31, 2023

@cosmo0920 Thanks.

For the reference, we need to implement process metrics like as: https://github.com/ncabatoff/process-exporter/blob/master/collector/process_collector.go

Yes, I was checking the exactly same code:)

Do you think it is possible to implement the feature to scrape top 10 processes at input plugin? or should it be implemented at filter plugin??? I would like to hear your thoughts on it.

@cosmo0920
Copy link
Contributor

cosmo0920 commented Aug 31, 2023

I think that scraping for top 10 processes is highly cost to determine the order with traversing procfs. Like as the above link, we should implement it with traversing all of the metrics of the process which are belonging to each of procfs for the processes.
This is because we're going to need to be digging to sort out for the CPU, memory, network bandwidth or other point of views.

For ordering the top of 10 process of the metrics, these should be handled by monitoring solution side.
For instance, Splunk can be displayed with the top of the 10 metrics in each of the graphs. That will be depending on the configurations but as far as I remember, the top 10 of the metrics should be the default.

Another plan is: Perhaps, we need to implement filtering feature for metrics in cmetrics?

@patrick-stephens
Copy link
Contributor

Agreed @cosmo0920 plus the choice of top 10/9/8/100 will be arbitrary so should be left to the user to tune what is required.

@kubotat
Copy link
Author

kubotat commented Aug 31, 2023

@cosmo0920 @patrick-stephens Thanks.

For ordering the top of 10 process of the metrics, these should be handled by monitoring solution side.
For instance, Splunk can be displayed with the top of the 10 metrics in each of the graphs. That will be depending on the configurations but as far as I remember, the top 10 of the metrics should be the default.

That makes sense to me.

@cosmo0920
Copy link
Contributor

cosmo0920 commented Sep 20, 2023

@kubotat I sent a PR for covering this issue at: #7943

I have a question for your request. Process' status is rapidly changed as I noticed. So, capturing R(running) status has quite timing issues. For now, I dropped the filtering feature of the process' statuses.
Even if with this hard thing, do you need to support for the regex/parameter to filter process statuses?

I mean the Linux process scheduler depends on this parameter for preemption latency: https://elixir.bootlin.com/linux/v6.5.4/source/kernel/sched/fair.c#L72

This could be too small to scrape metrics:
default: 6ms * (1 + ilog(number of online CPUs)) (unit: nanoseconds) vs. 5 seconds (default of the scraping interval)

This means that 3 digits smaller than scrape interval for collecting metrics.

@kubotat
Copy link
Author

kubotat commented Sep 21, 2023

@cosmo0920 Thank you so much for your feedback.

Even if with this hard thing, do you need to support for the regex/parameter to filter process statuses?

Filterting process by statuses is not the mandatory requirement.
Do process_include_pattern and process_exclude_pattern filter the metrics by the process name?

@cosmo0920
Copy link
Contributor

Filterting process by statuses is not the mandatory requirement.
Do process_include_pattern and process_exclude_pattern filter the metrics by the process name?

OK. I understand. And yes, they are already implemented in #7943.

@kubotat
Copy link
Author

kubotat commented Oct 16, 2023

@cosmo0920 Is there any timeline when PR #7943 will be merged into the main branch?

@cosmo0920
Copy link
Contributor

cosmo0920 commented Oct 17, 2023

Not sure but we might able to include this feature in 2.2 development cycle...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants