Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replaced ioutil.ReadAll with bufio.Scanner #1976

Closed
wants to merge 1 commit into from
Closed

Replaced ioutil.ReadAll with bufio.Scanner #1976

wants to merge 1 commit into from

Conversation

chicknsoup
Copy link

When /proc/net/tcp and /proc/net/tcp6 contain hundreds thousand of lines Prometheus will not be able to scrape node_exporter with tcpstat collector enabled (timeout exceeded).

Using bufio.Scanner resolves the issue.

Signed-off-by: chinhnc chicknsoupuds@gmail.com

Signed-off-by: chinhnc <chicknsoupuds@gmail.com>
@@ -60,13 +60,6 @@ func Test_parseTCPStatsError(t *testing.T) {

func TestTCPStat(t *testing.T) {

noFile, _ := os.Open("follow the white rabbit")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test should probably not be removed. We want to make sure we catch the error.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because these lines are replaced

		return nil, err
	}

So that test is no longer valid

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test is not obsolete, you need to check the scanner for errors.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, i forgot to check scanner.Err().

@SuperQ
Copy link
Member

SuperQ commented Feb 22, 2021

This was explicitly changed away from bufio.Scanner in #1380 in order to make sure we read the file in one syscall. Using a scanner leads to poor interaction with the kernel, which allows the contents of the file to change while you read it.

If you have hundreds of thousands of entries in this file, that's going to be many megabytes of output. Is it fast enough to be practical to read this much data? Do you have any benchmarks for this?

@chicknsoup
Copy link
Author

This was explicitly changed away from bufio.Scanner in #1380 in order to make sure we read the file in one syscall. Using a scanner leads to poor interaction with the kernel, which allows the contents of the file to change while you read it.

If you have hundreds of thousands of entries in this file, that's going to be many megabytes of output. Is it fast enough to be practical to read this much data? Do you have any benchmarks for this?

Actually the file was around 60MB (due to application's connection leak), plus the server was under heavy load. However, now it is normal again, so I cannot do any benchmark. I will try to do it when I see any of our servers in the same situation.

@SuperQ
Copy link
Member

SuperQ commented Feb 23, 2021

It sounds like this was working as intended. If the collector started to fail, it should report this via node_scrape_collector_success.

In situations where the node is failing, we don't want the exporter to contribute to the failure.

We're intentionally short-circuiting the read in order to avoid situations exactly like what you're describing.

@SuperQ
Copy link
Member

SuperQ commented Feb 23, 2021

It looks like this hasn't been converted to the new functions available in the procfs library. We should probably do that.

The procfs library uses an io.LimitReder() to gather the data. The only down side is the limit is currently 4GiB. This would be a lot of memory in situations like you describe.

@chicknsoup
Copy link
Author

It sounds like this was working as intended. If the collector started to fail, it should report this via node_scrape_collector_success.

In situations where the node is failing, we don't want the exporter to contribute to the failure.

We're intentionally short-circuiting the read in order to avoid situations exactly like what you're describing.

In my situation tcpstat collector did not fail but took too long (exceeding scrape timeout) and the whole scraping process failed.

It looks like this hasn't been converted to the new functions available in the procfs library. We should probably do that.

The procfs library uses an io.LimitReder() to gather the data. The only down side is the limit is currently 4GiB. This would be a lot of memory in situations like you describe.

I will try with this when I have a server with high load.

@SuperQ
Copy link
Member

SuperQ commented Feb 23, 2021

In my situation tcpstat collector did not fail but took too long (exceeding scrape timeout) and the whole scraping process failed.

This is still working-as-intended for Prometheus. We can't predict all failures, so by design, we prefer to hard fail and be as noisy as possible.

@chicknsoup
Copy link
Author

In my situation tcpstat collector did not fail but took too long (exceeding scrape timeout) and the whole scraping process failed.

This is still working-as-intended for Prometheus. We can't predict all failures, so by design, we prefer to hard fail and be as noisy as possible.

I see, so closing this PR.

@chicknsoup chicknsoup closed this Feb 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants