Replaced ioutil.ReadAll with bufio.Scanner #1976

chicknsoup · 2021-02-22T14:30:56Z

When /proc/net/tcp and /proc/net/tcp6 contain hundreds thousand of lines Prometheus will not be able to scrape node_exporter with tcpstat collector enabled (timeout exceeded).

Using bufio.Scanner resolves the issue.

Signed-off-by: chinhnc chicknsoupuds@gmail.com

Signed-off-by: chinhnc <chicknsoupuds@gmail.com>

SuperQ · 2021-02-22T16:21:46Z

collector/tcpstat_linux_test.go

@@ -60,13 +60,6 @@ func Test_parseTCPStatsError(t *testing.T) {

 func TestTCPStat(t *testing.T) {

-	noFile, _ := os.Open("follow the white rabbit")


This test should probably not be removed. We want to make sure we catch the error.

Because these lines are replaced

return nil, err }

So that test is no longer valid

The test is not obsolete, you need to check the scanner for errors.

OK, i forgot to check scanner.Err().

SuperQ · 2021-02-22T16:32:03Z

This was explicitly changed away from bufio.Scanner in #1380 in order to make sure we read the file in one syscall. Using a scanner leads to poor interaction with the kernel, which allows the contents of the file to change while you read it.

If you have hundreds of thousands of entries in this file, that's going to be many megabytes of output. Is it fast enough to be practical to read this much data? Do you have any benchmarks for this?

chicknsoup · 2021-02-23T01:14:35Z

This was explicitly changed away from bufio.Scanner in #1380 in order to make sure we read the file in one syscall. Using a scanner leads to poor interaction with the kernel, which allows the contents of the file to change while you read it.

If you have hundreds of thousands of entries in this file, that's going to be many megabytes of output. Is it fast enough to be practical to read this much data? Do you have any benchmarks for this?

Actually the file was around 60MB (due to application's connection leak), plus the server was under heavy load. However, now it is normal again, so I cannot do any benchmark. I will try to do it when I see any of our servers in the same situation.

SuperQ · 2021-02-23T06:20:20Z

It sounds like this was working as intended. If the collector started to fail, it should report this via node_scrape_collector_success.

In situations where the node is failing, we don't want the exporter to contribute to the failure.

We're intentionally short-circuiting the read in order to avoid situations exactly like what you're describing.

SuperQ · 2021-02-23T06:41:21Z

It looks like this hasn't been converted to the new functions available in the procfs library. We should probably do that.

The procfs library uses an io.LimitReder() to gather the data. The only down side is the limit is currently 4GiB. This would be a lot of memory in situations like you describe.

chicknsoup · 2021-02-23T12:51:00Z

It sounds like this was working as intended. If the collector started to fail, it should report this via node_scrape_collector_success.

In situations where the node is failing, we don't want the exporter to contribute to the failure.

We're intentionally short-circuiting the read in order to avoid situations exactly like what you're describing.

In my situation tcpstat collector did not fail but took too long (exceeding scrape timeout) and the whole scraping process failed.

It looks like this hasn't been converted to the new functions available in the procfs library. We should probably do that.

The procfs library uses an io.LimitReder() to gather the data. The only down side is the limit is currently 4GiB. This would be a lot of memory in situations like you describe.

I will try with this when I have a server with high load.

SuperQ · 2021-02-23T17:25:25Z

In my situation tcpstat collector did not fail but took too long (exceeding scrape timeout) and the whole scraping process failed.

This is still working-as-intended for Prometheus. We can't predict all failures, so by design, we prefer to hard fail and be as noisy as possible.

chicknsoup · 2021-02-25T08:38:11Z

In my situation tcpstat collector did not fail but took too long (exceeding scrape timeout) and the whole scraping process failed.

This is still working-as-intended for Prometheus. We can't predict all failures, so by design, we prefer to hard fail and be as noisy as possible.

I see, so closing this PR.

Replaced ioutil.ReadAll with bufio.Scanner

9e5616b

Signed-off-by: chinhnc <chicknsoupuds@gmail.com>

SuperQ reviewed Feb 22, 2021

View reviewed changes

chicknsoup closed this Feb 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replaced ioutil.ReadAll with bufio.Scanner #1976

Replaced ioutil.ReadAll with bufio.Scanner #1976

chicknsoup commented Feb 22, 2021

SuperQ Feb 22, 2021

chicknsoup Feb 23, 2021

SuperQ Feb 23, 2021

chicknsoup Feb 23, 2021

SuperQ commented Feb 22, 2021

chicknsoup commented Feb 23, 2021

SuperQ commented Feb 23, 2021

SuperQ commented Feb 23, 2021

chicknsoup commented Feb 23, 2021

SuperQ commented Feb 23, 2021

chicknsoup commented Feb 25, 2021

		@@ -60,13 +60,6 @@ func Test_parseTCPStatsError(t *testing.T) {

		func TestTCPStat(t *testing.T) {

		noFile, _ := os.Open("follow the white rabbit")

Replaced ioutil.ReadAll with bufio.Scanner #1976

Replaced ioutil.ReadAll with bufio.Scanner #1976

Conversation

chicknsoup commented Feb 22, 2021

SuperQ Feb 22, 2021

Choose a reason for hiding this comment

chicknsoup Feb 23, 2021

Choose a reason for hiding this comment

SuperQ Feb 23, 2021

Choose a reason for hiding this comment

chicknsoup Feb 23, 2021

Choose a reason for hiding this comment

SuperQ commented Feb 22, 2021

chicknsoup commented Feb 23, 2021

SuperQ commented Feb 23, 2021

SuperQ commented Feb 23, 2021

chicknsoup commented Feb 23, 2021

SuperQ commented Feb 23, 2021

chicknsoup commented Feb 25, 2021