Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lustre_exporter crash after query from cURL #110

Closed
mtds opened this issue Jul 24, 2017 · 24 comments
Closed

lustre_exporter crash after query from cURL #110

mtds opened this issue Jul 24, 2017 · 24 comments

Comments

@mtds
Copy link

mtds commented Jul 24, 2017

Just started lustre_exporter from the command line:

/usr/local/bin# ./lustre_exporter
INFO[0000] Starting lustre_exporter (version=1.0.0-rc.1, branch=master, revision=35894131aa5bef1d01a94c8a0c64c37cdc5ffaaf)  source="lustre_exporter.go:129"
INFO[0000] Build context (go=go1.8.3, user=myuser@myhost, date=20170724-10:21:35)  source="lustre_exporter.go:130"
INFO[0000] Enabled sources:                              source="lustre_exporter.go:140"
INFO[0000]  - procfs                                     source="lustre_exporter.go:142"
INFO[0000]  - procsys                                    source="lustre_exporter.go:142"
INFO[0000] Listening on :9169                            source="lustre_exporter.go:163"

Executed cURL from another host to get a list of available metrics:

curl http://myhost.mydomain:9169/metrics

On the client, I got curl: (52) Empty reply from server while on the host running the exporter I got the following errors:

panic: runtime error: slice bounds out of range [recovered]
        panic: runtime error: slice bounds out of range

goroutine 26 [running]:
io/ioutil.readAll.func1(0xc42031ba38)
        /data.local1/go/src/io/ioutil/ioutil.go:30 +0x119
panic(0x7c0fa0, 0xa15ff0)
        /data.local1/go/src/runtime/panic.go:489 +0x2cf
bytes.(*Buffer).ReadFrom(0xc42031b990, 0x9ece00, 0xc42018d020, 0xc4206cf800, 0x0, 0x200)
        /data.local1/go/src/bytes/buffer.go:180 +0x25c
io/ioutil.readAll(0x9ece00, 0xc42018d020, 0x200, 0x0, 0x0, 0x0, 0x0, 0x0)
        /data.local1/go/src/io/ioutil/ioutil.go:33 +0x150
io/ioutil.ReadFile(0xc42028cab0, 0x2a, 0x0, 0x0, 0x0, 0x0, 0x0)
        /data.local1/go/src/io/ioutil/ioutil.go:70 +0xf7
github.com/HewlettPackard/lustre_exporter/sources.parseStatsFile(0x83771a, 0x1c, 0x830f81, 0xe, 0xc42028cab0, 0x2a, 0x0, 0x0, 0x0, 0x410ad8, ...)
        /data.local1/src/golang_work/src/github.com/HewlettPackard/lustre_exporter/sources/procfs.go:551 +0x42
github.com/HewlettPackard/lustre_exporter/sources.(*lustreProcfsSource).parseFile(0xc42010c270, 0x82de6f, 0x7, 0x832a75, 0x12, 0xc42028cab0, 0x2a, 0x0, 0x83771a, 0x1c, ...)
        /data.local1/src/golang_work/src/github.com/HewlettPackard/lustre_exporter/sources/procfs.go:806 +0x14e
github.com/HewlettPackard/lustre_exporter/sources.(*lustreProcfsSource).Update(0xc42010c270, 0xc420086ae0, 0xa206c0, 0xffffffffffffffff)
        /data.local1/src/golang_work/src/github.com/HewlettPackard/lustre_exporter/sources/procfs.go:408 +0x381
main.collectFromSource(0x82dc02, 0x6, 0x9ec180, 0xc42010c270, 0xc420086ae0)
        /data.local1/src/golang_work/src/github.com/HewlettPackard/lustre_exporter/lustre_exporter.go:70 +0x79
main.LustreSource.Collect.func1(0xc420086ae0, 0xc420114cf0, 0x82dc02, 0x6, 0x9ec180, 0xc42010c270)
        /data.local1/src/golang_work/src/github.com/HewlettPackard/lustre_exporter/lustre_exporter.go:59 +0x53
created by main.LustreSource.Collect
        /data.local1/src/golang_work/src/github.com/HewlettPackard/lustre_exporter/lustre_exporter.go:61 +0x111

I am running the exporter on Debian GNU/Linux 7.11 (wheezy) while I have compiled the binary with Golang 1.8.3 on Debian GNU/Linux 8.8 (jessie).

@joehandzik
Copy link
Contributor

Hey @mtds, I'm looking into this one but it seems like the error is coming outside of the scope of our code (we're panicking inside the ReadFile call here: https://github.com/HewlettPackard/lustre_exporter/blob/master/sources/procfs.go#L551)

Are you by any chance running on a 32 bit version of Debian?

@mtds
Copy link
Author

mtds commented Jul 24, 2017

Just another thing: if I launch cURL on the command line like this

curl http://myhost.mydomain:9169/

I wil get the following output (and the exporter does not crash):

<html>
      <head><title>Lustre Exporter</title></head>
       <body>
           <h1>Lustre Exporter</h1>
            <p><a href="/metrics">Metrics</a></p>
       </body>
</html>

which I believe is consistent on what is declared in the main function on lustre_exporter.go,
with the http.HandleFunc although I was expecting the exporter of making full use of the
promhttp package to present the list of metrics available not just a short HTML message.

@mtds
Copy link
Author

mtds commented Jul 24, 2017

Hi joehandzik.

Sorry, I did not see your comment until I posted mine.

No, I am running Debian 64 bit on all machines used in my test. On my host, where I have compiled
the exporter, I have an Intel Xeon W3505 CPU while one of the Lustre file server I have used
there's an Intel Xeon E5-2650 CPU.

@joehandzik
Copy link
Contributor

@mtds, that second bit is expected behavior, definitely. The node_exporter behaves the same way (https://github.com/prometheus/node_exporter), just tested it myself. You should need to target the /metrics endpoint to get the actual metrics themselves, that's also expected behavior (and it works on my machine for the node_exporter, so the issue you've found is legitimate).

@joehandzik
Copy link
Contributor

@mtds Not sure what to make of the fact that you're on a 64 bit box. I can't see any legitimate reason why that ReadFile operation should be overwhelmed by a memory allocation. I'll keep looking, may end up filing a bug against Golang itself.

@joehandzik
Copy link
Contributor

@mtds This continues to look like a problem we can't solve within the lustre_exporter, so we need to narrow down the problem to be able to get a fix more effectively. What sort of node were you running this on (client, mds, oss, etc)? Depending on your answer, we'll want to investigate specific procfs files.

@mtds
Copy link
Author

mtds commented Jul 24, 2017

@joehandzik, I think the problem may be related to a difference in the platform we are using.

The lustre_exporter crashed in the following cases:

  • MDS (Debian Wheezy / kernel 2.6.32 backported from CentOS / lustre: 2.5.3.90)
  • OSS (Debian Wheezy / kernel 3.2.88-1 / lustre: 2.5.3.90)

but it's working correctly on one of our client, which is mounting Lustre.

The client is still a 64 bit machine but the most important thing is that is running Debian Jessie
(the same Linux distribution where I have built the binary) and here are the details:

  • Client (Debian Jessie / kernel 3.16.43-2+deb8u1 / lustre: 2.6.92)

I can try to build the exporter on a Debian Wheezy machine and see if I got the same problem.

By the way, you are absolutely right: I never tested it before but (as you wrote) a simple curl http://myhost.domain:port/ reports an HTML message. I tested it right now with node_exporter
but you get the same output also with the ipmi_exporter.

@mtds
Copy link
Author

mtds commented Jul 24, 2017

Well, actually (as you wrote) it could be also a problem in how the lustre_exporter is accessing those
procfs files.

I can also make a test with a Lustre file system I have built on top of KVM based virtual machines,
using CentOS and Lustre version 2.8, just to see if this issue is not directly related to the type of
nodes involved.

@joehandzik
Copy link
Contributor

@mtds It's worth a shot to build on Wheezy just to eliminate that variable, but that shouldn't matter based on everything I (think I) understand about how Go works. We're more interested by the version differences in your MDS/OSS Lustre version compared to the client Lustre version (we've done most of our testing on Lustre 2.7 and up).

Can you possibly get us the sizes of all file underneath /proc/fs/lustre/ ? Primarily, I think we need sizes for files named "stats", "md_stats", and "encrypt_page_pools". Those are the only files that I think we'd be touching. @roclark, let me know if I'm off.

@roclark
Copy link
Contributor

roclark commented Jul 24, 2017

Yup, those are the appropriate files. The full paths would be:

# On OSS nodes:
/proc/fs/lustre/obdfilter/<OST>/stats
# On MDS nodes:
/proc/fs/lustre/mdt/<MDT>/md_stats
# On all Lustre nodes:
/proc/fs/lustre/sptlrpc/encrypt_page_pools

I am curious as to whether or not a bug was fixed in Lustre 2.6 with one of these files. I am not too fluent on the bug reports for pre-2.8, but might be we are experiencing something like that. Regardless, looking at those files would be a good next step in debugging this further.

@mtds
Copy link
Author

mtds commented Jul 24, 2017

Sorry but just to clarify: if you are referring exactly to the size of the files and not the number of lines,
the closest trick I know in order to get a meaningful number is to use wc -c and get the measure
in bytes.

  • OSS stats: We have 7 OSTs on every OSS, so I got the more or less the same number: 932 bytes (for each of them).
  • MDS: md_stats is reported to be 858 bytes;
  • Lustre client: encrypt_page_pools is reported to be 18 bytes.

@joehandzik
Copy link
Contributor

Hmm, those are tiny...I did some research, and the ReadFile implementation shouldn't run into any sort of trouble unless the files are multiple Gigabytes in size. Since the MDS is crashing and it only has the one md_stats file, could you send us that file? Feel free to use the email address I have on my GitHub profile.

@mtds
Copy link
Author

mtds commented Jul 24, 2017

@joehandzik just sent you the email containing our latest md_stats (I am not sure it will be really
helpful though).

The test with the lustre_exporter built on Wheezy gave exactly the same error message I reported
at the beginning of this thread. Needless to say, it crashes too. I have used Golang 1.8.3 and a Wheezy
64 bit VM to build the binary.

I still have to test it with a CentOS deployment, which uses Lustre 2.8.

@mtds
Copy link
Author

mtds commented Jul 25, 2017

Just another test: I have compiled the lustre_exporter on a VM with CentOS 6.8, using Golang
1.8.3.

The Lustre test filesystem I am using is based on version 2.8.0, installed directly from RPMs
from CentOS repos. All the VMs I have used in this test are installed with CentOS 6.8 (64 bit).

Here is a summary:

  • MDS / OSS: cURL works as expected, no crashes.
  • Lustre client: cURL works as expected but lustre_exporter reports the following error (although it
    did not crash):
ERRO[0037] ERROR: "procfs" source failed after 0.004039 seconds: strconv.ParseUint: parsing "15.31": invalid syntax  source="lustre_exporter.go:73"

So, it looks like the problem is definitely related to MDS and OSS running Debian Wheezy with Lustre 2.5.3.90.

In the client case, I don't think the problem is related to the size of encrypt_page_pools since it's
just 539 bytes. Anyway, for the sake of completeness of my report I will copy&paste it here:

physical pages:          125490
pages per pool:          512
max pages:               15686
max pools:               31
total pages:             0
total free:              0
idle index:              0/100
last shrink:             661s
last access:             661s
max pages reached:       0
grows:                   0
grows failure:           0
shrinks:                 0
cache access:            0
cache missing:           0
low free mark:           0
max waitqueue depth:     0
max wait time:           0/1000
out of mem:              0

@joehandzik
Copy link
Contributor

@mtds Thank you very much for the thorough investigation! I did get your email last night, btw, noticed an unexpected newline at the end of the file but nothing else out of the ordinary.

The client issue appears to be us trying to parse an unsigned integer out of a number that needs to be parsed as a floating point. We've known that we probably should remove integer operations internally simply because Prometheus operates on floats and we convert to floats eventually anyway, but this is a good reason to move more quickly on that.

@joehandzik
Copy link
Contributor

@mtds FYI, your parsing issue should be solved in current master. I'll take a look at integrating the file you emailed me into our functional testing code, to see if we can replicate the behavior you saw.

@mtds
Copy link
Author

mtds commented Jul 25, 2017

Thanks @joehandzik! I have just recompiled lustre_exporter on CentOS from the updated master and
I can confirm that the procfs error is gone.

Just for the sake of trying I have also recompiled the latest master branch on Wheezy and tried it again:
same error, the binary crashed again unfortunately.

@joehandzik
Copy link
Contributor

@mtds Great, thanks for taking a look. Though of course not so great that there's still a problem with Wheezy or older Lustre. I'll keep thinking of ways to figure out what the root cause is here.

@mtds
Copy link
Author

mtds commented Jul 25, 2017

@joehandzik it would be really nice if you manage to find the root cause.

I was thinking on trying to run lustre_exporter through a debugger like GDB but I don't think I'll ended up finding something useful.

@wutz
Copy link
Contributor

wutz commented Dec 1, 2017

I get into same issue.

  • Lustre Version: 2.5.3
  • OS / Kernel Version: RHEL 6.2 / 2.6.32-431

Debug to find that the os.File.Read /proc/fs/lustre/sptlrpc/encrypt_page_pools return n greater than buf length, this cause panic.

Try to write a c program, read function have same issue (return size greater than buf size).

The test buf size is 512, because ioutil.ReadFile use buf is file_size+512 in go library, and in the /proc file size always is 0.

If i try to use a big enough buf size (e.g 1024) to read or os.File.Read, the issue disappear.

@joehandzik
Copy link
Contributor

@wutaizeng I think the right answer here is just to disable the generic collector by using the flag setting '--collector.generic=core'. You'll still get the lustre_health metric, but you'll only lose the metrics from encrypt_page_pools. As far as I can tell from the issues that the two of your have run into, it only happens on older versions of Lustre anyway, so we're covered in 2.10+ (which is what we're targeting master at right now). @wutaizeng and @mtds, is that an acceptable workaround (setting the core flag for generic metrics)?

@wutz
Copy link
Contributor

wutz commented Dec 3, 2017

accept

@mtds
Copy link
Author

mtds commented Dec 4, 2017

Hi @joehandzik

Thanks for the suggestion: starting the exporter with this flag and querying it over cURL works perfectly now! :-)

I have just tested the exporter on one of our OSS (with Debian Wheezy):

lustre_exporter --collector.generic=core
INFO[0000] Starting lustre_exporter (version=, branch=, revision=)  source="lustre_exporter.go:116"
INFO[0000] Build context (go=go1.9.2, user=, date=)      source="lustre_exporter.go:117"
INFO[0000] Collector status:                             source="lustre_exporter.go:119"
INFO[0000]  - OST State: extended                        source="lustre_exporter.go:121"
INFO[0000]  - MDT State: extended                        source="lustre_exporter.go:123"
INFO[0000]  - MGS State: extended                        source="lustre_exporter.go:125"
INFO[0000]  - MDS State: extended                        source="lustre_exporter.go:127"
INFO[0000]  - Client State: extended                     source="lustre_exporter.go:129"
INFO[0000]  - Generic State: core                        source="lustre_exporter.go:131"
INFO[0000]  - Lnet State: extended                       source="lustre_exporter.go:133"
INFO[0000] Available sources:                            source="lustre_exporter.go:142"
INFO[0000]  - procfs                                     source="lustre_exporter.go:144"
INFO[0000]  - procsys                                    source="lustre_exporter.go:144"
INFO[0000] Listening on :9169                            source="lustre_exporter.go:165"

Given the fact that we are using a quite old Lustre version (2.5.3.90) I believe it does not make
sense to modify the code in order to avoid this problem. Eventually we'll reach a point where we
will switch to Lustre 2.10+ too (hopefully :-))

If possible, I would propose to amend the README to reflect the fact that similar crashes
(with old Lustre version) should be tested with this flag before reporting a bug.

@joehandzik
Copy link
Contributor

@mtds, that seems reasonable. We'll add a "Troubleshooting" headline into the README and suggest toggling the flag categories as a viable workaround. Really appreciate your thoughts!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants