-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
lustre_exporter crash after query from cURL #110
Comments
Hey @mtds, I'm looking into this one but it seems like the error is coming outside of the scope of our code (we're panicking inside the ReadFile call here: https://github.com/HewlettPackard/lustre_exporter/blob/master/sources/procfs.go#L551) Are you by any chance running on a 32 bit version of Debian? |
Just another thing: if I launch cURL on the command line like this
I wil get the following output (and the exporter does not crash):
which I believe is consistent on what is declared in the main function on lustre_exporter.go, |
Hi joehandzik. Sorry, I did not see your comment until I posted mine. No, I am running Debian 64 bit on all machines used in my test. On my host, where I have compiled |
@mtds, that second bit is expected behavior, definitely. The node_exporter behaves the same way (https://github.com/prometheus/node_exporter), just tested it myself. You should need to target the /metrics endpoint to get the actual metrics themselves, that's also expected behavior (and it works on my machine for the node_exporter, so the issue you've found is legitimate). |
@mtds Not sure what to make of the fact that you're on a 64 bit box. I can't see any legitimate reason why that ReadFile operation should be overwhelmed by a memory allocation. I'll keep looking, may end up filing a bug against Golang itself. |
@mtds This continues to look like a problem we can't solve within the lustre_exporter, so we need to narrow down the problem to be able to get a fix more effectively. What sort of node were you running this on (client, mds, oss, etc)? Depending on your answer, we'll want to investigate specific procfs files. |
@joehandzik, I think the problem may be related to a difference in the platform we are using. The lustre_exporter crashed in the following cases:
but it's working correctly on one of our client, which is mounting Lustre. The client is still a 64 bit machine but the most important thing is that is running Debian Jessie
I can try to build the exporter on a Debian Wheezy machine and see if I got the same problem. By the way, you are absolutely right: I never tested it before but (as you wrote) a simple curl http://myhost.domain:port/ reports an HTML message. I tested it right now with node_exporter |
Well, actually (as you wrote) it could be also a problem in how the lustre_exporter is accessing those I can also make a test with a Lustre file system I have built on top of KVM based virtual machines, |
@mtds It's worth a shot to build on Wheezy just to eliminate that variable, but that shouldn't matter based on everything I (think I) understand about how Go works. We're more interested by the version differences in your MDS/OSS Lustre version compared to the client Lustre version (we've done most of our testing on Lustre 2.7 and up). Can you possibly get us the sizes of all file underneath /proc/fs/lustre/ ? Primarily, I think we need sizes for files named "stats", "md_stats", and "encrypt_page_pools". Those are the only files that I think we'd be touching. @roclark, let me know if I'm off. |
Yup, those are the appropriate files. The full paths would be:
I am curious as to whether or not a bug was fixed in Lustre 2.6 with one of these files. I am not too fluent on the bug reports for pre-2.8, but might be we are experiencing something like that. Regardless, looking at those files would be a good next step in debugging this further. |
Sorry but just to clarify: if you are referring exactly to the size of the files and not the number of lines,
|
Hmm, those are tiny...I did some research, and the ReadFile implementation shouldn't run into any sort of trouble unless the files are multiple Gigabytes in size. Since the MDS is crashing and it only has the one md_stats file, could you send us that file? Feel free to use the email address I have on my GitHub profile. |
@joehandzik just sent you the email containing our latest md_stats (I am not sure it will be really The test with the lustre_exporter built on Wheezy gave exactly the same error message I reported I still have to test it with a CentOS deployment, which uses Lustre 2.8. |
Just another test: I have compiled the lustre_exporter on a VM with CentOS 6.8, using Golang The Lustre test filesystem I am using is based on version 2.8.0, installed directly from RPMs Here is a summary:
So, it looks like the problem is definitely related to MDS and OSS running Debian Wheezy with Lustre 2.5.3.90. In the client case, I don't think the problem is related to the size of encrypt_page_pools since it's
|
@mtds Thank you very much for the thorough investigation! I did get your email last night, btw, noticed an unexpected newline at the end of the file but nothing else out of the ordinary. The client issue appears to be us trying to parse an unsigned integer out of a number that needs to be parsed as a floating point. We've known that we probably should remove integer operations internally simply because Prometheus operates on floats and we convert to floats eventually anyway, but this is a good reason to move more quickly on that. |
@mtds FYI, your parsing issue should be solved in current master. I'll take a look at integrating the file you emailed me into our functional testing code, to see if we can replicate the behavior you saw. |
Thanks @joehandzik! I have just recompiled lustre_exporter on CentOS from the updated master and Just for the sake of trying I have also recompiled the latest master branch on Wheezy and tried it again: |
@mtds Great, thanks for taking a look. Though of course not so great that there's still a problem with Wheezy or older Lustre. I'll keep thinking of ways to figure out what the root cause is here. |
@joehandzik it would be really nice if you manage to find the root cause. I was thinking on trying to run lustre_exporter through a debugger like GDB but I don't think I'll ended up finding something useful. |
I get into same issue.
Debug to find that the os.File.Read Try to write a c program, The test buf size is If i try to use a big enough buf size (e.g 1024) to |
@wutaizeng I think the right answer here is just to disable the generic collector by using the flag setting '--collector.generic=core'. You'll still get the lustre_health metric, but you'll only lose the metrics from encrypt_page_pools. As far as I can tell from the issues that the two of your have run into, it only happens on older versions of Lustre anyway, so we're covered in 2.10+ (which is what we're targeting master at right now). @wutaizeng and @mtds, is that an acceptable workaround (setting the core flag for generic metrics)? |
accept |
Hi @joehandzik Thanks for the suggestion: starting the exporter with this flag and querying it over cURL works perfectly now! :-) I have just tested the exporter on one of our OSS (with Debian Wheezy):
Given the fact that we are using a quite old Lustre version (2.5.3.90) I believe it does not make If possible, I would propose to amend the README to reflect the fact that similar crashes |
@mtds, that seems reasonable. We'll add a "Troubleshooting" headline into the README and suggest toggling the flag categories as a viable workaround. Really appreciate your thoughts! |
Just started lustre_exporter from the command line:
Executed cURL from another host to get a list of available metrics:
On the client, I got curl: (52) Empty reply from server while on the host running the exporter I got the following errors:
I am running the exporter on Debian GNU/Linux 7.11 (wheezy) while I have compiled the binary with Golang 1.8.3 on Debian GNU/Linux 8.8 (jessie).
The text was updated successfully, but these errors were encountered: