-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[0.9.6.1] Panic on writes #5262
Comments
@caligula1989 - we really need to see the data. Can you enable write-tracing? https://github.com/influxdb/influxdb/blob/master/etc/config.sample.toml#L220 |
This is an example:
|
What version of InfluxDB are you running? |
0.9.6.1 |
@caligula1989 let me see if I can summarize the issue:
What percentage of writes are failing with the error? Is it 1%, 5%, 25%, 95%? Did you delete all data when switching between engines or did you just update the config and restart the process? Have you tried writing one of the failed batches using curl? If so, did the batch succeed? Have you tried resubmitting the same batch via the python client? Can you share more than one example of a failed batch? There's nothing obviously syntactically wrong with the single point example you provided, but perhaps in aggregate something will become clear. |
Yeah, you summarized it correctly.
|
@caligula1989 In addition to trying out the Python client, can you give us more information on how you are writing data? So far I've been unable to reproduce the issue with this simple script (also using the
Where
Though you may be writing data in another way. Also, can you tell us your:
|
@rossmcdonald
Each "point" is a dictionary of this format:
My system is debian wheezy, 64bit.
|
@caligula1989 Thanks for sending. I don't see any issues with your methodology above, so I don't think the way you are writing points is the issue. So far I haven't been able to reproduce the issue, even though I'm using the exact same InfluxDB python client version and InfluxDB version (though a different data set). Do you have any estimates on how often the panic occurs? Does it happen every time, or is it more of an intermittent issue? I'd be curious to see if you have the same issue using just the Also, how are you launching the |
@rossmcdonald Unfortunately I'm not able to pinpoint what exactly happens on the node when it starts happening - I was not able to reproduce it in a development environment. I'll keep trying. In any case - if I can get some background as to what the error message actually means on the server side, it might be helpful in order to debug it on my end & find a way to reproduce it consistently. |
@caligula1989 Regarding background on the error message, unfortunately I can't say for sure without a larger stack trace that pinpoints what routine is throwing the panic. Typically a:
Means that the code is trying to access an invalid index from an array, which isn't super helpful on its own. Are you receiving any extra error messages returned by the Python client? The fact that a restart seems to resolve the issue leads me to believe this may be caused by a lock. Are you executing any queries when you start to see issues? And does it get to a point where all of the writes are failing? |
No, I'm not getting anything else, except the fact that the status code is 200. Is there a way I can turn on more detailed logs that will show full stack traces? |
Closed by mistake |
The system logs all stack traces by default. I've not seen a case where only the panic message appears without the rest of the trace. Perhaps this is a new SystemD interaction issue, @rossmcdonald would know better if that's a reasonable guess. It is possible to log all writes, but that's very expensive on the log directory. Set
So do queries eventually start failing? If so, how? Do they log the start but never finish? Do they not even show up in the logs? Do they timeout? Does the system ever OOM? When the writes fail, you mentioned that some points still get through. So over time, more batches fail with a panic, but some of those points do persist? Over time, do more points persist from the write batches showing panics, or do fewer points persist, or is about the same throughout? |
@beckettsean
Enabling write-tracing simply logs the http payload, but doesn't provide any additional information such as where the exception occurred. I have copied some failing points in my previous comments - nothing special about them. The queries definitely finish - in fact, they finish MUCH quicker than normal queries. Depending on payload size, it can be anything between microseconds (see above, 689.542µs, for example), and up to around 5ms - much less than succesfull queries. The system does not OOM. Some points persist, but I believe this is because, as you can see from my code snippet above - when a write results in an error (and getting status code 200 is an error), the "batch" list is not emptied, so the node will try the points again - until a write suceeds and the list is flushed. |
That looks like the recovery handler in the HTTP service: https://github.com/influxdata/influxdb/blob/master/services/httpd/handler.go#L842 We need to add some better logging there. |
This is the full stack trace:
|
@caligula1989 are any of the tags you are writing empty? It looks like the line you are panicking on doesn't have a bounds check when it starts reading the tag value, so I think something like this may cause a panic: |
A panic would happen if you wrote the following: cpu,host= There was a missing bounds check when scanning the tag value. Fixes #5262.
@jsternberg in an earlier comment, @caligula1989 shared a write that caused the issue. It doesn't have any empty tag values, but there are some escaped spaces in tag values. Perhaps one of those is not evaluating properly? |
@beckettsean It could be a new line in the tag value as well right after the equals. Using the cli w/ |
@beckettsean unfortunately I can't reproduce the issue with the write used above, which is really strange. The stacktrace also points to the issue I mentioned being the problem, but I don't see any evidence that there's an errant newline in the input. |
Wouldn't we see a newline in the payload dump? I've reviewed the code of the http handler, and while I have limited knowledge of go in general and the influxdb codebase in particular - it SEEMS to me that the write of the dump doesn't happen at the same time as the request log, so if there are many requests, they wouldn't be subsequent in the log file, and the I wouldn't have pasted the right data being saved. Please correct me if I'm wrong (I could very well be). |
@caligula1989 you're right that we would expect a newline in the input, but from my (very) limited knowledge of looking at these, I've seen a few other bug reports where the logging output seemed to have nothing to do with the backtrace. So it's possible the request log written right before the panic has nothing to do with it. |
A panic would happen if you wrote the following: cpu,host= There was a missing bounds check when scanning the tag value. Fixes #5262.
@jsternberg Interesting. That's also what I would expect when looking at this: Only when the request actually ends, is the request line printed. If two requests come in simultaneously, what's to ensure that the printed payload actually belongs to the subsequent request line? Seems like the best course of action for me is to log the request data when I get these errors on the client side. I'll report my findings. |
So, I have enabled detailed logging on the nodes, and indeed - a tag contained a newline. I've addressed the issue on my end, and will check whether the issue is eliminated. In any case, better handling is essential here. There's absolutely no reason for such a bug to not have a straightforward error message. |
@caligula1989 #5370 will cause writes like that to return a parse error and not panic. |
Indeed it solved the issue. |
I'm getting a lot of these messages in my influx log:
The way I work is by writing batches from many nodes in the system. Each batch can contain anything between 1-40 points.
There's nothing special about the batches that fail - I turned on tracing to figure out what points are failing, and when I write the same data using the admin UI or cURL - everything works as expected.
It happens on tsm1 and bz1 both.
The text was updated successfully, but these errors were encountered: