Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[0.9.5] HTTP 500 errors on insertions (after timeout) #4870

Closed
joshliberty opened this issue Nov 22, 2015 · 6 comments
Closed

[0.9.5] HTTP 500 errors on insertions (after timeout) #4870

joshliberty opened this issue Nov 22, 2015 · 6 comments

Comments

@joshliberty
Copy link

I have a cluster of production servers writing web requests to a single influxdb node.
They're doing up to 100 requests per second - not something I'd expect to crash a node.

But after the system is up for a few hours (sometimes days), the clients start getting 500 errors. The log is littered with these messages:

[http] 2015/11/21 12:42:01 10.60.195.206 - root [21/Nov/2015:12:41:46 -0600] POST /write?db=mydb HTTP/1.1 500 32 - python-requests/2.4.3 CPython/2.7.3 Linux/3.2.0-4-amd64 84aa8e82-907f-11e5-b414-000000000000 15.000491819s

When this happens, I'm also getting timeouts on continuous queries & the following error:

[monitor] 2015/11/21 19:04:56 failed to store statistics: timeout

I'm not batching my queries, since they're web requests and I have no real way to batch them.

If useful, a typical minute looks like this:

[wal] 2015/11/21 22:17:06 Flush due to idle. Flushing 31 series with 31 points and 1544 bytes from partition 1
[wal] 2015/11/21 22:17:16 Flush due to idle. Flushing 31 series with 31 points and 1544 bytes from partition 1
[wal] 2015/11/21 22:17:26 Flush due to idle. Flushing 31 series with 31 points and 1544 bytes from partition 1
[wal] 2015/11/21 22:17:36 Flush due to idle. Flushing 31 series with 31 points and 1544 bytes from partition 1
[wal] 2015/11/21 22:17:46 Flush due to idle. Flushing 31 series with 31 points and 1544 bytes from partition 1
[wal] 2015/11/21 22:17:56 Flush due to idle. Flushing 31 series with 31 points and 1544 bytes from partition 1

@joshliberty
Copy link
Author

This also happens with the new storage engine, tsm1.

@beckettsean
Copy link
Contributor

@caligula1989 when the system is non-responsive to writes and queries, does the /ping endpoint still respond?

Does a restart clear the issue?

@beckettsean
Copy link
Contributor

@corylanou @rossmcdonald this sounds pretty familiar, is it another occurrence of the potential deadlock?

@beckettsean beckettsean changed the title HTTP 500 errors on insertions (after timeout) [0.9.5] HTTP 500 errors on insertions (after timeout) Nov 23, 2015
@joshliberty
Copy link
Author

A restart clears the issue for couple of hours, but it'll happen again.
I'm not sure about the ping endpoint, we check again next time it happens. If I had to venture a guess however, it would work, since queries such as "show measurements" still work on other DBs at least.

@rossmcdonald
Copy link
Contributor

@caligula1989 Thank you for the extra information. Can you also try sending a SIGQUIT signal to the process the next time this happens, and then send us the resulting stack trace? You can send a SIGQUIT to InfluxDB by using the command:

kill -SIGQUIT <PID of InfluxDB>

And then capture the stack trace output using the command:

sed -n '/SIGQUIT/,/InfluxDB starting/p' /var/log/influxdb/influxd.log > stack-trace.out

I can send you a link to an S3 bucket if it's too large to paste here.

@beckettsean
Copy link
Contributor

Believed fixed by #4913, which is available in the just-released version 0.9.5.1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants