-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
InfluxDB goes unresponsive #8500
Comments
As an additional note, it seems that when this happens, and I restart InfluxDB, one of my measurements will be corrupt. The measurement appears empty. Any attempt to insert into the measurement returns something like |
It just happened again. Working fine all day, and then just a minute ago it went unresponsive. The only thing I was doing (in addition to what's been going on all day long) was that I was skipping backwards in time a lot in grafana (the little left arrow thing next to the time selector in the top). Here's another SIGQUIT dump: https://gist.githubusercontent.com/phemmer/251eab87914681d30f0e0435c664e9f7/raw/a63f6cdf6c54d89771630764a13d42aa07318f88/log2 Note however that this time I was running InfluxDB on a completely different host and OS. This time it's on Linux, and I was running a7ca7b5. I also had a completely brand new database as the crash last night completely corrupted my old one and I couldn't recover it. Edit: And another... https://gist.githubusercontent.com/phemmer/251eab87914681d30f0e0435c664e9f7/raw/1008e1e1206e59501193d381fa99fc10af8972ae/log3 Edit: And just in case thats not enough SIGQUIT, here's some more: https://gist.githubusercontent.com/phemmer/251eab87914681d30f0e0435c664e9f7/raw/ac5fcb3757ff47f794dd8711bef51b1618e63b18/log4 |
Looks like a deadlock in the |
@phemmer Are you seeing any panics in your logs? |
no |
Perhaps. I'll see if I can reproduce it in an isolated environment. I rolled back to v1.2 because the issue started happening every few minutes, and was constantly corrupting my database. Though given how frequently it was occurring, it shouldn't be hard. |
@phemmer I would run |
I could drop the entire measurement |
I just upgraded my system to 1.3.0 (downloaded from https://dl.influxdata.com/influxdb/releases/influxdb-1.3.0.x86_64.rpm) and then ran into troubles. Things run fine for ~1 min, then the whole system locks up and dies. I have to shutdown the aws instance and bring it back up to get in. I can't find any logging that clearly says what happened -- but can try to dig further if this is helpful. I then tried upgrading to the nightly 1.4.0~n201706240800 / e014dd0 and I get the same behavior. I then put back the 1.2.4 and things are OK again I don't know if this is related... let me know if there is any more info you want (and where to find it!) |
The setup has active writes and continuous queries running. It would take some work to test without that... but if useful, let me know. |
No CQs |
@phemmer @ryantxu I just realised I didn't ask—do you know if profile endpoints are responding on @ryantxu if you're using 1.3 onwards then the new archive profile endpoint would be available to you: |
I must say that as an InfluxDB user, the handling of this issue is concerning. With this issue, and #8533, we now have several people confirming the issue. |
@phemmer I can assure you that we're looking into this. We had hoped that #8518 would fix the issue, but it appears not according to @ryantxu's testing of the
|
I understand we're working on it now, and I appreciate that, but how come it wasn't worked on until after 1.3.0 was released? That's my concern. As a user, I should be comfortable using stable releases. I only had 1 database, so I was unable to determine if other databases were affected. I'm working on reproducing this in a lab right now. Hopefully I'll be able to do so. |
@phemmer Are you able to share your log files related to the hang? |
@phemmer We are running about 100 production instances of 1.3.0 and have been running 1.3.0 shadows internally for weeks; we have never reproduced this error. We are giving this issue a very high priority - it is blocking our 1.3 GA build and announcement as it stands. I apologize for the frustration it has caused you. |
@phemmer by any chance do you have the server logs from either of the above two |
Unfortunately not as the logs have already rotated out (it's a personal box, and I don't permanently archive the logs). |
OK no worries. Was the box under significant write load? Roughly how many writes per second do you think? And typically what would your batch sizes be? |
Average WPS is around 6. The writer flushes every 100ms, and when its buffer is full. Normal batch size is pretty small, probably less than 20 points. However once every 60 seconds a batch (or multiples if it can't fit in the write buffer, buffer size is 512KB, and the average point size is ~110B) will come through that is about 4k points. |
Oh, probably a very critical bit of information I forgot about, I set |
@phemmer What kind of disks do you have? SSDs? HDD? Can you run Would you be able to test #8567, it may fix the timeouts. For the higher memory usage, if you can grab a memory profile via:
that might help identify what may be consuming more memory. |
Hybrid disks. I'll see about reproducing this again (I downgraded back to 1.2.4 which does not have the issue), but probably not tonight. Maybe tomorrow. |
Can you elaborate on "Hybrid disks"? What are the disks and how are they setup? Are they RAIDed? Directly attached? |
Each disk is a combination of SSD & spinning. There's 2 of them in raid-1. Yes, direct attach. |
@phemmer could you clarify something for us? Throughout all of the this ticket have you been running InfluxDB on the same HW (with the raid-1 hybrid drives), and could you also confirm:
I just want to get a better understanding of your environment. Cheers. |
Yes, same hardware. |
#8577 should fix the case where writer is blocked indefinitely on the |
@phemmer All of the pending fixes are on the |
@stuartcarnie and I have found the cause for the initial deadlock on |
I might be able to tomorrow night. I'll be intermittently available to test things as starting yesterday I'm on vacation through next week. But that one should be really easy to test. It generally occurred within a few minutes of starting InfluxDB.
🎆 I'm eager to see that PR, as when I skimmed the code, I couldn't see any obvious explanation for it. Everywhere I could see a lock being taken, all code paths released it. |
It's here: https://github.com/influxdata/influxdb/blob/master/tsdb/index/inmem/meta.go#L333 and https://github.com/influxdata/influxdb/blob/master/tsdb/index/inmem/meta.go#L338 The RLock is not released. |
Ahha. And that makes sense why I'm able to trigger the issue so easily. I have InfluxDB hooked up to splunk, so splunk can query InfluxDB directly. Splunk is quite fond of killing searches while they're in progress. I didn't even think of that atypical behavior. |
… goes unresponsive- fix is released)
Seems to have same issue with influxdb:1.5-alpine Blocking output.
|
This issue is still happening in 1.5.2. |
Bug report
System info: [Include InfluxDB version, operating system name, and other relevant details]
Version: 0b4528b
OS: FreeBSD
Steps to reproduce:
Expected behavior: [What you expected to happen]
Keep working
Actual behavior: [What actually happened]
Stops responding. Process still running, but no longer responds to any queries.
Additional info: [Include gist of relevant config, logs, etc.]
Happens at random, but it seems the more activity, the more likely the issue will occur. Haven't identified a pattern yet.
Here's the output of SIGQUIT.
https://gist.githubusercontent.com/phemmer/251eab87914681d30f0e0435c664e9f7/raw/e79a847b896a7533586f6b384bd2aeb4c4c98083/log
The text was updated successfully, but these errors were encountered: