stalled trying to rotate a log, then stalled again #914

felipeceglia · 2014-09-06T02:03:27Z

Hi folks,

I came into a strange issue where influxdb 0.8.1 started to get stalled after midnight. The process was there, but simply nothing was being logged and nothing happened. After I ran it on the terminal I saw a message that it could not rotate the log, which I immediately assumed was the root of the problem.

Then I shut it down, fixed the permissions and restarted it again.

Now, it starts, and still gets stalled.

Here you can see log and stdout:
https://gist.github.com/mehale/528b97a551ff9da2fb3c

Thanks,
Mehale

pauldix · 2014-09-08T14:41:58Z

Did you up your OS file limits?
http://influxdb.com/docs/v0.8/introduction/installation.html#file-limits

On Fri, Sep 5, 2014 at 10:03 PM, mehale notifications@github.com wrote:

Hi folks,

I came into a strange issue where influxdb 0.8.1 started to get stalled
after midnight. The process was there, but simply nothing was being logged
and nothing happened. After I ran it on the terminal I saw a message that
it could not rotate the log, which I immediately assumed was the root of
the problem.

Then I shut it down, fixed the permissions and restarted it again.

Now, it starts, and still gets stalled.

Here you can see log and stdout:
https://gist.github.com/mehale/528b97a551ff9da2fb3c

Thanks,
Mehale

—
Reply to this email directly or view it on GitHub
#914.

felipeceglia · 2014-09-08T21:06:00Z

Yes, ulimit was already set to unlimited.

Thanks

jvshahid · 2014-09-16T17:00:53Z

This has nothing to do with log rotation, my guess is that this line is causing the process to stall:

[2014/09/06 01:45:37 UTC] [INFO] (github.com/influxdb/influxdb/server.(*Server).ListenAndServe:108) Sending change connection string command (localhost.localdomain:18099,stats.reversebeacon.net:18099) (http://localhost.localdomain:18090,http://stats.reversebeacon.net:18090)

It's trying to change the hostname. If this is a single server I'd suggent to manually set the hostname in the config file to localhost.localdomain to prevent raft from trying to change the hostname and reach consensus. There's currently a hack in place to achieve this but many users reported problems with the current hack.

mattheworiordan · 2014-09-25T01:39:45Z

@jvshahid after running on a fresh database for around 1.5 weeks now, all of a sudden the database went offline. I issued a reboot, and now I am stuck in the exact same state I was in with #904. You can see the current logs at https://gist.github.com/mattheworiordan/bfafb3225f86a85a19ac.

Couple of points to bear in mind:

I am running Ubuntu 14.04
I am running InfluxDB 0.8.0 in a Docker container with Ubuntu 14.04
I will leave the database in this state for a few days so that hopefully we can debug this issue. Not being able to access my data again is painful for me, but I expect a complete non-starter for others, so I would love to do what I can to help.
I am not sure my issue is the same as this issue, however I am posting here because my issue was closed. I don't have any logs relating to ListenAndServe.

mattheworiordan · 2014-09-25T10:53:36Z

Hi, as an update, my issue is probably not related to this issue, so happy to raise a new ticket if you would like, or append all this info the closed ticket if it is reopened. After around 1 hour, the InfluxDB did in fact come online. Please see https://gist.github.com/mattheworiordan/bfafb3225f86a85a19ac for an updated log showing that 02:20:27 UTC (just under 1 hour after InfluxDB was started), the web interface comes online and recovers from the log with this log message: [2014/09/25 02:20:27 UTC] [INFO] (github.com/influxdb/influxdb/coordinator.(*RaftServer).startRaft:397) Recovered from log.

I did a check on the database size, and it comes in at 52GB. I have no idea if this is considered very large or not, but clearly waiting an hour to bring a database back online after a failure seems a bit excessive.

Out of interest, how quickly should one expect an InfluxDB to take to start up on EC2 with SSDs and a 50GB database? Is this normal, or do I have a problem? Any advice or steps I can take to help diagnose the issue would be appreciated.

jvshahid · 2014-09-26T16:36:58Z

@mattheworiordan there's an issue logged for startup times #945

jvshahid · 2014-09-26T18:25:58Z

I commented on this thread a while back with no further activity. Closing this issue. Let me know if that comment didn't fix the issue and I'll be happy to reopen the issue.

jvshahid added 0 - Backlog and removed 0 - Backlog labels Sep 16, 2014

jvshahid mentioned this issue Sep 16, 2014

Cannot boot after a crash - data is effectively lost now #904

Closed

jvshahid closed this as completed Sep 26, 2014

tau0 mentioned this issue Apr 24, 2015

The points disappeared. #2419

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stalled trying to rotate a log, then stalled again #914

stalled trying to rotate a log, then stalled again #914

felipeceglia commented Sep 6, 2014

pauldix commented Sep 8, 2014

felipeceglia commented Sep 8, 2014

jvshahid commented Sep 16, 2014

mattheworiordan commented Sep 25, 2014

mattheworiordan commented Sep 25, 2014

jvshahid commented Sep 26, 2014

jvshahid commented Sep 26, 2014

stalled trying to rotate a log, then stalled again #914

stalled trying to rotate a log, then stalled again #914

Comments

felipeceglia commented Sep 6, 2014

pauldix commented Sep 8, 2014

felipeceglia commented Sep 8, 2014

jvshahid commented Sep 16, 2014

mattheworiordan commented Sep 25, 2014

mattheworiordan commented Sep 25, 2014

jvshahid commented Sep 26, 2014

jvshahid commented Sep 26, 2014