Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stalled trying to rotate a log, then stalled again #914

Closed
felipeceglia opened this issue Sep 6, 2014 · 7 comments
Closed

stalled trying to rotate a log, then stalled again #914

felipeceglia opened this issue Sep 6, 2014 · 7 comments

Comments

@felipeceglia
Copy link

Hi folks,

I came into a strange issue where influxdb 0.8.1 started to get stalled after midnight. The process was there, but simply nothing was being logged and nothing happened. After I ran it on the terminal I saw a message that it could not rotate the log, which I immediately assumed was the root of the problem.

Then I shut it down, fixed the permissions and restarted it again.

Now, it starts, and still gets stalled.

Here you can see log and stdout:
https://gist.github.com/mehale/528b97a551ff9da2fb3c

Thanks,
Mehale

@pauldix
Copy link
Member

pauldix commented Sep 8, 2014

Did you up your OS file limits?
http://influxdb.com/docs/v0.8/introduction/installation.html#file-limits

On Fri, Sep 5, 2014 at 10:03 PM, mehale notifications@github.com wrote:

Hi folks,

I came into a strange issue where influxdb 0.8.1 started to get stalled
after midnight. The process was there, but simply nothing was being logged
and nothing happened. After I ran it on the terminal I saw a message that
it could not rotate the log, which I immediately assumed was the root of
the problem.

Then I shut it down, fixed the permissions and restarted it again.

Now, it starts, and still gets stalled.

Here you can see log and stdout:
https://gist.github.com/mehale/528b97a551ff9da2fb3c

Thanks,
Mehale


Reply to this email directly or view it on GitHub
#914.

@felipeceglia
Copy link
Author

Yes, ulimit was already set to unlimited.

Thanks

@jvshahid
Copy link
Contributor

This has nothing to do with log rotation, my guess is that this line is causing the process to stall:

[2014/09/06 01:45:37 UTC] [INFO] (github.com/influxdb/influxdb/server.(*Server).ListenAndServe:108) Sending change connection string command (localhost.localdomain:18099,stats.reversebeacon.net:18099) (http://localhost.localdomain:18090,http://stats.reversebeacon.net:18090)

It's trying to change the hostname. If this is a single server I'd suggent to manually set the hostname in the config file to localhost.localdomain to prevent raft from trying to change the hostname and reach consensus. There's currently a hack in place to achieve this but many users reported problems with the current hack.

@mattheworiordan
Copy link

@jvshahid after running on a fresh database for around 1.5 weeks now, all of a sudden the database went offline. I issued a reboot, and now I am stuck in the exact same state I was in with #904. You can see the current logs at https://gist.github.com/mattheworiordan/bfafb3225f86a85a19ac.

Couple of points to bear in mind:

  • I am running Ubuntu 14.04
  • I am running InfluxDB 0.8.0 in a Docker container with Ubuntu 14.04
  • I will leave the database in this state for a few days so that hopefully we can debug this issue. Not being able to access my data again is painful for me, but I expect a complete non-starter for others, so I would love to do what I can to help.
  • I am not sure my issue is the same as this issue, however I am posting here because my issue was closed. I don't have any logs relating to ListenAndServe.

@mattheworiordan
Copy link

Hi, as an update, my issue is probably not related to this issue, so happy to raise a new ticket if you would like, or append all this info the closed ticket if it is reopened. After around 1 hour, the InfluxDB did in fact come online. Please see https://gist.github.com/mattheworiordan/bfafb3225f86a85a19ac for an updated log showing that 02:20:27 UTC (just under 1 hour after InfluxDB was started), the web interface comes online and recovers from the log with this log message: [2014/09/25 02:20:27 UTC] [INFO] (github.com/influxdb/influxdb/coordinator.(*RaftServer).startRaft:397) Recovered from log.

I did a check on the database size, and it comes in at 52GB. I have no idea if this is considered very large or not, but clearly waiting an hour to bring a database back online after a failure seems a bit excessive.

Out of interest, how quickly should one expect an InfluxDB to take to start up on EC2 with SSDs and a 50GB database? Is this normal, or do I have a problem? Any advice or steps I can take to help diagnose the issue would be appreciated.

@jvshahid
Copy link
Contributor

@mattheworiordan there's an issue logged for startup times #945

@jvshahid
Copy link
Contributor

I commented on this thread a while back with no further activity. Closing this issue. Let me know if that comment didn't fix the issue and I'll be happy to reopen the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants