-
Notifications
You must be signed in to change notification settings - Fork 394
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cronicle scheduler stops working after 2^31 milliseconds (~25 days) of continuous running #108
Comments
Cronicle v0.8.26 has a new HTTP JSON REST API called
Example response (with {
"code": 0,
"version": "0.8.26",
"node": "v10.5.0",
"hostname": "dev01.xxx.net",
"ip": "100.xxx.1",
"pid": 2740,
"now": 1535488477.914,
"uptime": 176,
"last_tick": 1535488477.361,
"tick_age": 0.5529999732971191,
"cpu": {
"user": 1684823,
"system": 239750
},
"mem": {
"rss": 49987584,
"heapTotal": 27570176,
"heapUsed": 23715944,
"external": 1119951
}
} I am planning on setting up a 1-minute Crontab to monitor this API and its output via curl, and pipe the response to a log file. If the "tick age" ever exceeds 60 seconds, I can take action, such as restarting the service. This is more of a fact finding mission than a solution, as it is sometimes hard to tell if a server is in a bad state, i.e. slave servers. Example crontab:
|
After examining all my server logs, I can conclude that on all 5 servers where this issue was found, the total run time (daemon lifetime / uptime) from initial startup to going berserk, was exactly 24 days, 20.5 hours. These servers were all started on different days and times of day, too, so this really is looking like a preset length to self destruction. |
Okay, this can't be a coincidence. I think I may actually be onto something here. All of my Cronicle servers didn't glitch out at the same time, but if you consider that they were all started at different days & times, the actual total runtime (total duration of daemon lifespan), is damn near identical. It's always 24 days + 20.5 hours. Using one of my server's logs, I was able to get this estimate closer to exact (+0-1 minutes):
Now, that total number of seconds suddenly caught my eye. Why does that number look so familiar??? 2,147,520.... 2,147,520..... Then it hit me. That looks like a reduced version of
However, that's 2.1 BILLION, not million. But what if we're talking about milliseconds here, and not seconds? The answer is, this is very close, and keep in mind I don't know EXACTLY when my servers bugged out, and stopped firing timers. I am estimating based on when things started going wonky in the logs, presumably some number of seconds after the actual bug out time:
That's within 36,352 milliseconds, or only 36 seconds. And my estimates are based on log events that would occur about a minute after. OMG I think it adds up. Um..... okay, this CANNOT be a coincidence. I think Cronicle is dying after |
@jhuckaby 👏 🙏 for all the fact findings. I |
Hi @jhuckaby , Probably only node js 10.x.x is affected , the same version that you are using :) |
@pioardi Oh wow, look at that. Thank you for tracking that down! I searched the Node.js bug tracker several times, but never found that bug. I am 99.9% certain that is the same issue as this. |
@jhuckaby So..good news :) |
Looks like the bug was introduced in this PR, and released with Node v10.0. nodejs/node#18486 Thankfully the bug is fixed, landed, and will be included in the next release (presumably Node v10.10). PR for fix: nodejs/node#22214 I'll have to put in a special check for Node v10.0 - v10.9, and emit a warning to the console on startup or something. |
okay, the latest version of |
@tckb Yup, the bug is confirmed in Node v10.9, alas. It looks like all released versions of Node v10 are affected, from v10.0.0 all the way to v10.9.0. The bug was introduced in nodejs/node#18486 which was released as part of Node v10.0.0. |
What you can do in the interim, while we wait for the next Node.js release, is schedule a weekly restart of the service. Pick a day/time when you know you don't have any jobs running, for example Sunday at 3:00 AM:
I suppose you could do this inside Cronicle as a scheduled event using the Shell Plugin, but it would have to be detached, so the restart itself didn't abort the job. Might be simpler to just use a crontab ;) |
@tckb I apologize, but it looks like I was given some misinformation. The bug is indeed fixed in Node.js v10.9.0. I verified this myself using libfaketime and simulating 25 days of runtime in a few minutes. I was able to reproduce the bug in 10.5, but not on 10.9. Whew! I have released Cronicle v0.8.27 which emits a warning message on startup if you are running on Node.js v10.0.0 to v10.8.0. |
Ah! that's a relief thanks @jhuckaby |
I am seeing a very disturbing situation brewing. It seems that after 25 or so days of continuous running, Cronicle will randomly "stop" running scheduled jobs. This actually affects more than the scheduler, as it seems that all the Node.js timers and intervals are all magically stopping, while the daemon process itself remains up. No errors, no uncaught exceptions, just a silent timer failure. Here are the facts at a glance:
More details about the bug
I am seeing strong evidence that all the Node.js timers are failing. Here is what I've observed so far:
/opt/cronicle/queue
directory every minute, looking for detached job update JSON files, was clearly not working on several servers. This system works off the master pixl-server 1s interval tick timer.I also see evidence that this isn't limited to the pixl-server 1s interval tick timer. On one of my servers, I see this logged:
This is right when the master server lost connectivity with mgr01 and mgr02. Now, the Cronicle code automatically tries to reconnect to lost slaves right after this happens. It reconnects automatically and retries indefinitely. But it never happened. It never even tried once.
Check out the code: https://github.com/jhuckaby/Cronicle/blob/master/lib/comm.js#L231-L237
That bit of code never ran on the affected server (i.e. it never logged "Reconnecting to slave:"), and this runs under its own separate setTimeout() call (i.e. not the pixl-server 1s interval timer). Meaning, this is evidence that the entire Node.js timer system got borked.
Furthermore, the other question is WHY did it lose connectivity with mgr01/mgr02 and at the same time as everything else? Well, I believe the answer is again timers:
As you can see here, devrelease.stage (Cronicle master) says that mgr01 and mgr02 disconnected unexpectedly. But look at what mgr01 and mgr02 have to say about this:
They say that the MASTER disconnected from THEM. So both are pointing the finger at the other.
Here is what I believe actually happened. Socket.io fired a "disconnect" event for both servers on the master, but it never actually lost connectivity. It disconnected because of a loss of heartbeat. Socket.io has its own heartbeat "echo packet" system that keeps TCP connections alive, and it runs on its own setInterval. This also failed at the same time as everything else, resulting in the sockets hitting idle timeout and dying.
I still don't know WHY yet, but it looks like the entire Node.js timer subsystem failed. One reason could be that the event loop was totally CPU-bound, but I see no other evidence of that. The CPUs were asleep on all affected servers. If the main process was CPU-pinned (which would indeed cause timers to not fire), then we would have seen this in our monitoring tools. But no, the load averages were all very low all morning long, and memory usage is normal.
Investigation ongoing, but I don't have much else to go on at the moment.
The text was updated successfully, but these errors were encountered: