Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cronicle scheduler stops working after 2^31 milliseconds (~25 days) of continuous running #108

Closed
jhuckaby opened this issue Aug 28, 2018 · 13 comments
Assignees
Labels

Comments

@jhuckaby
Copy link
Owner

I am seeing a very disturbing situation brewing. It seems that after 25 or so days of continuous running, Cronicle will randomly "stop" running scheduled jobs. This actually affects more than the scheduler, as it seems that all the Node.js timers and intervals are all magically stopping, while the daemon process itself remains up. No errors, no uncaught exceptions, just a silent timer failure. Here are the facts at a glance:

  • There is no known fix or workaround, other than to restart Cronicle.
  • All reports are for servers that have been running Cronicle continuously for 25+ days.
  • All reports are for Node v.10 (v10.5.0 and v10.9.0).
  • All reports are for CentOS (v6.9 and v7.2).
  • Both AWS (S3) and local filesystem storage backends are affected.
  • This affects both the Cronicle master server, and slave servers.
    • Slave servers can "survive" this issue for the most part, because they don't rely on timers, except when detached jobs are involved. Detached jobs never "complete" on affected slave servers.
  • CPU and memory are both normal on an affected server.
    • Meaning, CPU / load average are very low, almost zero, and memory usage is normal, around 50 - 100 MB. This does not appear to be a memory leak or event loop starvation issue.

More details about the bug

I am seeing strong evidence that all the Node.js timers are failing. Here is what I've observed so far:

  1. The detached job queue system in Cronicle, which polls the /opt/cronicle/queue directory every minute, looking for detached job update JSON files, was clearly not working on several servers. This system works off the master pixl-server 1s interval tick timer.
  2. Cronicle's daily log rotation system, which runs on every server, also uses the pixl-server 1s tick timer, specifically the "day" event which is emitted at midnight, was not working on my affected servers. Both of them had old logs from August 21 (meaning the "day" tick did not fire at midnight on Aug 22). This is also more evidence that points at (at least) the master pixl-server 1s tick timer.
  3. When I logged into Cronicle on one of my affected servers, I noticed that the clock in the browser UI was not ticking up. The animated clock logo was not moving, and the time in the top-right was not changing. These things are all triggered by server-sent events, that work off the master pixl-server 1s tick timer.

I also see evidence that this isn't limited to the pixl-server 1s interval tick timer. On one of my servers, I see this logged:

[1534858789.948][2018-08-21 06:39:49][devrelease.xxx][Cronicle][debug][5][Marking slave as disabled: mgr01.xxx][]
[1534858789.955][2018-08-21 06:39:49][devrelease.xxx][Cronicle][debug][5][Marking slave as disabled: mgr02.xxx][]

This is right when the master server lost connectivity with mgr01 and mgr02. Now, the Cronicle code automatically tries to reconnect to lost slaves right after this happens. It reconnects automatically and retries indefinitely. But it never happened. It never even tried once.

Check out the code: https://github.com/jhuckaby/Cronicle/blob/master/lib/comm.js#L231-L237

That bit of code never ran on the affected server (i.e. it never logged "Reconnecting to slave:"), and this runs under its own separate setTimeout() call (i.e. not the pixl-server 1s interval timer). Meaning, this is evidence that the entire Node.js timer system got borked.

Furthermore, the other question is WHY did it lose connectivity with mgr01/mgr02 and at the same time as everything else? Well, I believe the answer is again timers:

[1534858789.948][2018-08-21 06:39:49][devrelease.xxx][Error][error][server][Slave disconnected unexpectedly: mgr01.xxx][]
[1534858789.954][2018-08-21 06:39:49][devrelease.xxx][Error][error][server][Slave disconnected unexpectedly: mgr02.xxx][]

As you can see here, devrelease.stage (Cronicle master) says that mgr01 and mgr02 disconnected unexpectedly. But look at what mgr01 and mgr02 have to say about this:

[1534858789.946][2018-08-21 06:39:49][mgr02.xxx][Cronicle][debug][5][Socket.io client disconnected: _IXAzsPZvXRS6aEpAAAA (IP: ::ffff:100.xx)][]
[1534858789.946][2018-08-21 06:39:49][mgr01.xxx][Cronicle][debug][5][Socket.io client disconnected: oyMlpxiFAbZp4uSzAAAA (IP: ::ffff:100.xx)][]

They say that the MASTER disconnected from THEM. So both are pointing the finger at the other.

Here is what I believe actually happened. Socket.io fired a "disconnect" event for both servers on the master, but it never actually lost connectivity. It disconnected because of a loss of heartbeat. Socket.io has its own heartbeat "echo packet" system that keeps TCP connections alive, and it runs on its own setInterval. This also failed at the same time as everything else, resulting in the sockets hitting idle timeout and dying.

I still don't know WHY yet, but it looks like the entire Node.js timer subsystem failed. One reason could be that the event loop was totally CPU-bound, but I see no other evidence of that. The CPUs were asleep on all affected servers. If the main process was CPU-pinned (which would indeed cause timers to not fire), then we would have seen this in our monitoring tools. But no, the load averages were all very low all morning long, and memory usage is normal.

Investigation ongoing, but I don't have much else to go on at the moment.

@jhuckaby
Copy link
Owner Author

jhuckaby commented Aug 28, 2018

Cronicle v0.8.26 has a new HTTP JSON REST API called /api/app/status which returns some internal status information, including the current "tick age" (the number of seconds since the last tick). This age should NEVER be longer than 1.0 (1 second). If it is over 60 seconds, Cronicle will log a level 1 debug message as part of this API. You can fetch this API by hitting a URL locally such as:

curl -s http://127.0.0.1:3012/api/app/status

Example response (with ?pretty=1):

{
	"code": 0,
	"version": "0.8.26",
	"node": "v10.5.0",
	"hostname": "dev01.xxx.net",
	"ip": "100.xxx.1",
	"pid": 2740,
	"now": 1535488477.914,
	"uptime": 176,
	"last_tick": 1535488477.361,
	"tick_age": 0.5529999732971191,
	"cpu": {
		"user": 1684823,
		"system": 239750
	},
	"mem": {
		"rss": 49987584,
		"heapTotal": 27570176,
		"heapUsed": 23715944,
		"external": 1119951
	}
}

I am planning on setting up a 1-minute Crontab to monitor this API and its output via curl, and pipe the response to a log file. If the "tick age" ever exceeds 60 seconds, I can take action, such as restarting the service. This is more of a fact finding mission than a solution, as it is sometimes hard to tell if a server is in a bad state, i.e. slave servers.

Example crontab:

* * * * * root /usr/bin/curl -s -m 5 "http://127.0.0.1:3012/api/app/status" >>/opt/cronicle/logs/monitor.log 2>&1

@jhuckaby jhuckaby self-assigned this Aug 28, 2018
@jhuckaby jhuckaby added the bug label Aug 28, 2018
@jhuckaby
Copy link
Owner Author

After examining all my server logs, I can conclude that on all 5 servers where this issue was found, the total run time (daemon lifetime / uptime) from initial startup to going berserk, was exactly 24 days, 20.5 hours.

These servers were all started on different days and times of day, too, so this really is looking like a preset length to self destruction.

@jhuckaby
Copy link
Owner Author

jhuckaby commented Aug 30, 2018

Okay, this can't be a coincidence. I think I may actually be onto something here.

All of my Cronicle servers didn't glitch out at the same time, but if you consider that they were all started at different days & times, the actual total runtime (total duration of daemon lifespan), is damn near identical. It's always 24 days + 20.5 hours.

Using one of my server's logs, I was able to get this estimate closer to exact (+0-1 minutes):

24 days, 20 hours, 32 minutes
or... 2,147,520 seconds

Now, that total number of seconds suddenly caught my eye. Why does that number look so familiar??? 2,147,520.... 2,147,520.....

Then it hit me. That looks like a reduced version of 2 ^ 31 (i.e. 2 raised to the 31st power) -- in other words, the 32-bit signed integer maximum value:

2,147,483,647

However, that's 2.1 BILLION, not million. But what if we're talking about milliseconds here, and not seconds?

The answer is, this is very close, and keep in mind I don't know EXACTLY when my servers bugged out, and stopped firing timers. I am estimating based on when things started going wonky in the logs, presumably some number of seconds after the actual bug out time:

My server's Cronicle lifetime in milliseconds: 
2,147,520,000

32-bit signed integer maximum:
2,147,483,647

That's within 36,352 milliseconds, or only 36 seconds. And my estimates are based on log events that would occur about a minute after. OMG I think it adds up.

Um..... okay, this CANNOT be a coincidence.

I think Cronicle is dying after 2 ^ 31 (2,147,483,647) milliseconds of total runtime. What could possibly cause such a weird thing?

@jhuckaby jhuckaby changed the title Cronicle scheduler randomly stops working after 25+ days of continuous running Cronicle scheduler stops working after 2^31 milliseconds (~25 days) of continuous running Aug 30, 2018
@tckb
Copy link

tckb commented Aug 31, 2018

@jhuckaby 👏 🙏 for all the fact findings. I am was considering cronicle for running some critical scheduling tasks. I will hold it off until some time. Any fixes for this yet? though am not a js guy to find and fix it, I will take a shot.

@pioardi
Copy link

pioardi commented Aug 31, 2018

Hi @jhuckaby ,
I find this nodejs/node#22149 and this issue seems to be a node js issue .

Probably only node js 10.x.x is affected , the same version that you are using :)

@jhuckaby
Copy link
Owner Author

@pioardi Oh wow, look at that. Thank you for tracking that down! I searched the Node.js bug tracker several times, but never found that bug. I am 99.9% certain that is the same issue as this.

@pioardi
Copy link

pioardi commented Aug 31, 2018

@jhuckaby So..good news :)

@jhuckaby
Copy link
Owner Author

Looks like the bug was introduced in this PR, and released with Node v10.0. nodejs/node#18486

Thankfully the bug is fixed, landed, and will be included in the next release (presumably Node v10.10). PR for fix: nodejs/node#22214

I'll have to put in a special check for Node v10.0 - v10.9, and emit a warning to the console on startup or something.

@tckb
Copy link

tckb commented Aug 31, 2018

okay, the latest version of node still has the issue right? 😐

@jhuckaby
Copy link
Owner Author

@tckb Yup, the bug is confirmed in Node v10.9, alas. It looks like all released versions of Node v10 are affected, from v10.0.0 all the way to v10.9.0. The bug was introduced in nodejs/node#18486 which was released as part of Node v10.0.0.

@jhuckaby
Copy link
Owner Author

What you can do in the interim, while we wait for the next Node.js release, is schedule a weekly restart of the service. Pick a day/time when you know you don't have any jobs running, for example Sunday at 3:00 AM:

echo "0 3 * * 0 /opt/cronicle/bin/control.sh restart" > /etc/cron.d/cronicle-weekly-restart.cron

I suppose you could do this inside Cronicle as a scheduled event using the Shell Plugin, but it would have to be detached, so the restart itself didn't abort the job. Might be simpler to just use a crontab ;)

@jhuckaby
Copy link
Owner Author

jhuckaby commented Sep 2, 2018

@tckb I apologize, but it looks like I was given some misinformation. The bug is indeed fixed in Node.js v10.9.0. I verified this myself using libfaketime and simulating 25 days of runtime in a few minutes. I was able to reproduce the bug in 10.5, but not on 10.9. Whew!

I have released Cronicle v0.8.27 which emits a warning message on startup if you are running on Node.js v10.0.0 to v10.8.0.

@jhuckaby jhuckaby closed this as completed Sep 2, 2018
@tckb
Copy link

tckb commented Sep 3, 2018

Ah! that's a relief thanks @jhuckaby

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants