-
Notifications
You must be signed in to change notification settings - Fork 30k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
setInterval callback function unexpected halt #22149
Comments
This is interesting. I have an app with some code looking like this: function doSomething() {
promiseReturningFunction().then((data) => {
//do some
}).catch((error) => {
//log error
}).then(() => {
setTimeout(doSomething, 5000);
});
} This is just meant to run these code blocks forever. Then suddenly a few installations stopped working recently. The app still responds to api requests and new timers that are set during for example an api call is executed. However, the doSomething functions are no longer being exectuted. It is as if the event queue at some point just was emptied or something. My installations that are having this problem are running node 10.4.1 or 10.6.0 depending on when they were upgraded. A bug in the core nodejs is not my first thought but I cant find a logical explanation for my app to suddenly drop 5-6 timer constructs like this that has worked for a few years before.... |
@nodejs/timers |
For what it's worth you are not supposed to rely on timers staying running for a month. That said, I haven't run into this sort of thing before - even on Node servers with an uptime of more than a month.
It is very likely that Also note that is not a great way to write Node.js code (and this is core's fault IMO). Are you using async_hooks and some form of instrumentation? Can you check the logs for |
@benjamingr yea that would of course be the obvious conclusion and what I initially thought until I realized it wall all timers at once. I just can't figure out how that problem suddenly occurred on 6 different places in the code. The "background tasks" are not depending on each other and hasn't been changed for a loong time. Had it been one timer that stopped that's one thing but all at the same time? This was also on 7 different servers running the same app in different locations and with a similar up-time. Like 3-4 weeks or so until the issue occurred. Perhaps there are a better construct to keep some "background tasks" executing but I thought this was simple enough to do its job. Not familiar with the async_hooks and logs are not saved (which is a flaw...) :( |
@tkarls to be clear, I am not accusing you of writing code problematically - I am saying Node should guide users to safer ways to write code and make them "obvious". If there is a way to get the
I'm getting "off" timer land and into "promise debugging land" here - just saying this is entirely on our radar and something we're thinking and talking about improving. Also it's possible your problem was entirely different and we do have a bug in
Typically, not on the server and using something that does scheduling - either what your cloud provides, Again, I strongly believe this is an issue with Node's user experience (not your or most people's code) and I feel strongly that Node (well, us) should fix it. |
As tkarls says, I have a similar situation; @benjamingr thank you for your advice on coding, but I can't use cron because I use doSomething to trigger about 400-500 milliseconds. |
@newbreach in a similar situation - what sort of code do you have in the setTimeout? |
@benjamingr no problem, I didn't feel particularly 'accused' because it it was an easy fix I could do to my code I want to know about it so I can fix the problem :D Thanks for your advises. Using external schedule trigging had not even occurred to me! Today the node process is kept alive using a systemd service that limits memory and tasks and restarts the node process if needed while ensuring always up. I'm sure there are more advanced monitors. But back to the issue at hand. I hear what you are saying about unhandled rejections but I do have that catch before the then that sets the timer. So any error or exception should be caught I think (I have double checked that all promises are returned and not just created locally too). One of my "workers" query an api over the network. Another checks a local database and talks to a local service. A third one is querying the NTP status on the server etc. Most of the functions are rather small and easy to overview. So I find it improbable (but I suppose not impossible) that all these functions have a bug in them that manifest at the same time. This happened on several servers during the summer (running 10.4.1-10.6.0) but not on any of our 60+ servers running <= 9.8.0. Granted our sw is also newer on the newer installations but a few of the "workers" hasn't been changed in well over a year. The only thing that doesn't indicate a bug/change in setTimeout/setInterval to me is probably that there isn't 500 people reporting it already due to node being so popular. Which I can agree is a pretty strong indication too... I do appreciate you insights and guesses though. That is all I'm doing right now too :) |
My code logic is probably like this
The part of the code is as follows: setInterval(function () {
let cbs = [];
for (let key of callbackQueue.keys()) {
let item = callbackQueue.get(key)
if (item && Date.now() - item.time >= 40000) {
callbackQueue.delete(key);
cbs.push(item);
}
}
for (let i in cbs) {
cbs[i].cb.call(cbs[i].thisArg, new util.BusinessError('network fail', 4001));
}
}, 200); thanks! Edit by @Fishrock123 - code formatting |
Click the mistake, I'm sorry |
I'm trying to reproduce this (with Edit: there appears to be a way to test it in virtualbox if anyone is interested in taking a stab sooner: https://superuser.com/a/1022317/248212 |
This just became a problem recently? Like, in Node 10 ... or something else? |
I'm not seeing anything in the 10.x code that jumps out at me but will continue investigating and try to reproduce. |
@Fishrock123 Well for me I have not experienced anything like this on node <= 9.8 Today I will try and attach a debugger to one of my failing servers. I said before that new timers set by setTimeout() in api calls were executed correctly. This may have been a premature conclusion. Because it looks like this is not the case in two other API calls. I don't seem to get an error thrown though since the API call ends with success. Just the scheduled task does not run. I will add any info I get from the debugging here |
Code is running with all the debug stuff for 2 days now here - not sure how to debug it otherwise - if anyone has any experience "speeding time" on an emulated system please chime in. |
Ok, so I have successfully been able to connect the debugger over a ssh tunnel. I have 3 timers I can access via the api. 2 is not working in the sense that I can step the code and see that setTimeout() is indeed called with sensible values and does not throw. But the callback is never called. One callback is working as expected which is a bit strange. I called process._getActiveHandles(); before the setTimeout of the working timer and got 21 handles. After the call to setTimeout() I got 22 handles and the last handle in the list is my new timer which I can see from the msecs value. Doing the same thing on a setTimeout that is not working results is: I do have several timers in the list already with timer values that I do recognize and the only thing I can see is really different from the timer that is still working and the one that is not is the timeout value. The working timer uses a unique timeout value only used for that timer. The timers that isn't working is using the same timeout values as several of my "background timers" is using. Not sure if this is a coincidence or not... but I can also see in the debugger that one and exactly one background timer is still working. That one also has a unique timeout value that no other times has in the code. All other timers uses "common values" such as 3000, 5000, 10000 and 30000 and more than one timer uses each value. sounds pretty strange to me... |
That sounds like a rescheduling/reuse/re-enroll bug. (tl;dr some path causes pooled timers to not execute or schedule) @apapirovski You're probably most familiar with this code now, any ideas? |
Nothing off the top of my head. Will investigate. |
@tkarls can you elaborate on "background timers"? Are these using |
@apapirovski Like this: function doSomething() {
promiseReturningFunction().then((data) => {
//do some
}).catch((error) => {
//log error
}).then(() => {
setTimeout(doSomething, 5000);
});
} I call the function once when the app is lifted and then they are supposed to keep running without being poked from the outside. Hence "backgrund task" but I never use anything else than setTimeout to do the scheduling. The api timer for instance is set in response to an api call. For instance I have a call that enables the firewall that sets an timer to auto disable it unless the user confirms the action within 30 seconds. This timer is not working. The relevant code is as follows; enableFirewall: (options) => {
return getFwStatus().then((before) => {
return execp('sudo ufw --force enable').then(()=>{
setDisableTimer(disableTime);
}).then(()=>{
return getFwStatus().then((after) => {
options.before = before;
options.after = after;
return addAuditLogEntry(options);
});
});
}).then(()=>{
return disableTime;
}).catch( (error) => {
options.name = 'firewall-error';
return disableFirewall(options).catch(() => {
return execp('sudo ufw disable')
}).then(() => {
return Promise.reject(error);
});
});
}
function setDisableTimer(timeout) {
cancelDisableTimer();
disableFirewallTimer = setTimeout(() => {
disableFirewall({ user: '--disable timer--'});
}, timeout);
}
function cancelDisableTimer() {
if (disableFirewallTimer != null) {
clearTimeout(disableFirewallTimer);
disableFirewallTimer = null;
}
}
function disableFirewall(options) {
var options = options || { user: '--unknown user--' }
return getFwStatus().then((before) => {
return execp('sudo ufw disable').then(() => {
return getFwStatus().then((after) => {
options.before = before;
options.after = after;
return addAuditLogEntry(options);
});
});
});
} The api call will call enableFirewall and I do see when stepping the code that it reaches and executes without error disableFirewallTimer = setTimeout(() => {
disableFirewall({ user: '--disable timer--'});
}, timeout); and that the timeout parameter indeed is set to 30000 as I expected it to be. But a breakpoint in the anonymous function never get hit (and the firewall doesn't get called either). One of the "background tasks" that are not working any more function keepNtpStatusFresh() {
getNtpStatusOutputInternal().then((data) => {
lastNtpCheck.data = data;
lastNtpCheck.when = new Date();
}).catch((error) => {
lastNtpCheck.data = null;
sails.log.error(error);
}).then(() => {
setTimeout(keepNtpStatusFresh, 5000);
});
}
function getNtpStatusOutputInternal() {
return execp(sails.config.paneda.ntpStatus, { timeout: 30000 }).then((result) => {
var re = new RegExp(/^(.)([^\s=]+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)/gm);
var match = re.exec(result.stdout);
var servers = [];
while (match != null) {
servers.push({
type: match[1],
remote: match[2],
refid: match[3],
st: Number(match[4]),
t: match[5],
when: Number(match[6]),
poll: Number(match[7]),
reach: Number(match[8]),
delay: Number(match[9]),
offset: Number(match[10]),
jitter: Number(match[11]),
});
match = re.exec(result.stdout);
}
return { output: result.stdout, serverList: servers };
});
} |
My test based on the above is still running for several days now by the way. |
Yes, also since our servers were restarted they are working fine for little over 10 days now... I have tried to check when the servers were upgraded and when the error showed up on each server. It looks like they all failed after about 25 days of run-time. If this was a totally random problem or some sort of race condition I would have expected one server to fail after a few days, one to fail after a few weeks and perhaps 1 should still be running. 100% of the upgraded servers did fail. They are also running in completely different environments. My brain can't help noticing that 25 days is almost exactly the number of milliseconds that can fit into a signed 32 bit integer (2 160 000 000 vs 2 147 483 647 ). Not sure if this is a pure coincidence (especially since Javascript doesn't have signed 32 bit integers AFAIK) but I'm mentioning it anyway :) |
@tkarls as far as I know livuv uses 64bit ints for timers from what I can tell and Node just calls into libuv for checking. Maybe @cjihrig or @saghul will know? (since libuv timers) (We do cap the timeout itself to 32bit integers but not the number of repeats). I wonder if this correlates with a libuv update - I'd try and bisect but for a bug that takes ~25 days to reproduce I don't think it'll help. |
@apapirovski @benjamingr @Fishrock123 I was just able to reproduce this with a smaller example 😁 I used libfaketime to speed things up by a factor of 1000 on a linux server and executed a test program containing functions scheduling themselves in the callback and printing the current date. The simulated start date is today (2018-08-09) and the last print is in september 03 wich happened to be 25 days! Here is the script for reference: function timer1 () {
console.log('timer1' + new Date().toString());
setTimeout(timer1, 500);
}
function timer2 () {
console.log('timer2' + new Date().toString());
setTimeout(timer2, 500);
}
function timer3 () {
console.log('timer3' + new Date().toString());
setTimeout(timer3, 500);
}
function timer4 () {
console.log('timer4' + new Date().toString());
setTimeout(timer4, 500);
}
function timer5 () {
console.log('timer5' + new Date().toString());
setTimeout(timer5, 1500);
}
function timer6 () {
console.log('timer6' + new Date().toString());
setTimeout(timer6, 500);
}
function timer7 () {
console.log('timer7' + new Date().toString());
setTimeout(timer7, 500);
}
function timer8 () {
console.log('timer8' + new Date().toString());
setTimeout(timer8, 500);
}
function timer9 () {
console.log('timer9' + new Date().toString());
setTimeout(timer9, 500);
}
function timer10 () {
console.log('timer10' + new Date().toString());
setTimeout(timer10, 500);
}
timer1();
timer2();
timer3();
timer4();
timer5();
timer6();
timer7();
timer8();
timer9();
timer10(); In the middle of september 3 the output stopped. No error other output in the console. Here is how I started it: @benjamingr So, libuv... I'm not at all familiar with that lib. But I suppose the bug could be there instead... |
I think I know the bug... confirming right now. Ugh. |
@tkarls excellent work isolating it. libuv is the underlying i/o library Node uses (it was written for Node but is used by others now) - it sounds like Anatoli has some idea about where the bug is though :) |
Do we know what versions of Node this is affecting? We're currently on 9.x and was going to start implementing some polling using Also do we know if it's only when using |
Hi, |
升级到10.9.0以上吧,就是18年8月15日以后的版本都可以 |
This is fixed in Node 10.9.0 |
@peernohell I guess we're not getting the answer to our questions 😆 |
I’ve got the same issue with node.js 12.x. Either setInterval or setTimeout unexpectedly stop without any error after some few weeks. |
@tkarls Excellent job isolating the problem! We know that Node 10.9 fixes the issue, what about Node 12 and Node 14, can you please comment on those? |
I am having the same issue with timeouts in Node 12.18.3 |
@paynebc Any chance that you could reproduce this with |
I will give this a try |
FYI: I ran my script above with libfaketime again and it does not detect any problem on node 12.18.3 |
Interesting, it must be a different problem then. I'm thinking about renting a VPS and running like 5-10 major versions of NodeJS in parallel with |
I tried the script above now, too. It does indeed run totally fine, so whatever it is I have must be of different origin. But it does show exactly the same symptoms: every couple days, my main loop just stops working. I am using the following function and calling it with await in my main loop instead of passing my mainloop function directly to setTimeout exports.sleep = function(millis) { I am now running my application using NODE_DEBUG=* in the hopes of finding anything useful. |
It's interesting because you are using
Are you absolutely sure that it is the |
yes this is how it works. I now added some more logging to find out if this is indeed where it hangs. I will report back once I have an idea what it might be |
Ok, sounds good! FYI actually the reason why I used |
yes I originally wanted to do it that way, too. but I was not sure what happens, when one loop iteration takes longer than my interval time. Actually I guess I do know what happens and I don't think it would be a good idea :) This way I do not have a constant wait time because the interval will be whatever the iteration uses +500 ms, but that is fine, I don't care about exact times, I just don't want the loop to run constantly |
I understand, it makes sense! I think what would happen depends whether your loop is synchronous or asynchronous?
Therefore yes, I think your approach is better for what you want to achieve :) |
the work that is happening in the loop is asynchronous. so yes, I will get multiple instances running in parallel and I would very much want to avoid that |
Hi, I am currently facing the same issue with v12.18.1, did you manage to find a way to replicate the issue? I am gonna switch to using node-cron, but I wanna make sure it's setInterval that's causing the issue, and not some other part of the process. I opened an issue here: #41037 |
@ThabetSabha as what I know the issue is still there in node 14, and we’ve used node-cron to replace most long-running setInterval in our projects for more than one year, and it’s been working well until now |
@nhathongly great, that's what I will be doing, did you manage to figure out a way to replicate the issue though? |
I have the same problem in node v20.11.0. All of my loops stop working after a a message from a ws server. no errors, even in inspection mode. it just stops. i've tried different solutions: setinterval (sync and async):
circular timeout:
and async delay:
all of the processes are async so blocking should not be possible. has anyone found a solution yet? |
@MvDDD It's been a long time since I've paid attention to this problem, and most likely the cause is numeric type overflow. Here are three tips: 1. Switch node.js versions; 2. Reframe the problem to the development team. Switch to a scheduling library. |
There is a setInterval method in the express project. When the project has been running for a month or so, there is no error, the project does not stop, but the callback of the setInterval is no longer executed.
like this :
setInterval(function(){
//do samething
console.log(Date.now());
},500);
A month later, no output is available, but the application process is robust.
The text was updated successfully, but these errors were encountered: