-
Notifications
You must be signed in to change notification settings - Fork 30.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
timers: fix cleanup of nested same-timeout timers #7827
timers: fix cleanup of nested same-timeout timers #7827
Conversation
const TIMEOUT = 100; | ||
const start = Timer.now(); | ||
|
||
function testNestedTimers() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The contents of this function could probably just be top-level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
I was just about to look for other ways than twitter to reach out to people potentially interested in fixing this issue, and just found this PR. That's great :) I'll take a look asap, thank you! |
Here's a CI run: https://ci.nodejs.org/job/node-test-pull-request/3377/ The test worries me though, we've had troubles relying on timeout duartions in tests too much. Perhaps there is a way to reduce the reliance on timeout durations. |
@Fishrock123 Agree, I don't like relying the timeout duration one bit. I consulted with @julianduque to see if there was a better way. Do you have any suggestions? If |
@@ -211,9 +211,9 @@ function listOnTimeout() { | |||
debug('%d list empty', msecs); | |||
assert(L.isEmpty(list)); | |||
this.close(); | |||
if (list._unrefed === true) { | |||
if (list._unrefed === true && list === unrefedLists[msecs]) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mind adding a block comment that describes why these checks are needed? Maybe something very similar to the block comment included with the newly added test would be a good fit.
The goal would be to make it clear to anyone reading the code (including those not familiar with it) why these extra checks are needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
2d9fb7a
to
1cbc063
Compare
// erroneously deleted. If we are able to cancel the timer successfully, | ||
// the bug is fixed. | ||
clearTimeout(handle2); | ||
}, 1), 10); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The , 1
argument to common.mustCall
is optional and can be left out (which I would recommend here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
@erinspice One of the ways to not rely on timeout duration would be to check the output of Something along the lines of: const handle1 = setTimeout(common.mustCall(function() {
// Cause the old TIMEOUT list to be deleted
clearTimeout(handle1);
// Cause a new list with the same key (TIMEOUT) to be created for this timer
const handle2 = setTimeout(function() {
common.fail('Inner callback is not called');
}, TIMEOUT);
setTimeout(common.mustCall(function() {
// Attempt to cancel the second timer. Fix for this bug will keep the
// newer timer from being dereferenced by keeping its list from being
// erroneously deleted. If we are able to cancel the timer successfully,
// the bug is fixed.
clearTimeout(handle2);
setImmediate(function() {
setImmediate(function () {
var activeHandles = process._getActiveHandles();
checkNoTimersListWithTimeoutOf(TIMEOUT); // this would need to be implemented in this test
})
);
}, 1), 10);
// When this callback completes, `listOnTimeout` should now look at the
// correct list and refrain from removing the new TIMEOUT list which
// contains the reference to the newer timer.
}, 1), TIMEOUT); The code above is just an example of what could be done, if you explore this solution feel free to change it :) The problem with that approach is that it relies heavily on the current implementation of the libuv event loop. More specifically, it relies on the fact that the underlying libuv handle that was created for the timers list for timers with delays of It also relies on |
@misterdjules Thanks, that appears to work when running the test singly. Now running the entire suite. Will |
It should not, handles are per process. |
After thinking a bit more about the problems mentioned in the testing approach suggested in my recent comment, it seems that the problem with that approach is that if Maybe combining the two methods, that is keeping the assert on durationOfTest < TIMEOUT * 2 and adding the method mentioned above is the best test we can come up with for now. |
@misterdjules Sounds good. |
@erinspice This PR is looking great so far. The comments in the test were very useful in order to understand how it was testing the change, and very well written. The commit message is also very clear. This is very good work with a great attention to details, and it is very much appreciated. Thank you! |
1cbc063
to
5cfd98d
Compare
My pleasure! I think I have addressed all the feedback so far. |
// the bug is fixed. | ||
clearTimeout(handle2); | ||
setImmediate(function() { | ||
setImmediate(function() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be nitpicking, but I think this callback should also be wrapped in a common.mustCall
call. This would mean that the outer setImmediate
callback would also need to be wrapped in a common.mustCall
call, or that this inner setImmediate
callback's definition be moved at the top level of the file and wrapped in a common.mustCall
call there, and then used here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Done.
I think this is looking good to me once the latest nit is addressed, but I'd like to wait for others, especially @Fishrock123, to weigh in too before giving a formal approval. Thank you once again! |
5cfd98d
to
22241d4
Compare
If a different approach to testing is not found, one thing that might be helpful in mitigating any timer race condition stuff in the current incarnation is Another thing to do might be to have a collaborator run the stress test on Jenkins to make sure the test is not flaky. I'd recommend running the test 9999 times on a Windows machine and a SmartOS machine. If you want to be thorough, maybe one or two other things. Lastly, there is a separate stress test for Raspberry Pi devices. I'd run the test 999 times on that. I'm happy to kick off the stress tests if we think this is more-or-less in its probable final form. |
Just to make sure we're on the same page, there are (among others) two distinct problems with this test:
The use of nested immediates are an attempt at solving 2), but I think your suggestion for using @erinspice Would you mind updating this PR by changing the value of We could get rid of the timing-based assertion if In other words, keeping the timing-based test in combination with using nested immediates seems like a solution that could give false positives, but no false negatives if we consider that the current tests suite ensures that timers fire after Your suggestion about running stress tests should give us a better idea about the likeliness of false positives, but I expect that using There is definitely room for improvement, and we might be able to come up with more robust mechanisms, both in the test added by that PR and in the rest of the timers tests. However I believe that what after @erinspice updates this PR to use
I think these are great suggestions.
After @erinspice makes the change to assigning the value |
setImmediate(common.mustCall(function() { | ||
setImmediate(common.mustCall(function() { | ||
const activeHandles = process._getActiveHandles(); | ||
activeHandles.forEach(common.mustCall(function(handle) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's OK to leave the forEach
callback as is, that is not wrapped by common.mustCall
, unless we're somehow not confident that Array.forEach
will call it activeHandles.length
time.
However we should probably assert that activeHandles.length
> 0, and that there's one instance of Timer
in there (in addition to the assert already present).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the second argument to common.mustCall
here because it was asserting a single call with no argument. There is no "called at least once" assertion. We could remove the mustCall completely from this loop.
The assertion that there is still one Timer remaining at this point fails, though, presumably because of the two calls to setImmediate? There are zero in the list of active handles.
This code passes:
setImmediate(common.mustCall(function() {
setImmediate(common.mustCall(function() {
const activeHandles = process._getActiveHandles();
const activeTimers = activeHandles.filter(function (handle) {
return handle instanceof Timer;
});
// Make sure our clearTimeout succeeded. One timer finished and
// the other was canceled, so none should be active.
assert.equal(activeTimers.length, 0, 'No Timers remain.');
}));
}));
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@misterdjules pushed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assertion that there is still one Timer remaining at this point fails, though, presumably because of the two calls to setImmediate? There are zero in the list of active handles.
Absolutely, my mistake, and sorry for the confusion. It's expected that no timers are left in the list of active handles at this point, so the latest changes are good.
Basically the goal is to have some check that Timer
instances are actually added to the list of active handles at some point, otherwise checking for their absence doesn't tell us much. We know now that these Timer
instances are added to the list of active handles, because we've been working on that test and we checked that ourselves, but future changes might change that and the test might pass just because Timer
instances are never added to the list of active handles anymore.
What do you think of adding a check after the 10 ms timer is added that makes sure that there are at least two instances of Timer
in the active handles list, one with a delay of TIMEOUT
and the other with a delay of 10ms?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@misterdjules Done.
Actually, in its current state we could still have false negatives, because there's no check that ensures that the I added a comment inline to mention that. @erinspice Do you mind updating this PR with that? And then I'll try to stop adding work to this PR :) My apologies for the confusion. |
ae7640f
to
b54d9ab
Compare
b54d9ab
to
818a33f
Compare
@erinspice Thank you for the updates! Changes LGTM. New CI run here: https://ci.nodejs.org/job/node-test-pull-request/3426/. As per @Trott's suggestion: Windows stress test running at https://ci.nodejs.org/job/node-stress-single-test/831/. If the tests above do not identify any significant problem, we'll still wait a couple days so that someone else can also approve the change. After that delay, of if someone else also approves the changes, we'll be able to merge the changes in master. We're getting closer, thank you @erinspice for your patience! |
Only failure on CI is a Windows failure for
This test does not have a history of false positives, at least as far as I am aware. On the other hand, I wouldn't expect it to be affected by this change. Here's another CI: https://ci.nodejs.org/job/node-test-pull-request/3428/ Stress test on Stress test on |
@nodejs/collaborators Does anyone have some time to review this? I would appreciate if another pair of eyes could take a look and formally give feedback. Thank you! |
Very interesting: On master, |
The test takes 50 seconds on some of the project's Windows CI infrastructure. Reducing the test repetitions from 50 to 20 trims that to about 20 seconds. Tests will timeout at 60 seconds, so this helps keep the test reliable. (There was a timeout on CI today when testing an unrelated code change.) Refs: nodejs#7827 (comment)
@nodejs/collaborators Last call to be able to review this, otherwise I'll land it probably later today, maybe tomorrow morning pacific time at the latest. Thanks! |
LGTM. Is it worth running the timers benchmarks in |
@Trott I do not expect these changes to affect performance significantly, but it surely doesn't hurt to run the benchmarks, if only to confirm that theory. |
After running the following command:
with
These results seem to indicate that the only significant impact that we can say these changes possibly have with a significant level of confidence are performance improvements in Regardless, the benchmarks seem to not reject the null-hypothesis for these changes (p-values for the timers benchmarks are well above 0.05, and thus have no stars associated to them). Thus I'd think these changes are safe performance-wise. |
The test takes 50 seconds on some of the project's Windows CI infrastructure. Reducing the test repetitions from 50 to 20 trims that to about 20 seconds. Tests will timeout at 60 seconds, so this helps keep the test reliable. (There was a timeout on CI today when testing an unrelated code change.) Refs: nodejs#7827 (comment) PR-URL: nodejs#7886 Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com> Reviewed-By: JungMinu - Minwoo Jung <jmwsoft@gmail.com> Reviewed-By: Michaël Zasso <mic.besace@gmail.com>
LGTM |
Landed in 7d75338. Thank you @erinspice, @Fishrock123, @Trott and @jasnell! |
For nested timers with the same timeout, we can get into a situation where we have recreated a timer list immediately before we need to clean up an old timer list with the same key. Fix: make sure the list to be deleted is the same instance as the list whose reference was used to determine that a cleanup is necessary. If it's not the same instance, a new list with the same key has been created, and it should not be deleted. Fixes: #7722 PR-URL: #7827 Reviewed-By: Rich Trott <rtrott@gmail.com> Reviewed-By: James M Snell <jasnell@gmail.com> Reviewed-By: Julien Gilli <jgilli@fastmail.fm>
The test takes 50 seconds on some of the project's Windows CI infrastructure. Reducing the test repetitions from 50 to 20 trims that to about 20 seconds. Tests will timeout at 60 seconds, so this helps keep the test reliable. (There was a timeout on CI today when testing an unrelated code change.) Refs: #7827 (comment) PR-URL: #7886 Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com> Reviewed-By: JungMinu - Minwoo Jung <jmwsoft@gmail.com> Reviewed-By: Michaël Zasso <mic.besace@gmail.com>
For nested timers with the same timeout, we can get into a situation where we have recreated a timer list immediately before we need to clean up an old timer list with the same key. Fix: make sure the list to be deleted is the same instance as the list whose reference was used to determine that a cleanup is necessary. If it's not the same instance, a new list with the same key has been created, and it should not be deleted. Fixes: #7722 PR-URL: #7827 Reviewed-By: Rich Trott <rtrott@gmail.com> Reviewed-By: James M Snell <jasnell@gmail.com> Reviewed-By: Julien Gilli <jgilli@fastmail.fm>
@Fishrock123 is this dependent on your timers re-write? |
The test takes 50 seconds on some of the project's Windows CI infrastructure. Reducing the test repetitions from 50 to 20 trims that to about 20 seconds. Tests will timeout at 60 seconds, so this helps keep the test reliable. (There was a timeout on CI today when testing an unrelated code change.) Refs: #7827 (comment) PR-URL: #7886 Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com> Reviewed-By: JungMinu - Minwoo Jung <jmwsoft@gmail.com> Reviewed-By: Michaël Zasso <mic.besace@gmail.com>
The test takes 50 seconds on some of the project's Windows CI infrastructure. Reducing the test repetitions from 50 to 20 trims that to about 20 seconds. Tests will timeout at 60 seconds, so this helps keep the test reliable. (There was a timeout on CI today when testing an unrelated code change.) Refs: #7827 (comment) PR-URL: #7886 Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com> Reviewed-By: JungMinu - Minwoo Jung <jmwsoft@gmail.com> Reviewed-By: Michaël Zasso <mic.besace@gmail.com>
The test takes 50 seconds on some of the project's Windows CI infrastructure. Reducing the test repetitions from 50 to 20 trims that to about 20 seconds. Tests will timeout at 60 seconds, so this helps keep the test reliable. (There was a timeout on CI today when testing an unrelated code change.) Refs: #7827 (comment) PR-URL: #7886 Reviewed-By: Santiago Gimeno <santiago.gimeno@gmail.com> Reviewed-By: JungMinu - Minwoo Jung <jmwsoft@gmail.com> Reviewed-By: Michaël Zasso <mic.besace@gmail.com>
Checklist
make -j4 test
(UNIX), orvcbuild test nosign
(Windows) passesAffected core subsystem(s)
timers
Description of change
For nested timers with the same timeout, we can get into a situation
where we have recreated a timer list immediately before we need to
clean up an old timer list with the same key. Fix: make sure the list
to be deleted is the same instance as the list whose reference was used
to determine that a cleanup is necessary. If it's not the same instance,
a new list with the same key has been created, and it should not be
deleted.
Fixes: #7722