Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

investigate flaky internet/test-uv-threadpool-schedule #25305

Closed
Trott opened this issue Jan 1, 2019 · 11 comments
Closed

investigate flaky internet/test-uv-threadpool-schedule #25305

Trott opened this issue Jan 1, 2019 · 11 comments
Labels
flaky-test Issues and PRs related to the tests with unstable failures on the CI.

Comments

@Trott
Copy link
Member

Trott commented Jan 1, 2019

internet/test-uv-threadpool-schedule failed on node-daily-master (the only place where internet tests are run) last night. Considering it was only added two days, that's probably cause for concern?

https://ci.nodejs.org/job/node-test-commit-custom-suites/810/default/console

test-rackspace-ubuntu1604-x64-1

00:02:11 not ok 25 internet/test-uv-threadpool-schedule
00:02:11   ---
00:02:11   duration_ms: 0.613
00:02:11   severity: fail
00:02:11   exitcode: 1
00:02:11   stack: |-
00:02:11     assert.js:351
00:02:11         throw err;
00:02:11         ^
00:02:11     
00:02:11     AssertionError [ERR_ASSERTION]: fast I/O took longer to complete, actual: 17, expected: 3.63
00:02:11         at GetAddrInfoReqWrap.onResolve (/home/iojs/build/workspace/node-test-commit-custom-suites/default/test/internet/test-uv-threadpool-schedule.js:41:12)
00:02:11         at GetAddrInfoReqWrap.callback (/home/iojs/build/workspace/node-test-commit-custom-suites/default/test/common/index.js:376:15)
00:02:11         at GetAddrInfoReqWrap.onlookup [as oncomplete] (dns.js:57:17)
00:02:11   ...

@gireeshpunathil

@Trott Trott added the flaky-test Issues and PRs related to the tests with unstable failures on the CI. label Jan 1, 2019
@gireeshpunathil
Copy link
Member

@Trott - do you know how many CPUs this (failing) system has? IMO the most probable way this test case fail the way it had, is by real thread (CPU) congestion.

I will see if I can replicate this locally; or else request for CI access, if it is very easily reproduced there. thanks!

@gireeshpunathil
Copy link
Member

if the host type is a single one for this test, I can also calibrate the test for the host - meaning, baseline the ratio of pooled workers for fast I/O vs. slow I/O that suites the host based on its CPU characteristics.

@Trott
Copy link
Member Author

Trott commented Jan 2, 2019

do you know how many CPUs this (failing) system has?

@gireeshpunathil It seems to be a 4-CPU machine:

root@test-rackspace-ubuntu1604-x64-1:~# cat /proc/cpuinfo | grep processor | wc -l
4
root@test-rackspace-ubuntu1604-x64-1:~#

@Trott
Copy link
Member Author

Trott commented Jan 2, 2019

Failed again today on node-daily-master:

https://ci.nodejs.org/job/node-test-commit-custom-suites/811/default/console

test-rackspace-ubuntu1604-x64-1

00:02:17 not ok 25 internet/test-uv-threadpool-schedule
00:02:17   ---
00:02:17   duration_ms: 1.316
00:02:17   severity: fail
00:02:17   exitcode: 1
00:02:17   stack: |-
00:02:17     assert.js:351
00:02:17         throw err;
00:02:17         ^
00:02:17     
00:02:17     AssertionError [ERR_ASSERTION]: fast I/O took longer to complete, actual: 16, expected: 10.08
00:02:17         at GetAddrInfoReqWrap.onResolve (/home/iojs/build/workspace/node-test-commit-custom-suites/default/test/internet/test-uv-threadpool-schedule.js:41:12)
00:02:17         at GetAddrInfoReqWrap.callback (/home/iojs/build/workspace/node-test-commit-custom-suites/default/test/common/index.js:376:15)
00:02:17         at GetAddrInfoReqWrap.onlookup [as oncomplete] (dns.js:57:17)
00:02:17   ...

@gireeshpunathil
Copy link
Member

I am able to recreate the failure locally.
taking strace dumps, I am able to see the libuv fast IO thread is engaged correctly, and is able to finish in time. Looks like there is an issue in the way we are computing the time differences.

in short, the test case is at fault, and needs to be fixed. I will investigate.

@gireeshpunathil
Copy link
Member

gireeshpunathil commented Jan 3, 2019

ok, there is the picture:

  • record the start.
  • issue a hundred dns lookup (async). when last one finishes mark the end time of dns
  • issue a file read (async). when done, mark the end time of fs
  • assert that time for fs is at most 1/100th of time for dns.

intent: make sure dns threads (slow IO) do not throttle fs threads (fast I/O)

all is good, except the placement of the time recording. As the 100 dns are initiated back-to-back, few of those finishes before the fs starts, and fires back the result into the event loop. So when the fs finishes, 2 things happen:

  • the event descriptor for fs gets queued behind those for dns
  • the time spent on callbacks for the dns are accounted against fs

so;

  • if large dns queries pile up events in the loop in this manner, do we think the original issue for which the libuv fix was designed is resolved fully?
  • if not, my test case may be at fault, and I need to measure things accurately, but how?

wondering on these. /cc @nodejs/libuv @addaleax

@Trott
Copy link
Member Author

Trott commented Jan 3, 2019

The fix (for the issue that this test is supposed to detect) is in 11.0 and newer.

But with 11.6.0, running at the command line on a macOS laptop, the test fails for me:

$ node test/internet/test-uv-threadpool-schedule.js 
assert.js:351
    throw err;
    ^

AssertionError [ERR_ASSERTION]: fast I/O took longer to complete, actual: 9, expected: 5.03
    at GetAddrInfoReqWrap.onResolve (/Users/trott/io.js/test/internet/test-uv-threadpool-schedule.js:41:12)
    at GetAddrInfoReqWrap.callback (/Users/trott/io.js/test/common/index.js:376:15)
    at GetAddrInfoReqWrap.onlookup [as oncomplete] (dns.js:57:17)
$

Of course, results may depend on network, DNS, etc.
¯\(ツ)

It fails on master too:

$ ./node test/internet/test-uv-threadpool-schedule.js 
assert.js:351
    throw err;
    ^

AssertionError [ERR_ASSERTION]: fast I/O took longer to complete, actual: 10, expected: 4.65
    at GetAddrInfoReqWrap.onResolve (/Users/trott/io.js/test/internet/test-uv-threadpool-schedule.js:41:12)
    at GetAddrInfoReqWrap.callback (/Users/trott/io.js/test/common/index.js:376:15)
    at GetAddrInfoReqWrap.onlookup [as oncomplete] (dns.js:57:17)
$ 

Results don't look significantly different for 10.x:

$ node test/internet/test-uv-threadpool-schedule.js 
assert.js:350
    throw err;
    ^

AssertionError [ERR_ASSERTION]: fast I/O took longer to complete, actual: 9, expected: 4.9
    at GetAddrInfoReqWrap.onResolve (/Users/trott/io.js/test/internet/test-uv-threadpool-schedule.js:41:12)
    at GetAddrInfoReqWrap.callback (/Users/trott/io.js/test/common/index.js:376:15)
    at GetAddrInfoReqWrap.onlookup [as oncomplete] (dns.js:57:17)
$ 

Changing one line of the test:

const slowIOmax = 100 * require('os').cpus().length;

...doesn't change the results other than increasing the times. The test still fails on master, 11.x, and 10.x for me locally.

Is it possible that the test as written is just too dependent on environment? Perhaps we need a different approach, or maybe the test should be removed? (It failed for the third time in a row today on node-daily-master.)

@Trott
Copy link
Member Author

Trott commented Jan 3, 2019

Oh, the fix is also in 10.12, so I have to run Node.js earlier than that.

The test still fails but we do a significant difference!

10.12:

$ node test/internet/test-uv-threadpool-schedule.js 
assert.js:350
    throw err;
    ^

AssertionError [ERR_ASSERTION]: fast I/O took longer to complete, actual: 13, expected: 4.86
    at GetAddrInfoReqWrap.onResolve (/Users/trott/io.js/test/internet/test-uv-threadpool-schedule.js:41:12)
    at GetAddrInfoReqWrap.callback (/Users/trott/io.js/test/common/index.js:376:15)
    at GetAddrInfoReqWrap.onlookup [as oncomplete] (dns.js:57:17)
$ 

10.11:

$ node test/internet/test-uv-threadpool-schedule.js 
assert.js:348
    throw err;
    ^

AssertionError [ERR_ASSERTION]: fast I/O took longer to complete, actual: 239, expected: 2.42
    at GetAddrInfoReqWrap.onResolve (/Users/trott/io.js/test/internet/test-uv-threadpool-schedule.js:41:12)
    at GetAddrInfoReqWrap.callback (/Users/trott/io.js/test/common/index.js:376:15)
    at GetAddrInfoReqWrap.onlookup [as oncomplete] (dns.js:57:17)
$ 

Maybe changing const expectedMax = slowIOtime / 100; to const expectedMax = slowIOtime / 10; is enough to fix the issue while still catching the regression? Still feels awfully environment-dependent, but I'm not sure I have a better idea, unfortunately.

@gireeshpunathil
Copy link
Member

@Trott - thanks for running many tests, making inferences and the recommendation. I agree that being more conservative with the calculation is the most reasonable thing to do here. But let me wait to hear back from libuv folks once. The key question is: do we think just fair sharing of threads for 2 types of I/O is sufficient here, or do we need to make sure the work done reaches the consumer in a timely manner to ensure fairness.

actual: 239, expected: 2.42

this is the kind of problem that the test is supposed to catch. 239 milliseconds for a 1K file read! this is definitely thread starvation. I strongly believe this is either without the libuv fix, or with a fix that had issues (which got refixed later)

@Trott
Copy link
Member Author

Trott commented Jan 4, 2019

I strongly believe this is either without the libuv fix,

Yes, exactly. That was on Node 10.11 (without the fix). The run above it is Node 10.12 where the fix first appeared. The result there is only 13 ms.

@Trott
Copy link
Member Author

Trott commented Jan 4, 2019

Today's:

00:03:42 not ok 25 internet/test-uv-threadpool-schedule
00:03:42   ---
00:03:42   duration_ms: 0.712
00:03:42   severity: fail
00:03:42   exitcode: 1
00:03:42   stack: |-
00:03:42     assert.js:351
00:03:42         throw err;
00:03:42         ^
00:03:42     
00:03:42     AssertionError [ERR_ASSERTION]: fast I/O took longer to complete, actual: 17, expected: 4.03
00:03:42         at GetAddrInfoReqWrap.onResolve (/home/iojs/build/workspace/node-test-commit-custom-suites/default/test/internet/test-uv-threadpool-schedule.js:41:12)
00:03:42         at GetAddrInfoReqWrap.callback (/home/iojs/build/workspace/node-test-commit-custom-suites/default/test/common/index.js:376:15)
00:03:42         at GetAddrInfoReqWrap.onlookup [as oncomplete] (dns.js:57:17)
00:03:42   ...

Trott added a commit to Trott/io.js that referenced this issue Jan 6, 2019
test-uv-threadpool-schedule has been failing consistently in
node-daily-master CI. It also fails consistently on my personal laptop.
These changes make it pass consistently with current master but fail
consistently with Node.js 10.11 (which was the last release in the 10.x
line before the fix that the test is for). It succeeds/fails as expected
whether or not I am connected to the network.

Fixes: nodejs#25305
@Trott Trott closed this as completed in 9987f1a Jan 8, 2019
addaleax pushed a commit that referenced this issue Jan 9, 2019
test-uv-threadpool-schedule has been failing consistently in
node-daily-master CI. It also fails consistently on my personal laptop.
These changes make it pass consistently with current master but fail
consistently with Node.js 10.11 (which was the last release in the 10.x
line before the fix that the test is for). It succeeds/fails as expected
whether or not I am connected to the network.

Fixes: #25305

PR-URL: #25358
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Gireesh Punathil <gpunathi@in.ibm.com>
Reviewed-By: Sakthipriyan Vairamani <thechargingvolcano@gmail.com>
refack pushed a commit to refack/node that referenced this issue Jan 14, 2019
test-uv-threadpool-schedule has been failing consistently in
node-daily-master CI. It also fails consistently on my personal laptop.
These changes make it pass consistently with current master but fail
consistently with Node.js 10.11 (which was the last release in the 10.x
line before the fix that the test is for). It succeeds/fails as expected
whether or not I am connected to the network.

Fixes: nodejs#25305

PR-URL: nodejs#25358
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Gireesh Punathil <gpunathi@in.ibm.com>
Reviewed-By: Sakthipriyan Vairamani <thechargingvolcano@gmail.com>
BridgeAR pushed a commit to BridgeAR/node that referenced this issue Jan 16, 2019
test-uv-threadpool-schedule has been failing consistently in
node-daily-master CI. It also fails consistently on my personal laptop.
These changes make it pass consistently with current master but fail
consistently with Node.js 10.11 (which was the last release in the 10.x
line before the fix that the test is for). It succeeds/fails as expected
whether or not I am connected to the network.

Fixes: nodejs#25305

PR-URL: nodejs#25358
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Gireesh Punathil <gpunathi@in.ibm.com>
Reviewed-By: Sakthipriyan Vairamani <thechargingvolcano@gmail.com>
BethGriggs pushed a commit that referenced this issue Apr 28, 2019
test-uv-threadpool-schedule has been failing consistently in
node-daily-master CI. It also fails consistently on my personal laptop.
These changes make it pass consistently with current master but fail
consistently with Node.js 10.11 (which was the last release in the 10.x
line before the fix that the test is for). It succeeds/fails as expected
whether or not I am connected to the network.

Fixes: #25305

PR-URL: #25358
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Gireesh Punathil <gpunathi@in.ibm.com>
Reviewed-By: Sakthipriyan Vairamani <thechargingvolcano@gmail.com>
BethGriggs pushed a commit that referenced this issue May 10, 2019
test-uv-threadpool-schedule has been failing consistently in
node-daily-master CI. It also fails consistently on my personal laptop.
These changes make it pass consistently with current master but fail
consistently with Node.js 10.11 (which was the last release in the 10.x
line before the fix that the test is for). It succeeds/fails as expected
whether or not I am connected to the network.

Fixes: #25305

PR-URL: #25358
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Gireesh Punathil <gpunathi@in.ibm.com>
Reviewed-By: Sakthipriyan Vairamani <thechargingvolcano@gmail.com>
MylesBorins pushed a commit that referenced this issue May 16, 2019
test-uv-threadpool-schedule has been failing consistently in
node-daily-master CI. It also fails consistently on my personal laptop.
These changes make it pass consistently with current master but fail
consistently with Node.js 10.11 (which was the last release in the 10.x
line before the fix that the test is for). It succeeds/fails as expected
whether or not I am connected to the network.

Fixes: #25305

PR-URL: #25358
Reviewed-By: Anna Henningsen <anna@addaleax.net>
Reviewed-By: James M Snell <jasnell@gmail.com>
Reviewed-By: Gireesh Punathil <gpunathi@in.ibm.com>
Reviewed-By: Sakthipriyan Vairamani <thechargingvolcano@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flaky-test Issues and PRs related to the tests with unstable failures on the CI.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants