sync: Pool tests flaky on arm builders #31422

bcmills · 2019-04-11T19:34:47Z

Possibly related to #24640.

Samples:
https://build.golang.org/log/10c155a9635967f5b3006b6a04b6d5442ff9713a:

--- FAIL: TestPoolDequeue (0.00s)
    pool_test.go:239: popHead never succeeded
FAIL
FAIL	sync	0.864s

https://build.golang.org/log/3fbb17b4083eca9629c97f7b71879c804ecf5d0d and
https://build.golang.org/log/9f98720ace008f8c98f74c0d14049cb67b3c56f5:

##### sync -cpu=10
--- FAIL: TestPoolChain (0.00s)
    pool_test.go:239: popHead never succeeded
FAIL
FAIL	sync	0.864s

The text was updated successfully, but these errors were encountered:

ianlancetaylor · 2019-04-11T21:20:41Z

CC @aclements

aclements · 2019-04-16T01:35:28Z

I just got this once in 1,045 runs of all.bash on my linux/amd64 workstation.

--- FAIL: TestPoolChain (0.00s)
    pool_test.go:239: popHead never succeeded
FAIL
FAIL    sync    0.827s

This is certainly a theoretically possible failure, but when I wrote this test I though the chance of hitting the bad schedule was infinitesimal. Maybe there's a more likely schedule that can cause this.

josharian · 2019-05-14T00:19:15Z

Lots of instances of this on arm and arm64 builders:

$ greplogs -dashboard -E popHead -l
2019-04-29T15:23:10-db1514c/linux-arm64-packet
2019-04-29T21:26:07-d5014ec/linux-arm64-packet
2019-04-29T22:17:05-ccbc9a3/linux-arm64-packet
2019-04-30T15:48:46-4ad1355/netbsd-arm-bsiegert
2019-04-30T16:59:13-f686a28/netbsd-arm-bsiegert
2019-04-30T18:40:06-62ddf7d/linux-arm64-packet
2019-04-30T19:13:43-8e4f1a7/linux-arm64-packet
2019-04-30T20:26:36-85387aa/netbsd-arm-bsiegert
2019-05-01T14:59:51-ab5cee5/netbsd-arm-bsiegert
2019-05-01T16:10:05-e56c73f/netbsd-arm-bsiegert
2019-05-01T16:53:19-f0c383b/netbsd-arm-bsiegert
2019-05-01T16:55:33-07f6894/netbsd-arm-bsiegert
2019-05-01T21:14:28-aaf40f8/netbsd-arm-bsiegert
2019-05-01T22:22:41-e5f0d14/netbsd-arm-bsiegert
2019-05-02T14:04:56-2316784/netbsd-arm-bsiegert
2019-05-02T14:44:05-19f5c23/netbsd-arm-bsiegert
2019-05-02T22:17:31-fe83731/netbsd-arm-bsiegert
2019-05-03T15:17:54-5e404b3/netbsd-arm-bsiegert
2019-05-03T15:20:15-f5c43b9/netbsd-arm-bsiegert
2019-05-03T15:20:41-2c67cdf/linux-arm64-packet
2019-05-03T18:42:04-7fcba81/linux-arm64-packet
2019-05-06T17:06:16-5003b62/netbsd-arm-bsiegert
2019-05-06T18:17:03-cc5eaf9/linux-arm64-packet
2019-05-06T20:09:58-e1f9e70/netbsd-arm-bsiegert
2019-05-06T20:57:39-a62b572/netbsd-arm-bsiegert
2019-05-06T20:59:20-f4a5ae5/netbsd-arm-bsiegert
2019-05-06T21:14:52-5c15ed6/linux-arm
2019-05-06T21:23:29-04845fe/linux-arm64-packet
2019-05-06T21:23:29-04845fe/netbsd-arm-bsiegert
2019-05-06T23:02:29-6b1ac82/netbsd-arm-bsiegert
2019-05-06T23:23:45-53374e7/linux-arm64-packet
2019-05-06T23:23:45-53374e7/netbsd-arm-bsiegert
2019-05-07T12:48:04-a88cb1d/netbsd-arm-bsiegert
2019-05-07T16:59:51-8280455/linux-arm64-packet
2019-05-07T16:59:51-8280455/netbsd-arm-bsiegert
2019-05-08T16:00:05-4cd6c3b/linux-arm64-packet
2019-05-08T16:00:05-4cd6c3b/netbsd-arm-bsiegert
2019-05-08T16:55:59-2625fef/netbsd-arm-bsiegert
2019-05-08T17:11:57-5a2da56/netbsd-arm-bsiegert
2019-05-09T00:02:34-f766b68/netbsd-arm-bsiegert
2019-05-09T16:10:22-d56199d/linux-arm64-packet
2019-05-09T17:11:16-a44c3ed/linux-arm64-packet
2019-05-09T17:49:12-50a1d89/netbsd-arm-bsiegert
2019-05-09T21:13:18-6ed2ec4/netbsd-arm-bsiegert
2019-05-09T21:13:21-1ea7644/netbsd-arm-bsiegert
2019-05-09T21:13:39-13723d4/netbsd-arm-bsiegert
2019-05-09T21:13:56-a4f5c9c/netbsd-arm-bsiegert
2019-05-10T00:14:40-4ae31dc/netbsd-arm-bsiegert
2019-05-10T14:24:43-2aa8971/netbsd-arm-bsiegert
2019-05-11T03:02:33-ce5ae2f/netbsd-arm-bsiegert
2019-05-11T23:19:40-0926701/netbsd-arm-bsiegert

dianhong01 · 2019-06-03T02:42:16Z

when I run sync pool test cases like below for about 2000 times, they were all passed in arm64 device.
../golang/bin/go test sync -cpu=10 - c -o s1
./s1
But when I run case like that:
../golang/bin/go test sync -cpu=10 - c -o s2
./s2 -test.short
there were 1521 passed and 1378 failed.

when run all.bash script, the flag '-test.short' is set, which could make installation more efficient. In this case, the flag '-test.short' control value of "N". As comment in code "In theory it's possible in a valid schedule for popHead to never succeed", so I guess maybe N is too small to pass the case.

func testPoolDequeue(t *testing.T, d PoolDequeue) {
const P = 10
// In long mode, do enough pushes to wrap around the 21-bit
// indexes.
N := 1<<21 + 1000
if testing.Short() {
N = 1e3
}
...........

bcmills · 2019-06-26T14:45:03Z

@aclements, is this still on the radar for 1.13? Is this more likely a bug in the test, or in the Pool implementation?

aclements · 2019-06-26T17:35:37Z

Given that the long test doesn't flake, this is almost certainly a bug in the test. In the short test, there are only 100 expected PopHeads. On my linux/amd64 laptop, in 1000 runs, it gets as low as 50 successful PopHeads, but that seems to be a hard floor. It does give me pause that the failure rate is that high, since I would expect these schedules to be quite rare.

aclements · 2019-06-26T18:19:32Z

I added some logging. It looks like the time between the PushHead committing and the PopHead committing is just long enough that the racing PopTail loop can regularly succeed and drain the queue.

This means it's just the test. I'm not sure why it's so flaky on arm64 specifically, but it may be that that window is just larger because of architectural details. I'm still thinking about how to make the test less flaky. We could of course just add retries, but it would be nice to do something better.

aclements · 2019-06-26T18:33:01Z

Or we just remove the nPopHead check.

gopherbot · 2019-06-26T18:43:27Z

Change https://golang.org/cl/183981 mentions this issue: sync: only check for successful PopHeads in long mode

bcmills added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Apr 11, 2019

bcmills added this to the Go1.13 milestone Apr 11, 2019

bradfitz added NeedsFix The path to resolution is known, but the work has not been done. release-blocker Testing An issue that has been verified to require only test changes, not just a test failure. labels Apr 30, 2019

gopherbot removed the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Apr 30, 2019

bcmills changed the title ~~sync: Pool tests flaky on linux-arm64-packet builder~~ sync: Pool tests flaky on arm builders May 14, 2019

bcmills assigned aclements May 15, 2019

bradfitz mentioned this issue May 28, 2019

sync: TestPoolChain occasionally fails on linux-arm64 #32265

Closed

gopherbot closed this as completed in 9caaac2 Jun 26, 2019

bcmills mentioned this issue Dec 18, 2019

x/build/env/linux-arm64/packet: restrict CPUs to a reasonable number of cores #36170

Closed

golang locked and limited conversation to collaborators Jun 25, 2020

gopherbot added the FrozenDueToAge label Jun 25, 2020

rsc unassigned aclements Jun 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sync: Pool tests flaky on arm builders #31422

sync: Pool tests flaky on arm builders #31422

bcmills commented Apr 11, 2019

ianlancetaylor commented Apr 11, 2019

aclements commented Apr 16, 2019

josharian commented May 14, 2019

dianhong01 commented Jun 3, 2019 •

edited

Loading

bcmills commented Jun 26, 2019

aclements commented Jun 26, 2019

aclements commented Jun 26, 2019

aclements commented Jun 26, 2019

gopherbot commented Jun 26, 2019

sync: Pool tests flaky on arm builders #31422

sync: Pool tests flaky on arm builders #31422

Comments

bcmills commented Apr 11, 2019

ianlancetaylor commented Apr 11, 2019

aclements commented Apr 16, 2019

josharian commented May 14, 2019

dianhong01 commented Jun 3, 2019 • edited Loading

bcmills commented Jun 26, 2019

aclements commented Jun 26, 2019

aclements commented Jun 26, 2019

aclements commented Jun 26, 2019

gopherbot commented Jun 26, 2019

dianhong01 commented Jun 3, 2019 •

edited

Loading