[3.4] ./test: line 229: kill: (7520) - No such process #14257

ahrtr · 2022-07-22T01:35:59Z

I saw this error multiple times in 3.4 pipeline,

 {"level":"info","ts":1658452719.9775898,"caller":"tester/cluster.go:585","msg":"closed tester HTTP server","tester-address":"127.0.0.1:9028"}
./test: line 229: kill: (7520) - No such process
./test: line 229: kill: (7521) - No such process
./test: line 229: kill: (7522) - No such process
ETCD_TESTER_EXIT_CODE: 2
FAILED! 'tail -1000 /tmp/etcd-functional-1/etcd.log'
tail: cannot open '/tmp/etcd-functional-1/etcd.log' for reading: No such file or directory
Error: Process completed with exit code 1.

Refer to https://github.com/etcd-io/etcd/runs/7460637442?check_suite_focus=true

The text was updated successfully, but these errors were encountered:

xiaods · 2022-07-23T12:12:39Z

run functional testcase on release-3.4 branch. it output below error.

{"level":"info","ts":1658578145.555051,"caller":"tester/stresser.go:56","msg":"creating stresser","type":"KV_TXN_WRITE_DELETE","weight":0.35,"endpoint":"127.0.0.1:3379"}
{"level":"info","ts":1658578145.555056,"caller":"tester/stresser.go:56","msg":"creating stresser","type":"LEASE","weight":0,"endpoint":"127.0.0.1:3379"}
{"level":"info","ts":1658578145.555061,"caller":"tester/cluster.go:298","msg":"added lease stresser","endpoint":"127.0.0.1:3379"}
{"level":"info","ts":1658578145.5550668,"caller":"tester/cluster.go:327","msg":"updated stressers"}
{"level":"info","ts":1658578145.555187,"caller":"tester/cluster.go:458","msg":"sent request","operation":"INITIAL_START_ETCD","to":"127.0.0.1:3379"}
{"level":"info","ts":1658578145.55519,"caller":"tester/cluster.go:458","msg":"sent request","operation":"INITIAL_START_ETCD","to":"127.0.0.1:1379"}
{"level":"info","ts":1658578145.5551932,"caller":"tester/cluster.go:458","msg":"sent request","operation":"INITIAL_START_ETCD","to":"127.0.0.1:2379"}
2022-07-23T20:09:05.555+0800	INFO	agent	agent/handler.go:40	handler success	{"address": "127.0.0.1:39027", "operation": "INITIAL_START_ETCD"}
2022-07-23T20:09:05.555+0800	INFO	agent	agent/handler.go:40	handler success	{"address": "127.0.0.1:29027", "operation": "INITIAL_START_ETCD"}
2022-07-23T20:09:05.555+0800	INFO	agent	agent/handler.go:40	handler success	{"address": "127.0.0.1:19027", "operation": "INITIAL_START_ETCD"}
{"level":"info","ts":1658578145.555926,"caller":"tester/cluster.go:470","msg":"received response","operation":"INITIAL_START_ETCD","from":"127.0.0.1:3379","success":false,"status":"\"INITIAL_START_ETCD\" is not valid; last server operation was \"INITIAL_START_ETCD\""}
{"level":"info","ts":1658578145.5559459,"caller":"tester/cluster.go:470","msg":"received response","operation":"INITIAL_START_ETCD","from":"127.0.0.1:2379","success":false,"status":"\"INITIAL_START_ETCD\" is not valid; last server operation was \"INITIAL_START_ETCD\""}
{"level":"info","ts":1658578145.5559502,"caller":"tester/cluster.go:470","msg":"received response","operation":"INITIAL_START_ETCD","from":"127.0.0.1:1379","success":false,"status":"\"INITIAL_START_ETCD\" is not valid; last server operation was \"INITIAL_START_ETCD\""}
{"level":"fatal","ts":1658578145.55598,"caller":"etcd-tester/main.go:49","msg":"Bootstrap failed","error":"\"INITIAL_START_ETCD\" is not valid; last server operation was \"INITIAL_START_ETCD\", \"INITIAL_START_ETCD\" is not valid; last server operation was \"INITIAL_START_ETCD\", \"INITIAL_START_ETCD\" is not valid; last server operation was \"INITIAL_START_ETCD\"","stacktrace":"main.main\n\t/Users/xiaods/Documents/code/etcd/functional/cmd/etcd-tester/main.go:49\nruntime.main\n\t/usr/local/Cellar/go/1.17.6/libexec/src/runtime/proc.go:255"}
ETCD_TESTER_EXIT_CODE: 1

FAILED! 'tail -1000 /tmp/etcd-functional-1/etcd.log'
tail: cannot open '/tmp/etcd-functional-1/etcd.log' for reading: No such file or directory

lavacat · 2022-07-28T21:27:02Z

Here are some more examples of failure due to ETCD_TESTER_EXIT_CODE: 2
https://github.com/etcd-io/etcd/runs/7528996274?check_suite_focus=true
https://github.com/etcd-io/etcd/runs/7443519017?check_suite_focus=true
https://github.com/etcd-io/etcd/runs/7436236229?check_suite_focus=true

lavacat · 2022-07-29T15:49:09Z

if nobody is looking at this, please assign to me

lavacat · 2022-08-12T17:53:07Z

still looking at this issue. Quick update. There are couple separate investigations:

Two runs failed due to
{"level":"warn","ts":1658383863.827886,"caller":"tester/cluster_run.go:50","msg":"round FAIL","round":0,"case":2,"case-total":22,"error":"injection error: cluster too slow: only 1961 commits in 6 retries"}
836
https://github.com/etcd-io/etcd/runs/7443519017?check_suite_focus=true
https://github.com/etcd-io/etcd/runs/7436236229?check_suite_focus=true
In test case BLACKHOLE_PEER_PORT_TX_RX_LEADER_UNTIL_TRIGGER_SNAPSHOT
I was able to reproduce this locally. Consistently on Mac and once in docker image. Looking into this more.
This run failed due to
{"level":"warn","ts":1658869799.140927,"caller":"tester/cluster_run.go:50","msg":"round FAIL","round":0,"case":20,"case-total":22,"error":"wait full health error: etcdserver: too many requests ("127.0.0.1:1379")"}
in test case BLACKHOLE_PEER_PORT_TX_RX_LEADER
Haven't tried to reproduce this yet.
There is setup problem related to all test cases/branches. SIGQUIT_ETCD_AND_REMOVE_DATA_AND_STOP_AGENT removes tmp directories that record logs of the cluster and they are never appended to test output here. I'll submit a fix for this independently of prior 2 cases.

lavacat · 2022-08-29T06:26:55Z

Let close this issue.

Quick summary

no solution for BLACKHOLE_PEER_PORT_TX_RX_LEADER_UNTIL_TRIGGER_SNAPSHOT (see note below).
seams to be a one-off
Filed [test] after failure functional tests can't access logs in /tmp/etcd-functional* #14384

BLACKHOLE_PEER_PORT_TX_RX_LEADER_UNTIL_TRIGGER_SNAPSHOT

On my local Mac etcd_disk_wal_fsync_duration_seconds_bucket is pretty high and that probably explains the slow cluster and that this test fails for me locally consistently. It will do ~700 commits out of 2000.
Even if I remove injected failure 'BLACKHOLE_PEER_PORT_TX_RX' this test never passes, cluster is too slow for me locally.

From GitHub actions logs we can see that it got close to 2000 commits that are needed for snapshot.

cluster too slow: only 1961 commits in 6 retries
luster too slow: only 1967 commits in 6 retries

So, my hypothesis is that it's noisy neighbor problem on GitHub action VM, but I can't prove it, probably need to look at metrics.

stale · 2022-12-31T23:11:19Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

ahrtr added the help wanted label Jul 22, 2022

ahrtr mentioned this issue Jul 22, 2022

Plans for v3.4.20 release #14232

Closed

25 tasks

ahrtr added area/testing type/flake labels Jul 22, 2022

ahrtr assigned lavacat Jul 29, 2022

lavacat mentioned this issue Nov 22, 2022

functional test flaky. "etcdserver: too many requests" #14826

Closed

stale bot added the stale label Dec 31, 2022

stale bot closed this as completed Apr 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[3.4] ./test: line 229: kill: (7520) - No such process #14257

[3.4] ./test: line 229: kill: (7520) - No such process #14257

ahrtr commented Jul 22, 2022

xiaods commented Jul 23, 2022

lavacat commented Jul 28, 2022

lavacat commented Jul 29, 2022

lavacat commented Aug 12, 2022

lavacat commented Aug 29, 2022 •

edited

Loading

stale bot commented Dec 31, 2022

[3.4] ./test: line 229: kill: (7520) - No such process #14257

[3.4] ./test: line 229: kill: (7520) - No such process #14257

Comments

ahrtr commented Jul 22, 2022

xiaods commented Jul 23, 2022

lavacat commented Jul 28, 2022

lavacat commented Jul 29, 2022

lavacat commented Aug 12, 2022

lavacat commented Aug 29, 2022 • edited Loading

Quick summary

BLACKHOLE_PEER_PORT_TX_RX_LEADER_UNTIL_TRIGGER_SNAPSHOT

stale bot commented Dec 31, 2022

lavacat commented Aug 29, 2022 •

edited

Loading