Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[3.4] ./test: line 229: kill: (7520) - No such process #14257

Closed
ahrtr opened this issue Jul 22, 2022 · 6 comments
Closed

[3.4] ./test: line 229: kill: (7520) - No such process #14257

ahrtr opened this issue Jul 22, 2022 · 6 comments

Comments

@ahrtr
Copy link
Member

ahrtr commented Jul 22, 2022

I saw this error multiple times in 3.4 pipeline,

 {"level":"info","ts":1658452719.9775898,"caller":"tester/cluster.go:585","msg":"closed tester HTTP server","tester-address":"127.0.0.1:9028"}
./test: line 229: kill: (7520) - No such process
./test: line 229: kill: (7521) - No such process
./test: line 229: kill: (7522) - No such process
ETCD_TESTER_EXIT_CODE: 2
FAILED! 'tail -1000 /tmp/etcd-functional-1/etcd.log'
tail: cannot open '/tmp/etcd-functional-1/etcd.log' for reading: No such file or directory
Error: Process completed with exit code 1.

Refer to https://github.com/etcd-io/etcd/runs/7460637442?check_suite_focus=true

@xiaods
Copy link

xiaods commented Jul 23, 2022

run functional testcase on release-3.4 branch. it output below error.

{"level":"info","ts":1658578145.555051,"caller":"tester/stresser.go:56","msg":"creating stresser","type":"KV_TXN_WRITE_DELETE","weight":0.35,"endpoint":"127.0.0.1:3379"}
{"level":"info","ts":1658578145.555056,"caller":"tester/stresser.go:56","msg":"creating stresser","type":"LEASE","weight":0,"endpoint":"127.0.0.1:3379"}
{"level":"info","ts":1658578145.555061,"caller":"tester/cluster.go:298","msg":"added lease stresser","endpoint":"127.0.0.1:3379"}
{"level":"info","ts":1658578145.5550668,"caller":"tester/cluster.go:327","msg":"updated stressers"}
{"level":"info","ts":1658578145.555187,"caller":"tester/cluster.go:458","msg":"sent request","operation":"INITIAL_START_ETCD","to":"127.0.0.1:3379"}
{"level":"info","ts":1658578145.55519,"caller":"tester/cluster.go:458","msg":"sent request","operation":"INITIAL_START_ETCD","to":"127.0.0.1:1379"}
{"level":"info","ts":1658578145.5551932,"caller":"tester/cluster.go:458","msg":"sent request","operation":"INITIAL_START_ETCD","to":"127.0.0.1:2379"}
2022-07-23T20:09:05.555+0800	INFO	agent	agent/handler.go:40	handler success	{"address": "127.0.0.1:39027", "operation": "INITIAL_START_ETCD"}
2022-07-23T20:09:05.555+0800	INFO	agent	agent/handler.go:40	handler success	{"address": "127.0.0.1:29027", "operation": "INITIAL_START_ETCD"}
2022-07-23T20:09:05.555+0800	INFO	agent	agent/handler.go:40	handler success	{"address": "127.0.0.1:19027", "operation": "INITIAL_START_ETCD"}
{"level":"info","ts":1658578145.555926,"caller":"tester/cluster.go:470","msg":"received response","operation":"INITIAL_START_ETCD","from":"127.0.0.1:3379","success":false,"status":"\"INITIAL_START_ETCD\" is not valid; last server operation was \"INITIAL_START_ETCD\""}
{"level":"info","ts":1658578145.5559459,"caller":"tester/cluster.go:470","msg":"received response","operation":"INITIAL_START_ETCD","from":"127.0.0.1:2379","success":false,"status":"\"INITIAL_START_ETCD\" is not valid; last server operation was \"INITIAL_START_ETCD\""}
{"level":"info","ts":1658578145.5559502,"caller":"tester/cluster.go:470","msg":"received response","operation":"INITIAL_START_ETCD","from":"127.0.0.1:1379","success":false,"status":"\"INITIAL_START_ETCD\" is not valid; last server operation was \"INITIAL_START_ETCD\""}
{"level":"fatal","ts":1658578145.55598,"caller":"etcd-tester/main.go:49","msg":"Bootstrap failed","error":"\"INITIAL_START_ETCD\" is not valid; last server operation was \"INITIAL_START_ETCD\", \"INITIAL_START_ETCD\" is not valid; last server operation was \"INITIAL_START_ETCD\", \"INITIAL_START_ETCD\" is not valid; last server operation was \"INITIAL_START_ETCD\"","stacktrace":"main.main\n\t/Users/xiaods/Documents/code/etcd/functional/cmd/etcd-tester/main.go:49\nruntime.main\n\t/usr/local/Cellar/go/1.17.6/libexec/src/runtime/proc.go:255"}
ETCD_TESTER_EXIT_CODE: 1

FAILED! 'tail -1000 /tmp/etcd-functional-1/etcd.log'
tail: cannot open '/tmp/etcd-functional-1/etcd.log' for reading: No such file or directory

@lavacat
Copy link

lavacat commented Jul 29, 2022

if nobody is looking at this, please assign to me

@lavacat
Copy link

lavacat commented Aug 12, 2022

still looking at this issue. Quick update. There are couple separate investigations:

  1. Two runs failed due to
    {"level":"warn","ts":1658383863.827886,"caller":"tester/cluster_run.go:50","msg":"round FAIL","round":0,"case":2,"case-total":22,"error":"injection error: cluster too slow: only 1961 commits in 6 retries"}
    836
    https://github.com/etcd-io/etcd/runs/7443519017?check_suite_focus=true
    https://github.com/etcd-io/etcd/runs/7436236229?check_suite_focus=true
    In test case BLACKHOLE_PEER_PORT_TX_RX_LEADER_UNTIL_TRIGGER_SNAPSHOT
    I was able to reproduce this locally. Consistently on Mac and once in docker image. Looking into this more.

  2. This run failed due to
    {"level":"warn","ts":1658869799.140927,"caller":"tester/cluster_run.go:50","msg":"round FAIL","round":0,"case":20,"case-total":22,"error":"wait full health error: etcdserver: too many requests ("127.0.0.1:1379")"}
    in test case BLACKHOLE_PEER_PORT_TX_RX_LEADER
    Haven't tried to reproduce this yet.

  3. There is setup problem related to all test cases/branches. SIGQUIT_ETCD_AND_REMOVE_DATA_AND_STOP_AGENT removes tmp directories that record logs of the cluster and they are never appended to test output here. I'll submit a fix for this independently of prior 2 cases.

@lavacat
Copy link

lavacat commented Aug 29, 2022

Let close this issue.

Quick summary

  1. no solution for BLACKHOLE_PEER_PORT_TX_RX_LEADER_UNTIL_TRIGGER_SNAPSHOT (see note below).
  2. seams to be a one-off
  3. Filed [test] after failure functional tests can't access logs in /tmp/etcd-functional* #14384

BLACKHOLE_PEER_PORT_TX_RX_LEADER_UNTIL_TRIGGER_SNAPSHOT

On my local Mac etcd_disk_wal_fsync_duration_seconds_bucket is pretty high and that probably explains the slow cluster and that this test fails for me locally consistently. It will do ~700 commits out of 2000.
Even if I remove injected failure 'BLACKHOLE_PEER_PORT_TX_RX' this test never passes, cluster is too slow for me locally.

From GitHub actions logs we can see that it got close to 2000 commits that are needed for snapshot.

cluster too slow: only 1961 commits in 6 retries
luster too slow: only 1967 commits in 6 retries

So, my hypothesis is that it's noisy neighbor problem on GitHub action VM, but I can't prove it, probably need to look at metrics.

@stale
Copy link

stale bot commented Dec 31, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Dec 31, 2022
@stale stale bot closed this as completed Apr 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

3 participants