Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[3.5] member replace e2e test #17123

Merged
merged 1 commit into from
Jan 11, 2024
Merged

Conversation

ZhouJianMS
Copy link
Contributor

Follow up: #17079 (comment)

@k8s-ci-robot
Copy link

Hi @ZhouJianMS. Thanks for your PR.

I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ahrtr
Copy link
Member

ahrtr commented Dec 15, 2023

can you please also backport the change for test case TestMemberReplace in #17119?

@ZhouJianMS
Copy link
Contributor Author

can you please also backport the change for test case TestMemberReplace in #17119?

Sure, done.

@ahrtr
Copy link
Member

ahrtr commented Dec 15, 2023

@ZhouJianMS Reran the e2e test 4 times, always failed. Can you please download the log and take a look what's the issue?

Screenshot 2023-12-15 at 16 10 21

@ZhouJianMS
Copy link
Contributor Author

ZhouJianMS commented Dec 18, 2023

@ZhouJianMS Reran the e2e test 4 times, always failed. Can you please download the log and take a look what's the issue?

The test is stuck in waiting member killed after executing member remove command. @serathius Does member in etcd 3.5 automatically kill themselves or not?

After adding member.Kill(), the member can be successfully stopped. But I noticed that adding member back failed due to error "the member has been permanently removed from the cluster". Investigating why re-added member got the same ID.

require.NoError(t, err)
require.Equal(t, found, false, "Expected member to be removed")
for member.IsRunning() {
time.Sleep(10 * time.Millisecond)
Copy link
Member

@fuweid fuweid Dec 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For 3.5 release, there is no background goroutine as zombie process's reaper.
I think we can just export Close() here. Otherwise, zombie process is running and member.IsRunning is true until timeout.

// Close waits for the expect process to exit.
// Close currently does not return error if process exited with !=0 status.
// TODO: Close should expose underlying proces failure by default.
func (ep *ExpectProcess) Close() error { return ep.close(false) }
ps -ef | grep 2975576
fuwei    2975576 2973518  0 15:35 pts/1    00:00:03 /tmp/go-build2055634301/b001/e2e.test -test.testlogfile=/tmp/go-build2055634301/b001/testlog.txt -test.paniconexit0 -test.v=true -test.timeout=30m0s -test.run=TestMemberReplace
fuwei    2975587 2975576  0 15:35 ?        00:00:00 [etcd] <defunct>
fuwei    2975588 2975576  0 15:35 pts/39   00:00:02 /home/fuwei/go/src/go.etcd.io/etcd/bin/etcd --name test-0 --listen-client-urls http://localhost:20000 --advertise-client-urls http://localhost:20000 --listen-peer-urls http://localhost:20001 --initial-advertise-peer-urls http://localhost:20001 --initial-cluster-token  --data-dir /tmp/TestMemberReplace206448044/002 --snapshot-count 100000 --experimental-corrupt-check-time 1s --initial-cluster test-0=http://localhost:20001,test-1=http://localhost:20006,test-2=http://localhost:20011
fuwei    2975589 2975576  0 15:35 pts/40   00:00:02 /home/fuwei/go/src/go.etcd.io/etcd/bin/etcd --name test-1 --listen-client-urls http://localhost:20005 --advertise-client-urls http://localhost:20005 --listen-peer-urls http://localhost:20006 --initial-advertise-peer-urls http://localhost:20006 --initial-cluster-token  --data-dir /tmp/TestMemberReplace206448044/003 --snapshot-count 100000 --experimental-corrupt-check-time 1s --initial-cluster test-0=http://localhost:20001,test-1=http://localhost:20006,test-2=http://localhost:20011
fuwei    2975630 2975576  0 15:35 ?        00:00:00 [etcdctl] <defunct>
fuwei    2975657 2975576  0 15:35 ?        00:00:00 [etcdctl] <defunct>
fuwei    2975667 2975576  0 15:35 ?        00:00:00 [etcdctl] <defunct>
fuwei    2982980 2975823  0 15:46 pts/44   00:00:00 grep --color=auto --exclude-dir=.bzr --exclude-dir=CVS --exclude-dir=.git --exclude-dir=.hg --exclude-dir=.svn --exclude-dir=.idea --exclude-dir=.tox 2975576

https://github.com/etcd-io/etcd/blob/5b572f15162d9b61979fb5eed65e65b026917464/pkg/expect/expect.go#L88C8-L88C23

Comment on lines +132 to +143
func patchArgs(args []string, flag, newValue string) []string {
for i, arg := range args {
if strings.Contains(arg, flag) {
args[i] = fmt.Sprintf("--%s=%s", flag, newValue)
return args
}
}
args = append(args, fmt.Sprintf("--%s=%s", flag, newValue))
return args
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the implementation is different from the main branch?

etcd/tests/e2e/utils.go

Lines 139 to 147 in 93530f6

func patchArgs(args []string, flag, newValue string) error {
for i, arg := range args {
if strings.Contains(arg, flag) {
args[i] = fmt.Sprintf("--%s=%s", flag, newValue)
return nil
}
}
return fmt.Errorf("--%s flag not found", flag)
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3.5 does not have "initial-cluster-state" as a default argument in e2e framework. So, we need to append it in args.

func (cfg *EtcdProcessClusterConfig) SetInitialOrDiscovery(serverCfg *EtcdServerProcessConfig, initialCluster []string, initialClusterState string) {
if cfg.Discovery == "" && len(cfg.ServerConfig.DiscoveryCfg.Endpoints) == 0 {
serverCfg.InitialCluster = strings.Join(initialCluster, ",")
serverCfg.Args = append(serverCfg.Args, "--initial-cluster="+serverCfg.InitialCluster)
serverCfg.Args = append(serverCfg.Args, "--initial-cluster-state="+initialClusterState)
}

First introduced in commit 6f63f4b

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should backport #16707 to 3.5 if possible.

Please add a comment to explain the difference for now.

Signed-off-by: ZhouJianMS <zhoujian@microsoft.com>
Copy link
Member

@ahrtr ahrtr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Thanks

@ahrtr ahrtr merged commit 880004c into etcd-io:release-3.5 Jan 11, 2024
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

5 participants