Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 1986003: Retry kubeconfig checks, when kube-apiserver is temporarily unavailable #26377

Merged
merged 1 commit into from
Aug 18, 2021

Conversation

soltysh
Copy link
Contributor

@soltysh soltysh commented Aug 4, 2021

/assign @p0lyn0mial

since you complained about this particular to me some time ago, maybe not necessarily about this particular problem but it's a good starting point 😉 which I noticed fails pretty frequently

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 4, 2021
fmt.Sprintf(`oc --kubeconfig "%s" get namespace kube-system`, kubeconfigPath)).Output()
framework.Logf(out)
// retry error when kube-apiserver was temporarily unavailable
retry := strings.Contains(out, "The connection to the server localhost:6443 was refused")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would you be open to adding a precondition before running the test instead?
not sure this is the only error we might get, for example on an IPv6 cluster we might get [::1]:6443: connect: connection refused

we could check if the cluster is in stable condition (not progressing, the pods on the same revision) for X min.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking progressing might slow this down, but we can add that as last resort, re-tries seems simpler, b/c theoretically we can even pass the test when rollout is in progress.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added check for ipv6, though

@p0lyn0mial
Copy link
Contributor

since you complained about this particular to me some time ago, maybe not necessarily about this particular problem but it's a good starting point 😉 which I noticed fails pretty frequently

wasn't me :)

framework.Logf("Verifying kubeconfig %q on master %s", master.Name)
out, err := oc.AsAdmin().Run("debug").Args("node/"+master.Name, "--", "chroot", "/host", "/bin/bash", "-euxo", "pipefail", "-c", fmt.Sprintf(`oc --kubeconfig "%s" get namespace kube-system`, kubeconfigPath)).Output()
retry, err := testNode(oc, kubeconfig, master.Name)
if retry {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is one retry enough?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, after consideration I did 2 😉

@soltysh
Copy link
Contributor Author

soltysh commented Aug 9, 2021

@aojea updated, ptal

@soltysh soltysh force-pushed the retry_kubeconfigs branch 2 times, most recently from a3d8ff8 to db592fb Compare August 9, 2021 14:49
@soltysh
Copy link
Contributor Author

soltysh commented Aug 9, 2021

or @p0lyn0mial

Comment on lines 57 to 58
retry := strings.Contains(out, "The connection to the server localhost:6443 was refused") ||
strings.Contains(out, "[::1]:6443: connect: connection refused")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are the messages so different between protocols? is just curiosity , using curl they are more consisten

$ curl localhost:6443
curl: (7) Failed to connect to localhost port 6443: Connection refused
$ curl [::1]:6443
curl: (7) Failed to connect to ::1 port 6443: Connection refused
$ curl 127.0.0.1:6443
curl: (7) Failed to connect to 127.0.0.1 port 6443: Connection refused

@aojea
Copy link
Contributor

aojea commented Aug 9, 2021

lgtm, just a question to double check the "retry error messages"

@soltysh soltysh changed the title Retry kubeconfig checks, when kube-apiserver is temporarily unavailable Bug 1986003: Retry kubeconfig checks, when kube-apiserver is temporarily unavailable Aug 16, 2021
@openshift-ci openshift-ci bot added bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Aug 16, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 16, 2021

@soltysh: This pull request references Bugzilla bug 1986003, which is valid. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.9.0) matches configured target release for branch (4.9.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @wangke19

In response to this:

Bug 1986003: Retry kubeconfig checks, when kube-apiserver is temporarily unavailable

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot requested a review from wangke19 August 16, 2021 08:29
@soltysh soltysh force-pushed the retry_kubeconfigs branch from db592fb to dc68551 Compare August 16, 2021 08:56
@soltysh
Copy link
Contributor Author

soltysh commented Aug 16, 2021

lgtm, just a question to double check the "retry error messages"

I just checked that, you're right the message should be identical, only the address will change, it's coming from here: https://github.com/kubernetes/kubernetes/blob/cbb5ea8210596ada1efce7e7a271ca4217ae598e/staging/src/k8s.io/kubectl/pkg/cmd/util/helpers.go#L237-L243, so I've updated accordingly the PR.

With:

_, err := net.Dial("tcp", "localhost:6443")
fmt.Println(err)
_, err = net.Dial("tcp6", "[::1]:6443")
fmt.Println(err)

I got:

dial tcp [::1]:6443: connect: connection refused
dial tcp6 [::1]:6443: connect: connection refused

fmt.Sprintf(`oc --kubeconfig "%s" get namespace kube-system`, kubeconfigPath)).Output()
framework.Logf(out)
// retry error when kube-apiserver was temporarily unavailable, this matches oc error coming from:
// https://github.com/kubernetes/kubernetes/blob/cbb5ea8210596ada1efce7e7a271ca4217ae598e/staging/src/k8s.io/kubectl/pkg/cmd/util/helpers.go#L237-L243
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the message you've linked ends with a quesiton tag, but the regex work fine
https://play.golang.org/p/aBGVR3_3YsQ

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I didn't want to be that precise 😉

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

those symbols on regexps 😬
🤣

@aojea
Copy link
Contributor

aojea commented Aug 16, 2021

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 16, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 16, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aojea, soltysh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@soltysh
Copy link
Contributor Author

soltysh commented Aug 16, 2021

/test e2e-gcp
/test e2e-aws-fips

@soltysh
Copy link
Contributor Author

soltysh commented Aug 16, 2021

/test e2e-gcp

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@soltysh
Copy link
Contributor Author

soltysh commented Aug 17, 2021

/override ci/prow/e2e-agnostic-cmd
/override ci/prow/e2e-aws-single-node
These are not required and I'll continue debugging test-cmd separately.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 17, 2021

@soltysh: Overrode contexts on behalf of soltysh: ci/prow/e2e-agnostic-cmd, ci/prow/e2e-aws-single-node

In response to this:

/override ci/prow/e2e-agnostic-cmd
/override ci/prow/e2e-aws-single-node
These are not required and I'll continue debugging test-cmd separately.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@soltysh
Copy link
Contributor Author

soltysh commented Aug 17, 2021

/test e2e-gcp

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

1 similar comment
@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@soltysh
Copy link
Contributor Author

soltysh commented Aug 17, 2021

/test e2e-gcp

1 similar comment
@soltysh
Copy link
Contributor Author

soltysh commented Aug 17, 2021

/test e2e-gcp

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

4 similar comments
@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-ci openshift-ci bot merged commit 1e66686 into openshift:master Aug 18, 2021
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 18, 2021

@soltysh: Some pull requests linked via external trackers have merged:

The following pull requests linked via external trackers have not merged:

These pull request must merge or be unlinked from the Bugzilla bug in order for it to move to the next state. Once unlinked, request a bug refresh with /bugzilla refresh.

Bugzilla bug 1986003 has not been moved to the MODIFIED state.

In response to this:

Bug 1986003: Retry kubeconfig checks, when kube-apiserver is temporarily unavailable

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants