Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prow jobs are failing with 'Could not resolve host: github.com' #20716

Closed
mborsz opened this issue Feb 3, 2021 · 28 comments
Closed

Prow jobs are failing with 'Could not resolve host: github.com' #20716

mborsz opened this issue Feb 3, 2021 · 28 comments
Assignees
Labels
area/prow Issues or PRs related to prow kind/bug Categorizes issue or PR as related to a bug. kind/flake Categorizes issue or PR as related to a flaky test. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/testing Categorizes an issue or PR as relevant to SIG Testing.
Milestone

Comments

@mborsz
Copy link
Member

mborsz commented Feb 3, 2021

What happened:
Many prow jobs started failing with error like:

Cloning into 'test-infra'...
fatal: unable to access 'https://github.com/kubernetes/test-infra/': Could not resolve host: github.com

e.g. https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce-scalability/1356950100881969152
https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-kubernetes-build-fast/1356957645537284096/
What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Please provide links to example occurrences, if any:

Anything else we need to know?:
https://k8s-testgrid.appspot.com/sig-scalability-gce#gce-cos-master-scalability-100 suggests that this started happening between 03:43 PST and 04:27 PST.

@mborsz mborsz added the kind/bug Categorizes issue or PR as related to a bug. label Feb 3, 2021
@mborsz
Copy link
Member Author

mborsz commented Feb 3, 2021

/assign @e-blackwelder

@jkaniuk
Copy link
Contributor

jkaniuk commented Feb 10, 2021

@k8s-ci-robot k8s-ci-robot added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Feb 10, 2021
@k8s-ci-robot
Copy link
Contributor

@jkaniuk: You must be a member of the kubernetes/milestone-maintainers GitHub team to set the milestone. If you believe you should be able to issue the /milestone command, please contact your and have them propose you as an additional delegate for this responsibility.

In response to this:

Various jobs are still flaky, examples:

/priority critical-urgent
/sig testing
/sig network
/wg k8s-infra
/milestone v1.21
/kind flake

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added sig/testing Categorizes an issue or PR as relevant to SIG Testing. sig/network Categorizes an issue or PR as relevant to SIG Network. wg/k8s-infra kind/flake Categorizes issue or PR as related to a flaky test. labels Feb 10, 2021
@jprzychodzen
Copy link
Contributor

@jprzychodzen
Copy link
Contributor

It seems that flaking jobs run on cluster k8s-infra-prow-build

@jprzychodzen
Copy link
Contributor

jprzychodzen commented Feb 16, 2021

It seems that migration to pod-utils will resolve this problem https://github.com/kubernetes/test-infra/pull/18057/files (add retry on resolving git clone command)

EDIT: however, this can move DNS problem to other parts of the test

@jkaniuk
Copy link
Contributor

jkaniuk commented Feb 16, 2021

It is not only scalability jobs that have this problem.

Shouln't we enable NodeLocalDNS on the Prow cluster?

@jprzychodzen
Copy link
Contributor

jprzychodzen commented Feb 16, 2021

/cc @cjwagner as test-infra-oncall

Cole, could you edit kube-dns configmap in cluster k8s-infra-prow-build and scale kube-dns pods? I guess that something like this would help

"data": {
        "linear":'{"coresPerReplica":256,"nodesPerReplica":8,"min":4,"preventSinglePointFailure":true}'
    },

Regarding NodeLocalDNS - kubernetes/kubernetes#56903 - it seems that this would increase reliability of DNS, so it should be enabled.

@chaodaiG
Copy link
Contributor

/cc @cjwagner as test-infra-oncall

Cole, could you edit kube-dns configmap in cluster k8s-infra-prow-build and scale kube-dns pods? I guess that something like this would help

"data": {
        "linear":'{"coresPerReplica":256,"nodesPerReplica":8,"min":4,"preventSinglePointFailure":true}'
    },

Regarding NodeLocalDNS - kubernetes/kubernetes#56903 - it seems that this would increase reliability of DNS, so it should be enabled.

This is exactly the patch I mentioned per #20816 (comment), + @BenTheElder for awareness

@cjwagner
Copy link
Member

@jprzychodzen @chaodaiG There are no existing data entries in the kube-dns configmap, but there is a very similar entry in the kube-dns-autoscaler configmap, is that what you were referring to? I don't want to break DNS by applying this to the wrong configmap so please confirm which is intended before I proceed.

@chaodaiG
Copy link
Contributor

@jprzychodzen @chaodaiG There are no existing data entries in the kube-dns configmap, but there is a very similar entry in the kube-dns-autoscaler configmap, is that what you were referring to? I don't want to break DNS by applying this to the wrong configmap so please confirm which is intended before I proceed.

Not entirely sure, but based on to my understanding, there are 2 things needed:

  • NodeLocal DNSCache needs to be enabled on the cluster
  • edit kube-dns-autoscaler ( I believe @jprzychodzen miss typed the name of configmap above, please verify) configmap under kube-system namespace to the value mentioned by @jprzychodzen above.

@jprzychodzen
Copy link
Contributor

@cjwagner I was just referring to scaling configuration for kube-dns, which of course is controlled in ConfigMap object kube-dns-autoscaler. Please change that value in kube-dns-autoscaler.

For NodeLocal DNSCache, please run gcloud container clusters update prow-build --update-addons=NodeLocalDNS=ENABLED, this will enable NodeLocalDNS during the next node upgrade, as mentioned in GCP documentation.

@spiffxp
Copy link
Member

spiffxp commented Feb 18, 2021

For NodeLocal DNSCache, please run gcloud container clusters update prow-build --update-addons=NodeLocalDNS=ENABLED, this will enable NodeLocalDNS during the next node upgrade, as mentioned in GCP documentation.

We use terraform to manage these clusters, will this cause terraform to think it needs to recreate the cluster? e.g.

@chaodaiG
Copy link
Contributor

We use terraform to manage these clusters, will this cause terraform to think it needs to recreate the cluster? e.g.

I have drafted a change to this kubernetes/k8s.io#1680, it shouldn't trigger cluster recreation

@spiffxp
Copy link
Member

spiffxp commented Feb 19, 2021

/milestone v1.21

@k8s-ci-robot k8s-ci-robot added this to the v1.21 milestone Feb 19, 2021
@spiffxp
Copy link
Member

spiffxp commented Feb 19, 2021

kubernetes/k8s.io#1686 (comment) - the addon should be installed to k8s-infra-prow-build after N hours

regarding editing the config map, where are the numbers coming from? is this something we could check in as a resource to apply automatically?

@spiffxp
Copy link
Member

spiffxp commented Feb 19, 2021

This is the current setting

$ k --context=gke_k8s-infra-prow-build-trusted_us-central1_prow-build-trusted get configmap -n kube-system kube-dns-autoscaler -o=json | jq .data
{
  "linear": "{\"coresPerReplica\":256,\"nodesPerReplica\":16,\"preventSinglePointFailure\":true}"
}
$ k --context=gke_k8s-infra-prow-build_us-central1_prow-build get configmap -n kube-system kube-dns-autoscaler -o=json | jq .data
{
  "linear": "{\"coresPerReplica\":256,\"nodesPerReplica\":16,\"preventSinglePointFailure\":true}"
}

@chaodaiG
Copy link
Contributor

kubernetes/k8s.io#1686 (comment) - the addon should be installed to k8s-infra-prow-build after N hours

regarding editing the config map, where are the numbers coming from? is this something we could check in as a resource to apply automatically?

I have briefly looked at terraform doc, doesn't seem like there is a way to apply configmap

@chaodaiG
Copy link
Contributor

The numbers were from "trial-and-error" as I have guessed, basically scaling up node-local-dns pods from 1 pod per 16 nodes, to 1 pod per 4-8 nodes.

@spiffxp
Copy link
Member

spiffxp commented Feb 19, 2021

I have briefly looked at terraform doc, doesn't seem like there is a way to apply configmap

That's fine, I chose to have the build cluster terraform stop at the "infra" layer. It gets a cluster up and configured, but what is deployed to that cluster is something else's responsibility.

In this case, files in a given cluster's resources dir get kubectl apply'd by a prowjob, much the same way prow.k8s.io is deployed from the cluster/ dir (ref: https://testgrid.k8s.io/wg-k8s-infra-k8sio#post-k8sio-deploy-prow-build-resources)

So if it's just a matter of committing a kube-system configmap file to github and applying it, we're good. I just want to avoid manually editing a file in-cluster.

@chaodaiG
Copy link
Contributor

Generally sgtm. There is one catch, it's a system configmap with value formatted as string, so the interface change in the future will fail silently

@spiffxp
Copy link
Member

spiffxp commented Feb 22, 2021

kubernetes/k8s.io#1691 merged, which deployed the configmap changes

# kubectl --context=gke_k8s-infra-prow-build_us-central1_prow-build get configmap -n kube-system kube-dns-autoscaler -o=json | jq -r .data.linear | jq .
{
  "coresPerReplica": 256,
  "nodesPerReplica": 8,
  "min": 4,
  "preventSinglePointFailure": true
}

If you need to make any further changes, please do so by opening PR's against that repo


If I look at https://k8s-testgrid.appspot.com/sig-scalability-gce#gce-cos-master-scalability-100&width=5 as reference, this problem seems to have resolved once the dns cache was enabled (but before this patch was added)?

@spiffxp
Copy link
Member

spiffxp commented Feb 22, 2021

I'm trying to see if I can find a cloud logging query that shows the extent of this issue, and to confirm it's gone

@spiffxp
Copy link
Member

spiffxp commented Feb 23, 2021

/close
I'm giving up on the logging query / graph. "Cannot resolve github.com" can appear in a number of different containers/fields, so the logging queries go through TB of data and seem to be erroring out at the moment.

But I also haven't seen any occurrences of "Cannot resolve github.com" since the cluster's nodes were recreated.

Please /reopen if you run into this again

@k8s-ci-robot
Copy link
Contributor

@spiffxp: Closing this issue.

In response to this:

/close
I cannot ("Cannot resolve github.com" can appear in a number of different containers/fields, so the logging queries go through TB of data and seem to be erroring out at the moment)

But I also haven't seen any occurrences of "Cannot resolve github.com" since the cluster's nodes were recreated.

Please /reopen if you run into this again

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jprzychodzen
Copy link
Contributor

jprzychodzen commented Feb 23, 2021

Thanks @spiffxp @chaodaiG

Job https://testgrid.k8s.io/sig-scalability-gce#gce-cos-1.19-scalability-100 was flaking a lot, it seems that problem is now resolved.

EDIT: not -> now

@chaodaiG
Copy link
Contributor

The most recent failure on https://testgrid.k8s.io/sig-scalability-gce#gce-cos-1.19-scalability-100 was from last Friday.

Screen Shot 2021-02-23 at 7 04 28 AM

The only red column was on last Friday at 4AM (https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gci-gce-scalability-stable1/1362733725674115072). Which was before @spiffxp updating the nodepool with the fixes (See the timestamp at kubernetes/k8s.io#1686 (comment)).

@jprzychodzen , did you see anything different there?

@jprzychodzen
Copy link
Contributor

Jakub Przychodzeń , did you see anything different there?

Sorry, a typo in my comment. I've fixed it. Thanks again for handling this issue.

s/not/now changes meaning a lot ;-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/prow Issues or PRs related to prow kind/bug Categorizes issue or PR as related to a bug. kind/flake Categorizes issue or PR as related to a flaky test. priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/testing Categorizes an issue or PR as relevant to SIG Testing.
Projects
None yet
Development

No branches or pull requests

9 participants