Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jobs erroring out due to failure to fetch token #20816

Closed
BenTheElder opened this issue Feb 10, 2021 · 14 comments
Closed

jobs erroring out due to failure to fetch token #20816

BenTheElder opened this issue Feb 10, 2021 · 14 comments
Assignees
Labels
area/prow Issues or PRs related to prow kind/bug Categorizes issue or PR as related to a bug. kind/flake Categorizes issue or PR as related to a flaky test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/testing Categorizes an issue or PR as relevant to SIG Testing.
Milestone

Comments

@BenTheElder
Copy link
Member

What happened:

Seeing jobs fail early on with no logs more and more often lately, and not signs we've seen before like "failed scheduling" (need to increase nodes) or github clone flakes (kubernetes/kubernetes is a huge clone, still need to mitigate that ##18226).

Instead we're seeing:

 - containerID: containerd://2372ffea321a2dec129d222dc4b43ee8774b4c0528a2e36d9271c6a51819b35e
    image: gcr.io/k8s-prow/initupload:v20210209-12c3370f77
    imageID: gcr.io/k8s-prow/initupload@sha256:4108bc2d7ddb6a2272d525b060d305edd5f9575430ca022ceef22475140ce713
    lastState: {}
    name: initupload
    ready: false
    restartCount: 0
    state:
      terminated:
        containerID: containerd://2372ffea321a2dec129d222dc4b43ee8774b4c0528a2e36d9271c6a51819b35e
        exitCode: 1
        finishedAt: "2021-02-10T07:02:42Z"
        message: |
          iter close error: Post \"https://storage.googleapis.com/upload/storage/v1/b/kubernetes-jenkins/o?alt=json\u0026name=pr-logs%2Fpull%2F98948%2Fpull-kubernetes-bazel-build%2Flatest-build.txt\u0026prettyPrint=false\u0026projection=full\u0026uploadType=multipart\": oauth2: cannot fetch token: Post \"https://oauth2.googleapis.com/token\": dial tcp: i/o timeout writer close error: Post \"https://storage.googleapis.com/upload/storage/v1/b/kubernetes-jenkins/o?alt=json\u0026name=pr-logs%2Fpull%2F98948%2Fpull-kubernetes-bazel-build%2F1359393190783750144%2Fclone-log.txt\u0026prettyPrint=false\u0026projection=full\u0026uploadType=multipart\": oauth2: cannot fetch token: Post \"https://oauth2.googleapis.com/token\": dial tcp: i/o timeout upload error: writer close error: Post \"https://storage.googleapis.com/upload/storage/v1/b/kubernetes-jenkins/o?alt=json\u0026name=pr-logs%2Fpull%2F98948%2Fpull-kubernetes-bazel-build%2F1359393190783750144%2Fclone-records.json\u0026prettyPrint=false\u0026projection=full\u0026uploadType=multipart\": oauth2: cannot fetch token: Post \"https://oauth2.googleapis.com/token\": dial tcp: i/o timeout writer close error: Post \"https://storage.googleapis.com/upload/storage/v1/b/kubernetes-jenkins/o?alt=json\u0026name=pr-logs%2Fpull%2F98948%2Fpull-kubernetes-bazel-build%2F1359393190783750144%2Fbuild-log.txt\u0026prettyPrint=false\u0026projection=full\u0026uploadType=multipart\": oauth2: cannot fetch token: Post \"https://oauth2.googleapis.com/token\": dial tcp: i/o timeout writer close error: Post \"https://storage.googleapis.com/upload/storage/v1/b/kubernetes-jenkins/o?alt=json\u0026name=pr-logs%2Fpull%2F98948%2Fpull-kubernetes-bazel-build%2F1359393190783750144%2Ffinished.json\u0026prettyPrint=false\u0026projection=full\u0026uploadType=multipart\": oauth2: cannot fetch token: Post \"https://oauth2.googleapis.com/token\": dial tcp: i/o timeout]","file":"prow/cmd/initupload/main.go:45","func":"main.main","level":"fatal","msg":"Failed to initialize job","severity":"fatal","time":"2021-02-10T07:02:42Z"}
        reason: Error
        startedAt: "2021-02-10T06:46:42Z"

Which is not really surfaced to the user, unless they know to go digging into the recorded pod state. Instead you just see:

Test started today at 10:46 PM failed after 16m35s. (more info)

Setting aside the UX, we should fix this error, we're seeing it a lot, even on jobs that are running with guaranteed QoS.

What you expected to happen:

  • Ideally the failure mode should be surfaced in deck
  • Even more ideally this should not be causing jobs to fail / the root cause should be mitigated

How to reproduce it (as minimally and precisely as possible):

This is happening pretty often I think, but it's hard to pin down as the job state is just "failed" and the runtime is ~ 16m36s (which seems oddly specific?)

with https://prow.k8s.io/?repo=kubernetes%2Fkubernetes&type=presubmit&state=failure I see 28 hits for 16m36s

Please provide links to example occurrences, if any:

kubernetes/kubernetes#98768
kubernetes/kubernetes#98948 (comment)

Anything else we need to know?:

/area prow

@BenTheElder BenTheElder added the kind/bug Categorizes issue or PR as related to a bug. label Feb 10, 2021
@k8s-ci-robot k8s-ci-robot added the area/prow Issues or PRs related to prow label Feb 10, 2021
@BenTheElder
Copy link
Member Author

cc @chaodaiG per go.k8s.io/oncall

@BenTheElder
Copy link
Member Author

I would move kubernetes/kubernetes#98768 here, but CI signal is probably tracking that one, and it predates this. Filing here with additional details and for visibility.

@chaodaiG
Copy link
Contributor

The initupload logs are the tail logs, I may need to see the full log for better understanding. Which cluster does this use? How can I get access to it? @BenTheElder

@chaodaiG
Copy link
Contributor

Post \"https://oauth2.googleapis.com/token\": dial tcp: i/o timeout, I have searched a little bit, and there are a couple of possibilities:

  • https://oauth2.googleapis.com/token has intermittent connection issue. The is not very likely, as it could be mitigated by Podutils: Retry uploads #19884 .
  • Network issue on the cluster.
  • DNS issue. I have seen freshly created node having short period of DNS issue, does this build cluster happen to have node autoscaling turned on?

@spiffxp
Copy link
Member

spiffxp commented Feb 10, 2021

@chaodaiG

The initupload logs are the tail logs, I may need to see the full log for better understanding. Which cluster does this use? How can I get access to it?

if the job is using cluster: k8s-infra-prow-build you'll need to look at the k8s-infra-prow-build project in the kuberentes.io gcp org

owner access to this project (and all other prow-related projects in kubernetes.io) is defined by https://github.com/kubernetes/k8s.io/blob/5b2304f875d8b877764a7bde5398eb6f4ffb0494/groups/groups.yaml#L645-L665

viewer access is defined by https://github.com/kubernetes/k8s.io/blob/5b2304f875d8b877764a7bde5398eb6f4ffb0494/groups/groups.yaml#L667-L701

I'll open a PR to reconcile the #test-infra-oncall entries, they look stale to me

@chaodaiG
Copy link
Contributor

Thank you @spiffxp , please tag me on the PR

@spiffxp
Copy link
Member

spiffxp commented Feb 10, 2021

@chaodaiG opened kubernetes/k8s.io#1636 and cc'ed you

@spiffxp
Copy link
Member

spiffxp commented Feb 10, 2021

/priority important-soon
/sig testing
/wg k8s-infra
/milestone v1.21
/kind flake
re "more and more often lately" in description

@k8s-ci-robot k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Feb 10, 2021
@k8s-ci-robot k8s-ci-robot added this to the v1.21 milestone Feb 10, 2021
@k8s-ci-robot k8s-ci-robot added wg/k8s-infra kind/flake Categorizes issue or PR as related to a flaky test. labels Feb 10, 2021
chaodaiG added a commit to chaodaiG/test-infra that referenced this issue Feb 11, 2021
chaodaiG added a commit to chaodaiG/test-infra that referenced this issue Feb 11, 2021
@chaodaiG
Copy link
Contributor

Some stats, scanning through all prow jobs in the past 24 hours (~24000), by downloading podinfo.json from gcs, so far only finished half of them:

  • total: 12476
  • failed: 2874
  • initupload failure: 85

@spiffxp
Copy link
Member

spiffxp commented Feb 11, 2021

oauth2: cannot fetch token started showing up in logs for k8s-infra-prow-build sometime late Feb 2, early Feb 3.

https://console.cloud.google.com/logs/query;query=%22oauth2:%20cannot%20fetch%20token%22%0Aresource.labels.container_name%3D%22initupload%22;timeRange=P14D?project=k8s-infra-prow-build
Screen Shot 2021-02-11 at 1 53 54 AM

But nothing of the sort for k8s-infra-prow-build-trusted
https://console.cloud.google.com/logs/query;query=%22oauth2:%20cannot%20fetch%20token%22%0Aresource.labels.container_name%3D%22initupload%22;timeRange=2021-02-02T22:14:00.000Z%2F2021-02-11T08:00:00.000Z?project=k8s-infra-prow-build-trusted

Nor k8s-prow or k8s-prow-builds

What changed?

  • PR's merged within that time period were irrelevant (k8s.gcr.io manifests, kubernets.io group updates)
  • So unless someone manually ran commands, cluster resources and underlying weren't intentionally changed
  • It may be possible this was our GKE maintenance window, and nodes were upgraded?
  • Would this correspond to a change in code in the initupload container?
  • (deferring looking at audit logs and trying to narrow down the time range to tomorrow)

@BenTheElder
Copy link
Member Author

It may be possible this was our GKE maintenance window, and nodes were upgraded?

Nodes from Jan 22nd exist still in greenhouse's pool, but the autoscaled node pool has all nodes from today.

Checking the cluster operation logs: It looks like @spiffxp upgraded the cluster on the 10th. I don't see any cluster operations around February 2nd at all.
https://console.cloud.google.com/logs/viewer?project=k8s-infra-prow-build&organizationId=758905017065&minLogLevel=0&expandAll=false&timestamp=2021-02-12T04:11:49.687000000Z&customFacets=&limitCustomFacetWidth=false&dateRangeEnd=2021-02-12T04:09:56.858Z&interval=PT1H&resource=gke_cluster%2Fcluster_name%2Fprow-build%2Flocation%2Fus-central1&dateRangeUnbound=backwardInTime&scrollTimestamp=2021-02-11T05:51:11.338541094Z

I don't think this is the case.

@BenTheElder
Copy link
Member Author

BenTheElder commented Feb 12, 2021

Nor k8s-prow or k8s-prow-builds

Is k8s-infra-prow-builds the only one with autoscaling? (I think so?)

If so I suspect this is an issue with autoscaling the in-cluster DNS. @chaodaiG mentioned a workaround for this applied to another prow cluster.

EDIT: and this has been a fairly obvious suspect for some time IMHO, but I forgot to mention.
We should probably try the workaround there (adjusting the DNS scaling parameters).

@spiffxp
Copy link
Member

spiffxp commented Feb 22, 2021

/close
Same query link as above... seems to have stopped happening near 2020-02-19 20:00:00Z
Screen Shot 2021-02-22 at 12 06 05 PM

How does this line up with updating k8s-infra-prow-build?

So the problem stopped due to one of:

  • node recreation
  • adding node local dns (which required node recreation to take effect)
  • upgrading nodes to v1.17 (which required node recreation to take effect)
  • something completely unrelated to updating the cluster

It took about 4.5h to fully recreate nodes, I opted not to try these out one-by-one to determine exactly which one would solve it

@k8s-ci-robot
Copy link
Contributor

@spiffxp: Closing this issue.

In response to this:

/close
Same query link as above... seems to have stopped happening near 2020-02-19 20:00:00Z
Screen Shot 2021-02-22 at 12 06 05 PM

How does this line up with updating k8s-infra-prow-build?

So the problem stopped due to one of:

  • node recreation
  • adding node local dns (which required node recreation to take effect)
  • upgrading nodes to v1.17 (which required node recreation to take effect)
  • something completely unrelated to updating the cluster

It took about 4.5h to fully recreate nodes, I opted not to try these out one-by-one to determine exactly which one would solve it

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/prow Issues or PRs related to prow kind/bug Categorizes issue or PR as related to a bug. kind/flake Categorizes issue or PR as related to a flaky test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. sig/testing Categorizes an issue or PR as relevant to SIG Testing.
Projects
None yet
Development

No branches or pull requests

4 participants