Restore periodic prow job for upgrade tests #5193

lbernick · 2022-07-21T17:35:01Z

Recently, we have had a few issues of regressions that were not detected by CI.

For example, #4834 introduced changes that caused a new client to be incompatible with an old server.

In addition, some breaking changes were introduced where resources already stored on a cluster became invalid when the server version was upgraded. #4779 added new defaults via the mutating webhook, and then #4818 validated the existence of these defaults. However, existing resources didn't receive these defaults, so they failed validation.

We already have some upgrade tests.
These tests have two parts:

Install a new version on a cluster with an older server version, and run integration tests (using the newer version client) on this newer version server
Create some resources on a cluster with an older server version, then upgrade the cluster to use a newer server version. Run integration tests on this cluster.

We should add a step where we run integration tests on the older server version (before upgrading the server). This will address the first type of regression detailed here. When we do make a breaking change purposefully (after deprecating a field), I would imagine we'd remove tests for this functionality, so I don't think adding this check would stop us from removing deprecated fields.

The second type of regression (existing resources becoming invalid) seems like it should have been caught by the upgrade tests. I'm not sure why these tests didn't catch this issue-- maybe upgrading the server reapplies the webhook defaults.

yuzp1996 · 2022-07-28T04:49:22Z

Hello! I run e2e on my machine and try to figure out how the e2e test works. Now I think I am a little familiar with this e2e test. But when I run the e2e test I found some weird things and was not sure if it is a bug.

First, there is some unnormal log and I find the related issue #4777 but seems like the bug is not resolved. But I think it is a good chance to fix it and happy to do this.

Second I did not find the function apply_resources in the repository tektoncd/pipeline or tektoncd/plumbing, I guess maybe the function is deleted but forget to delete the reference here ?

pipeline/test/e2e-tests-upgrade.sh

Line 69 in c1d4702

apply_resources ${test}

yuzp1996 · 2022-07-28T05:02:41Z

I noticed that in the e2e-tests-upgrade.sh the older version is v0.5.2.

pipeline/test/e2e-tests-upgrade.sh

Line 28 in c1d4702

PREVIOUS_PIPELINE_VERSION=v0.5.2

I think maybe v0.5.2 is too old that is not suitable to be the older version in the upgrade e2e test? I guess maybe no one will upgrade the server to the latest from the oldest version in their production environment?

The above are some of my humble opinions, not sure if it is reasonable. 😂

yuzp1996 · 2022-07-28T05:20:11Z

I find that the e2e-tests-upgrade.sh script is referenced here as a cron job but I don't find the related running job in tektoncd/pipeline github actions.

https://github.com/tektoncd/plumbing/blob/c5b0f73a1b5e2ca0025d21ac9b7d69a3f7843506/prow/config.yaml#L1829

Could you help show me where I can get the relevant running job? @lbernick Thanks very much!

afrittoli · 2022-07-28T05:33:40Z

Upgrade testing is something we've been discussing with @abayer in the productivity WG.
It's definitely something that we need to implement, we should have solid tests before v1.0.

We also need to make sure we have enough visibility on periodic jobs. If an upgrade job is running it has not been maintained or seen by anyone for a long time afaik.

We where discussing perhaps we should include it in the release process.

dibyom · 2022-07-28T18:34:56Z

@yuzp1996 Nice detective work! I agree pipeline v0.5.2 is too old and we should definitely make sure the cron job is 1) running and 2) sending notifications somewhere when it fails

lbernick · 2022-07-28T20:14:01Z

Looking at our prow dashboard, upgrade tests have been failing with fatal: destination path '/home/prow/go/src/github.com/google/licenseclassifier' already exists and is not an empty directory. I'm guessing they've been broken for some time and we haven't noticed.

Since these tests appear to be broken anyway, might be a good opportunity to move them to Tekton :)

yuzp1996 · 2022-07-29T22:06:52Z

Looking at our prow dashboard, upgrade tests have been failing with fatal: destination path '/home/prow/go/src/github.com/google/licenseclassifier' already exists and is not an empty directory. I'm guessing they've been broken for some time and we haven't noticed.

Since these tests appear to be broken anyway, might be a good opportunity to move them to Tekton :)

I am sorry I don't understand what moving them to Tekton means. Is it something like dogfooding? Another CI system based on Tekton?

yuzp1996 · 2022-07-31T12:25:01Z

Hi, @lbernick @afrittoli looks like there will be a large change for the e2e upgrade test. I am interested in e2e upgrade test. Is there anything I can do to help now? 😁

lbernick · 2022-08-02T15:34:34Z

I am sorry I don't understand what moving them to Tekton means. Is it something like dogfooding? Another CI system based on Tekton?

What I meant here is that instead of having this test run on prow, we can run it on our dogfooding kubernetes cluster via a Tekton TaskRun or PipelineRun. One of our medium-term goals is to move Tekton test infrastructure away from using prow and use Tekton to test Tekton instead.

I am interested in e2e upgrade test. Is there anything I can do to help now? 😁

A great first step would be seeing if you can get the existing scripts working on a k8s cluster! It looks like the tests are currently broken with the error message I've pasted above. Once we have a working script we can get started on adding this as a recurring TaskRun/PipelineRun in our dogfooding cluster.

yuzp1996 · 2022-08-03T13:53:45Z

I am sorry I don't understand what moving them to Tekton means. Is it something like dogfooding? Another CI system based on Tekton?

What I meant here is that instead of having this test run on prow, we can run it on our dogfooding kubernetes cluster via a Tekton TaskRun or PipelineRun. One of our medium-term goals is to move Tekton test infrastructure away from using prow and use Tekton to test Tekton instead.

I am interested in e2e upgrade test. Is there anything I can do to help now? 😁

A great first step would be seeing if you can get the existing scripts working on a k8s cluster! It looks like the tests are currently broken with the error message I've pasted above. Once we have a working script we can get started on adding this as a recurring TaskRun/PipelineRun in our dogfooding cluster.

OK! I commented the initialize $@ to avoid creating GCP cluster when I run the script locally and found the e2e upgrade test worked well.

pipeline/test/e2e-tests-upgrade.sh

Line 32 in 3c5d33b

initialize $@

Looks like something went wrong in function initialize() I will try to find out why.
https://github.com/tektoncd/plumbing/blob/75f7d7e959de11479a64bb5a05f4f66217a22ef3/scripts/e2e-tests.sh#L357

yuzp1996 · 2022-08-09T05:46:16Z

I find that Prow job will run the command /usr/local/bin/entrypoint.sh first

Then I pull the image to find out if the files /home/prow/go/src/github.com/google/licenseclassifier exist. Looks like we do not need to clone https://github.com/google/licenseclassifier to GOPATH anymore!

Related Issue:
google/licenseclassifier#20
tektoncd/plumbing#349

Related file:
entrypoint.sh
https://github.com/tektoncd/plumbing/blob/5dbf210387217e509550ce1c522f11a854058d10/tekton/images/test-runner/entrypoint.sh#L21

Solution:
I want to add a judgement before cloning the repository. This preserves the greatest possible compatibility.

yuzp1996 · 2022-08-10T01:37:32Z

I made a change in tektoncd/plumbing#1175 and maybe it can resolve the problem when running the e2e-upgrade-test.sh.

If directory ${GOPATH}/src/github.com/google/licenseclassifier exists and is not empty we should skip clone again or will cause an error. Related Issue: tektoncd/pipeline#5193 Signed-off-by: yuzhipeng <yuzp1996@qq.com>

lbernick · 2022-08-10T14:33:05Z

I noticed our unit tests also clone the license classifier but don't run into the same problem-- check out the first log line here. Also, the upgrade tests use a version of the test-runner that is 4 months older than the version the rest of our CI uses, and uses go 1.16 instead of go 1.18. I wonder if this could be related, but I don't have a way to know what was in those images. @afrittoli @abayer can you comment here?

If directory ${GOPATH}/src/github.com/google/licenseclassifier exists and is not empty we should skip clone again or will cause an error. Related Issue: tektoncd/pipeline#5193 Signed-off-by: yuzhipeng <yuzp1996@qq.com>

lbernick · 2022-08-29T13:32:53Z

@abayer are you still able to update the prow config to use the new image? Hoping that upgrade tests can run successfully after this change, but even if they do I think it's worth making their failures more discoverable (e.g. post on tekton slack).

abayer · 2022-08-29T13:34:25Z

I'm pretty sure I already did, lemme check.

abayer · 2022-08-29T13:35:06Z

Yup, I did. =)

lbernick · 2022-08-29T13:48:40Z

It looks like upgrade tests are still failing on the prow dashboard with

Cloning into '/home/prow/go/src/github.com/google/licenseclassifier'...
Activated service account credentials for: [prow-account@tekton-releases.iam.gserviceaccount.com]
fatal: not a git repository (or any of the parent directories): .git

which now that I look at it is a different error than before-- so might need some more updates to the image

JeromeJu · 2022-10-06T14:27:18Z

This comment updates the current progress:

kubetest as dependency
- when run locally kubetest would need to be installed. I think this happened as upgrade tests script is ran differently from other e2e tests.
- kubetest is deprecated as in https://github.com/kubernetes/test-infra/blob/master/kubetest/README.md]
connection timeouts
- I found some of the tests are flakey due to connection timeout in:
  https://paste.googleplex.com/5897102004584448

JeromeJu · 2022-10-13T13:49:31Z

When disabling initialize, the test could run locally.
When set Previous version at v0.40.2 and deploy on the current devel, the following error occurred when upgrading to the current release:

Warning: autoscaling/v2beta1 HorizontalPodAutoscaler is deprecated in v1.22+, unavailable in v1.25+; use autoscaling/v2 HorizontalPodAutoscaler
horizontalpodautoscaler.autoscaling/tekton-pipelines-webhook configured
deployment.apps/tekton-pipelines-webhook configured
service/tekton-pipelines-webhook configured
error when retrieving current configuration of:
Resource: "/v1, Resource=serviceaccounts", GroupVersionKind: "/v1, Kind=ServiceAccount"
Name: "tekton-pipelines-resolvers", Namespace: "tekton-pipelines-resolvers"
from server for: "STDIN": Get "https://35.202.119.160/api/v1/namespaces/tekton-pipelines-resolvers/serviceaccounts/tekton-pipelines-resolvers": dial tcp 35.202.119.160:443: connect: connection timed out
error when retrieving current configuration of:
Resource: "rbac.authorization.k8s.io/v1, Resource=clusterrolebindings", GroupVersionKind: "rbac.authorization.k8s.io/v1, Kind=ClusterRoleBinding"
Name: "tekton-pipelines-resolvers", Namespace: ""
from server for: "STDIN": Get "https://35.202.119.160/apis/rbac.authorization.k8s.io/v1/clusterrolebindings/tekton-pipelines-resolvers": dial tcp 35.202.119.160:443: connect: connection timed out
error when retrieving current configuration of:
Resource: "rbac.authorization.k8s.io/v1, Resource=rolebindings", GroupVersionKind: "rbac.authorization.k8s.io/v1, Kind=RoleBinding"
Name: "tekton-pipelines-resolvers-namespace-rbac", Namespace: "tekton-pipelines-resolvers"
from server for: "STDIN": Get "https://35.202.119.160/apis/rbac.authorization.k8s.io/v1/namespaces/tekton-pipelines-resolvers/rolebindings/tekton-pipelines-resolvers-namespace-rbac": dial tcp 35.202.119.160:443: connect: connection timed out
error when retrieving current configuration of:
Resource: "/v1, Resource=configmaps", GroupVersionKind: "/v1, Kind=ConfigMap"
Name: "bundleresolver-config", Namespace: "tekton-pipelines-resolvers"
from server for: "STDIN": Get "https://35.202.119.160/api/v1/namespaces/tekton-pipelines-resolvers/configmaps/bundleresolver-config": dial tcp 35.202.119.160:443: connect: connection timed out
error when retrieving current configuration of:
Resource: "/v1, Resource=configmaps", GroupVersionKind: "/v1, Kind=ConfigMap"Name: "config-logging", Namespace: "tekton-pipelines-resolvers"
from server for: "STDIN": Get "https://35.202.119.160/api/v1/namespaces/tekton-pipelines-resolvers/configmaps/config-logging": dial tcp 35.202.119.160:443: connect: connection timed out
error when retrieving current configuration of:
Resource: "/v1, Resource=configmaps", GroupVersionKind: "/v1, Kind=ConfigMap"
Name: "config-observability", Namespace: "tekton-pipelines-resolvers"
from server for: "STDIN": Get "https://35.202.119.160/api/v1/namespaces/tekton-pipelines-resolvers/configmaps/config-observability": dial tcp 35.202.119.160:443: connect: connection timed out
ERROR: Build pipeline installation failed

This seems to be the expected behavior of the existing upgrade tests where the resources of configmaps are missing.

Thanks to pointers from @lbernick , when testing with a fresh kind cluster without the previous config.
The current output of v0.40.2 upgrading to devel:
Terminal Saved Output.txt

Currently status: tried to figure out the reasons why specific tests failed
Approach: running against
Currently when running against a GKE
Next step: replicate the prow job env and why it succeeded

JeromeJu · 2022-11-16T16:51:01Z

Current update:

Trying to debug

/usr/local/bin/runner.sh: line 120: ./test/e2e-tests-upgrade.sh: No such file or directory

Previous assumptions:

might resolve with the kind cluster change

Approaches to this question:

not sure if the --repo arg worked for catalog test or if it's path_alias
why it didn't work for the upgrade test

And my questions are about:

where does it clone the --repo if it's what worked for the catalog test (didn't seem to find it in the runner.sh https://github.com/tektoncd/plumbing/blob/main/tekton/images/test-runner/runner.sh#L100)
Looking at how prow handles the clone (or say how that clone-log.txt got generated)

JeromeJu · 2022-11-22T16:49:14Z

/close
Closing as currently the upgrade test is running in prow at https://prow.tekton.dev/view/gs/tekton-prow/logs/ci-tekton-pipeline-upgrade-tests/1594874572958601216. While #5782 is created to track down the following improvement for regression.

tekton-robot · 2022-11-22T16:49:16Z

@JeromeJu: You can't close an active issue/PR unless you authored it or you are a collaborator.

In response to this:

/close
Closing as currently the upgrade test is running in prow at https://prow.tekton.dev/view/gs/tekton-prow/logs/ci-tekton-pipeline-upgrade-tests/1594874572958601216. While #5782 is created to track down the following improvement for regression.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

lbernick · 2023-06-21T13:35:53Z

commenting for the record, looks like tektoncd/plumbing#1279 is the PR that got upgrade tests working.

Scenario 2 of the upgrade test aims to install pipeline server from the preivous release, create resources from that version and test on those resources against the client of the current release. This commit restores the apply_resources function to enable the 2nd scenario of the upgrade test work by applying the resources from the old server version of pipeline. It was preivously renamed and removed in tektoncd#2685. It fixes the missing piece where the required resources from the previous release have not been created to accomplish scenario 2 of the upgrade test. part of: tektoncd#5193

Scenario 2 of the upgrade test aims to install pipeline server from the preivous release, create resources from that version and test on those resources against the client of the current release. This commit restores the apply_resources function to enable the 2nd scenario of the upgrade test work by applying the resources from the old server version of pipeline. It was preivously renamed and removed in tektoncd#2685. It fixes the missing piece where the required resources from the previous release have not been created to accomplish scenario 2 of the upgrade test. More specifically, it creates resources in v1 examples that are not under beta or no-ci. part of: tektoncd#5193

lbernick added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Jul 21, 2022

lbernick added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Jul 28, 2022

lbernick added this to Pipelines V1 Aug 2, 2022

yuzp1996 mentioned this issue Aug 10, 2022

Fix: determine whether the destination path already exists before clone tektoncd/plumbing#1175

Merged

2 tasks

lbernick moved this to Todo in Pipelines V1 Aug 16, 2022

tekton-robot assigned JeromeJu Sep 14, 2022

xchapter7x added this to Tekton Community Roadmap Sep 20, 2022

xchapter7x moved this to Todo in Tekton Community Roadmap Sep 20, 2022

lbernick mentioned this issue Oct 11, 2022

Integration tests for upgrading API versions #5627

Closed

This was referenced Oct 17, 2022

Add prerequisites for running upgrade tests locally #5649

Merged

Upgrade test runner tektoncd/plumbing#1247

Closed

JeromeJu mentioned this issue Nov 7, 2022

[upgrade test] Change to Kind cluster and Unfixed upgrade test release version #5726

Merged

3 tasks

lbernick moved this from Todo to In Progress in Pipelines V1 Nov 7, 2022

JeromeJu mentioned this issue Nov 22, 2022

Add client/server compatibility tests #5782

Closed

1 task

lbernick closed this as completed Nov 22, 2022

Repository owner moved this from In Progress to Done in Pipelines V1 Nov 22, 2022

Repository owner moved this from Todo to Done in Tekton Community Roadmap Nov 22, 2022

XinruZhang mentioned this issue Dec 20, 2022

Can't update to V0.42 - ResolutionRequest: conversion failed to version v1beta1 #5873

Closed

JeromeJu mentioned this issue May 25, 2023

Cleanup outdated usage for functions in upgrade test #6723

Merged

7 tasks

lbernick changed the title ~~Improve upgrade tests~~ Restore periodic prow job for upgrade tests Jun 26, 2023

lbernick mentioned this issue Jun 26, 2023

Epic: Upgrade test improvements #6870

Open

10 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restore periodic prow job for upgrade tests #5193

Restore periodic prow job for upgrade tests #5193

lbernick commented Jul 21, 2022

yuzp1996 commented Jul 28, 2022

yuzp1996 commented Jul 28, 2022 •

edited

Loading

yuzp1996 commented Jul 28, 2022 •

edited

Loading

afrittoli commented Jul 28, 2022

dibyom commented Jul 28, 2022

lbernick commented Jul 28, 2022 •

edited

Loading

yuzp1996 commented Jul 29, 2022

yuzp1996 commented Jul 31, 2022

lbernick commented Aug 2, 2022

yuzp1996 commented Aug 3, 2022 •

edited

Loading

yuzp1996 commented Aug 9, 2022 •

edited

Loading

yuzp1996 commented Aug 10, 2022 •

edited

Loading

lbernick commented Aug 10, 2022

lbernick commented Aug 29, 2022

abayer commented Aug 29, 2022

abayer commented Aug 29, 2022

lbernick commented Aug 29, 2022

JeromeJu commented Oct 6, 2022 •

edited

Loading

JeromeJu commented Oct 13, 2022 •

edited

Loading

JeromeJu commented Nov 16, 2022 •

edited

Loading

JeromeJu commented Nov 22, 2022

tekton-robot commented Nov 22, 2022

lbernick commented Jun 21, 2023

Restore periodic prow job for upgrade tests #5193

Restore periodic prow job for upgrade tests #5193

Comments

lbernick commented Jul 21, 2022

yuzp1996 commented Jul 28, 2022

yuzp1996 commented Jul 28, 2022 • edited Loading

yuzp1996 commented Jul 28, 2022 • edited Loading

afrittoli commented Jul 28, 2022

dibyom commented Jul 28, 2022

lbernick commented Jul 28, 2022 • edited Loading

yuzp1996 commented Jul 29, 2022

yuzp1996 commented Jul 31, 2022

lbernick commented Aug 2, 2022

yuzp1996 commented Aug 3, 2022 • edited Loading

yuzp1996 commented Aug 9, 2022 • edited Loading

yuzp1996 commented Aug 10, 2022 • edited Loading

lbernick commented Aug 10, 2022

lbernick commented Aug 29, 2022

abayer commented Aug 29, 2022

abayer commented Aug 29, 2022

lbernick commented Aug 29, 2022

JeromeJu commented Oct 6, 2022 • edited Loading

JeromeJu commented Oct 13, 2022 • edited Loading

JeromeJu commented Nov 16, 2022 • edited Loading

JeromeJu commented Nov 22, 2022

tekton-robot commented Nov 22, 2022

lbernick commented Jun 21, 2023

yuzp1996 commented Jul 28, 2022 •

edited

Loading

yuzp1996 commented Jul 28, 2022 •

edited

Loading

lbernick commented Jul 28, 2022 •

edited

Loading

yuzp1996 commented Aug 3, 2022 •

edited

Loading

yuzp1996 commented Aug 9, 2022 •

edited

Loading

yuzp1996 commented Aug 10, 2022 •

edited

Loading

JeromeJu commented Oct 6, 2022 •

edited

Loading

JeromeJu commented Oct 13, 2022 •

edited

Loading

JeromeJu commented Nov 16, 2022 •

edited

Loading