Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restore periodic prow job for upgrade tests #5193

Closed
Tracked by #6870
lbernick opened this issue Jul 21, 2022 · 23 comments
Closed
Tracked by #6870

Restore periodic prow job for upgrade tests #5193

lbernick opened this issue Jul 21, 2022 · 23 comments
Assignees
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.

Comments

@lbernick
Copy link
Member

Recently, we have had a few issues of regressions that were not detected by CI.

For example, #4834 introduced changes that caused a new client to be incompatible with an old server.

In addition, some breaking changes were introduced where resources already stored on a cluster became invalid when the server version was upgraded. #4779 added new defaults via the mutating webhook, and then #4818 validated the existence of these defaults. However, existing resources didn't receive these defaults, so they failed validation.

We already have some upgrade tests.
These tests have two parts:

  1. Install a new version on a cluster with an older server version, and run integration tests (using the newer version client) on this newer version server
  2. Create some resources on a cluster with an older server version, then upgrade the cluster to use a newer server version. Run integration tests on this cluster.

We should add a step where we run integration tests on the older server version (before upgrading the server). This will address the first type of regression detailed here. When we do make a breaking change purposefully (after deprecating a field), I would imagine we'd remove tests for this functionality, so I don't think adding this check would stop us from removing deprecated fields.

The second type of regression (existing resources becoming invalid) seems like it should have been caught by the upgrade tests. I'm not sure why these tests didn't catch this issue-- maybe upgrading the server reapplies the webhook defaults.

@lbernick lbernick added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Jul 21, 2022
@yuzp1996
Copy link
Contributor

Hello! I run e2e on my machine and try to figure out how the e2e test works. Now I think I am a little familiar with this e2e test. But when I run the e2e test I found some weird things and was not sure if it is a bug.

First, there is some unnormal log and I find the related issue #4777 but seems like the bug is not resolved. But I think it is a good chance to fix it and happy to do this.
image

Second I did not find the function apply_resources in the repository tektoncd/pipeline or tektoncd/plumbing, I guess maybe the function is deleted but forget to delete the reference here ?

apply_resources ${test}

@yuzp1996
Copy link
Contributor

yuzp1996 commented Jul 28, 2022

I noticed that in the e2e-tests-upgrade.sh the older version is v0.5.2.

PREVIOUS_PIPELINE_VERSION=v0.5.2

I think maybe v0.5.2 is too old that is not suitable to be the older version in the upgrade e2e test? I guess maybe no one will upgrade the server to the latest from the oldest version in their production environment?

The above are some of my humble opinions, not sure if it is reasonable. 😂

@yuzp1996
Copy link
Contributor

yuzp1996 commented Jul 28, 2022

I find that the e2e-tests-upgrade.sh script is referenced here as a cron job but I don't find the related running job in tektoncd/pipeline github actions.

https://github.com/tektoncd/plumbing/blob/c5b0f73a1b5e2ca0025d21ac9b7d69a3f7843506/prow/config.yaml#L1829

Could you help show me where I can get the relevant running job? @lbernick Thanks very much!

@afrittoli
Copy link
Member

Upgrade testing is something we've been discussing with @abayer in the productivity WG.
It's definitely something that we need to implement, we should have solid tests before v1.0.

We also need to make sure we have enough visibility on periodic jobs. If an upgrade job is running it has not been maintained or seen by anyone for a long time afaik.

We where discussing perhaps we should include it in the release process.

@lbernick lbernick added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Jul 28, 2022
@dibyom
Copy link
Member

dibyom commented Jul 28, 2022

@yuzp1996 Nice detective work! I agree pipeline v0.5.2 is too old and we should definitely make sure the cron job is 1) running and 2) sending notifications somewhere when it fails

@lbernick
Copy link
Member Author

lbernick commented Jul 28, 2022

Looking at our prow dashboard, upgrade tests have been failing with fatal: destination path '/home/prow/go/src/github.com/google/licenseclassifier' already exists and is not an empty directory. I'm guessing they've been broken for some time and we haven't noticed.

Since these tests appear to be broken anyway, might be a good opportunity to move them to Tekton :)

@yuzp1996
Copy link
Contributor

Looking at our prow dashboard, upgrade tests have been failing with fatal: destination path '/home/prow/go/src/github.com/google/licenseclassifier' already exists and is not an empty directory. I'm guessing they've been broken for some time and we haven't noticed.

Since these tests appear to be broken anyway, might be a good opportunity to move them to Tekton :)

I am sorry I don't understand what moving them to Tekton means. Is it something like dogfooding? Another CI system based on Tekton?

@yuzp1996
Copy link
Contributor

Hi, @lbernick @afrittoli looks like there will be a large change for the e2e upgrade test. I am interested in e2e upgrade test. Is there anything I can do to help now? 😁

@lbernick
Copy link
Member Author

lbernick commented Aug 2, 2022

I am sorry I don't understand what moving them to Tekton means. Is it something like dogfooding? Another CI system based on Tekton?

What I meant here is that instead of having this test run on prow, we can run it on our dogfooding kubernetes cluster via a Tekton TaskRun or PipelineRun. One of our medium-term goals is to move Tekton test infrastructure away from using prow and use Tekton to test Tekton instead.

I am interested in e2e upgrade test. Is there anything I can do to help now? 😁

A great first step would be seeing if you can get the existing scripts working on a k8s cluster! It looks like the tests are currently broken with the error message I've pasted above. Once we have a working script we can get started on adding this as a recurring TaskRun/PipelineRun in our dogfooding cluster.

@yuzp1996
Copy link
Contributor

yuzp1996 commented Aug 3, 2022

I am sorry I don't understand what moving them to Tekton means. Is it something like dogfooding? Another CI system based on Tekton?

What I meant here is that instead of having this test run on prow, we can run it on our dogfooding kubernetes cluster via a Tekton TaskRun or PipelineRun. One of our medium-term goals is to move Tekton test infrastructure away from using prow and use Tekton to test Tekton instead.

I am interested in e2e upgrade test. Is there anything I can do to help now? 😁

A great first step would be seeing if you can get the existing scripts working on a k8s cluster! It looks like the tests are currently broken with the error message I've pasted above. Once we have a working script we can get started on adding this as a recurring TaskRun/PipelineRun in our dogfooding cluster.

OK! I commented the initialize $@ to avoid creating GCP cluster when I run the script locally and found the e2e upgrade test worked well.

initialize $@

Looks like something went wrong in function initialize() I will try to find out why.
https://github.com/tektoncd/plumbing/blob/75f7d7e959de11479a64bb5a05f4f66217a22ef3/scripts/e2e-tests.sh#L357

@yuzp1996
Copy link
Contributor

yuzp1996 commented Aug 9, 2022

I find that Prow job will run the command /usr/local/bin/entrypoint.sh first
image

Then I pull the image to find out if the files /home/prow/go/src/github.com/google/licenseclassifier exist. Looks like we do not need to clone https://github.com/google/licenseclassifier to GOPATH anymore!

image

Related Issue:
google/licenseclassifier#20
tektoncd/plumbing#349

Related file:
entrypoint.sh
https://github.com/tektoncd/plumbing/blob/5dbf210387217e509550ce1c522f11a854058d10/tekton/images/test-runner/entrypoint.sh#L21

Solution:
I want to add a judgement before cloning the repository. This preserves the greatest possible compatibility.

@yuzp1996
Copy link
Contributor

yuzp1996 commented Aug 10, 2022

I made a change in tektoncd/plumbing#1175 and maybe it can resolve the problem when running the e2e-upgrade-test.sh.

yuzp1996 added a commit to yuzp1996/plumbing that referenced this issue Aug 10, 2022
If directory ${GOPATH}/src/github.com/google/licenseclassifier exists and
is not empty we should skip clone again or will cause an error.

Related Issue: tektoncd/pipeline#5193

Signed-off-by: yuzhipeng <yuzp1996@qq.com>
yuzp1996 added a commit to yuzp1996/plumbing that referenced this issue Aug 10, 2022
If directory ${GOPATH}/src/github.com/google/licenseclassifier exists and
is not empty we should skip clone again or will cause an error.

Related Issue: tektoncd/pipeline#5193

Signed-off-by: yuzhipeng <yuzp1996@qq.com>
@lbernick
Copy link
Member Author

I noticed our unit tests also clone the license classifier but don't run into the same problem-- check out the first log line here. Also, the upgrade tests use a version of the test-runner that is 4 months older than the version the rest of our CI uses, and uses go 1.16 instead of go 1.18. I wonder if this could be related, but I don't have a way to know what was in those images. @afrittoli @abayer can you comment here?

yuzp1996 added a commit to yuzp1996/plumbing that referenced this issue Aug 11, 2022
If directory ${GOPATH}/src/github.com/google/licenseclassifier exists and
is not empty we should skip clone again or will cause an error.

Related Issue: tektoncd/pipeline#5193

Signed-off-by: yuzhipeng <yuzp1996@qq.com>
yuzp1996 added a commit to yuzp1996/plumbing that referenced this issue Aug 12, 2022
If directory ${GOPATH}/src/github.com/google/licenseclassifier exists and
is not empty we should skip clone again or will cause an error.

Related Issue: tektoncd/pipeline#5193

Signed-off-by: yuzhipeng <yuzp1996@qq.com>
tekton-robot pushed a commit to tektoncd/plumbing that referenced this issue Aug 12, 2022
If directory ${GOPATH}/src/github.com/google/licenseclassifier exists and
is not empty we should skip clone again or will cause an error.

Related Issue: tektoncd/pipeline#5193

Signed-off-by: yuzhipeng <yuzp1996@qq.com>
@lbernick lbernick moved this to Todo in Pipelines V1 Aug 16, 2022
@lbernick
Copy link
Member Author

@abayer are you still able to update the prow config to use the new image? Hoping that upgrade tests can run successfully after this change, but even if they do I think it's worth making their failures more discoverable (e.g. post on tekton slack).

@abayer
Copy link
Contributor

abayer commented Aug 29, 2022

I'm pretty sure I already did, lemme check.

@abayer
Copy link
Contributor

abayer commented Aug 29, 2022

Yup, I did. =)

@lbernick
Copy link
Member Author

It looks like upgrade tests are still failing on the prow dashboard with

Cloning into '/home/prow/go/src/github.com/google/licenseclassifier'...
Activated service account credentials for: [prow-account@tekton-releases.iam.gserviceaccount.com]
fatal: not a git repository (or any of the parent directories): .git

which now that I look at it is a different error than before-- so might need some more updates to the image

@JeromeJu
Copy link
Member

JeromeJu commented Oct 6, 2022

This comment updates the current progress:

@JeromeJu
Copy link
Member

JeromeJu commented Oct 13, 2022

  • When disabling initialize, the test could run locally.

  • When set Previous version at v0.40.2 and deploy on the current devel, the following error occurred when upgrading to the current release:

Warning: autoscaling/v2beta1 HorizontalPodAutoscaler is deprecated in v1.22+, unavailable in v1.25+; use autoscaling/v2 HorizontalPodAutoscaler
horizontalpodautoscaler.autoscaling/tekton-pipelines-webhook configured
deployment.apps/tekton-pipelines-webhook configured
service/tekton-pipelines-webhook configured
error when retrieving current configuration of:
Resource: "/v1, Resource=serviceaccounts", GroupVersionKind: "/v1, Kind=ServiceAccount"
Name: "tekton-pipelines-resolvers", Namespace: "tekton-pipelines-resolvers"
from server for: "STDIN": Get "https://35.202.119.160/api/v1/namespaces/tekton-pipelines-resolvers/serviceaccounts/tekton-pipelines-resolvers": dial tcp 35.202.119.160:443: connect: connection timed out
error when retrieving current configuration of:
Resource: "rbac.authorization.k8s.io/v1, Resource=clusterrolebindings", GroupVersionKind: "rbac.authorization.k8s.io/v1, Kind=ClusterRoleBinding"
Name: "tekton-pipelines-resolvers", Namespace: ""
from server for: "STDIN": Get "https://35.202.119.160/apis/rbac.authorization.k8s.io/v1/clusterrolebindings/tekton-pipelines-resolvers": dial tcp 35.202.119.160:443: connect: connection timed out
error when retrieving current configuration of:
Resource: "rbac.authorization.k8s.io/v1, Resource=rolebindings", GroupVersionKind: "rbac.authorization.k8s.io/v1, Kind=RoleBinding"
Name: "tekton-pipelines-resolvers-namespace-rbac", Namespace: "tekton-pipelines-resolvers"
from server for: "STDIN": Get "https://35.202.119.160/apis/rbac.authorization.k8s.io/v1/namespaces/tekton-pipelines-resolvers/rolebindings/tekton-pipelines-resolvers-namespace-rbac": dial tcp 35.202.119.160:443: connect: connection timed out
error when retrieving current configuration of:
Resource: "/v1, Resource=configmaps", GroupVersionKind: "/v1, Kind=ConfigMap"
Name: "bundleresolver-config", Namespace: "tekton-pipelines-resolvers"
from server for: "STDIN": Get "https://35.202.119.160/api/v1/namespaces/tekton-pipelines-resolvers/configmaps/bundleresolver-config": dial tcp 35.202.119.160:443: connect: connection timed out
error when retrieving current configuration of:
Resource: "/v1, Resource=configmaps", GroupVersionKind: "/v1, Kind=ConfigMap"Name: "config-logging", Namespace: "tekton-pipelines-resolvers"
from server for: "STDIN": Get "https://35.202.119.160/api/v1/namespaces/tekton-pipelines-resolvers/configmaps/config-logging": dial tcp 35.202.119.160:443: connect: connection timed out
error when retrieving current configuration of:
Resource: "/v1, Resource=configmaps", GroupVersionKind: "/v1, Kind=ConfigMap"
Name: "config-observability", Namespace: "tekton-pipelines-resolvers"
from server for: "STDIN": Get "https://35.202.119.160/api/v1/namespaces/tekton-pipelines-resolvers/configmaps/config-observability": dial tcp 35.202.119.160:443: connect: connection timed out
ERROR: Build pipeline installation failed

This seems to be the expected behavior of the existing upgrade tests where the resources of configmaps are missing.

Thanks to pointers from @lbernick , when testing with a fresh kind cluster without the previous config.
The current output of v0.40.2 upgrading to devel:
Terminal Saved Output.txt

Currently status: tried to figure out the reasons why specific tests failed
Approach: running against
Currently when running against a GKE
Next step: replicate the prow job env and why it succeeded

@JeromeJu
Copy link
Member

JeromeJu commented Nov 16, 2022

Current update:

  • Trying to debug
/usr/local/bin/runner.sh: line 120: ./test/e2e-tests-upgrade.sh: No such file or directory

Previous assumptions:

  • might resolve with the kind cluster change

Approaches to this question:

  • not sure if the --repo arg worked for catalog test or if it's path_alias
  • why it didn't work for the upgrade test

And my questions are about:

@JeromeJu
Copy link
Member

/close
Closing as currently the upgrade test is running in prow at https://prow.tekton.dev/view/gs/tekton-prow/logs/ci-tekton-pipeline-upgrade-tests/1594874572958601216. While #5782 is created to track down the following improvement for regression.

@tekton-robot
Copy link
Collaborator

@JeromeJu: You can't close an active issue/PR unless you authored it or you are a collaborator.

In response to this:

/close
Closing as currently the upgrade test is running in prow at https://prow.tekton.dev/view/gs/tekton-prow/logs/ci-tekton-pipeline-upgrade-tests/1594874572958601216. While #5782 is created to track down the following improvement for regression.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Repository owner moved this from In Progress to Done in Pipelines V1 Nov 22, 2022
Repository owner moved this from Todo to Done in Tekton Community Roadmap Nov 22, 2022
@lbernick
Copy link
Member Author

commenting for the record, looks like tektoncd/plumbing#1279 is the PR that got upgrade tests working.

JeromeJu added a commit to JeromeJu/pipeline that referenced this issue Jun 22, 2023
Scenario 2 of the upgrade test aims to install pipeline server from the
preivous release, create resources from that version and test on those
resources against the client of the current release.

This commit restores the apply_resources function to enable the 2nd
scenario of the upgrade test work by applying the resources from the old
server version of pipeline. It was preivously renamed and removed in
tektoncd#2685.

It fixes the missing piece where the required resources from the previous
release have not been created to accomplish scenario 2 of the upgrade
test.

part of: tektoncd#5193
JeromeJu added a commit to JeromeJu/pipeline that referenced this issue Jun 22, 2023
Scenario 2 of the upgrade test aims to install pipeline server from the
preivous release, create resources from that version and test on those
resources against the client of the current release.

This commit restores the apply_resources function to enable the 2nd
scenario of the upgrade test work by applying the resources from the old
server version of pipeline. It was preivously renamed and removed in
tektoncd#2685.

It fixes the missing piece where the required resources from the previous
release have not been created to accomplish scenario 2 of the upgrade
test. More specifically, it creates resources in v1 examples that are
not under beta or no-ci.

part of: tektoncd#5193
JeromeJu added a commit to JeromeJu/pipeline that referenced this issue Jun 22, 2023
Scenario 2 of the upgrade test aims to install pipeline server from the
preivous release, create resources from that version and test on those
resources against the client of the current release.

This commit restores the apply_resources function to enable the 2nd
scenario of the upgrade test work by applying the resources from the old
server version of pipeline. It was preivously renamed and removed in
tektoncd#2685.

It fixes the missing piece where the required resources from the previous
release have not been created to accomplish scenario 2 of the upgrade
test. More specifically, it creates resources in v1 examples that are
not under beta or no-ci.

part of: tektoncd#5193
JeromeJu added a commit to JeromeJu/pipeline that referenced this issue Jun 23, 2023
Scenario 2 of the upgrade test aims to install pipeline server from the
preivous release, create resources from that version and test on those
resources against the client of the current release.

This commit restores the apply_resources function to enable the 2nd
scenario of the upgrade test work by applying the resources from the old
server version of pipeline. It was preivously renamed and removed in
tektoncd#2685.

It fixes the missing piece where the required resources from the previous
release have not been created to accomplish scenario 2 of the upgrade
test. More specifically, it creates resources in v1 examples that are
not under beta or no-ci.

part of: tektoncd#5193
@lbernick lbernick changed the title Improve upgrade tests Restore periodic prow job for upgrade tests Jun 26, 2023
JeromeJu added a commit to JeromeJu/pipeline that referenced this issue Jun 28, 2023
Scenario 2 of the upgrade test aims to install pipeline server from the
preivous release, create resources from that version and test on those
resources against the client of the current release.

This commit restores the apply_resources function to enable the 2nd
scenario of the upgrade test work by applying the resources from the old
server version of pipeline. It was preivously renamed and removed in
tektoncd#2685.

It fixes the missing piece where the required resources from the previous
release have not been created to accomplish scenario 2 of the upgrade
test. More specifically, it creates resources in v1 examples that are
not under beta or no-ci.

part of: tektoncd#5193
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
Status: Done
Status: Done
Development

No branches or pull requests

7 participants