Skip to content
This repository has been archived by the owner on Sep 19, 2022. It is now read-only.

Fix Unit Tests #293

Merged
merged 4 commits into from
Oct 19, 2020
Merged

Fix Unit Tests #293

merged 4 commits into from
Oct 19, 2020

Conversation

andreyvelich
Copy link
Member

Related: #292.

I was able to run unit test locally after these changes .

  1. I changed expectedServiceDeletions from 0 to 1. Here: https://github.com/kubeflow/pytorch-operator/blob/master/pkg/controller.v1/pytorch/job.go#L163-L171, we don't check that job.Spec.CleanPodPolicy == common.CleanPodPolicyRunning before deleting services.
    @jiaqianjing is that correct behaviour? In TF operator we have one loop for services and pods: https://github.com/kubeflow/tf-operator/blob/master/pkg/controller.v1/tensorflow/job.go#L162-L172, but in PyTorch operator we analyse services and pods in the different loops.

  2. To fix this error: Fail in goroutine after TestCopyLabelsAndAnnotation has completed, I changed run test controller, like here: https://github.com/kubeflow/pytorch-operator/blob/master/pkg/controller.v1/pytorch/controller_test.go#L341-L347.

Do you know how we can add travis check before PR submission?

/assign @gaocegege @johnugeorge
/cc @jiaqianjing

@k8s-ci-robot
Copy link

@andreyvelich: GitHub didn't allow me to request PR reviews from the following users: jiaqianjing.

Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

Related: #292.

I was able to run unit test locally after these changes .

  1. I changed expectedServiceDeletions from 0 to 1. Here: https://github.com/kubeflow/pytorch-operator/blob/master/pkg/controller.v1/pytorch/job.go#L163-L171, we don't check that job.Spec.CleanPodPolicy == common.CleanPodPolicyRunning before deleting services.
    @jiaqianjing is that correct behaviour? In TF operator we have one loop for services and pods: https://github.com/kubeflow/tf-operator/blob/master/pkg/controller.v1/tensorflow/job.go#L162-L172, but in PyTorch operator we analyse services and pods in the different loops.

  2. To fix this error: Fail in goroutine after TestCopyLabelsAndAnnotation has completed, I changed run test controller, like here: https://github.com/kubeflow/pytorch-operator/blob/master/pkg/controller.v1/pytorch/controller_test.go#L341-L347.

Do you know how we can add travis check before PR submission?

/assign @gaocegege @johnugeorge
/cc @jiaqianjing

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Member

@gaocegege gaocegege left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@andreyvelich
Copy link
Member Author

/hold
Travis still fails: https://travis-ci.com/github/andreyvelich/pytorch-operator/builds/177732485.
How we can re-sync with coveralls to fix error ?

Bad response status from coveralls: 422
{"message":"Couldn't find a repository matching this job.","error":true}
The command "goveralls -service=travis-ci -v -package ./pkg/... -ignore "pkg/client/*/*.go,pkg/client/*/*/*.go,pkg/client/*/*/*/*.go,pkg/client/*/*/*/*/*.go,pkg/client/*/*/*/*/*/*.go,pkg/client/*/*/*/*/*/*/*.go,pkg/apis/pytorch/*/zz_generated.*.go"" exited with 1.

@gaocegege
Copy link
Member

I think it is caused by the version of the codegen tools

@andreyvelich
Copy link
Member Author

@gaocegege Any ideas which version can affect on this?
Also, do you know how we can add travis to PR presubmit checks ?

@gaocegege
Copy link
Member

Can we merge it now?

@gaocegege
Copy link
Member

No idea about it..

Any ideas which version can affect on this?

@andreyvelich
Copy link
Member Author

@gaocegege We can try to merge it, but unit test can still be failed because of goveralls error.
Maybe we should try to migrate to travis-ci.comfrom travis-ci.org ?
I believe PyTorch operator still uses old version: https://travis-ci.com/organizations/kubeflow/migrate.

@gaocegege
Copy link
Member

https://travis-ci.com/organizations/kubeflow/migrate @jlewi Hi Jeremy, can you help do it?

@jlewi
Copy link
Contributor

jlewi commented Aug 17, 2020

@andreyvelich and @gaocegege As WG leads can you please come up with an approach for maintaining and supporting your testing infrastructure that minimizes the burden on the GitHub org admins. If you need a GitHub org change please open an issue and assign it to one of the org admins (other than me) listed in:
https://github.com/kubeflow/internal-acls/blob/11984d38190b030fecf6886379f1f0eda9fa0000/github-orgs/kubeflow/org.yaml#L7

@gaocegege
Copy link
Member

As WG leads can you please come up with an approach for maintaining and supporting your testing infrastructure that minimizes the burden on the GitHub org admins.

I think the best approach is to give WG Leads the permission to manage projects in these test infras. I will try to promote it with GitHub org admins

@jlewi
Copy link
Contributor

jlewi commented Aug 18, 2020

I think the best approach is to give WG Leads the permission to manage projects in these test infras

What does this mean in terms of GitHub permissions?

@andreyvelich
Copy link
Member Author

@gaocegege @jlewi Will admin permission for pytorch-operator repository be enough to control Travis?
Like what GitHub team has for KF Serving.
Currently, we have write permission.

/cc @johnugeorge

@gaocegege
Copy link
Member

Can we merge it now?

@andreyvelich
Copy link
Member Author

@gaocegege Sure, let's create an issue to track this problem.

@gaocegege
Copy link
Member

/lgtm
/approve

@andreyvelich
Copy link
Member Author

/hold cancel

@andreyvelich
Copy link
Member Author

/retest

@andreyvelich
Copy link
Member Author

@Jeffwan @jinchihe This PR is needed to fix Travis and e2e bugs.

@jinchihe Do you have any ideas how to fix: evel=error msg="No Major.Minor.Patch elements found.
I saw that you have fixed it for tf-operator: kubeflow/training-operator#1179, but we deploy cluster from the bash script in pytorch-operator currently.

@jinchihe
Copy link
Member

@jinchihe Do you have any ideas how to fix: evel=error msg="No Major.Minor.Patch elements found.
I saw that you have fixed it for tf-operator: kubeflow/training-operator#1179, but we deploy cluster from the bash script in pytorch-operator currently.

Yeah, I found that's OK if use old k8s cluster, but for now the k8s 1.14.x cannot be deployed any more.

@andreyvelich
Copy link
Member Author

@jinchihe Do you have any ideas how to fix: evel=error msg="No Major.Minor.Patch elements found.
I saw that you have fixed it for tf-operator: kubeflow/tf-operator#1179, but we deploy cluster from the bash script in pytorch-operator currently.

Yeah, I found that's OK if use old k8s cluster, but for now the k8s 1.14.x cannot be deployed any more.

@jinchihe We use the same command in Katib to deploy cluster and it is working (https://github.com/kubeflow/katib/blob/master/test/scripts/v1beta1/create-cluster.sh#L36), but use kubectl and kustomize to deploy Katib and operators.

I believe it fails on this step.
Do you know how we should specify cluster version in ksonnet app?

@jinchihe
Copy link
Member

jinchihe commented Sep 14, 2020

Do you know how we should specify cluster version in ksonnet app?

Not sure, but guess here, need to confirm. Thanks.

--cluster-version 1.14

@k8s-ci-robot
Copy link

@andreyvelich: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
kubeflow-pytorch-operator-presubmit 152b920 link /test kubeflow-pytorch-operator-presubmit

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich, gaocegege

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@PatrickXYS
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot merged commit 3ae9808 into kubeflow:master Oct 19, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants