Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIG Network test group for blocking jobs #19160

Closed
cmluciano opened this issue Sep 9, 2020 · 17 comments
Closed

SIG Network test group for blocking jobs #19160

cmluciano opened this issue Sep 9, 2020 · 17 comments
Assignees
Labels
kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/network Categorizes an issue or PR as relevant to SIG Network.

Comments

@cmluciano
Copy link

What should be cleaned up or changed:
A colleague of mine created a public collection of sig-node test jobs that are critical to determining sig-node test condition. SIG-Net would benefit from a similar job collection for network related jobs that may be blocking a release or consistently failing. We currently have the sig-network-test-failures mailing list but a testgrid collection should also be made for looking at overall jobs.

I think we want a similar setup as SIG Node but tuned for looking at only sig-network related tests:

Provide any links for context:
Mailing list
sig-node informing/blocking jobs

@cmluciano cmluciano added the kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. label Sep 9, 2020
@cmluciano
Copy link
Author

cc @aojea @danwinship for any further ideas on what we should add or remove to signal overall sig-net test health

@cmluciano
Copy link
Author

/assign

@aojea
Copy link
Member

aojea commented Sep 10, 2020

I think we should start listing the "apis" we own and the components, ie..

apis:

  • endpoints/endpoints slices
  • ingress/service API
  • services
  • network policy
  • dns?
  • ...

components:

  • kube-proxy (iptables, ipvs, ...)
  • some parts of the kubelet?? (node addresses/kubelet network)
  • ...

and then investigate how well we are covering those and report based on area, ideally we should be able to report on

  • sig-network apis are (3 level status, OK, need attention, FAIL) , and drill down to say services API has a X% coverage (there is an apisnoop tool that reports that, don't know if we can leverage that)
  • sig-network components are OK (same 3 level report) , I don't think we need coverage for components, just an aggregation of success rate of the jobs as testgrid is doing today

and block on those reports.

I've added kind jobs based on the regex [sig-network] https://testgrid.k8s.io/sig-network-kind ,
but honestly I think we are not covering well enough the areas/components I listed above.
And the most important tests are already covered in

blocking presubmits: https://testgrid.k8s.io/presubmits-kubernetes-blocking
pr presubmits: https://testgrid.k8s.io/presubmits-kubernetes-blocking

my 2 cents

/cc @thockin

@spiffxp
Copy link
Member

spiffxp commented Sep 11, 2020

/sig network

@k8s-ci-robot k8s-ci-robot added the sig/network Categorizes an issue or PR as relevant to SIG Network. label Sep 11, 2020
@aojea
Copy link
Member

aojea commented Sep 11, 2020

i.e. for api coverage

neworking api coverage https://apisnoop.cncf.io/1.20.0/stable/networking
I don't know if it is possible to get the endpoints and services from here
https://apisnoop.cncf.io/1.20.0/stable/core

for the components ones, it's a bit tricky, per example, for ingress and network policy only the API is defined and left to 3rd party component the implementation.
The only option I see is to add a job to testgrid using a particular implementation, but this is a big can of worms and can bias the community to believe it is the standard 🤷

@cmluciano
Copy link
Author

Yes looks like we can get the endpoint tested state from core

image

Is it necessary to have a dashboard for the coverage if it is already presented and tracked by CNCF through apisnoop?

@cmluciano
Copy link
Author

for ingress and network policy only the API is defined and left to 3rd party component the implementation.
The only option I see is to add a job to testgrid using a particular implementation, but this is a big can of worms and can bias the community to believe it is the standard shrug

I tried something like the implementation-specific example when we were still working on kubernetes-anywhere but it didn't go very far because we did not want to choose one implementation over another.

Testing the APIs themselves with conformance type tests are probably good enough IMO.

I've added kind jobs based on the regex [sig-network] https://testgrid.k8s.io/sig-network-kind ,
but honestly I think we are not covering well enough the areas/components I listed above.
And the most important tests are already covered in

blocking presubmits: https://testgrid.k8s.io/presubmits-kubernetes-blocking
pr presubmits: https://testgrid.k8s.io/presubmits-kubernetes-blocking

I agree that filtering on the SIG-Net labels in the presubmit dashboards and putting them under our grouping would be very useful. I will probably start with that.

@aojea
Copy link
Member

aojea commented Sep 11, 2020

Is it necessary to have a dashboard for the coverage if it is already presented and tracked by CNCF through apisnoop?

nah, just brainstorming, we are currently blocking on testgrid jobs but, it may be interesting to block on coverage? at least per release? i.e release x+1 can' t have less coverage than release x, cc: @spiffxp

@aojea
Copy link
Member

aojea commented Sep 12, 2020

Related job to gate on coverage #19173

@cmluciano
Copy link
Author

Is it necessary to have a dashboard for the coverage if it is already presented and tracked by CNCF through apisnoop?

nah, just brainstorming, we are currently blocking on testgrid jobs but, it may be interesting to block on coverage? at least per release? i.e release x+1 can' t have less coverage than release x, cc: @spiffxp

Cool, for what it's worth, I do agree that we should have conformance coverage tests for the NetworkPolicy API in stable. I think NP went stable before the idea of API conformance testing and I will open an issue/add those tests.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 13, 2020
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 12, 2021
@cmluciano
Copy link
Author

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jan 13, 2021
@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 13, 2021
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels May 13, 2021
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. sig/network Categorizes an issue or PR as relevant to SIG Network.
Projects
None yet
Development

No branches or pull requests

5 participants