-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
REQUEST: New project roles: "Test Maintainer" and "Subject Matter Expert" #9735
Comments
This issue is currently awaiting triage. If CAPI contributors determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Thanks for putting together your thoughts @jackfrancis. Overall I like the idea of having new roles for fixing CI failures more efficiently compared to what currently we have.
Most of the above responsibilities described already fall into the responsibilities of the current CI team in the Release Team, so we may not need a new dedicated role for it and CI Lead + CI Lead team Members should be able to cover those.
CI team has multiple members and RT already has a concept of membership rotation, in which they take turns watching CI/triaging etc.
This I think is the most important role in the context of this issue and it should help in making sure CI issues are resolved on timely manner without blocking CAPI releases. Would be great to hear others opinion as well. cc @CecileRobertMichon @fabriziopandini @sbueringer @killianmuldoon @vincepri @enxebre |
cc @kubernetes-sigs/cluster-api-release-team |
I personally like the idea of having people on ration focused on investigating flakes for a couple of reasons:
Given that, IMO this fits naturally in the CI team (which is on rotation for a release) or in a working group inside it (the triage team), but ultimately everything works for me. My main concern is to onboarding enough people willing to do the work so the project does not depend anymore on a very limited set of people to ensure the quality of upcoming releases; with this regard, let's consider if titles like "Test Maintainer" and "Subject Matter Expert" might be intimidating for the folks approaching this job or available for a time-boxed/limited commitment only. |
I'm not sure about the "Subject Matter Expert" role. Wouldn't this ~ match to what we have reviewers / maintainers for? (the problem is just that by far we don't find enough folks for these roles already). Not sure if it makes sense to create a new dedicated / separate role of a subject matter expert, given that reviewers / approvers should already be subject matter experts of the code they are responsible for and are also aware of recent changes (which are quite often the cause of new occuring test failures). |
I think that is what I was describing. How do the current organizational structures support the "of the code they are responsible for" part of that equation? My understanding is that a reviewer or owner of CAPZ is scoped to the entire project. If there are specific |
In core CAPI we have additional maintainer/reviewer aliases for parts of the project: https://github.com/kubernetes-sigs/cluster-api/blob/main/OWNERS_ALIASES These aliases are then referenced by OWNERS files in various directories. |
@sbueringer @fabriziopandini I think this has been a good conversation. It seems to me that the two roles we're discussing largely already exist. Perhaps a good way to better formalize (I'm assuming a little more formality is helpful, but could be wrong there) this is to clarify the following:
The 2nd line item may not be super smooth at first, as lots of things may just result in calls for help to CAPI project maintainers, but this could act as a forcing function to better represent specific areas of the code as owned by certain folks. Over time this could lead to more effective test troubleshooting, and ultimately, better code ownership. We could write the above as a community in a doc, w/ the release team being the key stakeholder in how we decide to do things. WDYT? |
This is what we are doing now, and it doesn't work because we ultimately depend on a very limited set of people to ensure the quality of upcoming releases. (now triage to an appropriate owner = ping maintainer) We need more support from the community if we don't want to put our future release at risk What I propose is to try to assign the responsibility of fixing tests to a team of folks, let's call it "CI Reliability team"; people will try to get systems back to a steady state as quickly as possible, or escalate to maintainers when necessary. In order to get people to volunteer for this role, I suggest that this team should be on rotation with a time/effort-bound commitment, like the release team, and to keep things simple, I think that this team could actually be an addition to the release team roster (Note: I understand the objection that this is not usually part of a release team tasks, but I'm trying to be pragmatic and use the process we already have without adding further complexity) If we advertise this new "CI Reliability team" team properly, it becomes a matter of picking the right folks in the list of candidates for the next release team, the headline for this role could be:
|
Just to be clear, I'm 100% in favor of experimenting in this area to try to get more help because we clearly need it. I'm just doubting the original idea of this issue as I understood it, which would have been to establish an additional role specified in owner files in addition to reviewers/ approvers for specific areas (owner files exist to define review & approve permissions) |
FWIW the Kubernetes release team used to have two seperate roles for "CI signal" and It might be interesting to understand their reasoning so we can learn from their experience |
We discussed the idea of introducing a dedicated "Bug Triage" team in #9271 and at the end not to do so, since the current CI team already has the responsibilities that partly matches what we would be introducing. |
I think that in our model "CI signal" and "Bug Triage" somehow are already merged and it works well, and the output of their work is a set of issues with all the flakes/test failures with all the details the managed to collect. |
cc @cahillsf FYI (since you are a release lead for 1.7) |
cc @adilGhaffarDev as well since you are CI signal manager for the 1.7 release |
Thanks for putting this together @jackfrancis ! I like the concept suggested by this request. However, from my experience, there seems to be a significant overlap in responsibilities between the "Test Maintainer" and "SME" roles and the existing duties of the CI Team(as @furkatgofurov7 already pointed out in #9735 (comment)). I also like the idea of a "CI Reliability Team," but the expectations for this team appear to coincide with the responsibilities of the existing CI Team. Perhaps a reassessment of the CI Team's role and expectations could provide insights into the core issue? Below is my perspective. (Also, because we need to clarify the CI Team's scope via #9849) Summary of CI Duties:The responsibilities section of
The expectations surrounding "Bug Triage" and "Automation" have been somewhat unclear.
I hope my above explanation adds more clarity regarding the practices and functioning of the CI Team. Moreover, I believe that the duties of the proposed "CI Reliability Team" are essentially those of the existing CI Team within the CAPI Release Team. The main difference is that bug triage might fall outside the CI Team's scope, focusing instead on addressing test-related issues. |
I agree with @fabriziopandini on this, but not sure if it should be separate team or we just add this to the responsibility of CI Team. I am ok with both, if we plan to add it to responsibility of "CI Team" we just have to keep it in mind when we are making CI Team and communicate it properly to the team. For this release cycle we are doing following:
|
I don't think this is done by CI team since last 2 release cycles. I also think we should update documentation and remove bug triaging from CI teams responsibilities. This can be done weekly in release team meeting by Release Lead when whole team is present.
I agree I think this is also documented wrong, we need to fix it. But in practice things are happening in right direction, automation tasks in last 2 release cycles and this one too were divided accordingly, all tasks were not assigned to CI Team. |
brought up in CAPI office hours (see discussion details here: https://docs.google.com/document/d/1X5_qNUvoY0Tk3BwXODWHAl72L-APM_GiA-IE4WoitYY/edit#heading=h.y2rfv7e8f0wp) Next steps as understood from the release team:
feel free to correct me on any of the above. as advised by Fabrizio, I believe these action items are "small steps" that we can iterate on |
/priority backlog |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close not-planned |
@k8s-triage-robot: Closing this issue, marking it as "Not Planned". In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
What would you like to be added (User Story)?
As a release manager, I would like to be able to easily triage test failures by escalating to a "Test Maintainer" in order to resolve test failures and unblock release progress.
As a "Test Maintainer", I would like to be able to rapidly diagnose test failures by escalating to a "Subject Matter Expert" in order to debug the specific underlying failures in a critical CI test.
Detailed Description
At present, the CAPI project has a notion of a "maintainer" (see the
cluster-api-maintainers
in theOWNERS_ALIASES
file in the root of the project repository), who is responsible for approving reviewed PRs and merging them, as well as having various admin-level privileges to execute important tasks such as cutting a new release. Additionally, there is a notion of a "reviewer" as well (see thecluster-api-maintainers
in theOWNERS_ALIASES
file in the root of the project repository) whose sole responsibility is to review PRs and provide feedback to submitters, and finally tolgtm
a PR when it can be accepted into the project.This request suggests that of the current formal roles defined in CAPI, only "maintainer" suggests a role that can be used to aid test failure triage (basically you'd try your best to understand what test is failing, and perhaps why, and then reach out to a maintainer for help actually fixing things).
Because the CAPI project has grown considerably since these roles were codified, perhaps a more effective way to triage test failures is to introduce a dedicated "Test Maintainer" role with the following responsibilities:
The above describes a new "Subject Matter Expert" role, whose responsibilities are as follows:
Anything else you would like to add?
How would this work in practice?
For "Test Maintainer", we would want to agree upon a set of folks that would participate in a regular rotation. Defining how that rotation works is probably beyond this issue.
For "Subject Matter Expert", we would put an OWNERS file describing one or more maintainers for the code in the parent directory of the source that contains that code. That ownership is inherited by child directories, until another OWNERS file is encountered, which overridees the parent OWNERS file.
For example, this describes the current project-spanning ownership (i.e., current maintainers and reviewers:
This would define a set of subject matter experts for the
v1beta1
API:This would define a set of subject matter experts who are responsible for E2E tests:
Taking a concrete example, here is a recently failing CAPI test:
Note the stacktrace:
[FAILED] in [It] - /home/prow/go/src/sigs.k8s.io/cluster-api/test/e2e/clusterctl_upgrade.go:389 @ 11/17/23 11:42:09.424
A person fulfilling the "Test Maintainer" role could then do the following:
/test/e2e/clusterctl_upgrade.go
.Label(s) to be applied
/kind feature
One or more /area label. See https://github.com/kubernetes-sigs/cluster-api/labels?q=area for the list of labels.
The text was updated successfully, but these errors were encountered: