-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Managed Jenkins Infrastructure for TVM RFC #49
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I support this RFC! Thanks for this improvement, and as future steps, lets discuss about having more members of the community to be able to contribute with infrastructure machines and infra.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM - thank you for driving this effort @areusch !
This looks good to me--addressing the comment from @leandron, I'll also be doing work in the coming weeks to open-source the CI components of the tvm-ci, packer, and terraform repositories, at which point it should be fairly easy for others to make contributions to the CI/contribute machines to the infrastructure. |
|
||
1. **Maintenance of dedicated executor fleet**. TVM's build is sensitive to the type of hardware used to execute the CI. Using GitHub Actions only alleviates us of the burden of running the Jenkins master. We would still need to run our own fleet of executors with the GitHub agent. | ||
2. **Write access to CI configuration**. GitHub Actions is configured from within the `tvm` repository. While there are many benefits to this, operationally write access to the `tvm` repository is a slow process that is currently granted based on historical contribution to TVM. This process isn't particularly impedance-matched to the needs of a DevOps team, where access checks are routine but low-overhead and the group with write permissions should be controlled but easy to change. And, it's likely that many of the maintenance tasks involved with running TVM executors require the involvement of the current group of TVM Committers—indeed, no TVM committer is on the OctoML Infrastructure team today. This is not to say that any of these things could be changed, but when this project was started, it was considered to be challenging to accommodate these requirements in the TVM committer system. | ||
3. **Private TVM CI instances**. While TVM CI will always remain open and public, there are multiple companies which both contribute to TVM and desire to run their own CI instance internally. Sticking to an open-source CI system avoids any vendor-specific pitfalls (e.g. anyone *could* run Jenkins internally and reference our configuration). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressing the comment from @leandron, I'll also be doing work in the coming weeks to open-source the CI components of the tvm-ci, packer, and terraform repositories, at which point it should be fairly easy for others to make contributions to the CI/contribute machines
In addition to these it'd be nice if we had a single guide on how to deploy everything (from the level of I have a head node and some static machines with a freshly provisioned Ubuntu or something), both for ourselves in the future and to enable this as more than just a possibility
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a great idea--I'll definitely look into doing this in the coming weeks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks like a huge leap forward that TVM's CI system really needs, but I've raised a few concerns I have.
|
||
### Infrastructure-as-Code Repository | ||
|
||
The production TVM CI instance will be managed using an open source Infrastructure-as-Code repository living in GitLab. GitLab is preferable for DevOps workflows due to a slightly nicer pipelines system, particularly one which allows for manual intervention if needed. All configuration except credentials will be stored in this repository. TVM Committers, plus additional delegates of those committers responsible for running the TVM Jenkins infrastructure, will be granted write access to this repository. Any changes to this repository will require review from those individuals with write access who are actively involved in the day-to-day operations of TVM CI. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here it's suggested the code will be in GitLab whereas below is says it'll use GitHub Actions, which are you proposing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops, this was a decision we walked back after some consideration (e.g. better to stick with the same platform rather than have two). i missed this mention in my editing; fixed.
|
||
## Ownership | ||
|
||
We propose that the Infrastructure-as-Code repository for this system be open-sourced but that the maintenance be delegated to a set of volunteers in the community. IaC operations will be launched in practice from GitHub Actions inside a new e.g. `tlcpack/ci-*` repositories. Cloud credentials will be provided to the IaC repository (stored privately, accessible to those community volunteers involved with CI operations) to enable maintenance access to the fleet of nodes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Part of the problem with current CI is that even as a TVM committer I can't make meaningful changes to the infrastructure. The infrastructure in itself is a part of the TVM project, I'd suggest we encourage people to contribute infrastructure-as-code similar to other contributions, by using the committer system, rather than an alternative one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i agree with this. i think there are a few more obstacles before we can do this and i'd like to solve them in parallel without blocking efforts to improve CI:
- there isn't a path defined right now for folks who contribute only to TVM CI infrastructure to become committers
- nothing is codified right now so we can't use the traditional path
- there are folks who feel comfortable reviewing both Infra-as-Code and TVM, but my perception is that the number is small
what we're proposing is to handle this separately for now, however still grant TVM committers write access to the IaC repo (so the system is essentially still the committer system, just with extra folks who can write/deploy). this will also give us a good idea as to the GH permissions needed for such a repo, so that we can then consider unifying the two systems with a proper proposal later on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's also some prior art of CI code living outside the main repo (see https://github.com/kubernetes/test-infra, https://github.com/pytorch/builder), afaik for similar reasons (easier to commit to and iterate on)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify, I don't have any issue with the actual code being in a separate respository with different checks and such, that's totally normal in a lot of projects. The point of concern is taking the CI infrastructure, which every commit into TVM depends on, outside of the Apache TVM project. Taking your examples of kubernetes/test-infra
and pytorch/builder
, they all exist within the project itself, so the Kubernetes CI is under the kubernetes
namespace and governed under those rules.
- there isn't a path defined right now for folks who contribute only to TVM CI infrastructure to become committers
- nothing is codified right now so we can't use the traditional path
- there are folks who feel comfortable reviewing both Infra-as-Code and TVM, but my perception is that the number is small
I'm not sure this is true, I believe that the TVM community has a reasonable number of active committers comfortable with reviewing both, it's historically been difficult for them to contribute and continuing to manage it outside of the project seems to continue that practice. The path to becoming a committer does not seem to require comprehensive of knowledge across TVM, as the code owners file demonstrates certain committers have a large preference to a single area. I would support the PMC in guiding those who are interested in solely contributing to CI to becoming comitters as much as those who would contribute to other areas, such as documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update @areusch, I've suggested a few more for clarity around this but I believe we're almost there?
|
||
## Ownership | ||
|
||
We propose that the Infrastructure-as-Code repository for this system be open-sourced and that maintenance of the repository be handled by TVM committers. Once the system is operational, IaC operations will be launched from GitHub Actions inside new e.g. `tlcpack/tvm-ci-*` repositories. We will create the following repositories: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We propose that the Infrastructure-as-Code repository for this system be open-sourced and that maintenance of the repository be handled by TVM committers. Once the system is operational, IaC operations will be launched from GitHub Actions inside new e.g. `tlcpack/tvm-ci-*` repositories. We will create the following repositories: | |
We propose that the Infrastructure-as-Code repository for this system be open-sourced and that maintenance of the repositories, as part of the Apache TVM project, be under the same project governance and PMC; IaC will therefore be managed by TVM committers. Once the system is operational, IaC operations will be launched from GitHub Actions inside new e.g. `tlcpack/tvm-ci-*` repositories. We will create the following repositories: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
made changes roughly to this effect
We considered using GitHub Actions to drive the TVM CI instead of Jenkins. While GitHub Actions has several attractive properties (for two, a modern configuration language and management of the "Jenkins master" equivalent), there are a couple of compelling reasons to build our own infrastructure including the Jenkins master: | ||
|
||
1. **Maintenance of dedicated executor fleet**. TVM's build is sensitive to the type of hardware used to execute the CI. Using GitHub Actions only alleviates us of the burden of running the Jenkins master. We would still need to run our own fleet of executors with the GitHub agent. | ||
2. **Write access to CI configuration**. GitHub Actions is configured from within the `tvm` repository. While there are many benefits to this, operationally write access to the `tvm` repository is a slow process that is currently granted based on historical contribution to TVM. This process isn't particularly impedance-matched to the needs of a DevOps team, where access checks are routine but low-overhead and the group with write permissions should be controlled but easy to change. And, it's likely that many of the maintenance tasks involved with running TVM executors require the involvement of the current group of TVM Committers—indeed, no TVM committer is on the OctoML Infrastructure team today. This is not to say that any of these things could be changed, but when this project was started, it was considered to be challenging to accommodate these requirements in the TVM committer system. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs rephrasing with the context around choosing tlcpack
and using TVM Committers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i removed the last sentence, but i'd prefer to leave the rest in here actually. i'm very much okay with proceeding using the existing committer promotion strategy--i don't believe we should make a process exception over a perceived fear that the process won't work. however, i think it is pretty plain that a multi-week PMC vote is an extremely heavyweight way to add folks to an oncall rotation. i don't consider this problem perfectly solved in the new system, so i don't think it's worth removing the critique that GH actions locks us into that problem. i would like to revisit this problem in context of real experience operating the IaC repo and motivate any process changes off of that rather than off of a gut feeling. this does mean we're proceeding with a degraded oncall support, but i'm okay with that given it's a CI and the goal is to build a community-driven process.
thanks for the comments @Mousius, PTAL when you have a minute! |
@Mousius I've added language to explicitly describe our intent to host these repos underneath Apache in the medium term. PTAL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the work you've put in on this @areusch, let's get this going 😸!
Adding the Jenkins RFC.
@jroesch @leandron @driazati @konturn @tqchen