-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarify & document our CI strategy #7279
Comments
We also need to consider:
|
@ewdurbin and @pradyunsg are going to be talking about this issue this week as part of the donor-funded pip work that we want to complete by about the end of May. |
Based on a bunch of discussions lately, I think that we seem to have concensus to:
|
Concretely for the "make it faster" task, I can walk someone through getting performance data on Windows and analyzing it like I had to for #7263. If we refine the tools and process then we'll truly have a lot of visibility into where time is being spent. |
@chrahunt 🙋🏻♂️ This sounds like it would also be a good process to document in some form - either in pip's documentation if it is very-pip-specific or as a blog post somewhere, for broader visibility. |
Adjacent to making the entire test suite faster, there's also the concept of "failing fast." Basically, you want run the tests which are more likely to fail first. This is usually tests related to code that was changed since the last test run. With regards to that, it may be worth looking into using something along the lines testmon. I've not worked with testmon specifically, but I've heard good things about both the tool and the general approach it takes. As a note, a prerequisite of this is to know that the tests aren't order-dependent. So it may be worth trying to use something like pytest-randomly (where it can be used, since it's only Python 3.5+) first, and resolve any problems that show up there. |
The tricky thing with this, is that our tests invoke a subprocess of pip, which isn't very friendly for tools like this.
We parallelize and isolate our tests extensively, so they're definitely not order-dependent. :) |
Could ask @tarpas for help. I think the problem I faced is that when it's combined with the coverage collection tools (tarpas/pytest-testmon#86 (comment)), the results are totally broken (it may show 2-3 times lower coverage than it is for real). |
One of the ideas I had in the shower today, was to have our tests be split across CI providers by OS. We'll run linters + "latest stable" interpreters first, then if they all pass, we'd run the remaining interpreters. One way we could split-by-OS would be something like:
Then, we can add the constraints of running the full test-suite at least once on:
I don't think it makes sense to make a distinction between unit and integration tests here, but we might get good fail-fast speedups, from needing the unit tests to pass prior to running the integration tests. |
@pradyunsg I'd probably put linters to GH Actions too. If you want them to cancel tests, could think of something like hitting API for that... |
Also, have you considered using https://bors.tech? |
Since @webknjaz mentioned it: I highly recommend https://bors.tech. I use it for the vast majority of my own projects, and it's caught innumerable semantic merge conflicts before anything got merged. If y'all decide to use it, I'd be more than happy to help you configure it. 🙂 |
I spent some time looking and it doesn't look like any of the CI providers we're using have any plans to drop Python 2 support anytime soon. Most notably, GitHub announced a change for Python 2.7, but that decision got reversed
|
I've been experimenting w/ pip's CI setup in pradyunsg#4 and pradyunsg#5. Based on a whole bunch experimentation and trial & error, I think I have something nice that we can transition to right now. But first, some context which we should probably move into our docs as well eventually... The number of parallel jobs we get per-CI-provider:
IMO, the best utilization would be to have:
Alas, we have failures on Windows + GitHub Actions, that I don't want to deal with immediately. I think we should be running:
Other than that, I think we should have at least 1 CI job that runs unit and integration tests on:
I think it makes a lot of sense to group our "developer tooling" tasks as:
So, my proposal is:
Based on my experiments, our "bottleneck" CI job would then be MacOS, for which we only have 5 workers (i.e. 1 PR at a time); but that looks like it isn't significantly different from our current affairs, where Azure Pipelines is a similar blocker since we run tests on all platforms there. We can mitigate this in the future by moving toward the "best utilization" situation I described above, by swapping the CI platforms we use to run Windows & MacOS. |
But they are green in my old PR: #6953 |
@pradyunsg did you think about wiring up Zuul resources in addition to this? Could test some less conventional distros this way (fedora, centos, debian etc.) |
It's just skipping those tests entirely. :) https://github.com/pypa/pip/pull/6953/files#diff-2deae8ed35e0da386b702aa047e106cbR46-R47
I did, yea. I didn't find most of their documentation approachable or any good "hey, here's how you can get started" document or article. Before someone asks, I also looked at bors; and while it would make sense for us to use it, I don't want to bundle that up with these changes as well. Overall, I figured we should clean up what we have before adopting yet-another-CI-setup. :)
I spent a whole bunch of time thinking about this, and... no, I don't think we should be testing against these platforms on our CI. Most linux distros are basically the same from vanilla pip's PoV, and... the responsibility for making sure pip works on a specific linux distro lands on the distro's maintainers; not pip's maintainers. |
Agreed (i.e. latest CPython x64 interpreter) 👍
I read that as "unit tests on all supported Python interpreter versions (including PyPy) on x64 arch on all platforms" ?
I read that as "integration tests on all supported Python interpreter (including PyPy) on x64 arch on Linux+MacOS" ?
"integration tests on the latest CPython 2 & 3 and latest PyPy 2 & 3 with x64 arch on Windows"
👍
This is unclear as it seems to be included in the previous
👍 |
@xavfernandez Yea, I intended to write CPython, and basically only have 1 PyPy job. I've gone ahead and edited the message to correct that error. :) This stems from the fact that our tests on PyPy are really slow and I'd like to get away with not running tests on it, in too many places. FWIW, I'm going to do these changes incrementally, so there's no reason I can't experiment w/ trying to get PyPy tests on Linux + MacOS working fast-enough. :) |
FTR Travis CI now has support for a number of untypical architectures. It could be a good idea to have some smoke tests there while the main computing power would be elsewhere... |
Oh, and that's on 1 CI provider: all the tests on all the platforms with all the CPythons + Ubuntu PyPy3, in about 30 minutes. I quite like it and am very tempted to move fast on this once 20.3 is out. :) |
If we don't have issues with Azure Pipelines and since it is already setup, why not use it ? We should be able to dispatch the test suite in more than 25 workers. |
One good reason is release automation - if the entire pipeline is on one provider, we can use "depends on" and "if" based stuff, to dispatch a deployment that's only run if all the tests pass, and if we've got a tag pushed by the right people. Kinda tricky to do cross-provider. |
A tool like Zuul can solve cross-CI dependencies FWIW |
There's a very simple problem with Zuul, as far as I'm concerned.
|
I just went and had a quick look at the Zuul docs, and I agree. I had no idea where to start. I'd be pretty cautious about adopting extra tooling, we need to be sure we don't end up dependent on one person being the only one who can manage our infrastructure. (I was going to say "a limited number of people" but given the size of the pip team, that seems redundant 🙁) I'm not even sure how many people can manage our existing bots, for example. Keeping things simple, and using well-known and commonly used tools, should be an explicit goal here, IMO. |
Yes, the UX side of it is rather terrible. But folks who get used to it are happy.
Here's how the PR runs are reported: https://github.com/pyca/cryptography/pull/5548/checks?check_run_id=1345075288 I know @mnaser mentioned that Vexxhost would be open to providing a platform, and also there's OpenDev that cryptography uses. I think @ssbarnea may have more info about the opportunities. |
Donald wrote them. Ernest is managing the PSF hosting. I got access for being a whiny kid. Currently, the code at pypa/browntruck is auto-deployed at each commit. |
Honestly, this is reason enough to avoid it then. 🤷🏽
If there's an example of this, I'd like to see that. |
^ Ouch, that sounds very bad. This means that if all cores do happen to like I was hopping that we are all here with a common goal of improving python packaging user experience, not finding reasons for avoiding having a good test coverage. A good test coverage is about where people are using the tool in the wild (as opposed of where a selective group is using it). Testing less and claiming that everything is fine is not a good approach. |
Hi, @bristea! I think we are all indeed trying toward the same goal but have concerns about how to achieve it sustainably, especially given that the grant funding for pip work runs out at the end of this year. Will you be committing your own time to maintain any new parts of pip's test infrastructure, or donating money to fund others to do so? |
Yep, I do happen to use several Zuul servers daily, most notable ones being the OpenDev (OpenStack), Ansible and RDO ones. I have to disagree with @webknjaz about zuul ui being terrible, I would describe it as just a little bit behind gh-actions or travis but improving each day. Lets be pragmatic and look at those already using it, pyca/cryptography#5548 -- The downside that I see is that it reports as a single check but other than this nothing that would make impossible to identify what caused a potential job failure. In fact it does allow user to look a history of each job and identify if a failure is random or when it started to appear, much easier than on other systems. It is quite common for sensitive projects to use multiple Being offered such help is like throwing money at you, running CI is costly (not only compute) and if someone is offering to help you with hardware and also maintenance of jobs definitions, one should not say no. pip is probably one of the first projects that you do want to cover with a big test matrix, one covering major linux distributions and multiple architectures. Shortly, I am offering to maintain the pip zuul jobs running. Maintaining CI job for OpenStack is already my main job so keeping pip ones maintained would be very easy, especially as almost all projects under opendev directly depend on pip. Any pip regression hits us hard so we have a very good incentive in assuring this does not happen. |
@ssbarnea that is wonderful -- thank you very much for the offer! The rest of this somewhat lengthy comment is meant for people who perhaps aren't as familiar with pip's day-to-day context and history. @bristea: In case you are unfamiliar with the current funding situation: the PSF was able to get some funding, $407,000 USD in total, from Mozilla Open Source Support and the Chan Zuckerberg Initiative to hire contractors to work on the pip resolver and related user experience issues in 2020. You can see our roadmap and blog and forum and mailing list posts and notes from recent meetings to keep apprised. We also post updates to distutils-sig and the Packaging forum on Python's Discourse instance. Prior to late 2017, nearly no one was paid to work on any part of Python packaging -- PyPI, pip, or any other tool in the chain. Here's what it looked like in 2016. The Python Software Foundation was able to successfully apply for a few grants and similar funds over the past 3-4 years, which is why the new pypi.org is up, why it has two-factor auth and audit trails, and why pip has a new dependency resolver. Along the way we've been able to shore up some of our related infrastructure, and, for instance, pip's automated test suite is stronger than it was before our current resolver work started. And Bloomberg is paying Bernat Gabor to work on virtualenv, and Dustin Ingram gets to do a little packaging work while paid by Google, and Ernest W. Durbin III does sysadmin and some code review work on Warehouse (PyPI) as part of his work at PSF. But that's nearly it, I think. We are working assuming that, starting in January 2021, practically no one is being paid to contribute to pip. And so new problems that crop up with testrunners, CI configuration, etc. will have to wait till someone can fix them in their spare time, and will block the few volunteer hours that maintainers have available to do code review and merging, much less feature development. This is why Sorin's offer is so welcome! @bristea, you were replying to what @pradyunsg said in this comment where Pradyun was specifically considering the question/problem of testing "some less conventional distros this way (fedora, centos, debian etc.)". You suggested this would be a problem if statistics showed that the main supported platform was statistically insignificant in terms of proportion of pip's user base. Yes, it would be a problem if, for instance, pip maintainers concentrated on Ubuntu support to the detriment of Fedora support, but then it turned out we had far more users on Fedora than on Ubuntu! I hope you will take the 2020 Python developers' survey and sign up for user studies on Python packaging and spread the word about these efforts, so we have a better assessment of what operating systems our users use. And then that data will help pip's maintainers decide how much of their scarce time can go into support work for various platforms. |
The word I used is probably too strong. The problem is that there's a huge openstack bubble that just got used to how things are. UI is usable but it is often a bit more sophisticated than what GH/Travis users are used to. Truth be told, it's possible to customize that UI too and this is what probably causes the perception that the UX is bad. I guess if folks from a different bubble set up things like they like, it doesn't mean that it's that bad. OTOH since we don't see any setups using something more familiar, it creates a wrong impression of how things work... I don't have examples of Zuul dep configs myself, I just know that it's possible. Maybe Sorin has better demos. One notable thing about Zuul is that you can declare cross project PR dependencies that are agnostic to where the projects are hosted. For example, if some project on OpenDev depends on a bugfix PR in pip, they can specify such a dependency and Zuul will make sure to trigger the build on that OpenDev project once pip's PR is green. But of course it can follow way more complex dependencies. |
for what its worth i think the ux is subjective. haveing your pipline defiend declaritvly in code using ansible for the job logic provide a different configuration in code based workflow that many coming form a jenkins backround find unituitive. i have worked on openstack for over 7 years now and when i started we used jenkins to execute the job. in comparison to the ux of that solution zuul is much much better. with that said you are comparing it to travis azure and github actions. the main delta with regard to execution from a contributor point of view would be when the ci runs. with zuul it would run when they open the pull request to the main repo adn when they push updates, where as with a travis file unless you limited to PRs or were using a paid account that was limit to the offical repo, it would run when they push the change to there fork before they create teh pull request. im not sure if it would run agian for the PR in the case where it runs on the fork i honestly have too limited of an experice with travis to say. its something that would neeed to be discussed with the opendev team but i know that they provide thrid party ci for the ansible project today and the cryptography module adding . pypc has its own tenant in the opendev zuul today https://github.com/openstack/project-config/blob/master/zuul/main.yaml#L1645-L1667 to support pypa we would also need to create one for pypa a seperate tenant is likely the best way to integrate as the permissions could be scoped more cleanly but in general it not that diffuclut to do. i dont work for the opendev foundation or on the infra team so i cant really speak on behalf of them regardign if they are willing to provide the resouce to run the ci but to me as an openstack contibutor it would make sense given our depency on pip working this is a bit openstack specific still but https://docs.opendev.org/opendev/infra-manual/latest/creators.html documents many of the steps require to add new projects. the gerrit and pypi sections wont apply as you wont be using gerrit for review or having zuul publish packages to pypi(unless you want it too) on your behalf but it has some good references https://docs.opendev.org/opendev/infra-manual/latest/testing.html descibes the type of ci resources avaiable. basically it boils down to vms with 8G ram 80GB disk and 8 vcpus pyca are using opendev to build both x86 and arm wheels for example you can see that on there build page if you go to the artifacts tab you can download the build output and if you wanted too as a post merge trigger you could write a job to rebuild/upload those artifacts to an external site. anyway im not sure if this is helpful or not but just tought i would provide a little feedback as someone who has workd happliy on porject using zuul and who had deployed it before for third party ci. im obviously biased in favor of it but hopeful some of this is useful. |
Thanks, that's interesting. If I'm understanding what you're saying, it sounds like most of the setup could be done independently of the pip repository, at least in the first instance, maybe just running a daily CI run against master as a proof of concept. If it proved useful (and the people setting it up and running it were willing to take on the commitment) then we could link it to the pip repo so that it runs on PRs and merges, initially as an optional check, but if it turns out to work well, we could make it mandatory. I'm very much in favour of something like that where we can try the approach incrementally, with the pip developers able to leave the specialists to get on with the basic infrastructure, and without having the work be bottlenecked on pip developer availability. If I've misunderstood what you were saying, then I apologise (and you can ignore this comment). |
yes using zuul is not an all or nothing proposition pyca/cryptography are using it in addition to travis and azure for things they cant easily test in another way. in there case it was arm builds i believe. assuming open dev are happy to provided the resources you could totally start with a simple nightly build. openstack to test it properly need to deploy an entire cloud with software defiend storage and networking and boot nested vm in our bigger integration jobs. we also have a lot of smaller jobs that just run tox to execute unit tests or sphinx to generate docs. so zuul can handle both well because its just runing ansible playbooks against a set of resources provided by an abstraction called a nodeset. if github action or travis do what you need less complexity is always better then more. openstack has a lot of project with a lot of interdependency and a need for cross gating and very large scale. zuul was built for that but that may or may not be what you need for pip. |
I setup the Zuul integration with pyca, where our initial focus was to enable ARM64 testing and manylinux wheel building on the ARM64 resources provided by Linaro. OpenDev is not really looking to become a TravisCI replacement where we host 3rd-party CI for all comers. However, there are projects where there is obvious synergy with collaboration -- obviously OpenDev/OpenStack heavily depends on pip in CI and I think it's fair to say as a project we have historically found and done our best to help with issues in pip/setupools/etc. pyca was our first integration, and I have an action item to write up much clearer documentation.
We don't really need to keep talking theoretically around this; we can get some jobs working in a pull request easily and the project can evaluate how it would like to continue based on actual CI results. However, both sides need to agree to get things started:
Both projects taking these steps essentially formalises that pip is open to integration, and the OpenDev project is willing to provide the resources. With this done, we can start a proof of concept running jobs in a pull request. It would be good to confirm @ssbarnea and @SeanMooney are willing to help setup some initial jobs; it's only going to be useful if the CI has something to do! Pip can can see exactly what the configuration and results will look like on that pull request and make a decision about how to move forward. I can give you a heads up of what it will all look like though. The job definitions live under a .zuul.d directory [1]. Zuul will report via the checks API so the results of the jobs just show up in the list like any other CI, e.g. see the checks results on a pull request like pyca/cryptography#5533, where the run results are posted as https://github.com/pyca/cryptography/pull/5533/checks?check_run_id=1348557366. When you click on a job result, it will take you to the Zuul page where all logs, build artifacts, etc. are available, e.g. https://zuul.opendev.org/t/pyca/build/b7056847728149c18ca3a483d72c1a51. This played out in pyca/cryptography#5386 for pyca where we refined things until it was ready to merge and run against all PRs. [1] technically we do not need to have any job definitions or configuration in the pip repository; we could keep it all in OpenDev. However, this means if you want to modify the jobs you have to go searching in a separate repository but, more importantly, this would mean that pip developers can't modify the jobs without signing up for an OpenDev account and being given permissions to modify the jobs there. This is not usually the approach projects want to take; they want their CI configuration under project direct control. |
Folks, since we sorta hijacked the issue dedicated to the docs and it'd be great to keep it on topic, I've created #9103 to discuss Zuul effort there. Let's use it for that from now on. |
With all the Zuul stuff redirected over to #9103 (thanks @webknjaz!), I'd like to get the last bits of #2314 done. Here's my plan: all of pip's current CI moves to GitHub Actions, we deploy to PyPI {when conditions are right -- #2314} and we add a GitHub Action triggered at-release over on That should let us push a tag and have the release go out automagically. |
To address what @bristea said earlier:
I don't think it's bad. As I said in the comment that was quoted -- for pip, most Linux distros look the same. We're not testing against the distro-provided Python (the distros perform their own testing on them, because they have their own policies + they're patching Python+pip anyway) and so there really isn't much to gain by adding additional distributions. Testing multiple architectures also isn't super impactful -- pip is a pure-Python program and so are all of our dependencies. The "main" bit that's architecture-dependent is On the other hand, back when I wrote that, our CI used to take well over an hour to run tests for our existing test matrix per PR (if I remember correctly). That sort of feedback timeline basically KILLS productivity -- it'd suck in general for any software project but especially for a volunteer driven one like pip. I used to push a change and go do something else for an hour, simply because that's how long it took to get a CI result. As @brainwane noted, a fair amount of work has gone into getting those times down and bringing sufficient clarity to the CI (especially when there's failures). And even then, the CI times are in the push-and-go-have-a-meal territory, despite using RAMDisks on certain platforms and a multi-CI-provider approach to maximise the "workers" we use. The trade-offs are CI times vs CI support matrix size. It'd be amazing to have short times and large support matrices but the reality is that we don't have buckets of money earmarked "pip stuff" going around. [1] Adding more stuff to the CI matrix would only make the time situation worse, unless we get additional CI resources -- and as the person who has to wait, I obviously don't like that. ;) Which brings me to things like #9103 -- I'm very on board for so much more of this. External CI resources provided/donated by organizations with an interest in ensuring pip works well. As I've used a lot of words to say above -- right now, we're really hard-pressed on the CI resources situation and we could really use additional CI resources, to increase our CI matrix as well as to improve the developer experience. [1]: If you know someone/some organisation that'd be willing to do so, please do let us know. I'm sure PSF's Packaging-WG will figure out some way to put it to good use. As an example related to this issue, ~2-3 weeks of an experienced Python dev working to improve our test suite significantly, allowing for faster feature development and better sustainability for the project. Also, we've got a list for even more impactful projects, if that's more interesting. :) |
Coming with a good number of years of contributing to openstack, I find if really funny that 1h delay to be sound long. On OpenStack we do have cases where it takes even more than 24h for particular change to be checked or gated, is not common but we have jobs that run for 2-3h, with empty queue. I think is a bad development practice to optimize for time to pass CI by lowering the test matrix. Developers should have patience with their patches and also perform a decent amount of local testing before they propose a change. The bigger the risks the bigger the test matrix should be and I do find pip as one of the most important projects in python ecosystem. If a bug slips in that affects even 0.1% of users, that is serious issue. So please do not advocate for quick merges. The reality is that having a patched reviewed by humans, preferably at least two requires far more time than the CI, so the time to run CI is not the real bottleneck most of the time. Also, i do find quite dangerous to have merges happening too fast it it does not allow others to review them. In fact I would personally wish github would have a configurable cool-down time which project can configure, preventing merges unless they are at least a number of hours passed. |
The main problem I have with long running CI checks is that it causes wasted cycles. I work on a lot of projects concurrently, and tend to completely switch to another task after I finish working on pip, and would not check back in a long time. I believe most pip maintainers work the same way as well. This means a failing CI tend to make the code miss one precious oppertunity to get reviewed, and have to sit in the queue for considerably longer time than needed. If the CI could report more quickly, I would be able to afford waiting some extra time before switching task to avoid the PR from dropping out of the cycle. It may be counter-intuitive, but IMO long CI duration is a problem with pip exactly because pip PRs tend to need more time to get proper reviews, not the other way around. Projects with more review efforts can afford longer CI duration because missing some of the review oppertunity is less problematic. I am not faimilar with OpenStack and do not know how it compares to pip, but “long CI is not a problem since human reviews take longer” does not seem to be the correct conclusion to me. |
As a general note, I do feel like everything that has to be said here in terms of how everyone involved feels about $thing and $approach has been said. Instead of an extended discussion about the trade offs at play here, I’m more interested in breaking this issue into a list of action items, making dedicated issues for those and closing this. If someone else would like to get to making those action items before I do (at least 2 weeks from now), they’re welcome to! |
Here's a starting list:
That's a 5-minute brain dump of high-level, but hopefully small enough to be actionable, ideas. I don't have the time to manage any of these items, so I'm just throwing them in here in the hope that someone has the bandwidth for this sort of meta-activity. |
And, this is done now with #9759. You have been warned about lots of issue tracker churn this weekend. :) |
Of our combination of supported interpreters, OS and architecture we are currently only testing a few without clear strategy (cf https://github.com/pypa/pip/pull/7247/files)
The goal would be to come up with a bunch of rules like:
The text was updated successfully, but these errors were encountered: