Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify UpdateReadyForTesting, publish it on edits that change builds and on update creation #5538

Merged

Conversation

AdamWill
Copy link
Contributor

@AdamWill AdamWill commented Nov 21, 2023

This PR does three things that overall aim to address the issues discussed in https://pagure.io/fedora-ci/general/issue/436 (it's rather complicated for test systems to trigger tests appropriately for updates, and Fedora CI doesn't do the best job of it right now):

  1. It simplifies the UpdateReadyForTesting message (as a new V3), dropping most of the additional data that was included in the message initially at the request of the CI team in https://pagure.io/fedora-ci/general/issue/70 , but which the CI triggers aren't really using.
  2. It no longer publishes UpdateReadyForTesting when an (RPM-based) update's status is set to testing. Instead:
  3. It publishes UpdateReadyForTesting any time an (RPM-based) update is edited and the builds in it change.
  4. It publishes UpdateReadyForTesting when an (RPM-based) update is created.

Taken together, these changes should allow test systems simply to run tests any time they see an UpdateReadyForTesting message, and get optimal scheduling behavior (tests run on update creation, edit with changed builds, and 'Re-Trigger Tests' request from the web UI). Currently Fedora CI relies solely on the UpdateReadyForTesting messages, but this means it doesn't run tests when an update is created or edited until up to 24 hours later (because UpdateReadyForTesting is only published when the update is pushed to testing). openQA and FCOS CI handle scheduling better, but have to listen to multiple messages and implement complex logic to do it; this will allow us to radically simplify their scheduling code.

@AdamWill AdamWill requested a review from a team as a code owner November 21, 2023 23:59
@AdamWill AdamWill force-pushed the consistent-test-scheduling-message branch 5 times, most recently from b5fed07 to 08fee8b Compare November 22, 2023 00:31
@AdamWill AdamWill force-pushed the consistent-test-scheduling-message branch 2 times, most recently from 47554aa to 06e9113 Compare November 22, 2023 00:48
@AdamWill AdamWill force-pushed the consistent-test-scheduling-message branch from 06e9113 to 6c5a0b0 Compare November 22, 2023 00:53
@AdamWill
Copy link
Contributor Author

sigh, why is this so hard. okay, i know what's going on, it's just, ugh. will poke some more tomorrow.

@AdamWill
Copy link
Contributor Author

AdamWill commented Nov 22, 2023

what's going on is, we just basically use the entire Update instance to 'be' the update property of the message (mediated through database model tomfoolery) and the builds attribute of an Update instance is a list of Build model instances, and Build model instances don't have a task_id attribute, so this just isn't working. i wasn't totally grokking that that's really really how this works. i will think about it again tomorrow.

@AdamWill
Copy link
Contributor Author

on the whole, having thought it over a bit, I decided this approach is basically garbage. but I needed to go through it to work that out! gonna try a different approach tomorrow.

@AdamWill AdamWill force-pushed the consistent-test-scheduling-message branch 10 times, most recently from 6350be1 to defba00 Compare November 24, 2023 02:42
@AdamWill
Copy link
Contributor Author

goddamnit, sending a message on update creation turns out to be weirdly difficult!

@AdamWill
Copy link
Contributor Author

sigh, current attempt is going insanely slow as it's attempting to publish messages. working on a workaround for this locally.

@AdamWill AdamWill force-pushed the consistent-test-scheduling-message branch 3 times, most recently from 78d6cd1 to 6cdf46b Compare November 24, 2023 05:02
@AdamWill
Copy link
Contributor Author

This time?!

@AdamWill AdamWill changed the title WIP DO NOT MERGE add task ID to UpdateReadyForTesting WIP: Simplify UpdateReadyForTesting, publish it on edits that change builds and on update creation Nov 25, 2023
@AdamWill
Copy link
Contributor Author

woohoo! Still gonna keep this marked as WIP for now as I need to look over it for anything that needs tidying up, make sure there's enough testing of the new behaviour, and write a changelog note.

@mattiaverga mattiaverga added the 8.0-backports Mergify uses this to backport patches to 8.0 label Nov 29, 2023
@mattiaverga
Copy link
Contributor

I've just branched 8.0 and tagged this to be backported in it... let me know if the needed changes will land in time or if we should move this to 8.1

BTW can you add a note for the release notes to the PR?

@AdamWill
Copy link
Contributor Author

AdamWill commented Nov 29, 2023

yeah, sorry, I've just been distracted by a couple of different things the last couple days so didn't get to polishing this. I did mention above that writing a release note is one of the things that needs doing. edit: duh, I see the trigger PRs were merged! yay for that.

See https://pagure.io/fedora-ci/general/issue/436 . This isn't
strictly necessary, but it simplifies things and makes them
more consistent.

We introduce UpdateReadyForTestingV3, which now inherits from
UpdateMessage and is more similar again to all the other
messages. It drops quite a lot of the extra data that V1 and V2
contained because, AFAICS, the Fedora CI triggers never actually
used it. It retains only what they need - a stripped version of
the artifact dict with stripped build dicts. Really the triggers
only need the task ID and build ID from these build dicts.

Signed-off-by: Adam Williamson <awilliam@redhat.com>
@AdamWill AdamWill force-pushed the consistent-test-scheduling-message branch from becd8ae to db362b1 Compare November 29, 2023 17:27
@AdamWill AdamWill changed the title WIP: Simplify UpdateReadyForTesting, publish it on edits that change builds and on update creation Simplify UpdateReadyForTesting, publish it on edits that change builds and on update creation Nov 29, 2023
@AdamWill
Copy link
Contributor Author

For the historical record here, I'll note #3428 was the original ticket that led to the initial design here (where we publish the message on status change to 'testing').

That ticket states "Updates are manually created in a "Pending" state, where they are signed before they are pushed to a "Testing" where they are tested by the CI system. So we can't trigger the CI system off new updates being created."

I tried to find any further background behind that statement, but couldn't. My best read is that it expects "the CI system" either needs updates to be in updates-testing, or to be signed, before they can be tested.

This is definitely not the case for openQA, and per discussion in https://pagure.io/fedora-ci/general/issue/436 and review of the Fedora CI code, I don't believe it's true for Fedora CI either. AFAICS, like openQA, Fedora CI finds the builds in the update and retrieves them from Koji for testing, which does not require that they are in updates-testing or signed. I believe FCOS CI is similar also.

See https://pagure.io/fedora-ci/general/issue/436 . This is the
behaviour test systems want, based on real-world experience.
We do not want test systems to wait for updates to reach the
updates-testing repo before they test; we want them to test as
soon as the update is created. All the test systems that exist
pull the builds to be tested from Koji, they do not rely on them
being present in updates-testing.

Waiting for the update to reach updates-testing before tests run
can mean a wait of up to 24 hours, which is obviously not ideal
when people want test results fast. It also means we can't gate
push to updates-testing on tests that trigger on this message.

openQA currently listens out for other messages to use as a
proxy for 'update was created' in order to mitigate this problem,
but Fedora CI does not do so, and consequently its tests often
trigger late. This change, along with the earlier change to
publish this message when an update's builds are changed, will
allow us to simplify openQA's logic considerably (it can just
trigger any time it sees this message) and improve CI's
scheduling considerably without any need to make changes to the
trigger logic on the CI side.

Signed-off-by: Adam Williamson <awilliam@redhat.com>
Test systems should re-test an update if it's edited and its
builds change. Before b33609d
we could rely on UpdateReadyForTesting being sent (after a delay)
when this happened, because editing the builds causes the update
to be sent back to pending state and then to testing state, and we
were publishing the message any time an update hit testing state.
But now we publish the message on update creation instead of when
the update reaches testing status, that is no longer the case. So
instead we should directly publish UpdateReadyForTesting when an
update is edited and the builds change.

Signed-off-by: Adam Williamson <awilliam@redhat.com>
Signed-off-by: Adam Williamson <awilliam@redhat.com>
@AdamWill AdamWill force-pushed the consistent-test-scheduling-message branch from 85b9b53 to d6c257a Compare November 29, 2023 18:08
@AdamWill
Copy link
Contributor Author

OK, I think (hope?) this is ready now.

Copy link
Contributor

@mattiaverga mattiaverga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@jlebon
Copy link
Contributor

jlebon commented Feb 1, 2024

Do I understand correctly that this is removing the .artifacts.builds[].component field? Was hoping as part of the simplification of coreos/coreos-ci#53 that I'd be able to always rely on that (which previously wasn't in all the messages that we triggered on). Currently using a gnarly regex on the nvr field. I could keep doing that, but... seems like a desirable field to have for anything listening and triggering on these messages?

@AdamWill
Copy link
Contributor Author

AdamWill commented Feb 1, 2024

yes, it does do that; I wasn't aware you were using it.

You shouldn't need to use a gnarly regex to parse an NVR. RPM does not allow dashes in version or release fields - see the Version and Release paragraphs under Preamble - so you can always just split it from the right. In Python, this should always get the N:

nvr.rsplit("-", 2)[0]

is it difficult to do that in Jenkinsfile syntax?

@jlebon
Copy link
Contributor

jlebon commented Feb 1, 2024

is it difficult to do that in Jenkinsfile syntax?

Right, exactly. The problem is working within the constraints of the message selector syntax. It's the difference between having the job trigger on every Bodhi message and then have imperative code doing this type of checking and exiting early vs having the messaging plugin do the filtering for us.

@jlebon
Copy link
Contributor

jlebon commented Feb 1, 2024

yes, it does do that; I wasn't aware you were using it.

To clarify, we're not using it currently. I had plans to use it but it's definitely not a blocker.

Happy to peel this off into an RFE ticket for now.

@AdamWill
Copy link
Contributor Author

AdamWill commented Feb 1, 2024

Okay, I see.

The behind-the-scenes here is: Bodhi has a really nice, elegant system for producing the update dict in its messages. When constructing the message you literally just pass an instance of its Update class and it all gets serialized for you, mostly magically.

The artifact dict is, uh. Not that! It gets pulled together kind of awkwardly on the fly when publishing just this one specific message which is different from all the other dozen or so messages Bodhi publishes that relate to updates. That function has to exist, anything that publishes this message has to call it, it's just kinda ugly compared to the nice way it works for all the other messages.

In this PR I really actually wanted to throw the artifact dict away entirely, but sadly I decided it was a bit too difficult for now because the CI pipelines use some info from it which isn't easily available otherwise. Your situation is similar, I guess, so we could add it back, I just kinda hate to because it's going in the opposite direction of the one I still do ultimately wish we could go in (get rid of this special dict entirely and have UpdateReadyForTesting be just like all the other messages).

@jlebon
Copy link
Contributor

jlebon commented Feb 5, 2024

Gotcha. I'm not particularly attached to the artifact dict myself. Could it be added to .update.builds[] instead? We already have .nvr there, so we have that info handy. :)

@AdamWill
Copy link
Contributor Author

AdamWill commented Feb 5, 2024

possibly, but the issue with that is because of the neat way the message dict for the update is created, adding things to it isn't necessarily straightforward (it can mean a database schema change). I'll try and find a minute to look into how that works in more detail and see what the implications of adding it there are.

@AdamWill
Copy link
Contributor Author

AdamWill commented Feb 5, 2024

For anyone playing along at home, this is now deployed to prod. jlebon and I have updated the FCOS and openQA schedulers respectively. Fedora CI doesn't need any scheduler changes but should start exhibiting improved behaviour (scheduling tests faster when updates are created or edited).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
8.0-backports Mergify uses this to backport patches to 8.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants