Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider (and attempt) blessing snapshots runs to "release" status #352

Open
3 tasks
kltm opened this issue Jan 17, 2024 · 13 comments
Open
3 tasks

Consider (and attempt) blessing snapshots runs to "release" status #352

kltm opened this issue Jan 17, 2024 · 13 comments
Assignees

Comments

@kltm
Copy link
Member

kltm commented Jan 17, 2024

Look at blessing snapshots to release, to:

No new libraries or technologies. The only "interesting" additions would likely be:

  • an extension of the current Zenodo scripts to assemble an upload package from a given snapshot and push it
  • snapshots to get an associated date and location
  • figure out what to do about testing
@kltm kltm self-assigned this Jan 17, 2024
@kltm
Copy link
Member Author

kltm commented Mar 4, 2024

Noting that we have a week-long holding pen for snapshots already built in for debugging, during the "Publish" step. If I switch over to having these autoclean by bucket policy, these would give us a clean jump-off point to perform the manual publication that we're already doing because of the Zenodo instability. This holding pen could be arbitrarily extended up from a week to however long we want.

While this very much falls short of a full after-the-fact "blessing" system, it is actually very in line with current practices and I believe that with the change of a few lines of the current manual release SOP, we could bring up a successful snapshot.

@pgaudet What are the minimum indicators you need before knowing if a snapshot is worthwhile? Would you be able to look at the stats and, if it looks okay, let me know and I could put it out on the experimental AmiGO so you could take a closer look? How would letting you know work? Could I just sign you up for all success snapshot run emails and you get back to me when the timing feels right? If this kind of thing might work for you, I think I have a fairly quick way forward:

  • add better hygene to the snapshot holding pen (i.e. go-data-products-daily), so that only intended files are kept
  • create release SOP to move held daily to zenodo, publish, and deploy
  • pause/remove release code
  • add bits to warn you of snapshots passing
  • remove release amigo-exp deployment, create manual SOP that aims at specified daily bucket

@kltm
Copy link
Member Author

kltm commented Mar 6, 2024

7-day existence rule added; we should see results very soon.

@kltm
Copy link
Member Author

kltm commented Mar 6, 2024

The dailies now auto-clean. Moving forward, we can use these as a clean base, within a week, to create a release.

@pgaudet
Copy link
Contributor

pgaudet commented Mar 11, 2024

@kltm

What are the minimum indicators you need before knowing if a snapshot is worthwhile? Would you be able to look at the stats and, if it looks okay, let me know and I could put it out on the experimental AmiGO so you could take a closer look? How would letting you know work? Could I just sign you up for all success snapshot run emails and you get back to me when the timing feels right?

The same procedure as we have now for the release seems appropriate:

  1. I get a notification that a release/snapshot is ready to be checked. Note that having the data on some experimental AmiGO is required for the checks to be carried out.
  2. I look at the stats, and if all is OK, I notify you. Right now this communication is by email; we can change that if needed.

Does that answer all the questions?

Thanks, Pascale

@kltm
Copy link
Member Author

kltm commented Mar 11, 2024

Talking to @pgaudet this morning, until we've run through this a couple of times to work out the kinks (or have a machine that gets us back to where we were), we'll:

  • setup pascale to get snapshot success emails
  • pascale will look at reports from the snapshot when she is feeling like the timing is right for a release
  • within a week, will let seth know that things are looking okay
  • seth will put the candidate onto amigo-exp
  • if gets a thumbs-up from pascale, seth will promote the snapshot to a release

@kltm
Copy link
Member Author

kltm commented Mar 16, 2024

Okay, after a little consideration, I think I may have some "easy" ways forward, although any one might take a day or so to put together. Essentially, the issue is with a bad docker/jenkins interaction. I can now see a few ways to bypass this:

  1. break the pipeline into two pieces, pre-index and post-index, and do the middle part (essentially) manually. while labor-intensive, this is nearly guaranteed to be tractable
  2. set a pipeline (snapshot) to use a single standing docker instance to build the index. possible issue here is that "remote controlling" docker may be a big PITA, but we bypass the interaction bug and we still have full automation
  3. break the solr load into smaller pieces that should individually not have the footprint to stop things. I think this would likely work, but would be slow to test

@kltm
Copy link
Member Author

kltm commented Mar 19, 2024

Actually, poking around in this, I think I'm going to try something else first:
4. "catch" the error, wait, and then continue; going to take a look at the Jenkins docs but, IIRC, this is supported

@kltm
Copy link
Member Author

kltm commented Mar 19, 2024

Also, clarifying for "3", to make this work, the whole image would have to be dropped and stood back up. If going that way, there will be some temporary repetition and we may have to introduce a template functions to bypass the string limit we will almost immediately smack into.

kltm added a commit that referenced this issue Mar 19, 2024
@kltm
Copy link
Member Author

kltm commented Mar 19, 2024

Looking at the failure messages, and understanding how this is happening at a stage level (not a step level), I think I can change tack a little.
I've created a new pipeline snapshot-post-fail; it has the following properties

  • all stages through to the mega-make have been removed
  • blanket replacement of "$BRANCH_NAME" with "snapshot"
  • remove initialize() and watchdog()
  • in script conditionals (if/else), if there is a 'snapshot', add a 'snapshot-post-fail'

I believe what this should allow me to do is "hijack" the snapshot run with the new pipeline, picking up where the failed (but data-wise sound) run terminated.

@kltm
Copy link
Member Author

kltm commented Mar 19, 2024

@kltm
Copy link
Member Author

kltm commented Mar 20, 2024

Cheers to @dustine32 for helping me out with a code review. Issues that I'll fix before proceeding:

  • match metadata to snapshot, specifically TARGET_BUCKET
  • re-add watchdog, cause I screwed up above
  • change "when" variables in "Publish" to let in snapshot-post-fail

@kltm
Copy link
Member Author

kltm commented Mar 26, 2024

@pgaudet I believe a snapshot has now gone through, using the modified pipeline. Would you be able to briefly review it? If it seems solid, we can either 1) attempt to do the new "promotion" procedure, where we try and take a snapshot and make it a release or 2) do the same thing we did here for release, giving us a very very high probability of success.

@kltm
Copy link
Member Author

kltm commented Apr 4, 2024

Noting that I'm now working towards something between the two above.
Essentially, I will be taking the release pipeline, removing the first part of it, and replacing it with a "copy from snapshot". We can refine this model and timing, but a huge improvement over what we have now (nothing).
(@dustine32 I'll be hunting after you in the next day or so for a review of that change and as a sanity check.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: In Progress
Development

No branches or pull requests

2 participants