Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[kbn-es] Elasticsearch snapshot automation #52016

Closed
tylersmalley opened this issue Dec 2, 2019 · 16 comments · Fixed by #53706
Closed

[kbn-es] Elasticsearch snapshot automation #52016

tylersmalley opened this issue Dec 2, 2019 · 16 comments · Fixed by #53706
Assignees
Labels
Team:Operations Team label for Operations Team

Comments

@tylersmalley
Copy link
Contributor

Relying on the release manager for our ES snapshots has had some issues we are looking to resolve.

  • A breaking change being merged into Elasticsearch causes CI to be in a failed state
  • It can be really long between successful release manager builds (1 month+)
  • When bumping versions, we must pin the snapshot to a previous artifact until the release manager produces a successful subsequent build.

To accomplish this, we should create a nightly job that will build Elasticsearch and run Kibana tests against it. If the tests are successful, we will promote the ES snapshot to Google Cloud bucket. If they are unsuccessful, we need to be notified either by an issue being created or a Slack notification to the operations channel.

@tylersmalley tylersmalley added the Team:Operations Team label for Operations Team label Dec 2, 2019
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-operations (Team:Operations)

@tylersmalley
Copy link
Contributor Author

kbn-es stores the archive in .es/cache. If we were to add an option to kbn es source to generate archives for all the platforms we could use those to push to the bucket.

@tylersmalley
Copy link
Contributor Author

@spalger @brianseeders feel free to add any additional details.

@brianseeders
Copy link
Contributor

brianseeders commented Dec 2, 2019

Would we need to run the full suite of tests? Except for firefox, visual regression, and accessibility? e.g. intake tests (no linting, etc), api integration tests, and chrome functional tests?

@tylersmalley
Copy link
Contributor Author

That should be sufficient. Running only what can be effected by ES should limit the number of flaky tests we run into.

@spalger
Copy link
Contributor

spalger commented Dec 4, 2019

I personally think email notifications are sufficient, but otherwise I agree. I like the name "es snapshot validation" for this job...

@brianseeders
Copy link
Contributor

Thoughts on this as a game plan? @tylersmalley @spalger

ES Snapshot plan

ES Snapshot Build Job:

  • One job that has param for which branch to build (instead of one job per tracked branch)

Process:

  • Checkout ES source for given tracked branch
  • Build artifacts for all platforms/licenses (similar to docs here)
  • Possibly run some smoke tests? Are there already some quick ES smoke tests easily invokable via gradle?
  • Upload all artifacts to GCS, identified by branch and commit hash or similar
  • If build or possible smoke tests failed:
    • Alert Operations team
  • If everything successful:
    • Copy artifacts to <branch>-latest
      • i.e. this always gets updated in-place with the latest successfully-built artifact
    • Kick off downstream Kibana ES Snapshot Verification job

Kibana ES Snapshot Verification job:

  • One job that has param for which branch to verify

Process:

  • Run mostly the normal pipeline, except:
    • Only run the tests/steps required to validate ES (e.g. can ignore Firefox Smoke)
    • Use the <branch>-latest snapshot built in the previous step
    • Possibly add retry to the test suites
  • If tests fail:
    • Alert Operations team
  • If tests succeed:
    • Promote the snapshot by copying it to <branch>-latest-verified or similar

Misc

  • Create a property in package.json for snapshot version overrides, if blank or missing just use <target branch>-latest-verified
  • We could build a mechanism for pausing build promotions, but it would only be useful if there is a short-lived problem not caught by the verification job

@spalger
Copy link
Contributor

spalger commented Dec 13, 2019

Sounds good to me, though I'm concerned with the likelihood that by using known GCP names we have two issues:

  • no history, rolling back requires restarting the process
  • consumers that pull assets while uploads are in progress might get the wrong shasum for the artifact, and have to wait some unknown amount of time to get the right combination of asset and shasum

I think I'd prefer a slightly more complicated solution that uses a JSON manifest in a known location that the jobs can update once they've uploaded assets/shasum to a unique location, something like:

  • ES Snapshot Build Job
    • Build (I don't think test is needed)
    • Upload to unique location
    • trigger child job with parameters for unique location and branch name
  • Kibana ES Snapshot Verification Job
    • Pull from unique location
    • Run needed tests
    • Copy JSON manifest to manifest.bak-{jobNum} or something
    • Overwrite JSON manifest to point to verified assets

@brianseeders
Copy link
Contributor

I don't see a way to do atomic operations on a group of files in GCS so I think you're right, we'll have to do a manifest. It shouldn't add too much more complexity

@tylersmalley
Copy link
Contributor Author

My concern with the proposed manifest approach is we then need to come up with a solution to cleanup the builds. We can't simply rely on TTL's like release manager uses as we want builds to exist for versions/branches which are no longer being built continuously. Could we limit to retaining only previous, current and latest?

@brianseeders
Copy link
Contributor

@tylersmalley is there a reasonable timeframe for how long a snapshot should exist after we stop building a branch? I was actually looking at keeping all of the snapshots, and just having them automatically delete from GCS after a certain period of time

@tylersmalley
Copy link
Contributor Author

My goal is that yarn es snapshot would work on even old branches which are are no longer under active development/support. There are folks that still create plugins for these versions and we often use these branches to test a support issue. Having a single way to bring ES up is useful without having to know ahead of time if the branch is still under maintenance or not. Currently, we have this note in the contributing docs due to them expiring.

I see a few options:

  • We keep a build indefinably for each version
  • We only keep a manifest, which points to the last public release

@brianseeders
Copy link
Contributor

I could just upload the most recently verified snapshot for each version to a second bucket that acts as permanent storage for the most recently-verified snapshot. Then, when we check for a manifest, use it as a backup if the daily one 404s. Pretty simple and there wouldn't be any cleanup logic or pinning that should need to be done anywhere.

@tylersmalley
Copy link
Contributor Author

👍 I like that

@brianseeders brianseeders self-assigned this Dec 14, 2019
@brianseeders
Copy link
Contributor

How should we be notified for ES snapshot build failures?
What about for verification failures?

Start with e-mail for both and go from there?

If e-mail, do you think build-kibana@elastic.co is a good place for these? Or should we make another group?

@tylersmalley
Copy link
Contributor Author

Yeah, email to build-kibana@elastic.co is a good place to start.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Operations Team label for Operations Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants