Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support tracks across multiple versions of Elasticsearch #69

Closed
danielmitterdorfer opened this issue Mar 21, 2016 · 0 comments
Closed
Labels
enhancement Improves the status quo :Usability Makes Rally easier to use
Milestone

Comments

@danielmitterdorfer
Copy link
Member

danielmitterdorfer commented Mar 21, 2016

Concept

Since b5b4911 we ensure that Rally always uses the most recent mapping file. However, a track has more version-dependent parts than just the mapping:

  • Configuration files
  • The queries that can be executed and possibly also the concrete query syntax
  • Data format

In #26 we will externalize all these parts into a descriptive track specification. In this ticket we will ensure that users will be able to use these tracks across multiple versions of Elasticsearch. For now the earliest supported version will still be Elasticsearch 5.0.0-alpha1. The problem is not so much the track specification as cluster provisioning in Rally (which is just a separate topic that we cannot / will not tackle at the same time).

Assumptions

We will assume that users will either benchmark a specific released version of Elasticsearch (they run e.g. esrally --pipeline=from-distribution --distribution-version=5.0.0-alpha2) or they actively develop on core and benchmark a source build of Elasticsearch (e.g. esrally --revision=latest). Hence, the only scenarios that we will officially support:

  • Compatibility with the latest master branch version of Elasticsearch
  • Compatibility with a previously released version of Elasticsearch

Solution

Data Management

All of this can be achieved by using git under the hood for managing track specifications. The track repository needs to contain a few branches:

  • master: we should ensure that it will always work with the latest master branch version of Elasticsearch
  • One branch for each released version of Elasticsearch. The branch name will correspond to the URL-friendly name for the download of the distribution itself (e.g. for the URL-friendly name "5.0.0-alpha2", also a branch "5.0.0-alpha2" will exist in the repo).

The repo will contain one folder for each track which also corresponds with its name. Which file(s) the folder will contain is defined by #26.

This approach also gives us a few additional benefits:

  • Error corrections in track specifications or enhancements can very easily be added over time. If you make a mistake, just fix it on the affected branch and cherry-pick it to all other branches "before" that are also affected (e.g. fix on master and cherry-pick to the "5.0.0-alpha2" and "5.0.0-alpha1" branches)
  • Core developers which have very intimate knowledge of the development timeline can even benchmark arbitrary prior versions of Elasticsearch. They just need to find the right commit in the track specification repository and check that out.

Version-specific track execution

Rally will manage the git repository all by itself. It will perform the following steps before the start of a race:

  1. Fetch the latest version of the repository (git fetch origin -p, I think -p makes sense although I wouldn't expect a whole lot of deleted branches)
  2. It will determine whether to check out a version branch or the master, check it out and rebase: git checkout $BRANCH && git rebase origin/$BRANCH)
  3. Log the checked out revision of the track data in any case in the race index in the metrics store in order to clearly document the revision that has been used.

After that, the usual mechanism will do its job (i.e. downloading benchmark data etc.)

Non-standard cases (doesn't necessarily have to be an error):

  • No network connection (git fetch origin will fail): I wouldn't necessarily fail in this case but I think it justifies a warning on the console and in the log
  • Local working copy is dirty (i.e. git checkout $BRANCH will fail): Abort and tell the user to cleanup. It could be that they're currently working on a new track and we should not mess with their files. It will show during development of the feature but maybe it makes sense that we don't force the user to always commit their changes before running a track.
  • The remote branch does not exist (i.e. git rebase will fail): I'd assume in that case that this is a new track that has not yet been pushed to origin (even very likely when it is a community benchmark)

We could also add a flag for advanced users to not touch the repo at all. This would allow them to checkout a specific revision manually and run a race against that revision.

Depending on #26, mapping files may be even stored in this repo (I'd expect so) but most certainly we will not store the benchmark data there. So there should be some kind of naming convention for benchmark data files should they differ across branches (e.g. "documents.json.gz" vs. "documents-1.4.1.json.gz"). Having that said, I wouldn't expect that it is needed in a whole lot of cases.

Finally a very important assumption: At no point in time there will be two (or more) Rally processes running on the same machine (otherwise one Rally process could use the configuration of version X and, while the benchmark is running, another one could use the configuration of version Y and they could interfere with each other in very bad ways). But as running multiple benchmarks simultaneously is obviously a very bad idea as it gives bogus results that's just one more way to shoot yourself into the foot.

@danielmitterdorfer danielmitterdorfer added this to the Backlog milestone Mar 21, 2016
@danielmitterdorfer danielmitterdorfer changed the title Support multiple versions of mapping files Support multiple versions of mapping files and benchmark data May 11, 2016
@danielmitterdorfer danielmitterdorfer modified the milestones: 0.4.0, Backlog May 11, 2016
@danielmitterdorfer danielmitterdorfer added enhancement Improves the status quo :Usability Makes Rally easier to use :Benchmark and removed idea labels May 11, 2016
@danielmitterdorfer danielmitterdorfer changed the title Support multiple versions of mapping files and benchmark data Support multiple versions of benchmarks May 11, 2016
@danielmitterdorfer danielmitterdorfer modified the milestones: 0.3.0, 0.4.0 May 12, 2016
@danielmitterdorfer danielmitterdorfer changed the title Support multiple versions of benchmarks Support multiple versions of tracks May 12, 2016
@danielmitterdorfer danielmitterdorfer changed the title Support multiple versions of tracks Support tracks across multiple versions of Elasticsearch May 12, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improves the status quo :Usability Makes Rally easier to use
Projects
None yet
Development

No branches or pull requests

1 participant