Support tracks across multiple versions of Elasticsearch #69

danielmitterdorfer · 2016-03-21T13:08:25Z

Concept

Since b5b4911 we ensure that Rally always uses the most recent mapping file. However, a track has more version-dependent parts than just the mapping:

Configuration files
The queries that can be executed and possibly also the concrete query syntax
Data format

In #26 we will externalize all these parts into a descriptive track specification. In this ticket we will ensure that users will be able to use these tracks across multiple versions of Elasticsearch. For now the earliest supported version will still be Elasticsearch 5.0.0-alpha1. The problem is not so much the track specification as cluster provisioning in Rally (which is just a separate topic that we cannot / will not tackle at the same time).

Assumptions

We will assume that users will either benchmark a specific released version of Elasticsearch (they run e.g. esrally --pipeline=from-distribution --distribution-version=5.0.0-alpha2) or they actively develop on core and benchmark a source build of Elasticsearch (e.g. esrally --revision=latest). Hence, the only scenarios that we will officially support:

Compatibility with the latest master branch version of Elasticsearch
Compatibility with a previously released version of Elasticsearch

Solution

Data Management

All of this can be achieved by using git under the hood for managing track specifications. The track repository needs to contain a few branches:

master: we should ensure that it will always work with the latest master branch version of Elasticsearch
One branch for each released version of Elasticsearch. The branch name will correspond to the URL-friendly name for the download of the distribution itself (e.g. for the URL-friendly name "5.0.0-alpha2", also a branch "5.0.0-alpha2" will exist in the repo).

The repo will contain one folder for each track which also corresponds with its name. Which file(s) the folder will contain is defined by #26.

This approach also gives us a few additional benefits:

Error corrections in track specifications or enhancements can very easily be added over time. If you make a mistake, just fix it on the affected branch and cherry-pick it to all other branches "before" that are also affected (e.g. fix on master and cherry-pick to the "5.0.0-alpha2" and "5.0.0-alpha1" branches)
Core developers which have very intimate knowledge of the development timeline can even benchmark arbitrary prior versions of Elasticsearch. They just need to find the right commit in the track specification repository and check that out.

Version-specific track execution

Rally will manage the git repository all by itself. It will perform the following steps before the start of a race:

Fetch the latest version of the repository (git fetch origin -p, I think -p makes sense although I wouldn't expect a whole lot of deleted branches)
It will determine whether to check out a version branch or the master, check it out and rebase: git checkout $BRANCH && git rebase origin/$BRANCH)
Log the checked out revision of the track data in any case in the race index in the metrics store in order to clearly document the revision that has been used.

After that, the usual mechanism will do its job (i.e. downloading benchmark data etc.)

Non-standard cases (doesn't necessarily have to be an error):

No network connection (git fetch origin will fail): I wouldn't necessarily fail in this case but I think it justifies a warning on the console and in the log
Local working copy is dirty (i.e. git checkout $BRANCH will fail): Abort and tell the user to cleanup. It could be that they're currently working on a new track and we should not mess with their files. It will show during development of the feature but maybe it makes sense that we don't force the user to always commit their changes before running a track.
The remote branch does not exist (i.e. git rebase will fail): I'd assume in that case that this is a new track that has not yet been pushed to origin (even very likely when it is a community benchmark)

We could also add a flag for advanced users to not touch the repo at all. This would allow them to checkout a specific revision manually and run a race against that revision.

Depending on #26, mapping files may be even stored in this repo (I'd expect so) but most certainly we will not store the benchmark data there. So there should be some kind of naming convention for benchmark data files should they differ across branches (e.g. "documents.json.gz" vs. "documents-1.4.1.json.gz"). Having that said, I wouldn't expect that it is needed in a whole lot of cases.

Finally a very important assumption: At no point in time there will be two (or more) Rally processes running on the same machine (otherwise one Rally process could use the configuration of version X and, while the benchmark is running, another one could use the configuration of version Y and they could interfere with each other in very bad ways). But as running multiple benchmarks simultaneously is obviously a very bad idea as it gives bogus results that's just one more way to shoot yourself into the foot.

The text was updated successfully, but these errors were encountered:

Closes #69

danielmitterdorfer added the idea label Mar 21, 2016

danielmitterdorfer added this to the Backlog milestone Mar 21, 2016

danielmitterdorfer mentioned this issue Mar 21, 2016

Have Rally support multiple Elasticsearch versions #68

Closed

danielmitterdorfer changed the title ~~Support multiple versions of mapping files~~ Support multiple versions of mapping files and benchmark data May 11, 2016

danielmitterdorfer modified the milestones: 0.4.0, Backlog May 11, 2016

danielmitterdorfer mentioned this issue May 11, 2016

Add simple script engine data points to geonames benchmark #95

Closed

danielmitterdorfer added enhancement Improves the status quo :Usability Makes Rally easier to use :Benchmark and removed idea labels May 11, 2016

danielmitterdorfer changed the title ~~Support multiple versions of mapping files and benchmark data~~ Support multiple versions of benchmarks May 11, 2016

danielmitterdorfer modified the milestones: 0.3.0, 0.4.0 May 12, 2016

danielmitterdorfer mentioned this issue May 12, 2016

Improve flexibility of benchmarks #53

Closed

5 tasks

danielmitterdorfer changed the title ~~Support multiple versions of benchmarks~~ Support multiple versions of tracks May 12, 2016

danielmitterdorfer changed the title ~~Support multiple versions of tracks~~ Support tracks across multiple versions of Elasticsearch May 12, 2016

danielmitterdorfer mentioned this issue May 12, 2016

Support multiple track repositories #99

Closed

danielmitterdorfer closed this as completed Jun 6, 2016

danielmitterdorfer added a commit that referenced this issue Jun 7, 2016

Merge remote-tracking branch 'origin/json-tracks'

ed51d73

Closes #69

danielmitterdorfer mentioned this issue Jun 9, 2016

Add logging data set #90

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support tracks across multiple versions of Elasticsearch #69

Support tracks across multiple versions of Elasticsearch #69

danielmitterdorfer commented Mar 21, 2016 •

edited

Loading

Support tracks across multiple versions of Elasticsearch #69

Support tracks across multiple versions of Elasticsearch #69

Comments

danielmitterdorfer commented Mar 21, 2016 • edited Loading

Concept

Assumptions

Solution

Data Management

Version-specific track execution

danielmitterdorfer commented Mar 21, 2016 •

edited

Loading