You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Since b5b4911 we ensure that Rally always uses the most recent mapping file. However, a track has more version-dependent parts than just the mapping:
Configuration files
The queries that can be executed and possibly also the concrete query syntax
Data format
In #26 we will externalize all these parts into a descriptive track specification. In this ticket we will ensure that users will be able to use these tracks across multiple versions of Elasticsearch. For now the earliest supported version will still be Elasticsearch 5.0.0-alpha1. The problem is not so much the track specification as cluster provisioning in Rally (which is just a separate topic that we cannot / will not tackle at the same time).
Assumptions
We will assume that users will either benchmark a specific released version of Elasticsearch (they run e.g. esrally --pipeline=from-distribution --distribution-version=5.0.0-alpha2) or they actively develop on core and benchmark a source build of Elasticsearch (e.g. esrally --revision=latest). Hence, the only scenarios that we will officially support:
Compatibility with the latest master branch version of Elasticsearch
Compatibility with a previously released version of Elasticsearch
Solution
Data Management
All of this can be achieved by using git under the hood for managing track specifications. The track repository needs to contain a few branches:
master: we should ensure that it will always work with the latest master branch version of Elasticsearch
One branch for each released version of Elasticsearch. The branch name will correspond to the URL-friendly name for the download of the distribution itself (e.g. for the URL-friendly name "5.0.0-alpha2", also a branch "5.0.0-alpha2" will exist in the repo).
The repo will contain one folder for each track which also corresponds with its name. Which file(s) the folder will contain is defined by #26.
This approach also gives us a few additional benefits:
Error corrections in track specifications or enhancements can very easily be added over time. If you make a mistake, just fix it on the affected branch and cherry-pick it to all other branches "before" that are also affected (e.g. fix on master and cherry-pick to the "5.0.0-alpha2" and "5.0.0-alpha1" branches)
Core developers which have very intimate knowledge of the development timeline can even benchmark arbitrary prior versions of Elasticsearch. They just need to find the right commit in the track specification repository and check that out.
Version-specific track execution
Rally will manage the git repository all by itself. It will perform the following steps before the start of a race:
Fetch the latest version of the repository (git fetch origin -p, I think -p makes sense although I wouldn't expect a whole lot of deleted branches)
It will determine whether to check out a version branch or the master, check it out and rebase: git checkout $BRANCH && git rebase origin/$BRANCH)
Log the checked out revision of the track data in any case in the race index in the metrics store in order to clearly document the revision that has been used.
After that, the usual mechanism will do its job (i.e. downloading benchmark data etc.)
Non-standard cases (doesn't necessarily have to be an error):
No network connection (git fetch origin will fail): I wouldn't necessarily fail in this case but I think it justifies a warning on the console and in the log
Local working copy is dirty (i.e. git checkout $BRANCH will fail): Abort and tell the user to cleanup. It could be that they're currently working on a new track and we should not mess with their files. It will show during development of the feature but maybe it makes sense that we don't force the user to always commit their changes before running a track.
The remote branch does not exist (i.e. git rebase will fail): I'd assume in that case that this is a new track that has not yet been pushed to origin (even very likely when it is a community benchmark)
We could also add a flag for advanced users to not touch the repo at all. This would allow them to checkout a specific revision manually and run a race against that revision.
Depending on #26, mapping files may be even stored in this repo (I'd expect so) but most certainly we will not store the benchmark data there. So there should be some kind of naming convention for benchmark data files should they differ across branches (e.g. "documents.json.gz" vs. "documents-1.4.1.json.gz"). Having that said, I wouldn't expect that it is needed in a whole lot of cases.
Finally a very important assumption: At no point in time there will be two (or more) Rally processes running on the same machine (otherwise one Rally process could use the configuration of version X and, while the benchmark is running, another one could use the configuration of version Y and they could interfere with each other in very bad ways). But as running multiple benchmarks simultaneously is obviously a very bad idea as it gives bogus results that's just one more way to shoot yourself into the foot.
The text was updated successfully, but these errors were encountered:
danielmitterdorfer
changed the title
Support multiple versions of mapping files
Support multiple versions of mapping files and benchmark data
May 11, 2016
danielmitterdorfer
changed the title
Support multiple versions of mapping files and benchmark data
Support multiple versions of benchmarks
May 11, 2016
Concept
Since b5b4911 we ensure that Rally always uses the most recent mapping file. However, a track has more version-dependent parts than just the mapping:
In #26 we will externalize all these parts into a descriptive track specification. In this ticket we will ensure that users will be able to use these tracks across multiple versions of Elasticsearch. For now the earliest supported version will still be Elasticsearch 5.0.0-alpha1. The problem is not so much the track specification as cluster provisioning in Rally (which is just a separate topic that we cannot / will not tackle at the same time).
Assumptions
We will assume that users will either benchmark a specific released version of Elasticsearch (they run e.g.
esrally --pipeline=from-distribution --distribution-version=5.0.0-alpha2
) or they actively develop on core and benchmark a source build of Elasticsearch (e.g.esrally --revision=latest
). Hence, the only scenarios that we will officially support:Solution
Data Management
All of this can be achieved by using git under the hood for managing track specifications. The track repository needs to contain a few branches:
The repo will contain one folder for each track which also corresponds with its name. Which file(s) the folder will contain is defined by #26.
This approach also gives us a few additional benefits:
Version-specific track execution
Rally will manage the git repository all by itself. It will perform the following steps before the start of a race:
git fetch origin -p
, I think-p
makes sense although I wouldn't expect a whole lot of deleted branches)git checkout $BRANCH && git rebase origin/$BRANCH
)race
index in the metrics store in order to clearly document the revision that has been used.After that, the usual mechanism will do its job (i.e. downloading benchmark data etc.)
Non-standard cases (doesn't necessarily have to be an error):
git fetch origin
will fail): I wouldn't necessarily fail in this case but I think it justifies a warning on the console and in the loggit checkout $BRANCH
will fail): Abort and tell the user to cleanup. It could be that they're currently working on a new track and we should not mess with their files. It will show during development of the feature but maybe it makes sense that we don't force the user to always commit their changes before running a track.git rebase
will fail): I'd assume in that case that this is a new track that has not yet been pushed to origin (even very likely when it is a community benchmark)We could also add a flag for advanced users to not touch the repo at all. This would allow them to checkout a specific revision manually and run a race against that revision.
Depending on #26, mapping files may be even stored in this repo (I'd expect so) but most certainly we will not store the benchmark data there. So there should be some kind of naming convention for benchmark data files should they differ across branches (e.g. "documents.json.gz" vs. "documents-1.4.1.json.gz"). Having that said, I wouldn't expect that it is needed in a whole lot of cases.
Finally a very important assumption: At no point in time there will be two (or more) Rally processes running on the same machine (otherwise one Rally process could use the configuration of version X and, while the benchmark is running, another one could use the configuration of version Y and they could interfere with each other in very bad ways). But as running multiple benchmarks simultaneously is obviously a very bad idea as it gives bogus results that's just one more way to shoot yourself into the foot.
The text was updated successfully, but these errors were encountered: