Skip to content

Commit

Permalink
Also allow a uncompressed document corpus
Browse files Browse the repository at this point in the history
With this commit, Rally does not demand anymore that the document corpus
is compressed. This helps especially when a track is only used locally
and not shared with others.

Closes #331
  • Loading branch information
danielmitterdorfer committed Oct 23, 2017
1 parent cb1a5f0 commit 2e60e2d
Show file tree
Hide file tree
Showing 6 changed files with 446 additions and 123 deletions.
68 changes: 47 additions & 21 deletions docs/adding_tracks.rst
Original file line number Diff line number Diff line change
Expand Up @@ -58,25 +58,22 @@ You will note that the file is tab-delimited but we need JSON to bulk-index data

Store the script as ``toJSON.py`` in our tutorial directory (``~/rally-tracks/tutorial``) and invoke the script with ``python3 toJSON.py > documents.json``.

Next we need to compress the JSON file with ``bzip2 -9 -c documents.json > documents.json.bz2``.

We also need a mapping file for our documents. For details on how to write a mapping file, see `the Elasticsearch documentation on mappings <https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping.html>`_ and look at an `example mapping file <https://github.com/elastic/rally-tracks/blob/master/geonames/mappings.json>`_. Place the mapping file in the tutorial directory.

Finally, add a file called ``track.json`` right next to the mapping file::

{
"short-description": "Tutorial benchmark for Rally",
"description": "This test indexes 8.6M documents (POIs from Geonames, total 2.8 GB json) using 8 client threads and 5000 docs per bulk request against Elasticsearch",
"description": "This test indexes 8.6 million documents (POIs from Geonames) using 8 clients and 5000 docs per bulk request against Elasticsearch",
"indices": [
{
"name": "geonames",
"types": [
{
"name": "type",
"mapping": "mappings.json",
"documents": "documents.json.bz2",
"documents": "documents.json",
"document-count": 8647880,
"compressed-bytes": 197857614,
"uncompressed-bytes": 2790927196
}
]
Expand Down Expand Up @@ -161,9 +158,9 @@ When you invoke ``esrally list tracks --track-path=~/rally-tracks/tutorial``, th
Name Description Documents Compressed Size Uncompressed Size Default Challenge All Challenges
---------- ----------------------------- ----------- --------------- ----------------- ----------------- ---------------
tutorial Tutorial benchmark for Rally 8647880 188.7 MB 2.6 GB index-and-query index-and-query
tutorial Tutorial benchmark for Rally 8647880 N/A 2.6 GB index-and-query index-and-query

Congratulations, you have created your first track! You can test it with ``esrally --track-path=~/rally-tracks/tutorial --offline`` and run specific challenges with ``esrally --track-path=~/rally-tracks/tutorial --challenge=index-and-query --offline``.
Congratulations, you have created your first track! You can test it with ``esrally --track-path=~/rally-tracks/tutorial`` and run specific challenges with ``esrally --track-path=~/rally-tracks/tutorial --challenge=index-and-query``.

.. _add_track_test_mode:

Expand All @@ -177,12 +174,7 @@ When you invoke Rally with ``--test-mode``, it switches to a mode that allows yo

Rally will postprocess all data file names of a track. So instead of ``documents.json.bz2``, Rally will attempt to find ``documents-1k.json.bz2`` and will assume it contains 1.000 documents. However, you need to prepare these data files otherwise this test mode is not supported.

The preparation is very easy and requires these two steps:

1. Pick 1.000 documents from your data set. We choose the first 1.000 here but it does not matter usually which part you choose: ``head -n 1000 documents.json > documents-1k.json``.
2. Compress it: ``bzip2 -9 -c documents-1k.json > documents-1k.json.bz2``

You have to repeat these steps for all data files of your track.
The preparation is very easy. Just pick 1.000 documents for every data file in your track. We choose the first 1.000 here but it does not matter usually which part you choose: ``head -n 1000 documents.json > documents-1k.json``.

Structuring your track
----------------------
Expand Down Expand Up @@ -221,17 +213,16 @@ Now modify ``track.json`` so it knows about your new file::

{
"short-description": "Tutorial benchmark for Rally",
"description": "This test indexes 8.6M documents (POIs from Geonames, total 2.8 GB json) using 8 client threads and 5000 docs per bulk request against Elasticsearch",
"description": "This test indexes 8.6 million documents (POIs from Geonames) using 8 clients and 5000 docs per bulk request against Elasticsearch",
"indices": [
{
"name": "geonames",
"types": [
{
"name": "type",
"mapping": "mappings.json",
"documents": "documents.json.bz2",
"documents": "documents.json",
"document-count": 8647880,
"compressed-bytes": 197857614,
"uncompressed-bytes": 2790927196
}
]
Expand Down Expand Up @@ -268,18 +259,17 @@ However, if your track consists of multiple challenges it can be cumbersome to i

{% import "rally.helpers" as rally %}
{
"short-description": "Standard benchmark in Rally (8.6M POIs from Geonames)",
"description": "This test indexes 8.6M documents (POIs from Geonames, total 2.8 GB json) using 8 client threads and 5000 docs per bulk request against Elasticsearch",
"short-description": "Tutorial benchmark for Rally",
"description": "This test indexes 8.6 million documents (POIs from Geonames) using 8 clients and 5000 docs per bulk request against Elasticsearch",
"indices": [
{
"name": "geonames",
"types": [
{
"name": "type",
"mapping": "mappings.json",
"documents": "documents.json.bz2",
"documents": "documents.json",
"document-count": 8647880,
"compressed-bytes": 197857614,
"uncompressed-bytes": 2790927196
}
]
Expand Down Expand Up @@ -326,7 +316,43 @@ If you use this idiom you can then refer to variables inside your snippets with
Sharing your track with others
------------------------------

At the moment your track is only available on your local machine but maybe you want to share it with other people in your team. You can share the track itself in any way you want, e.g. you can check it into version control. However, you will most likely not want to commit the potentially huge data file. Therefore, you can expose the data via http (e.g. via S3) and Rally can download it from there. To make this work, you need to add an additional property ``data-url`` at the top-level of your ``track.json`` file which contains the URL from where to download your documents. Rally expects that the URL points to the parent path and will append the file name of the compressed documents automatically.
At the moment your track is only available on your local machine but maybe you want to share it with other people in your team. You can share the track itself in any way you want, e.g. you can check it into version control. However, you will most likely not want to commit the potentially huge data file. Therefore, you can expose the data via http (e.g. via S3) and Rally can download it from there. To make this work, you need to add an additional property ``data-url`` at the top-level of your ``track.json`` file which contains the URL from where to download your documents. Rally expects that the URL points to the parent path and will append the document file name automatically.

It is also recommended that you compress your document corpus to save network bandwidth. We recommend to use bzip2 compression. You can create a compressed archive with the following command::

bzip2 -9 -c documents.json > documents.json.bz2

If you want to support the test mode, don't forget to also compress your test mode corpus with::

bzip2 -9 -c documents-1k.json > documents-1k.json.bz2

Then upload ``documents.json.bz2`` and ``documents-1k.json.bz2`` to the remote location.

Finally, specify the compressed file name in your ``track.json`` file::

{
"short-description": "Tutorial benchmark for Rally",
"description": "This test indexes 8.6 million documents (POIs from Geonames) using 8 clients and 5000 docs per bulk request against Elasticsearch",
"data-url": "http://benchmarks.elasticsearch.org.s3.amazonaws.com/corpora/geonames",
"indices": [
{
"name": "geonames",
"types": [
{
"name": "type",
"mapping": "mappings.json",
"documents": "documents.json.bz2",
"document-count": 8647880,
"compressed-bytes": 197857614,
"uncompressed-bytes": 2790927196
}
]
}
],
...
}

Specifying ``compressed-bytes`` (file size of ``documents.json.bz2``) and ``uncompressed-bytes`` (file size of ``documents.json``) is optional but helps Rally to provide progress indicators and also verify integrity.

How to contribute a track
-------------------------
Expand Down
4 changes: 2 additions & 2 deletions docs/track.rst
Original file line number Diff line number Diff line change
Expand Up @@ -182,9 +182,9 @@ Each type consists of the following properties:

* ``name`` (mandatory): Name of the type.
* ``mapping`` (mandatory): File name of the corresponding mapping file.
* ``documents`` (optional): File name of the corresponding documents that should be indexed. If you are using parent-child, specify the number of parent documents. This file has to be compressed either as ``.zip``, ``.bz2``, ``.gz``, ``.tar``, ``.tar.gz``, ``.tgz`` or ``.tar.bz2`` and must contain exactly one JSON file with the same name. The preferred file extension for our official tracks is ``.bz2``.
* ``documents`` (optional): File name of the corresponding documents that should be indexed. For local use, this file can be a ``.json`` file. If you provide a ``data-url`` we recommend that you provide a compressed file here. The following extensions are supported: ``.zip``, ``.bz2``, ``.gz``, ``.tar``, ``.tar.gz``, ``.tgz`` or ``.tar.bz2``. It must contain exactly one JSON file with the same name. The preferred file extension for our official tracks is ``.bz2``.
* ``includes-action-and-meta-data`` (optional, defaults to ``false``): Defines whether the documents file contains already an action and meta-data line (``true``) or only documents (``false``).
* ``document-count`` (mandatory if ``documents`` is set): Number of documents in the documents file. This number is used by Rally to determine which client indexes which part of the document corpus (each of the N clients gets one N-th of the document corpus).
* ``document-count`` (mandatory if ``documents`` is set): Number of documents in the documents file. This number is used by Rally to determine which client indexes which part of the document corpus (each of the N clients gets one N-th of the document corpus). If you are using parent-child, specify the number of parent documents.
* ``compressed-bytes`` (optional but recommended if ``documents`` is set): The size in bytes of the compressed document file. This number is used to show users how much data will be downloaded by Rally and also to check whether the download is complete.
* ``uncompressed-bytes`` (optional but recommended if ``documents`` is set): The size in bytes of the documents file after decompression. This number is used by Rally to show users how much disk space the decompressed file will need and to check that the whole file could be decompressed successfully.

Expand Down
Loading

0 comments on commit 2e60e2d

Please sign in to comment.