Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use zeroes instead of whitespaces as padding bytes #896

Closed
jpountz opened this issue Feb 13, 2020 · 4 comments · Fixed by #899
Closed

Use zeroes instead of whitespaces as padding bytes #896

jpountz opened this issue Feb 13, 2020 · 4 comments · Fixed by #899
Assignees
Labels
enhancement Improves the status quo :Load Driver Changes that affect the core of the load driver such as scheduling, the measurement approach etc.
Milestone

Comments

@jpountz
Copy link
Contributor

jpountz commented Feb 13, 2020

When generating ids for conflicts, rally uses spaces as padding bytes. I rarely see spaces in ids, so I think this is less realistic than using zeroes? Furthermore Elasticsearch has some optimizations for ids that look like base64 or numeric ids, which get defeated by the usage of spaces.

@jpountz
Copy link
Contributor Author

jpountz commented Feb 13, 2020

The following patch should do the trick:

diff --git a/esrally/track/params.py b/esrally/track/params.py
index c7c079a..5f8bb3f 100644
--- a/esrally/track/params.py
+++ b/esrally/track/params.py
@@ -622,7 +622,7 @@ def build_conflicting_ids(conflicts, docs_to_index, offset, shuffle=random.shuff
     all_ids = [0] * docs_to_index
     for i in range(docs_to_index):
         # always consider the offset as each client will index its own range and we don't want uncontrolled conflicts across clients
-        all_ids[i] = "%10d" % (offset + i)
+        all_ids[i] = "%010d" % (offset + i)
     if conflicts == IndexIdConflict.RandomConflicts:
         shuffle(all_ids)
     return all_ids

@jpountz
Copy link
Contributor Author

jpountz commented Feb 13, 2020

Changing this makes ids compressed at the Elasticsearch level rather than Lucene. It tends to help because Elaticsearch's level compression does things a bit more efficiently since it doesn't need to preserve ordering (while Lucene needs to for range queries). This means you won't see Lucene's LowercaseAsciiCompression class in the hot classes reported by telemetry.

@jpountz
Copy link
Contributor Author

jpountz commented Feb 13, 2020

As a data point, I can index geonames 5% faster (230k docs/s -> 242k docs/s quite consistently) with this change if no fields are indexed in the mapping (enabled=false on the root object mapper), which is likely a best-case scenario for this change.

@danielmitterdorfer
Copy link
Member

Thanks for the analysis. I think the id structure is a leftover from the pre-Rally days even. :) This definitely makes sense to change. @drawlerr can you please look into this?

@danielmitterdorfer danielmitterdorfer added :Load Driver Changes that affect the core of the load driver such as scheduling, the measurement approach etc. enhancement Improves the status quo labels Feb 14, 2020
@danielmitterdorfer danielmitterdorfer added this to the 1.4.1 milestone Feb 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improves the status quo :Load Driver Changes that affect the core of the load driver such as scheduling, the measurement approach etc.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants