Use zeroes instead of whitespaces as padding bytes #896

jpountz · 2020-02-13T15:00:39Z

When generating ids for conflicts, rally uses spaces as padding bytes. I rarely see spaces in ids, so I think this is less realistic than using zeroes? Furthermore Elasticsearch has some optimizations for ids that look like base64 or numeric ids, which get defeated by the usage of spaces.

jpountz · 2020-02-13T15:14:32Z

The following patch should do the trick:

diff --git a/esrally/track/params.py b/esrally/track/params.py
index c7c079a..5f8bb3f 100644
--- a/esrally/track/params.py
+++ b/esrally/track/params.py
@@ -622,7 +622,7 @@ def build_conflicting_ids(conflicts, docs_to_index, offset, shuffle=random.shuff
     all_ids = [0] * docs_to_index
     for i in range(docs_to_index):
         # always consider the offset as each client will index its own range and we don't want uncontrolled conflicts across clients
-        all_ids[i] = "%10d" % (offset + i)
+        all_ids[i] = "%010d" % (offset + i)
     if conflicts == IndexIdConflict.RandomConflicts:
         shuffle(all_ids)
     return all_ids

jpountz · 2020-02-13T15:18:20Z

Changing this makes ids compressed at the Elasticsearch level rather than Lucene. It tends to help because Elaticsearch's level compression does things a bit more efficiently since it doesn't need to preserve ordering (while Lucene needs to for range queries). This means you won't see Lucene's LowercaseAsciiCompression class in the hot classes reported by telemetry.

jpountz · 2020-02-13T16:59:49Z

As a data point, I can index geonames 5% faster (230k docs/s -> 242k docs/s quite consistently) with this change if no fields are indexed in the mapping (enabled=false on the root object mapper), which is likely a best-case scenario for this change.

danielmitterdorfer · 2020-02-14T06:23:29Z

Thanks for the analysis. I think the id structure is a leftover from the pre-Rally days even. :) This definitely makes sense to change. @drawlerr can you please look into this?

danielmitterdorfer added :Load Driver Changes that affect the core of the load driver such as scheduling, the measurement approach etc. enhancement Improves the status quo labels Feb 14, 2020

danielmitterdorfer added this to the 1.4.1 milestone Feb 14, 2020

danielmitterdorfer assigned drawlerr Feb 14, 2020

drawlerr mentioned this issue Feb 14, 2020

Use zeros instead of whitespaces as padding bytes #899

Merged

drawlerr closed this as completed in #899 Feb 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use zeroes instead of whitespaces as padding bytes #896

Use zeroes instead of whitespaces as padding bytes #896

jpountz commented Feb 13, 2020

jpountz commented Feb 13, 2020

jpountz commented Feb 13, 2020

jpountz commented Feb 13, 2020

danielmitterdorfer commented Feb 14, 2020

Use zeroes instead of whitespaces as padding bytes #896

Use zeroes instead of whitespaces as padding bytes #896

Comments

jpountz commented Feb 13, 2020

jpountz commented Feb 13, 2020

jpountz commented Feb 13, 2020

jpountz commented Feb 13, 2020

danielmitterdorfer commented Feb 14, 2020