Introduce new encoding of BPV 21 for DocIdsWriter used in BKD Tree #13521

expani · 2024-06-26T10:19:10Z

Background

Lucene uses 3 different ways of storing the docIds in KDD file of a BKD Tree based index if the docIds in a leaf block are not sorted :

If the difference b/w min and max docId in a block ( mostly 512 docIds in a block ) can be represented in 16 bits, then it packs 2 docIds in a single integer ( BPV_16 )
If the max in a block can be represented in 24 bits, then it packs 8 docIds in 3 longs ( BPV_24 )
The last one just writes the docIds as integers ( BPV_32 )

BPV_24 uses less number of bitwise/arithmetic operations compared to BPV_16. Even if we represent the docIds using any other number of bits like 21/22/23/25/26/27/28/29/30/31, the number of bitwise operations used for encoding and decoding will increase. Let's compare 2 encoding schemes as an example,

Note | represents one packed long and , separates the individual docIds in the packed longs.
Only decoding is shown in the example, but the same applies for encoding as well.

For BPV_24

| 24,24,16 | 8,24,24,8 | 16,24,24 |

Packs 8 docIds in 3 longs.So, 512 docIds in ( 3/8 * 512 ) = 192 longs
It uses 18 bitwise operations to decode 8 docIds. So, 512 docIds require ( 18/8 * 512 ) = 1152 bitwise operations.
The bitwise operators used for decoding 8 docIds can also be visualised as follows :

S, MS, MSSO, MS, MS, MSSO, MS, M

M is mask using AND, S is bitwise left/right shift and O is logical OR to join 2 partial halves present in different longs.

For BPV_20

| 20,20,20,4 | 16,20,20,8 | 12,20,20,12 | 8,20,20,16 | 4,20,20,20 |

Packs 16 docIds in 5 longs. So, 512 docIds require ( 5/16 *512 ) = 160 longs
However, it will use 38 bitwise operations to decode 16 docIds. So, 512 docIds require ( 38/16 * 512 ) = 1216 bitwise ops.

S, MS, MS, MSSO, MS, MS, MSSO, MS, MS, MSSO, MS, MS, MSSO, MS, MS, M

I have analysed the same for other BPV like 21/22/23/25/26/27/28/29/30/31 and in all cases the number of bitwise operations for encoding and decoding is higher than BPV_24.

Solution

While analysing for BPV_21, I observed that if we just pack 3 docIds in a long, then number of bitwise operations in encoding and decoding can be reduced to be less than BPV_24. The extra bit can be kept at leftmost position ( MSB ) as 0 to reduce the number of operations.

| 1,21,21,21 | 1,21,21,21 | 1,21,21,21 |

Decoding

S, MS, M, S, MS, M, S, MS, M

In this case, it requires 12 bitwise operations to decode 9 docIds. So, 512 docIds will require ( 12/9 * 512 ) ~ 683 bitwise ops.

It will store 9 docIds in 3 packed longs. So, 512 docIds require ( 3/9 *512 ) ~ 171 longs. This will reduce the number of longs required for such leaves compared to BPV_24 by 21 (192-171)

Micro Benchmark

Since, introducing BPV_21 will compete with BPV_24, I wrote a micro benchmark to compare the encoding and decoding speeds of both these variations.

Bit21With2StepsAddEncoder and Bit21With3StepsAddEncoder both perform encode/decode using the proposed BPV_21 format.
Bit24Encoder is the exact replica of BPV_24 used in Lucene today.

Java version Used

openjdk version "22.0.1" 2024-04-16
OpenJDK Runtime Environment Corretto-22.0.1.8.1 (build 22.0.1+8-FR)
OpenJDK 64-Bit Server VM Corretto-22.0.1.8.1 (build 22.0.1+8-FR, mixed mode, sharing)

Input to the benchmark :

        private static final String USAGE = "\n USAGE " +
                "\n <1> Input Data File Path " +
                "\n<2> Output Directory Path " +
                "\n<3> Number of Iterations " +
                "\n<4> Encoder name \n" +
                "\n<5> Input Scale Factor";

Sample run command

> nohup.out && nohup java -Xms6G -Xmx6G -cp <PathToJar>/lucene-core-10.0.0-SNAPSHOT.jar org.apache.lucene.util.bkd.docIds.DocIdWriterBenchmark <PathToInput>/finalfile_512.txt <OutDir> Bit24Encoder 10 1000 &

The data used in the benchmark is lucene/core/src/java/org/apache/lucene/util/bkd/docIds/data/finalfile.txt which contains all docId sequences that can be represented in 21 bits ( Max is <= 0x001FFFFFL aka 20,97,151 ). This has been extracted from first 10 million docs of NYC Taxi data by only indexing the field fare_amount as a double point.

There are 6509 docId sequences in the input file and 6493 of them contain 512 docIds each. There are a total of 33,28,103 docIds in those 6509 sequences.
Input scale factor multiplies the number of docIds sequences by the given factor to increase the load for the benchmark.

The script below was executed 5 times and the numbers are the average of those runs.
10 is the number of iterations for the encoder and 1000 is the input scale factor.

for i in `seq 1 10`
do

 echo -e "\nRunning benchmark for Bit24Encoder at "`date +'%F %T'`"\n"
 java -Xms6G -Xmx6G -cp ./lucene-core-10.0.0-SNAPSHOT.jar org.apache.lucene.util.bkd.docIds.DocIdWriterBenchmark ./finalfile.txt ./Out/ Bit24Encoder 10 1000

 echo -e "\nRunning benchmark for Bit21With2StepsAddEncoder at "`date +'%F %T'`"\n"
 java -Xms6G -Xmx6G -cp ./lucene-core-10.0.0-SNAPSHOT.jar org.apache.lucene.util.bkd.docIds.DocIdWriterBenchmark ./finalfile.txt ./Out/ Bit21With2StepsAddEncoder 10 1000

 echo -e "\nRunning benchmark for Bit21With3StepsAddEncoder at "`date +'%F %T'`"\n"
 java -Xms6G -Xmx6G -cp ./lucene-core-10.0.0-SNAPSHOT.jar org.apache.lucene.util.bkd.docIds.DocIdWriterBenchmark ./finalfile.txt ./Out/ Bit21With3StepsAddEncoder 10 1000

done

Write latencies numbers exhibited 2 patterns, one when EBS Write latency peaked and other during normal EBS latency. Both the variations are captured below.

Architecture	Instance Type	Encoder	Decode Latency	Encode Low Write Latency	Encode High Write Latency
x86_64	m5.xlarge	Bit24Encoder	3303 ms	13023 ms	50599 ms
x86_64	m5.xlarge	Bit21With2StepsAddEncoder	3037 ms	11680 ms	42576 ms
x86_64	m5.xlarge	Bit21With3StepsAddEncoder	3081 ms	11101 ms	42610 ms
aarch64	r6g.large	Bit24Encoder	3454 ms	16208 ms	78946 ms
aarch64	r6g.large	Bit21With2StepsAddEncoder	2954 ms	14968 ms	65165 ms
aarch64	r6g.large	Bit21With3StepsAddEncoder	2792 ms	15777 ms	65177 ms

There was also size reduction of around 200 MB in kdd file when indexing entire NYC Taxi data with this change.

Next steps

Need inputs from the maintainers and contributors on this new BPV format and other benchmarks that need to be executed ( probably luceneutil ? ) to justify that this change doesn't cause any like regressions as seen in SIMD Optimisation.

After feedback, I will fix build failures, add UTs and remove the org.apache.lucene.util.bkd.docIds package for this to be in a state to be merged.

expani · 2024-06-26T10:22:20Z

Tagging @jpountz @iverase @gf2121 for taking an initial look as you guys have most context of this area.

jpountz · 2024-06-27T11:31:18Z

The high-level change makes sense to me. We've used this trick to encode 3 21-bit integers in a 64-bit long in the past, it makes sense to me that it's helping here too. I also like that it would apply to all segments that have less than 2M docs, which should cover a large number of segments.

In terms of code organization, can you write your benchmarks as a JMH benchmark and put it under lucene/benchmark-jmh?

github-actions · 2024-07-12T00:19:18Z

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

expani · 2024-08-20T12:17:42Z

While trying out 2 different ways ( one a subset of other ) I found that Bit21With3StepsEncoder is better than Bit21With2StepsEncoder in aarch64 platforms with OpenJDK 21/22 whereas they are similar in x86.

x86 - EC2 instance type c5.9xlarge

Benchmark                                            (encoderName)  Mode  Cnt     Score    Error  Units
DocIdEncodingBenchmark.performEncodeDecode  Bit21With3StepsEncoder  avgt  100  6815.861 ±  9.791  ms/op
DocIdEncodingBenchmark.performEncodeDecode  Bit21With2StepsEncoder  avgt  100  6842.147 ± 10.821  ms/op
DocIdEncodingBenchmark.performEncodeDecode            Bit24Encoder  avgt  100  7788.751 ± 19.213  ms/op

aarch64 - EC2 instance type r6g.large

Benchmark                                            (encoderName)  Mode  Cnt     Score    Error  Units
DocIdEncodingBenchmark.performEncodeDecode  Bit21With3StepsEncoder  avgt  100  7777.416 ± 14.748  ms/op
DocIdEncodingBenchmark.performEncodeDecode  Bit21With2StepsEncoder  avgt  100  8799.267 ± 54.436  ms/op
DocIdEncodingBenchmark.performEncodeDecode            Bit24Encoder  avgt  100  8613.884 ±  8.642  ms/op

JDK

openjdk version "22.0.2" 2024-07-16
OpenJDK Runtime Environment Corretto-22.0.2.9.1 (build 22.0.2+9-FR)
OpenJDK 64-Bit Server VM Corretto-22.0.2.9.1 (build 22.0.2+9-FR, mixed mode, sharing)

@jpountz IMO We should use Bit21With3StepsEncoder in DocIdsWriter as using Bit21With2StepsEncoder might lead to performance regression for workloads in aarch64 platforms.
We can replace it with Bit21With2StepsEncoder in future when the performance is comparable to x86.

Let me know your thoughts on the same.

msfroh · 2024-08-27T21:40:43Z

The approach is pretty neat.

I'm wondering if Bit21With3StepsEncoder does better on aarch64 because of the explicitly unrolled loop? If so, I'm wondering if unrolling to a multiple of 2 longs would better align to processor cache lines.

That is, unrolling the loop to process 3 longs per iteration is faster than processing 1 long per iteration. What about 2 longs per iteration? What about 4 longs per iteration?

Since I've been playing around with the incubating vector API recently, I'm going to try downloading your microbenchmark and adding a vectorized implementation. (I have access to an M1 Mac that should be able to process 2 longs at a time, plus an Intel Xeon whose AVX-512 operations should probably be able to do 8 longs.)

msfroh · 2024-08-28T01:26:50Z

I tried modifying the loop to process 4 longs per iteration and noticed no difference on my Xeon host, which is unsurprising since there was no difference between 1 and 3.

I also tried the following SIMD implementation of decode:

        @Override
        public void decode(IndexInput in, int start, int count, int[] docIDs) throws IOException {
            int i = 0;

            long[] inputScratch = new long[LONG_SPECIES.length()];
            long[] outputScratch = new long[LONG_SPECIES.length() * 3];
            int bound = LONG_SPECIES.loopBound(count / 3) * 3;

            for (; i < bound; i += outputScratch.length) {
                for (int j = 0; j < LONG_SPECIES.length(); j++) {
                    inputScratch[j] = in.readLong();
                }
                LongVector longVector = LongVector.fromArray(LONG_SPECIES, inputScratch, 0);
                longVector.lanewise(VectorOperators.LSHR, 42)
                        .intoArray(outputScratch, 0);
                longVector.lanewise(VectorOperators.AND, 0x000003FFFFE00000L)
                        .lanewise(VectorOperators.LSHR, 21)
                        .intoArray(outputScratch, LONG_SPECIES.length());
                longVector.lanewise(VectorOperators.AND, 0x001FFFFFL)
                        .intoArray(outputScratch, LONG_SPECIES.length() * 2);
                for (int j = 0; j < LONG_SPECIES.length(); j++) {
                    docIDs[i + j] = (int) outputScratch[j];
                    docIDs[i + j + 1] = (int) outputScratch[j + LONG_SPECIES.length()];
                    docIDs[i + j + 2] = (int) outputScratch[j + LONG_SPECIES.length() * 2];
                }
            }
            for (; i < count - 2; i += 3) {
                long packedLong = in.readLong();
                docIDs[i] = (int) (packedLong >>> 42);
                docIDs[i + 1] = (int) ((packedLong & 0x000003FFFFE00000L) >>> 21);
                docIDs[i + 2] = (int) (packedLong & 0x001FFFFFL);
            }
            for (; i < count; i++) {
                docIDs[i] = in.readInt();
            }
        }

Unfortunately, it performs noticeably worse than the other implementations:

Benchmark                               (encoderName)  Mode  Cnt     Score    Error  Units
DocIdEncodingBenchmark.decode    Bit21WithSimdEncoder  avgt    5  2191.040 ± 14.913  ms/op
DocIdEncodingBenchmark.decode  Bit21With3StepsEncoder  avgt    5   850.331 ±  4.576  ms/op
DocIdEncodingBenchmark.decode  Bit21With2StepsEncoder  avgt    5   859.980 ±  4.567  ms/op
DocIdEncodingBenchmark.decode            Bit24Encoder  avgt    5   912.914 ±  5.488  ms/op

Maybe I'm doing it wrong 🤷

msfroh · 2024-08-28T17:31:33Z

Okay -- I was able to speed up the SIMD implementation a fair bit. Honestly, my main stupid mistake was that I hadn't declared LONG_SPECIES as static final, which probably prevented some inlining.

I removed the array allocations in each call, as well as the scalar operations within the vector loop.

        private static final VectorSpecies<Long> LONG_SPECIES = LongVector.SPECIES_MAX;
        private final long[] inputScratch = new long[512 / 3]; // We know that count is <= 512
        private final long[] outputScratch = new long[inputScratch.length * 3];
        @Override
        public void decode(IndexInput in, int start, int count, int[] docIDs) throws IOException {
            int i = 0;

            int bound = LONG_SPECIES.loopBound(count / 3) * 3;
            for (int j = 0; j < bound / 3; j++) {
                inputScratch[j] = in.readLong();
            }

            int inc = LONG_SPECIES.length() * 3;
            for (; i < bound; i += inc) {
                LongVector longVector = LongVector.fromArray(LONG_SPECIES, inputScratch, i/3);
                longVector.lanewise(VectorOperators.LSHR, 42)
                        .intoArray(outputScratch, i);
                longVector.lanewise(VectorOperators.AND, 0x000003FFFFE00000L)
                        .lanewise(VectorOperators.LSHR, 21)
                        .intoArray(outputScratch, i + LONG_SPECIES.length());
                longVector.lanewise(VectorOperators.AND, 0x001FFFFFL)
                        .intoArray(outputScratch, i + LONG_SPECIES.length() * 2);
            }
            for (int j = 0; j < bound; j += LONG_SPECIES.length() * 3) {
                for (int k = 0; k < LONG_SPECIES.length(); k++) {
                    docIDs[j + k * 3] = (int) outputScratch[j + k];
                    docIDs[j + k * 3 + 1] = (int) outputScratch[j + k + LONG_SPECIES.length()];
                    docIDs[j + k * 3 + 2] = (int) outputScratch[j + k + LONG_SPECIES.length() * 2];
                }
            }
            for (; i < count - 2; i += 3) {
                long packedLong = in.readLong();
                docIDs[i] = (int) (packedLong >>> 42);
                docIDs[i + 1] = (int) ((packedLong & 0x000003FFFFE00000L) >>> 21);
                docIDs[i + 2] = (int) (packedLong & 0x001FFFFFL);
            }
            for (; i < count; i++) {
                docIDs[i] = in.readInt();
            }
        }

It's still slower than the scalar implementation, but it's a lot closer:

Benchmark                               (encoderName)  Mode  Cnt     Score    Error  Units
DocIdEncodingBenchmark.decode    Bit21WithSimdEncoder  avgt    5  1032.151 ±  9.343  ms/op
DocIdEncodingBenchmark.decode  Bit21With3StepsEncoder  avgt    5   845.505 ±  5.924  ms/op
DocIdEncodingBenchmark.decode  Bit21With2StepsEncoder  avgt    5   851.975 ±  1.618  ms/op
DocIdEncodingBenchmark.decode            Bit24Encoder  avgt    5   913.055 ± 79.916  ms/op

…mpare performance

expani added 4 commits June 21, 2024 11:08

BPV 21 Encoding in DocIdsWriter and it's micro benchmark

11e41b4

Deleted single loop encoders

2376766

Using 2 step bpv 21 encoder in docIdsWriter

886810b

Minor refactor

9e20958

github-actions bot added the Stale label Jul 12, 2024

expani mentioned this pull request Aug 1, 2024

[Feature Request] Improve BKD Tree DocIds Encoding for 24 and 32 bit variations opensearch-project/OpenSearch#13686

Open

expani added 14 commits August 6, 2024 18:22

Initial commit - DocId Benchmark

5034e04

Explicit use of NIOFSDirectory

e06bc3b

Increasing iterations to 10

d2122cd

Added a new encoder

d09e79a

Removed older benchmarks

1ddd597

Added License

9f3792b

Refactoring

4e8a1f2

Gradle Tidy

04361d3

Added teardown to properly cleanup files

f35605a

Added comments

4a02e64

Reading docId sequences from files

c01fb9b

Using charset while reading file

d751e60

Avoid write/read to files per unique sequence

b833e73

Fixing build failure for distro test

db1ed1d

github-actions bot removed the Stale label Aug 7, 2024

expani added 2 commits August 20, 2024 17:16

Added output verification code and cleanup

42eac06

Added sequence smaller than 512 elements

5fb5b25

expani and others added 30 commits October 7, 2024 14:32

Fixed gradle check errors

e20be32

Added hybrid bit21 encoder

16c6c2b

Added versions.lock for unit testing dependency in lucene-jmh module

fa34d4e

Changed decoder for os.arch amd64 ( x86 processor )

a090196

Merge branch 'apache:main' into bpv21_main

dad37c1

Added unit tests to randomly generate docIds for all encoders

fd862b7

Added unit tests to randomly generate docIds for all encoders

68d0bbf

Removed resource file with hardcoded docIds

33232e5

Removed resource file with hardcoded docIds

5853093

Refactoring

8d6f7ca

Merge branch 'apache:main' into bpv21_main

336308e

Fixed all gradle failures

4fb0a87

Auto inferring docIdProvider

33fd619

Made encoding conditional based on architecture

295531c

Merge branch 'apache:main' into bpv21_main

7312d91

Fixed file cleanup for unit tests

359440a

Benchmarking effects of only using writeLong instead of writeInt

4406d75

Closing IndexInput after every use

dc3d6fc

Temporarily added debug loggers to track for larger inputs

0d1d879

Disabling encode benchmark temporarily

5b6b26d

Added extra loggers

fdca13f

Removed print statements for debugging stages

b2c45e5

Made loading docIdSequences parallel to reduce benchmark time

ecf53d9

Excluding lines with comments

7f95cd3

Changed readLong to readInt for trailing docIds in leaf block

af8b914

Adding new encoder with only readLong instead of readInt to easily co…

4d10d9d

…mpare performance

Adding new encoder with only readLong instead of readInt to easily co…

da2325c

…mpare performance

Added new hybrid encoder based on readlong vs readint tests

06dd48f

Added a variation of BPV32 using only r/w long

775fb38

Added a variation of BPV32 using only r/w long

702580a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce new encoding of BPV 21 for DocIdsWriter used in BKD Tree #13521

Introduce new encoding of BPV 21 for DocIdsWriter used in BKD Tree #13521

expani commented Jun 26, 2024 •

edited

Loading

expani commented Jun 26, 2024

jpountz commented Jun 27, 2024

github-actions bot commented Jul 12, 2024

expani commented Aug 20, 2024 •

edited

Loading

msfroh commented Aug 27, 2024

msfroh commented Aug 28, 2024

msfroh commented Aug 28, 2024 •

edited

Loading

Introduce new encoding of BPV 21 for DocIdsWriter used in BKD Tree #13521

Are you sure you want to change the base?

Introduce new encoding of BPV 21 for DocIdsWriter used in BKD Tree #13521

Conversation

expani commented Jun 26, 2024 • edited Loading

Background

Solution

Micro Benchmark

Next steps

expani commented Jun 26, 2024

jpountz commented Jun 27, 2024

github-actions bot commented Jul 12, 2024

expani commented Aug 20, 2024 • edited Loading

msfroh commented Aug 27, 2024

msfroh commented Aug 28, 2024

msfroh commented Aug 28, 2024 • edited Loading

expani commented Jun 26, 2024 •

edited

Loading

expani commented Aug 20, 2024 •

edited

Loading

msfroh commented Aug 28, 2024 •

edited

Loading