Improve FST performance created by Builder [LUCENE-9481] #10520

asfimport · 2020-08-25T04:06:51Z

Hi,

We are using Lucene 8 and recently we observed a performance issue with FST. When FST is used directly from org.apache.lucene.util.fst.Builder.finish(), its about 40-50% slower than when it's first saved to a DataOutput (e.g filesystem or heap) and loaded again. The FST we used is about 1.6MB.

Using VisualVM, we tracked the issue down to the DataInput.readVInt method, which in turn calls the readByte of the reversed BytesReader. When FST is loaded from filesystem/heap, it's using OnHeapFSTStore which use ReverseBytesReader, but when it's created by the Builder, it's backed by the BytesStore. The problem is, when the number of blocks of BytesStore is not 1, it'll use an anonymous class for reading bytes, and possibly because of JVM inlining, the ReverseBytesReader.readByte is inlined, while that of the anonymous class is not. We confirmed this by adding XX:+PrintInlining flag.

`@ 80` org.apache.lucene.util.fst.BytesStore$2::readByte (68 bytes) too big

`@ 17` org.apache.lucene.util.fst.ReverseBytesReader::readByte (17 bytes) inline (hot)

Currently we have 2 workarounds, which can also confirm the above theory:

Use a larger bytesPageBits in Builder, so that the block size is greater than the FST size.
Save and load the FST to heap after building, e.g using GrowableByteArrayDataOutput and ByteArrayDataInput.

But I think it would be better if this optimization is done inside Builder.finish() instead, e.g by auto-adjust the block size of BytesStore or copy it to a FSTStore if possible, since none of the 2 workarounds above look great. We have many FST with largely different sizes, so with the first option we need to maintain the bytesPageBits for each dictionary, otherwise we risk wasting unnecessary heap or using insufficient block size. I also think this issue would affect other Lucene users too.

For reference, the benchmark we setup is as below:

Warm up both FST for 10.000 iterations (10.000 seems to be the JVM inlining trigger threshold)
Run 100 tests, each test run 100.000 iterations
Each iteration is simply call org.apache.lucene.util.fst.Util.get() with the same 4-character word.

It reports that 98 out of 100 tests, the one with save/load logic is 40-50% faster on average.

Thanks

Migrated from LUCENE-9481 by Anh Dung Bui

asfimport · 2020-08-25T08:21:03Z

Anh Dung Bui (migrated from JIRA)

I realize SynonymMap also uses FST and might have the same problem, although Builder has been renamed to FSTCompiler now.

asfimport · 2020-08-25T17:12:45Z

Michael McCandless (@mikemccand) (migrated from JIRA)

Wow, that is a fascinating result!

Maybe, instead of a paged byte[] storage, we should have a store that just grows a single byte[] (e.g. using ArrayUtil.grow) up until reasonable limits? Then the resulting FST could use the same ReversedBytesReader?

mikemccand · 2023-09-27T16:18:55Z

Coming back to this fun FST issue ...

Given that FSTs are now readable off-heap, if we simply fixed FST writing to also be off-heap (this should not be so hard since the writing is already append-only), then we wouldn't have these silly performance problems because we are re-inventing an on-heap filesystem (paged byte[]), badly.

But a less radical fix here would be good too (progress not perfection) -- growing the byte[] maybe (though this is maybe problematic for larger FSTs since you need at least 2X the final size of the FST max heap size). Or, when the FST is done being built, allocate a single byte[] and copy the pages to it? (But the same 2X+ heap size requirement is there too).

dungba88 · 2023-09-28T02:02:08Z

Thanks @mikemccand. I originally also think of allocating single byte[] and yeah it would have the 2X heap size requirement.

As our use case are building and using FST at runtime, there might be (some, I'm not sure) penalty we would have if we are to read/write entirely with filesystem. In some of my previous experience with Java I/O, using FileChannel would be much faster than InputStream/OutputStream when writing to filesystem, but even so they are still slower than main memory.

I'm curious if Lucene also has a benchmark for comparison of off-heap and heap mode? (Also by off-heap, I assume you mean filesystem, instead of the direct memory, e.g with ByteBuffer.allocateDirect()?)

dungba88 · 2023-09-28T02:11:50Z

Maybe we could first allow FSTCompiler to specify its own DataOutput even when building the tree on-the-fly, instead of always relying on BytesStore? And we are free to choose an appropriate implementation, whether off-heap, or growing byte, etc.

dungba88 · 2023-10-03T11:39:03Z

I'm planning to refactor the BytesStore into an interface that can be chosen from the FST builder. And one can decide whether on-heap or off-heap or on-heap without blocks is best suited.

mikemccand · 2023-10-03T13:07:32Z

I'm planning to refactor the BytesStore into an interface that can be chosen from the FST builder. And one can decide whether on-heap or off-heap or on-heap without blocks is best suited.

Awesome, thanks @dungba88!

But, at write time, can we avoid BytesStore entirely, maybe? Just write to a DataOutput? But we would need to first refactor FST#addNode to not use this reverse method on BytesStore, and do its own byte[] buffering/reversal before writing? If possible, can we do this as a first step, which we test/merge before making the change to move to DataOutput?

Let's have further discussions on #12543?

mikemccand · 2023-10-03T13:48:19Z

Maybe we could first allow FSTCompiler to specify its own DataOutput even when building the tree on-the-fly, instead of always relying on BytesStore? And we are free to choose an appropriate implementation, whether off-heap, or growing byte, etc.

Ahh, yes, exactly! (Sorry, catching up and reading comments out-of-order!).

As our use case are building and using FST at runtime, there might be (some, I'm not sure) penalty we would have if we are to read/write entirely with filesystem. In some of my previous experience with Java I/O, using FileChannel would be much faster than InputStream/OutputStream when writing to filesystem, but even so they are still slower than main memory.

Well, Lucene's DataOutput abstraction can be backed by heap too, e.g. ByteBufferDataOutput or ByteArrayDataOutput (there may be others!). An application could even easily make its own "double the byte[]" to grow custom DataOutput.

I'm curious if Lucene also has a benchmark for comparison of off-heap and heap mode? (Also by off-heap, I assume you mean filesystem, instead of the direct memory, e.g with ByteBuffer.allocateDirect()?)

I think in the original issue that introduced off-heap FST reading, luceneutil benchmarks were run? Not certain though. I don't know of other benchys.

mikemccand mentioned this issue Oct 3, 2023

FST#Compiler allocates too much memory #12598

Closed

mikemccand mentioned this issue Oct 3, 2023

FSTCompiler.Builder should have an option to stream the FST bytes directly to Directory #12543

Closed

dungba88 mentioned this issue Dec 5, 2023

Allow FST builder to use different writer (#12543) #12624

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve FST performance created by Builder [LUCENE-9481] #10520

Improve FST performance created by Builder [LUCENE-9481] #10520

asfimport commented Aug 25, 2020

asfimport commented Aug 25, 2020

asfimport commented Aug 25, 2020

mikemccand commented Sep 27, 2023

dungba88 commented Sep 28, 2023 •

edited

Loading

dungba88 commented Sep 28, 2023

dungba88 commented Oct 3, 2023

mikemccand commented Oct 3, 2023

mikemccand commented Oct 3, 2023

Improve FST performance created by Builder [LUCENE-9481] #10520

Improve FST performance created by Builder [LUCENE-9481] #10520

Comments

asfimport commented Aug 25, 2020

asfimport commented Aug 25, 2020

asfimport commented Aug 25, 2020

mikemccand commented Sep 27, 2023

dungba88 commented Sep 28, 2023 • edited Loading

dungba88 commented Sep 28, 2023

dungba88 commented Oct 3, 2023

mikemccand commented Oct 3, 2023

mikemccand commented Oct 3, 2023

dungba88 commented Sep 28, 2023 •

edited

Loading