Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve FST performance created by Builder [LUCENE-9481] #10520

Open
asfimport opened this issue Aug 25, 2020 · 8 comments
Open

Improve FST performance created by Builder [LUCENE-9481] #10520

asfimport opened this issue Aug 25, 2020 · 8 comments

Comments

@asfimport
Copy link

Hi,

 

We are using Lucene 8 and recently we observed a performance issue with FST. When FST is used directly from org.apache.lucene.util.fst.Builder.finish(), its about 40-50% slower than when it's first saved to a DataOutput (e.g filesystem or heap) and loaded again. The FST we used is about 1.6MB.

 

Using VisualVM, we tracked the issue down to the DataInput.readVInt method, which in turn calls the readByte of the reversed BytesReader. When FST is loaded from filesystem/heap, it's using OnHeapFSTStore which use ReverseBytesReader, but when it's created by the Builder, it's backed by the BytesStore. The problem is, when the number of blocks of BytesStore is not 1, it'll use an anonymous class for reading bytes, and possibly because of JVM inlining, the ReverseBytesReader.readByte is inlined, while that of the anonymous class is not. We confirmed this by adding XX:+PrintInlining flag.

`@ 80` org.apache.lucene.util.fst.BytesStore$2::readByte (68 bytes) too big
`@ 17` org.apache.lucene.util.fst.ReverseBytesReader::readByte (17 bytes) inline (hot)

 

Currently we have 2 workarounds, which can also confirm the above theory:

  • Use a larger bytesPageBits in Builder, so that the block size is greater than the FST size.
  • Save and load the FST to heap after building, e.g using GrowableByteArrayDataOutput and ByteArrayDataInput.

But I think it would be better if this optimization is done inside Builder.finish() instead, e.g by auto-adjust the block size of BytesStore or copy it to a FSTStore if possible, since none of the 2 workarounds above look great. We have many FST with largely different sizes, so with the first option we need to maintain the bytesPageBits for each dictionary, otherwise we risk wasting unnecessary heap or using insufficient block size. I also think this issue would affect other Lucene users too.

 

For reference, the benchmark we setup is as below:

  • Warm up both FST for 10.000 iterations (10.000 seems to be the JVM inlining trigger threshold)
  • Run 100 tests, each test run 100.000 iterations
  • Each iteration is simply call org.apache.lucene.util.fst.Util.get() with the same 4-character word.

It reports that 98 out of 100 tests, the one with save/load logic is 40-50% faster on average.

Thanks


Migrated from LUCENE-9481 by Anh Dung Bui

@asfimport
Copy link
Author

Anh Dung Bui (migrated from JIRA)

I realize SynonymMap also uses FST and might have the same problem, although Builder has been renamed to FSTCompiler now.

@asfimport
Copy link
Author

Michael McCandless (@mikemccand) (migrated from JIRA)

Wow, that is a fascinating result!

Maybe, instead of a paged byte[] storage, we should have a store that just grows a single byte[] (e.g. using ArrayUtil.grow) up until reasonable limits?  Then the resulting FST could use the same ReversedBytesReader?

@mikemccand
Copy link
Member

Coming back to this fun FST issue ...

Given that FSTs are now readable off-heap, if we simply fixed FST writing to also be off-heap (this should not be so hard since the writing is already append-only), then we wouldn't have these silly performance problems because we are re-inventing an on-heap filesystem (paged byte[]), badly.

But a less radical fix here would be good too (progress not perfection) -- growing the byte[] maybe (though this is maybe problematic for larger FSTs since you need at least 2X the final size of the FST max heap size). Or, when the FST is done being built, allocate a single byte[] and copy the pages to it? (But the same 2X+ heap size requirement is there too).

@dungba88
Copy link
Contributor

dungba88 commented Sep 28, 2023

Thanks @mikemccand. I originally also think of allocating single byte[] and yeah it would have the 2X heap size requirement.

As our use case are building and using FST at runtime, there might be (some, I'm not sure) penalty we would have if we are to read/write entirely with filesystem. In some of my previous experience with Java I/O, using FileChannel would be much faster than InputStream/OutputStream when writing to filesystem, but even so they are still slower than main memory.

I'm curious if Lucene also has a benchmark for comparison of off-heap and heap mode? (Also by off-heap, I assume you mean filesystem, instead of the direct memory, e.g with ByteBuffer.allocateDirect()?)

@dungba88
Copy link
Contributor

Maybe we could first allow FSTCompiler to specify its own DataOutput even when building the tree on-the-fly, instead of always relying on BytesStore? And we are free to choose an appropriate implementation, whether off-heap, or growing byte, etc.

@dungba88
Copy link
Contributor

dungba88 commented Oct 3, 2023

I'm planning to refactor the BytesStore into an interface that can be chosen from the FST builder. And one can decide whether on-heap or off-heap or on-heap without blocks is best suited.

@mikemccand
Copy link
Member

I'm planning to refactor the BytesStore into an interface that can be chosen from the FST builder. And one can decide whether on-heap or off-heap or on-heap without blocks is best suited.

Awesome, thanks @dungba88!

But, at write time, can we avoid BytesStore entirely, maybe? Just write to a DataOutput? But we would need to first refactor FST#addNode to not use this reverse method on BytesStore, and do its own byte[] buffering/reversal before writing? If possible, can we do this as a first step, which we test/merge before making the change to move to DataOutput?

Let's have further discussions on #12543?

@mikemccand
Copy link
Member

Maybe we could first allow FSTCompiler to specify its own DataOutput even when building the tree on-the-fly, instead of always relying on BytesStore? And we are free to choose an appropriate implementation, whether off-heap, or growing byte, etc.

Ahh, yes, exactly! (Sorry, catching up and reading comments out-of-order!).

As our use case are building and using FST at runtime, there might be (some, I'm not sure) penalty we would have if we are to read/write entirely with filesystem. In some of my previous experience with Java I/O, using FileChannel would be much faster than InputStream/OutputStream when writing to filesystem, but even so they are still slower than main memory.

Well, Lucene's DataOutput abstraction can be backed by heap too, e.g. ByteBufferDataOutput or ByteArrayDataOutput (there may be others!). An application could even easily make its own "double the byte[]" to grow custom DataOutput.

I'm curious if Lucene also has a benchmark for comparison of off-heap and heap mode? (Also by off-heap, I assume you mean filesystem, instead of the direct memory, e.g with ByteBuffer.allocateDirect()?)

I think in the original issue that introduced off-heap FST reading, luceneutil benchmarks were run? Not certain though. I don't know of other benchys.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants