Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase the number of dims for KNN vectors to 2048 [LUCENE-10471] #11507

Closed
asfimport opened this issue Mar 17, 2022 · 51 comments
Closed

Increase the number of dims for KNN vectors to 2048 [LUCENE-10471] #11507

asfimport opened this issue Mar 17, 2022 · 51 comments

Comments

@asfimport
Copy link

The current maximum allowed number of dimensions is equal to 1024. But we see in practice a couple well-known models that produce vectors with > 1024 dimensions (e.g mobilenet_v2 uses 1280d vectors, OpenAI / GPT-3 Babbage uses 2048d vectors). Increasing max dims to 2048 will satisfy these use cases.

I am wondering if anybody has strong objections against this.


Migrated from LUCENE-10471 by Mayya Sharipova (@mayya-sharipova), 6 votes, updated Aug 15 2022
Pull requests: #874

@asfimport
Copy link
Author

Robert Muir (@rmuir) (migrated from JIRA)

I don't "strongly object" but I question the approach of just raising the limit to satisfy whatever shitty models people come up with. At some point we should have a limit, and people should do dimensionality reduction.

@asfimport
Copy link
Author

Julie Tibshirani (@jtibshirani) (migrated from JIRA)

I also don't have an objection to increasing it a bit. But along the same lines as Robert's point, it'd be good to think about our decision making process – otherwise we'd be tempted to continuously increase it. I've already heard users requesting 12288 dims (to handle OpenAI DaVinci embeddings).

Two possible approaches I could see:

  1. We do more research on the literature and decide on a reasonable max dimension. If a user wants to go beyond that, they should reconsider the model or perform dimensionality reduction. This would encourage users to think through their embedding strategy to optimize for performance. The improvements can be significant, since search time scales with vector dimensionality.
  2. Or we take a flexible approach where we bump the limit to a high upper bound. This upper bound would be based on how much memory usage is reasonable for one vector (similar to the max term size?)

I feel a bit better about approach 2 because I'm not confident I could come up with a statement about a "reasonable max dimension", especially given the fast-moving research.

@asfimport
Copy link
Author

Robert Muir (@rmuir) (migrated from JIRA)

I think the major problem is still no Vector API in the java APIs. It changes this entire conversation completely when we think about this limit.

if openjdk would release this low level vector API, or barring that, maybe some way to MR-JAR for it, or barring that, maybe some intrinsics such as SloppyMath.dotProduct and SloppyMath.matrixMultiply, maybe java wouldn't become the next COBOL.

@asfimport
Copy link
Author

Stanislav Stolpovskiy (migrated from JIRA)

I don't think there is a trend to increase dimensionality. Only few models have feature dimensions more than 2048.

Most of modern neural networks (ViT and whole Bert family) have dimensions less than 1k. 

However there are still many models like ms-resnet or EfficientNet that operate in range from 1k to 2048. 

And they are most common models for image embedding and vector search.

Current limit is forcing to do dimensionally reduction for pretty standard shapes. 

 

@asfimport
Copy link
Author

Michael Sokolov (@msokolov) (migrated from JIRA)

We should not be imposing an arbitrary limit that prevents people with CNNs (image-processing models) from using this feature. It makes sense to me to increase the limit to the point where we would see actual bugs/failures, or where the large numbers might prevent us from making some future optimization, rather than trying to determine where the performance stops being acceptable - that's a question for users to decide for themselves. Of course we don't know where that place is that we might want to optimize in the future (Rob and I discussed an idea using all-integer math that would suffer from overflow, but still we should not just allow MAX_INT dimensions I think? To me a limit like 16K makes sense – well beyond any stated use case, but not effectively infinite?

@asfimport
Copy link
Author

Mayya Sharipova (@mayya-sharipova) (migrated from JIRA)

@sstolpovskiy  @msokolov Thanks for providing your suggestions. It looks like we clearly see the need for upto 2048 dims for images, so I will be merging the linked PR. 

@asfimport
Copy link
Author

Robert Muir (@rmuir) (migrated from JIRA)

My questions are still unanswered. Please don't merge the PR when there are standing objections!

@asfimport
Copy link
Author

Mayya Sharipova (@mayya-sharipova) (migrated from JIRA)

Sorry, may be I should have provided more explanation.

  • First this issue is only about to have max dims up to 2048. We can create a separate issue to discuss other upper limits if there is a need for them. 
  • According to our ML experts resnet is an industry standard for images and it can need up to 2048 dims. It would be good that we can support it in Lucene. 
  • I can also run a performance test of 1M vectors of 2048 dims to see how much time and memory it may take to index and search these big vectors. 

@asfimport
Copy link
Author

Robert Muir (@rmuir) (migrated from JIRA)

The problem is that nobody will ever want to reduce the limit in the future. Let's be honest, once we support a limit of N, nobody will want to ever make it smaller because of the potential users who wouldn't be able to use it anymore.

So because this is a "one-way" decision, it needs serious justification, benchmarks, etc etc. Regardless of how the picture looks, its definitely not something we should be "rushing" into 9.3

@asfimport
Copy link
Author

Mayya Sharipova (@mayya-sharipova) (migrated from JIRA)

Got it, thanks, I will not rush, and will try to provide benchmarks.

@asfimport
Copy link
Author

Michael Wechner (@michaelwechner) (migrated from JIRA)

Maybe I do not understand the code base of Lucene well enough, but wouldn't it be possible to have a default limit of 1024 or 2028 and that one can set a different limit programmable on the IndexWriter/Reader/Searcher?

@asfimport
Copy link
Author

Marcus Eagan (@MarcusSorealheis) (migrated from JIRA)

@michaelwechner You are free to increase the dimension limit as it is a static variable and Lucene is your oyster. However, @erikhatcher has Seared in my mind that this a long term fork ok Lucene is a bad idea for many reasons.

#[~rcmuir] I agree with you on "whatever shitty models." They are here, and more are coming. With respect to the vector API, Oracle is doing an interesting bit of work in Open JDK 17 to improve their vector API. They've added support for Intel's short vector math library, which will improve. The folk at OpenJDK exploit the Panama APIs. There are several hardware accelerations they are yet to exploit, and many operations will fall back to scalar code.

My argument is for increasing the limit of dimensions is not to suggest that there is a better fulcrum in the performance tradeoff balancer, but that more users testing Lucene is good for improving the feature.

Open AI's Da Vinci is one such model but not the only

I've had customers ask for 4096 based on the performance they observe with question an answering. I'm waiting on the model and will share when I know. If customers want to introduce rampant numerical errors in their systems, there is little we can do for them. Don't take my word on any of this yet. I need to bring data and complete evidence. I'm asking my customers why they cannot do dimensional reduction.

@asfimport
Copy link
Author

Michael Sokolov (@msokolov) (migrated from JIRA)

> Maybe I do not understand the code base of Lucene well enough, but wouldn't it be possible to have a default limit of 1024 or 2028 and that one can set a different limit programmable on the IndexWriter/Reader/Searcher?

I think the idea is to protect ourselves from accidental booboos; this could eventually get exposed in some shared configuration file, and then if somebody passes MAX_INT it could lead to allocating huge buffers somewhere and taking down a service shared by many people/groups? Hypothetical, but it's basically following the principle that we should be strict to help stop people shooting themselves and others in the feet. We may also want to preserve our ability to introduce optimizations that rely on some limits to the size, which would become difficult if usage of larger sizes became entrenched. (We can't so easily take it back once it's out there). Having said that I still feel a 16K limit, while allowing for models that are beyond reasonable, wouldn't cause any of these sort of issues, so that's the number I'm advocating.

@asfimport
Copy link
Author

Julie Tibshirani (@jtibshirani) (migrated from JIRA)

It makes sense to me to increase the limit to the point where we would see actual bugs/failures, or where the large numbers might prevent us from making some future optimization, rather than trying to determine where the performance stops being acceptable - that's a question for users to decide for themselves.

Mike's perspective makes sense to me too. I'd be supportive of increasing the limit to an upper bound. Maybe we could run a test with ~1 million synthetic vectors with the proposed max dimension (~16K) to check there are no failures or unexpected behavior?

@asfimport
Copy link
Author

Robert Muir (@rmuir) (migrated from JIRA)

My main concern is that it can't be undone, as i mentioned. Nobody will be willing to go backwards.
It impacts more than current implementation, it impacts future implementations as well (different algorithms and data structures).
If something like 16k dimensions are allowed it may prevent even simple optimizations (such as 8-bit width).
So its important to be very conservative.

This is why I make a big deal about it, because of the "one-way" nature of the backwards compatibility associated with this change. It seems this is still not yet understood or appreciated.

Historically, users fight against every limit we have in lucene, so when people complain about this one, it doesn't bother me (esp when it seems related to one or two bad models/bad decisions unrelated to this project). But these limits are important, especially when features are in their infancy, without them, there is less flexibility and you can find yourself easily "locked in" to a particular implementation.

@asfimport
Copy link
Author

Robert Muir (@rmuir) (migrated from JIRA)

It is also terrible that this issue says 2048 but somehow that already blew up to 16k here.

-1 to 16K. Its unnecessarily large and puts the project at risk in the future. We can debate 2048.

@aykutfirat
Copy link

aykutfirat commented Jan 3, 2023

Lots of things happened since Aug, like the arrival of ChatGPT, and people's increased desire to use OpenAI's state of the art embeddings which are of size 1536. Can you at least please increase it to 1536 for now, while you discuss upper limits?

@uschindler
Copy link
Contributor

Lots of things happened since Aug, like the arrival of ChatGPT, and people's increased desire to use OpenAI's state of the art embeddings which are of size 1536. Can you at least please increase it to 1536 for now, while you discuss upper limits?

Actually it is a one-line change (without any garantees), see https://github.com/apache/lucene/pull/874/files

If you really want to shoot you in the foot: Download source code of Lucene in the version you need for your Elasticsearch instance (I assume you coming from elastic/elasticsearch#92458), patch it with #874, and then run './gradlew distribution'. Copy the JAR files into your ES districution. Done.

But it is not sure if this will blos up and indexes created by that won't read anymore with standard Lucene

@uschindler
Copy link
Contributor

Why I was doing that suggestion: If you are interested, try it out with your dataset and your Elasticsearch server and report back! Maybe you will figure out that performance does not work or memory usage is too high.

@gibrown
Copy link

gibrown commented Apr 3, 2023

I'll preface this by saying I am also skeptical that going beyond 1024 makes sense for most use cases and scaling is a concern. However, amidst the current excitement to try and use openai embeddings the first cut at choosing a system to store and use those embeddings was Elasticsearch. Then the 1024 limit was run into and so various folks are looking at other alternatives largely because of this limit.

The use cases tend to be Q/A, summarization, and recommendation systems for WordPress and Tumblr. There are multiple proof of concept systems people have built (typically on top of various typscript, javascript, or python libs) which use the openai embeddings directly (and give quite impressive results). Even though I am pretty certain that reducing the dimensions will be a better idea for many of these, the ability to build and prototype on higher dimensions would be extremely useful.

@FranciscoBorges
Copy link

@uschindler @rmuir FWIW We are interested in using Lucene's kNN with 1536 dimensions in order to use OpenAI's embeddings API. We benchmarked a patched Lucene/Solr. We fully understand (we measured it :-P) that there is an increase in memory consumption and latency. Sure thing.

We have applications where dev teams have chosen to work with OpenAI embeddings and where the number of records involved and requests per second make the trade offs of memory and latency perfectly acceptable.

There is a great deal of enthusiasm around OpenAI and releasing a working application ASAP. For many of these the resource cost of 1536 dimensions is perfectly acceptable against the alternative of delaying a pilot to optimize further.

Our work would be a lot easier if Lucene's kNN implementation supported 1536 dimensions without need for a patch.

@dsmiley
Copy link
Contributor

dsmiley commented May 5, 2023

I'm reminded of the great maxBooleanClauses debate. At least that limit is user configurable (for the system deployer; not the end user doing a query) whereas this new one for kNN is not.

I can understand how we got to this point -- limits often start as hard limits. The current limit even seems high based on what has been said. But users have spoken here on a need for them to configure Lucene for their use-case (such as experimentation within a system they are familiar with) and accept the performance consequences. I would like this to be possible with a System property. This hasn't been expressly asked? Why should Lucene, just a library that doesn't know what's best for the user, prevent a user from being able to do that?

This isn't an inquiry about why limits exist; of course systems need limits.

@alessandrobenedetti
Copy link
Contributor

Hi @dsmiley I updated the dev discussion on the mailing list:
[Proposal] Remove max number of dimensions for KNN vectors

And proceeded with a pragmatic new mail thread, where we just collect proposals with a motivation (no discussion there):
Dimensions Limit for KNN vectors - Next Steps

Feel free to participate!
My intention is to act relatively fast (and then also operate Solr side).
It's a train we don't need/want to miss!

@ryantbrown
Copy link

The rabbit hole that is trying to store Open AI embeddings in Elasticsearch eventually leads here. I read the entire thread and unless I am missing something, the obvious move to make the limit configurable (up to a point) or at a minimum, increase the limit to 1536 to support the text-embedding-ada-002 model. In other words, there should be a compelling reason not to increase the limit beyond the fact that it will hard to reduce in the future.

@nknize
Copy link
Member

nknize commented May 15, 2023

Cross posting here because I responded to the PR instead of this issue.

...why is it then that GPT-4, which internally represents each token with a vector of more than 8192, still inaccurately recalls information about entities?

I think this comment actually supports @MarcusSorealheis argument? e.g., What's the point in indexing 8K dimensions if it isn't much better at recall than 768?

If the real issue is with the use of HNSW, which isn't suitable for this, not that highe-dimensionality embeddings have value, then the solution isn't to not provide the feature, but to switch technologies to something more suitable for the type of applications that people use Lucene for.

I may be wrong but it seems like this is where most of the lucene committers here are settling?

Over a decade ago I wanted a high dimension index for some facial recognition and surveillance applications I was working on. I rejected Lucene at first only because of it being written in java and I personally felt something like C++ was a better fit for the high dimension job (no garbage collection to worry about). So I wrote a high dimension indexer for MongoDB inspired by RTree (for the record it's implementation is based on XTree) and wrote it using C++ 14 preview features (lambda functions were the new hotness on the block and java didn't even have them yet). Even in C++ back then SIMD wasn't very well supported by the compiler natively so I had to add all sorts of compiler tricks to squeeze every ounce of vector parallelization to make it performant. C++ has gotten better since then but I think java still lags in this area? Even JEP 426 is a ways off (maybe because OpenJDK is holding these things hostage)? So maybe java is still not the right fit here? I wonder though, does that mean Lucene shouldn't provide dimensionality higher than arbitrary 1024? Maybe not. I agree dimensional reduction techniques like PCA should be considered to reduce the storage volume. The problem with that argument is that dimensionality reduction fails when features are weakly correlated. You can't capture the majority of the signal in the first N components and therefore need higher dimensionality. But does that mean that 1024 is still too low to make Lucene a viable option?

Aside from conjecture does anyone have empirical examples where 1024 is too low and what specific Lucene capabilities (e.g., scoring?) would make adding support for dimensions higher than 1024 really worth considering over using dimensionality reduction? If Lucene doesn't do this does it really risk the project becoming irrelevant? That sounds a bit like sensationalism. Even if higher dimensionality is added to the current vector implementation (I'd actually argue we should explore converting BKD to support higher dimensions instead) are we convinced it will reasonably perform without JEP 426 or better SIMD support that's only available in newer JDKs? Can anyone smart here post their benchmarks to substantiate their claims? I know Pinecone (and others) have blogged about their love for RUST for these kinds of applications. Should Lucene just leave this to the job of alternative Search APIs? Maybe something like Tantivy or Rucene? Or is it time we explore a new optional Lucene Vector module that supports cutting edge JDK features through gradle tooling for optimizing the vector use case?

Interested what others think.

@alessandrobenedetti
Copy link
Contributor

alessandrobenedetti commented Jun 10, 2023

Copying and pasting here, just for visibility:

Here's an example why making the vector dimensions configurable is a bad idea: #12281 This issue shows that each added dimension makes the floating point errors larger and sometimes also returns NaN. Do we have tests when multiplying vectors causes NaN?

I may sound like somebody who contradicts another just for the sake of doing so, but I do genuinely believe these kind of discoveries support the fact that making it configurable is actually a good idea:
We are not changing a production system here but we are changing a library.
Enabling more users to experiment with higher dimensions increase the probability of finding (and then solving) this sort of issues.
I suspect we are not recommending anywhere here to go to prod with un-tested and un-benchmarked vector sizes anyway

@uschindler
Copy link
Contributor

Enabling more users to experiment with higher dimensions increase the probability of finding (and then solving) this sort of issues.

It also shows that this causes a long-tail of issues:

  • If we would fix the mentioned issue in the way proposed there, the performance would be going down, as a widening conversion to double would break hotspot optimizations and would also not be possible with SIMD in a performant manner. So this is a reason to for now stay with the current limit.

@uschindler
Copy link
Contributor

In addition if we raise the number of dimensions people will then start claiming for higher precision in calculations, completely forgetting that Lucene is a full text search engine to bring results in milliseconds not 2 hours. Score calculations introduce rounding anyways and making them exact is (a) not needed for lucene (we just sort on those values) and (b) would slow down the whole thing so much.

So keep with the current limit and NOT make it configurable. I agree to raise the maximum to 2048 (while recommending to people to use Java 20 for running Lucene and enable incubator vectors).

At same time close any issues about calculation precission and on the other hand get the JDK people support half float calculations.

@dsmiley
Copy link
Contributor

dsmiley commented Jun 10, 2023

I think a library should empower a user to discover what works (and doesn't) for them, rather than playing big brother and insist it knows best that there's no way some high setting could ever work for any user. Right? By making it a system property that does not need to be configured for <= 1024, it should raise a red flag to users that they are venturing into unusual territory. i.e. they've been warned. They'd have to go looking for such a setting and see warnings; it's not something a user would do accidentally either.

if we raise the number of dimensions people will then start claiming for higher precision in calculations,

LOL People may ask for whatever they want :-) including using/abusing a system beyond its intended scope. So what? BTW I've thoroughly enjoyed seeing several use cases of my code in Lucene/Solr that I had never considered yet worked really well for a user :-D. Pure joy. Of course not every request makes sense to us. I'd rather endure such than turn users away from Lucene that we can support trivially today.

@alessandrobenedetti
Copy link
Contributor

In addition if we raise the number of dimensions people will then start claiming for higher precision in calculations, completely forgetting that Lucene is a full text search engine to bring results in milliseconds not 2 hours. Score calculations introduce rounding anyways and making them exact is (a) not needed for lucene (we just sort on those values) and (b) would slow down the whole thing so much.

So keep with the current limit and NOT make it configurable. I agree to raise the maximum to 2048 (while recommending to people to use Java 20 for running Lucene and enable incubator vectors).

At same time close any issues about calculation precission and on the other hand get the JDK people support half float calculations.

@uschindler , I am not convinced but it's fine to have different opinions!
I do agree we should improve all the improvable and at the same time, in parallel, give users flexibility to experiment:

  • break it
  • make it super slow with enormous vectors
  • make it super slow with enormous field content
  • make it super slow with a great number of fields
    ...
    I just think that the more you let your users do (with a reasonable effort), the more we'll gain users and consequentially improvements (it's a fact that more people you involve in a project as users the more you increase the probability of developing contributors and contributions).

We may have different opinions here and that's fine, but my intent as a committer is to build the best solution for the community rather than the best solution according to my ideas.

You know, if we wanted sub-ms responses all the time we could set a hard limit to 1024 chars per textual field and allow a very low number of fields, but then would Lucene attract any user at all?

@mayya-sharipova
Copy link
Contributor

mayya-sharipova commented Jun 26, 2023

I would like to renew the issue in light of the recent integration of incubating Panama Vector API, as indexing of vectors with it much faster.

We run a benchmarking test, and indexing a dataset of vectors of 1536 dims was slightly faster than indexing of 1024 dims. This gives us enough confidence to extend max dims to 2048 (at least when vectorization is enabled).

Test environment

  • Dataset:

    • nq dataset with text field embedded with OpenAI text-embedding-ada-002 model, 1536 dims
  • KnnGraphTester

  • maxConn: 16, beamWidthIndex: 100

  • Apple M1 laptop

Test1:

  • Lucene 9.7 branch
  • Panama Vector API not enabled
  • vector dims=1024 (OpenAi vectors that were cut off to first 1024 dims)
  • Results: Indexed 2680961 documents in 3287s
Details
 java -cp  "lib/*:classes" -Xmx16g -Xms16g org.apache.lucene.util.hnsw.KnnGraphTester -dim 1024 -ndoc 2680961 -reindex -docs vectors_dims1024.bin -maxConn 16 -beamWidthIndex 100
creating index in vectors_dims1024.bin-16-100.index
MS 0 [2023-06-26T11:10:24.765857Z; main]: initDynamicDefaults maxThreadCount=4 maxMergeCount=9
IFD 0 [2023-06-26T11:10:24.782017Z; main]: init: current segments file is "segments"; deletionPolicy=org.apache.lucene.index.KeepOnlyLastCommitDeletionPolicy@646d64ab
IFD 0 [2023-06-26T11:10:24.783554Z; main]: now delete 0 files: []
IFD 0 [2023-06-26T11:10:24.784291Z; main]: now checkpoint "" [0 segments ; isCommit = false]
IFD 0 [2023-06-26T11:10:24.784338Z; main]: now delete 0 files: []
IFD 0 [2023-06-26T11:10:24.785377Z; main]: 0 ms to checkpoint
IW 0 [2023-06-26T11:10:24.785523Z; main]: init: create=true reader=null
IW 0 [2023-06-26T11:10:24.790087Z; main]:
dir=MMapDirectory@/Users/mayya/Elastic/knn/open_ai_vectors/vectors_dims1024.bin-16-100.index lockFactory=org.apache.lucene.store.NativeFSLockFactory@2c039ac6
index=
version=9.7.0
analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer
ramBufferSizeMB=1994.0
maxBufferedDocs=-1
mergedSegmentWarmer=null
delPolicy=org.apache.lucene.index.KeepOnlyLastCommitDeletionPolicy
commit=null
openMode=CREATE
similarity=org.apache.lucene.search.similarities.BM25Similarity
mergeScheduler=ConcurrentMergeScheduler: maxThreadCount=4, maxMergeCount=9, ioThrottle=true
codec=Lucene95
infoStream=org.apache.lucene.util.PrintStreamInfoStream
mergePolicy=[TieredMergePolicy: maxMergeAtOnce=10, maxMergedSegmentMB=5120.0, floorSegmentMB=2.0, forceMergeDeletesPctAllowed=10.0, segmentsPerTier=10.0, maxCFSSegmentSizeMB=8.796093022208E12, noCFSRatio=0.1, deletesPctAllowed=20.0
readerPooling=true
perThreadHardLimitMB=1945
useCompoundFile=false
commitOnClose=true
indexSort=null
checkPendingFlushOnUpdate=true
softDeletesField=null
maxFullFlushMergeWaitMillis=500
leafSorter=null
eventListener=org.apache.lucene.index.IndexWriterEventListener$1@2173f6d9
writer=org.apache.lucene.index.IndexWriter@307f6b8c

IW 0 [2023-06-26T11:10:24.790232Z; main]: MMapDirectory.UNMAP_SUPPORTED=true
DWPT 0 [2023-06-26T11:19:47.652040Z; main]: flush postings as segment _0 numDocs=460521
IW 0 [2023-06-26T11:19:47.653761Z; main]: 1 ms to write norms
IW 0 [2023-06-26T11:19:47.653954Z; main]: 0 ms to write docValues
IW 0 [2023-06-26T11:19:47.654032Z; main]: 0 ms to write points
IW 0 [2023-06-26T11:19:49.152263Z; main]: 1498 ms to write vectors
IW 0 [2023-06-26T11:19:49.166472Z; main]: 14 ms to finish stored fields
IW 0 [2023-06-26T11:19:49.166642Z; main]: 0 ms to write postings and finish vectors
IW 0 [2023-06-26T11:19:49.167167Z; main]: 0 ms to write fieldInfos
DWPT 0 [2023-06-26T11:19:49.167954Z; main]: new segment has 0 deleted docs
DWPT 0 [2023-06-26T11:19:49.168030Z; main]: new segment has 0 soft-deleted docs
DWPT 0 [2023-06-26T11:19:49.169572Z; main]: new segment has no vectors; no norms; no docValues; no prox; freqs
DWPT 0 [2023-06-26T11:19:49.169670Z; main]: flushedFiles=[_0_Lucene95HnswVectorsFormat_0.vem, _0.fdm, _0_Lucene95HnswVectorsFormat_0.vec, _0.fdx, _0_Lucene95HnswVectorsFormat_0.vex, _0.fdt, _0.fnm]
....
Indexed 2680961 documents in 3287s

Test2

  • Lucene 9.7 branch with FloatVectorValues.MAX_DIMENSIONS set to 2048
  • Panama Vector API enabled
  • vector dims=1536
  • Results: Indexed 2680961 documents in 3141s
Details
java --add-modules jdk.incubator.vector -cp  "lib/*:classes" -Xmx16g -Xms16g org.apache.lucene.util.hnsw.KnnGraphTester -dim 1536 -ndoc 2680961 -reindex -docs vectors.bin -maxConn 16 -beamWidthIndex 100

WARNING: Using incubator modules: jdk.incubator.vector
creating index in vectors.bin-16-100.index
Jun 26, 2023 10:34:29 A.M. org.apache.lucene.store.MemorySegmentIndexInputProvider <init>
INFO: Using MemorySegmentIndexInput with Java 20; to disable start with -Dorg.apache.lucene.store.MMapDirectory.enableMemorySegments=false
MS 0 [2023-06-26T14:34:29.271516Z; main]: initDynamicDefaults maxThreadCount=4 maxMergeCount=9
IFD 0 [2023-06-26T14:34:29.329779Z; main]: init: current segments file is "segments"; deletionPolicy=org.apache.lucene.index.KeepOnlyLastCommitDeletionPolicy@64f6106c
IFD 0 [2023-06-26T14:34:29.336415Z; main]: now delete 0 files: []
IFD 0 [2023-06-26T14:34:29.338546Z; main]: now checkpoint "" [0 segments ; isCommit = false]
IFD 0 [2023-06-26T14:34:29.338654Z; main]: now delete 0 files: []
IFD 0 [2023-06-26T14:34:29.347243Z; main]: 2 ms to checkpoint
IW 0 [2023-06-26T14:34:29.348255Z; main]: init: create=true reader=null
IW 0 [2023-06-26T14:34:29.368686Z; main]:
dir=MMapDirectory@/Users/mayya/Elastic/knn/open_ai_vectors/vectors.bin-16-100.index lockFactory=org.apache.lucene.store.NativeFSLockFactory@319b92f3
index=
version=9.7.0
analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer
ramBufferSizeMB=1994.0
maxBufferedDocs=-1
mergedSegmentWarmer=null
delPolicy=org.apache.lucene.index.KeepOnlyLastCommitDeletionPolicy
commit=null
openMode=CREATE
similarity=org.apache.lucene.search.similarities.BM25Similarity
mergeScheduler=ConcurrentMergeScheduler: maxThreadCount=4, maxMergeCount=9, ioThrottle=true
codec=Lucene95
infoStream=org.apache.lucene.util.PrintStreamInfoStream
mergePolicy=[TieredMergePolicy: maxMergeAtOnce=10, maxMergedSegmentMB=5120.0, floorSegmentMB=2.0, forceMergeDeletesPctAllowed=10.0, segmentsPerTier=10.0, maxCFSSegmentSizeMB=8.796093022208E12, noCFSRatio=0.1, deletesPctAllowed=20.0
readerPooling=true
perThreadHardLimitMB=1945
useCompoundFile=false
commitOnClose=true
indexSort=null
checkPendingFlushOnUpdate=true
softDeletesField=null
maxFullFlushMergeWaitMillis=500
leafSorter=null
eventListener=org.apache.lucene.index.IndexWriterEventListener$1@10a035a0
writer=org.apache.lucene.index.IndexWriter@67b467e9

IW 0 [2023-06-26T14:34:29.369224Z; main]: MMapDirectory.UNMAP_SUPPORTED=true
Jun 26, 2023 10:34:29 A.M. org.apache.lucene.util.VectorUtilPanamaProvider <init>
INFO: Java vector incubator API enabled; uses preferredBitSize=128
DWPT 0 [2023-06-26T14:40:36.945965Z; main]: flush postings as segment _0 numDocs=314897
IW 0 [2023-06-26T14:40:36.949748Z; main]: 2 ms to write norms
IW 0 [2023-06-26T14:40:36.950336Z; main]: 0 ms to write docValues
IW 0 [2023-06-26T14:40:36.950452Z; main]: 0 ms to write points
IW 0 [2023-06-26T14:40:38.639069Z; main]: 1688 ms to write vectors
IW 0 [2023-06-26T14:40:38.669749Z; main]: 29 ms to finish stored fields
IW 0 [2023-06-26T14:40:38.670044Z; main]: 0 ms to write postings and finish vectors
IW 0 [2023-06-26T14:40:38.670847Z; main]: 0 ms to write fieldInfos
DWPT 0 [2023-06-26T14:40:38.672893Z; main]: new segment has 0 deleted docs
DWPT 0 [2023-06-26T14:40:38.673016Z; main]: new segment has 0 soft-deleted docs
DWPT 0 [2023-06-26T14:40:38.675915Z; main]: new segment has no vectors; no norms; no docValues; no prox; freqs
DWPT 0 [2023-06-26T14:40:38.676120Z; main]: flushedFiles=[_0_Lucene95HnswVectorsFormat_0.vem, _0.fdm, _0_Lucene95HnswVectorsFormat_0.vec, _0.fdx, _0_Lucene95HnswVectorsFormat_0.vex, _0.fdt, _0.fnm]
DWPT 0 [2023-06-26T14:40:38.676311Z; main]: flushed codec=Lucene95
DWPT 0 [2023-06-26T14:40:38.677609Z; main]: flushed: segment=_0 ramUsed=1,945.012 MB newFlushedSize=1,863.46 MB docs/MB=168.985
DWPT 0 [2023-06-26T14:40:38.680696Z; main]: flush time 1735.77025 ms
IW 0 [2023-06-26T14:40:38.682741Z; main]: publishFlushedSegment seg-private updates=null
IW 0 [2023-06-26T14:40:38.683738Z; main]: publishFlushedSegment _0(9.7.0):C314897:[diagnostics={source=flush, lucene.version=9.7.0, os.version=13.2.1, os.arch=x86_64, os=Mac OS X, java.vendor=Oracle Corporation, java.runtime.version=20.0.1+9-29, timestamp=1687790438678}]:[attributes={Lucene90StoredFieldsFormat.mode=BEST_SPEED}] :id=717x28qrd00q2ke3d17eerf4x
BD 0 [2023-06-26T14:40:38.687864Z; main]: finished packet delGen=1 now completedDelGen=1
IW 0 [2023-06-26T14:40:38.691420Z; main]: publish sets newSegment delGen=1 seg=_0(9.7.0):C314897:[diagnostics={source=flush, lucene.version=9.7.0, os.version=13.2.1, os.arch=x86_64, os=Mac OS X, java.vendor=Oracle Corporation, java.runtime.version=20.0.1+9-29, timestamp=1687790438678}]:[attributes={Lucene90StoredFieldsFormat.mode=BEST_SPEED}] :id=717x28qrd00q2ke3d17eerf4x
IFD 0 [2023-06-26T14:40:38.692639Z; main]: now checkpoint "_0(9.7.0):C314897:[diagnostics={source=flush, lucene.version=9.7.0, os.version=13.2.1, os.arch=x86_64, os=Mac OS X, java.vendor=Oracle Corporation, java.runtime.version=20.0.1+9-29, timestamp=1687790438678}]:[attributes={Lucene90StoredFieldsFormat.mode=BEST_SPEED}] :id=717x28qrd00q2ke3d17eerf4y" [1 segments ; isCommit = false]
IFD 0 [2023-06-26T14:40:38.693268Z; main]: now delete 0 files: []
IFD 0 [2023-06-26T14:40:38.693464Z; main]: 1 ms to checkpoint
MP 0 [2023-06-26T14:40:38.700301Z; main]:   seg=_0(9.7.0):C314897:[diagnostics={source=flush, lucene.version=9.7.0, os.version=13.2.1, os.arch=x86_64, os=Mac OS X, java.vendor=Oracle Corporation, java.runtime.version=20.0.1+9-29, timestamp=1687790438678}]:[attributes={Lucene90StoredFieldsFormat.mode=BEST_SPEED}] :id=717x28qrd00q2ke3d17eerf4y size=1863.460 MB
MP 0 [2023-06-26T14:40:38.701368Z; main]: findMerges: 1 segments
MP 0 [2023-06-26T14:40:38.701645Z; main]:   allowedSegmentCount=10 vs count=1 (eligible count=1)   
 ...

Indexed 2680961 documents in 3141s

@mikemccand
Copy link
Member

We run a benchmarking test, and indexing a dataset of vectors of 1536 dims was slightly faster than indexing of 1024 dims. This gives us enough confidence to extend max dims to 2048 (at least when vectorization is enabled).

I found this very strange at first :)

But then I read more closely, and I think what you meant is indexing 1024 dims without Panama (SIMD vector instructions) is slower than indexing 1536 dims with Panama enabled? Which is really quite impressive.

Do we know what gains we see at search time going from 1024 -> 1536?

@uschindler
Copy link
Contributor

uschindler commented Jun 27, 2023

Interestingly it was only an Apple M1. This one only has 128 bits vector size and only 2 PU (the 128 bits is in the spec of CPU, but Robert told me about number of PUs; I found no info on that in wikichip). So I would like to also see the difference on a real cool AVX512 machine with 4 PUs.

So unfortunately the Apple M1 is a bit limited but it is still good enough to outperform the scalar impl. Cool. Now please test on a real Intel Server CPU. 😍

In general I am fine with rising vectors to 2048 dims. But apply that limit only to the HNSW codec. So check should not in the field type but in the codec.

@mayya-sharipova
Copy link
Contributor

@mikemccand Indeed, exactly as said, sorry for being unclear. We have not checked search, will work on that.

@uschindler Thanks, indeed, we need tests on other machines. +1 for raising dims to 2048 in HNSW codec.

@ChrisHegarty
Copy link
Contributor

I ran @mayya-sharipova's exact same benchmark/test on my machine. Here are the results.

Test environment

  • Dataset:

    • nq dataset with text field embedded with OpenAI text-embedding-ada-002 model, 1536 dims
  • KnnGraphTester

  • maxConn: 16, beamWidthIndex: 100

  • Linux, x86_64 11th Intel Core i5-11400 @ 2.60GHz - AVX 512

  • JDK 20.0.1

Result

Panama(bits) dims time (secs)
No 1024 3136
Yes(512) 1536 2633

So the test run with 1536 dims and Panama enabled at AVX 512 was 503 secs (or ~16%) faster than the run with 1024 dims and No Panama.

Test1:

  • Lucene 9.7.0
  • Panama Vector API not enabled
  • vector dims=1024 (OpenAi vectors that were cut off to first 1024 dims)
  • Results: Indexed 2680961 documents in 3136s
Details
davekim$ time /home/chegar/binaries/jdk-20.0.1/bin/java  -cp lucene-9.7.0/modules/*:/home/chegar/git/lucene/lucene/core/build/classes/java/test  -Xmx16g -Xms16g  org.apache.lucene.util.hnsw.KnnGraphTester  -dim 1024  -ndoc 2680961  -reindex  -docs vector_search-open_ai_vectors_1024-vectors_dims1024.bin  -maxConn 16  -beamWidthIndex 100
creating index in vector_search-open_ai_vectors_1024-vectors_dims1024.bin-16-100.index
Jun 28, 2023 1:44:34 PM org.apache.lucene.store.MemorySegmentIndexInputProvider <init>
INFO: Using MemorySegmentIndexInput with Java 20; to disable start with -Dorg.apache.lucene.store.MMapDirectory.enableMemorySegments=false
MS 0 [2023-06-28T12:44:34.340877459Z; main]: initDynamicDefaults maxThreadCount=4 maxMergeCount=9
IFD 0 [2023-06-28T12:44:34.355786340Z; main]: init: current segments file is "segments"; deletionPolicy=org.apache.lucene.index.KeepOnlyLastCommitDeletionPolicy@7e9a5fbe
IFD 0 [2023-06-28T12:44:34.358595927Z; main]: now delete 0 files: []
IFD 0 [2023-06-28T12:44:34.359321686Z; main]: now checkpoint "" [0 segments ; isCommit = false]
IFD 0 [2023-06-28T12:44:34.359380405Z; main]: now delete 0 files: []
IFD 0 [2023-06-28T12:44:34.360606701Z; main]: 0 ms to checkpoint
IW 0 [2023-06-28T12:44:34.361060247Z; main]: init: create=true reader=null
IW 0 [2023-06-28T12:44:34.367050357Z; main]:
dir=MMapDirectory@/home/chegar/git/lucene-vector-bench/vector_search-open_ai_vectors_1024-vectors_dims1024.bin-16-100.index lockFactory=org.apache.lucene.store.NativeFSLockFactory@46238e3f
index=
version=9.7.0
analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer
ramBufferSizeMB=1994.0
maxBufferedDocs=-1
mergedSegmentWarmer=null
delPolicy=org.apache.lucene.index.KeepOnlyLastCommitDeletionPolicy
commit=null
openMode=CREATE
similarity=org.apache.lucene.search.similarities.BM25Similarity
mergeScheduler=ConcurrentMergeScheduler: maxThreadCount=4, maxMergeCount=9, ioThrottle=true
codec=Lucene95
infoStream=org.apache.lucene.util.PrintStreamInfoStream
mergePolicy=[TieredMergePolicy: maxMergeAtOnce=10, maxMergedSegmentMB=5120.0, floorSegmentMB=2.0, forceMergeDeletesPctAllowed=10.0, segmentsPerTier=10.0, maxCFSSegmentSizeMB=8.796093022208E12, noCFSRatio=0.1, deletesPctAllowed=20.0
readerPooling=true
perThreadHardLimitMB=1945
useCompoundFile=false
commitOnClose=true
indexSort=null
checkPendingFlushOnUpdate=true
softDeletesField=null
maxFullFlushMergeWaitMillis=500
leafSorter=null
eventListener=org.apache.lucene.index.IndexWriterEventListener$1@6c9f5c0d
writer=org.apache.lucene.index.IndexWriter@de3a06f

IW 0 [2023-06-28T12:44:34.367221110Z; main]: MMapDirectory.UNMAP_SUPPORTED=true
Jun 28, 2023 1:44:34 PM org.apache.lucene.util.VectorUtilProvider lookup
WARNING: Java vector incubator module is not readable. For optimal vector performance, pass '--add-modules jdk.incubator.vector' to enable Vector API.
DWPT 0 [2023-06-28T12:53:31.591056430Z; main]: flush postings as segment _0 numDocs=460521
IW 0 [2023-06-28T12:53:31.591842896Z; main]: 0 ms to write norms
IW 0 [2023-06-28T12:53:31.592260907Z; main]: 0 ms to write docValues
IW 0 [2023-06-28T12:53:31.592370750Z; main]: 0 ms to write points
IW 0 [2023-06-28T12:53:32.987321518Z; main]: 1394 ms to write vectors
IW 0 [2023-06-28T12:53:32.997512174Z; main]: 10 ms to finish stored fields
IW 0 [2023-06-28T12:53:32.997693539Z; main]: 0 ms to write postings and finish vectors
IW 0 [2023-06-28T12:53:32.998159715Z; main]: 0 ms to write fieldInfos
DWPT 0 [2023-06-28T12:53:32.999257618Z; main]: new segment has 0 deleted docs
DWPT 0 [2023-06-28T12:53:32.999365945Z; main]: new segment has 0 soft-deleted docs
DWPT 0 [2023-06-28T12:53:33.000456314Z; main]: new segment has no vectors; no norms; no docValues; no prox; freqs
DWPT 0 [2023-06-28T12:53:33.000586334Z; main]: flushedFiles=[_0_Lucene95HnswVectorsFormat_0.vem, _0.fdm, _0_Lucene95HnswVectorsFormat_0.vec, _0.fdx, _0_Lucene95HnswVectorsFormat_0.vex, _0.fdt, _0.fnm]
DWPT 0 [2023-06-28T12:53:33.000673681Z; main]: flushed codec=Lucene95
DWPT 0 [2023-06-28T12:53:33.001725500Z; main]: flushed: segment=_0 ramUsed=1,945.017 MB newFlushedSize=1,824.658 MB docs/MB=252.388
DWPT 0 [2023-06-28T12:53:33.002919290Z; main]: flush time 1412.932331 ms
IW 0 [2023-06-28T12:53:33.004048349Z; main]: publishFlushedSegment seg-private updates=null
IW 0 [2023-06-28T12:53:33.004702334Z; main]: publishFlushedSegment _0(9.7.0):C460521:[diagnostics={os.arch=amd64, os.version=6.2.0-23-generic, lucene.version=9.7.0, source=flush, timestamp=1687956813001, java.runtime.version=20.0.1+9-29, java.vendor=Oracle Corporation, os=Linux}]:[attributes={Lucene90StoredFieldsFormat.mode=BEST_SPEED}] :id=1qx5zulv7rcv8o0t4f62zfjjz
BD 0 [2023-06-28T12:53:33.006074639Z; main]: finished packet delGen=1 now completedDelGen=1
IW 0 [2023-06-28T12:53:33.007517182Z; main]: publish sets newSegment delGen=1 seg=_0(9.7.0):C460521:[diagnostics={os.arch=amd64, os.version=6.2.0-23-generic, lucene.version=9.7.0, source=flush, timestamp=1687956813001, java.runtime.version=20.0.1+9-29, java.vendor=Oracle Corporation, os=Linux}]:[attributes={Lucene90StoredFieldsFormat.mode=BEST_SPEED}] :id=1qx5zulv7rcv8o0t4f62zfjjz
IFD 0 [2023-06-28T12:53:33.007718974Z; main]: now checkpoint "_0(9.7.0):C460521:[diagnostics={os.arch=amd64, os.version=6.2.0-23-generic, lucene.version=9.7.0, source=flush, timestamp=1687956813001, java.runtime.version=20.0.1+9-29, java.vendor=Oracle Corporation, os=Linux}]:[attributes={Lucene90StoredFieldsFormat.mode=BEST_SPEED}] :id=1qx5zulv7rcv8o0t4f62zfjk0" [1 segments ; isCommit = false]
IFD 0 [2023-06-28T12:53:33.008114732Z; main]: now delete 0 files: []
IFD 0 [2023-06-28T12:53:33.008168685Z; main]: 0 ms to checkpoint
MP 0 [2023-06-28T12:53:33.010309939Z; main]:   seg=_0(9.7.0):C460521:[diagnostics={os.arch=amd64, os.version=6.2.0-23-generic, lucene.version=9.7.0, source=flush, timestamp=1687956813001, java.runtime.version=20.0.1+9-29, java.vendor=Oracle Corporation, os=Linux}]:[attributes={Lucene90StoredFieldsFormat.mode=BEST_SPEED}] :id=1qx5zulv7rcv8o0t4f62zfjk0 size=1824.659 MB
MP 0 [2023-06-28T12:53:33.010610953Z; main]: findMerges: 1 segments
MP ...
Indexed 2680961 documents in 3136s

Test2

  • Lucene 9.7 with FloatVectorValues.MAX_DIMENSIONS patched to a MAX_DIMENSIONS of 2048
  • Panama Vector API enabled preferredBitSize=512
  • vector dims=1536
  • Results: Indexed 2680961 documents in 2633s
Details
davekim$ time /home/chegar/binaries/jdk-20.0.1/bin/java \
  --add-modules=jdk.incubator.vector \
  -cp /home/chegar/git/lucene/lucene/core/build/libs/lucene-core-9.7.0-SNAPSHOT.jar:lucene-9.7.0/modules/*:/home/chegar/git/lucene/lucene/core/build/classes/java/test \
  -Xmx16g -Xms16g \
  org.apache.lucene.util.hnsw.KnnGraphTester \
  -dim 1536 \
  -ndoc 2680961 \
  -reindex \
  -docs vector_search-open_ai_vectors-vectors.bin \
  -maxConn 16 \
  -beamWidthIndex 100
WARNING: Using incubator modules: jdk.incubator.vector
creating index in vector_search-open_ai_vectors-vectors.bin-16-100.index
Jun 28, 2023 3:18:08 PM org.apache.lucene.store.MemorySegmentIndexInputProvider <init>
INFO: Using MemorySegmentIndexInput with Java 20; to disable start with -Dorg.apache.lucene.store.MMapDirectory.enableMemorySegments=false
MS 0 [2023-06-28T14:18:08.783226914Z; main]: initDynamicDefaults maxThreadCount=4 maxMergeCount=9
IFD 0 [2023-06-28T14:18:08.798094830Z; main]: init: current segments file is "segments"; deletionPolicy=org.apache.lucene.index.KeepOnlyLastCommitDeletionPolicy@1efee8e7
IFD 0 [2023-06-28T14:18:08.800639373Z; main]: now delete 0 files: []
IFD 0 [2023-06-28T14:18:08.801349082Z; main]: now checkpoint "" [0 segments ; isCommit = false]
IFD 0 [2023-06-28T14:18:08.801461676Z; main]: now delete 0 files: []
IFD 0 [2023-06-28T14:18:08.802987862Z; main]: 0 ms to checkpoint
IW 0 [2023-06-28T14:18:08.803265302Z; main]: init: create=true reader=null
IW 0 [2023-06-28T14:18:08.809406650Z; main]:
dir=MMapDirectory@/home/chegar/git/lucene-vector-bench/vector_search-open_ai_vectors-vectors.bin-16-100.index lockFactory=org.apache.lucene.store.NativeFSLockFactory@1dd02175
index=
version=9.7.0
analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer
ramBufferSizeMB=1994.0
maxBufferedDocs=-1
mergedSegmentWarmer=null
delPolicy=org.apache.lucene.index.KeepOnlyLastCommitDeletionPolicy
commit=null
openMode=CREATE
similarity=org.apache.lucene.search.similarities.BM25Similarity
mergeScheduler=ConcurrentMergeScheduler: maxThreadCount=4, maxMergeCount=9, ioThrottle=true
codec=Lucene95
infoStream=org.apache.lucene.util.PrintStreamInfoStream
mergePolicy=[TieredMergePolicy: maxMergeAtOnce=10, maxMergedSegmentMB=5120.0, floorSegmentMB=2.0, forceMergeDeletesPctAllowed=10.0, segmentsPerTier=10.0, maxCFSSegmentSizeMB=8.796093022208E12, noCFSRatio=0.1, deletesPctAllowed=20.0
readerPooling=true
perThreadHardLimitMB=1945
useCompoundFile=false
commitOnClose=true
indexSort=null
checkPendingFlushOnUpdate=true
softDeletesField=null
maxFullFlushMergeWaitMillis=500
leafSorter=null
eventListener=org.apache.lucene.index.IndexWriterEventListener$1@3d3fcdb0
writer=org.apache.lucene.index.IndexWriter@641147d0

IW 0 [2023-06-28T14:18:08.809591811Z; main]: MMapDirectory.UNMAP_SUPPORTED=true
Jun 28, 2023 3:18:08 PM org.apache.lucene.util.VectorUtilPanamaProvider <init>
INFO: Java vector incubator API enabled; uses preferredBitSize=512
DWPT 0 [2023-06-28T14:23:17.927393364Z; main]: flush postings as segment _0 numDocs=314897
IW 0 [2023-06-28T14:23:17.928214793Z; main]: 0 ms to write norms
IW 0 [2023-06-28T14:23:17.928486805Z; main]: 0 ms to write docValues
IW 0 [2023-06-28T14:23:17.928593869Z; main]: 0 ms to write points
IW 0 [2023-06-28T14:23:19.282981254Z; main]: 1354 ms to write vectors
IW 0 [2023-06-28T14:23:19.290000600Z; main]: 6 ms to finish stored fields
IW 0 [2023-06-28T14:23:19.290178853Z; main]: 0 ms to write postings and finish vectors
IW 0 [2023-06-28T14:23:19.290669001Z; main]: 0 ms to write fieldInfos
DWPT 0 [2023-06-28T14:23:19.291053701Z; main]: new segment has 0 deleted docs
DWPT 0 [2023-06-28T14:23:19.291129515Z; main]: new segment has 0 soft-deleted docs
DWPT 0 [2023-06-28T14:23:19.292160606Z; main]: new segment has no vectors; no norms; no docValues; no prox; freqs
DWPT 0 [2023-06-28T14:23:19.292249403Z; main]: flushedFiles=[_0_Lucene95HnswVectorsFormat_0.vem, _0.fdm, _0_Lucene95HnswVectorsFormat_0.vec, _0.fdx, _0_Lucene95HnswVectorsFormat_0.vex, _0.fdt, _0.fnm]
DWPT 0 [2023-06-28T14:23:19.292320403Z; main]: flushed codec=Lucene95
DWPT 0 [2023-06-28T14:23:19.295665508Z; main]: flushed: segment=_0 ramUsed=1,945.012 MB newFlushedSize=1,863.46 MB docs/MB=168.985
DWPT 0 [2023-06-28T14:23:19.296825017Z; main]: flush time 1370.228388 ms
IW 0 [2023-06-28T14:23:19.297541689Z; main]: publishFlushedSegment seg-private updates=null
IW 0 [2023-06-28T14:23:19.298158353Z; main]: publishFlushedSegment _0(9.7.0):C314897:[diagnostics={source=flush, timestamp=1687962199295, java.runtime.version=20.0.1+9-29, java.vendor=Oracle Corporation, os=Linux, os.arch=amd64, os.version=6.2.0-23-generic, lucene.version=9.7.0}]:[attributes={Lucene90StoredFieldsFormat.mode=BEST_SPEED}] :id=9b08nbm1nw553b43pa9kzvach
BD 0 [2023-06-28T14:23:19.299549573Z; main]: finished packet delGen=1 now completedDelGen=1
IW 0 [2023-06-28T14:23:19.301085879Z; main]: publish sets newSegment delGen=1 seg=_0(9.7.0):C314897:[diagnostics={source=flush, timestamp=1687962199295, java.runtime.version=20.0.1+9-29, java.vendor=Oracle Corporation, os=Linux, os.arch=amd64, os.version=6.2.0-23-generic, lucene.version=9.7.0}]:[attributes={Lucene90StoredFieldsFormat.mode=BEST_SPEED}] :id=9b08nbm1nw553b43pa9kzvach
IFD 0 [2023-06-28T14:23:19.301281180Z; main]: now checkpoint "_0(9.7.0):C314897:[diagnostics={source=flush, timestamp=1687962199295, java.runtime.version=20.0.1+9-29, java.vendor=Oracle Corporation, os=Linux, os.arch=amd64, os.version=6.2.0-23-generic, lucene.version=9.7.0}]:[attributes={Lucene90StoredFieldsFormat.mode=BEST_SPEED}] :id=9b08nbm1nw553b43pa9kzvaci" [1 segments ; isCommit = false]
IFD 0 [2023-06-28T14:23:19.301666023Z; main]: now delete 0 files: []
IFD 0 [2023-06-28T14:23:19.301718781Z; main]: 0 ms to checkpoint
MP 0 [2023-06-28T14:23:19.303689024Z; main]:   seg=_0(9.7.0):C314897:[diagnostics={source=flush, timestamp=1687962199295, java.runtime.version=20.0.1+9-29, java.vendor=Oracle Corporation, os=Linux, os.arch=amd64, os.version=6.2.0-23-generic, lucene.version=9.7.0}]:[attributes={Lucene90StoredFieldsFormat.mode=BEST_SPEED}] :id=9b08nbm1nw553b43pa9kzvaci size=1863.460 MB
MP 0 [2023-06-28T14:23:19.303936133Z; main]: findMerges: 1 segments
MP ....
Indexed 2680961 documents in 2633s

Full output from the test runs can be see here https://gist.github.com/ChrisHegarty/ef008da196624c1a3fe46578ee3a0a6c.

@rmuir
Copy link
Member

rmuir commented Jun 28, 2023

Can we run this test with lucene's defaults (e.g. not a 2GB rambuffer)?
We are still talking about an hour to index < 3M docs, so I think the performance is not good.
As i've said before, i never thought 1024 was a good situation either. 768 is also excruciating.
Purpose of the vectorization is just to alleviate some of the pain. It is like giving the patient an aspirin, it doesn't really fix the problem.

@alessandrobenedetti
Copy link
Contributor

I am extremely curious, what should we consider a good performance to index <3M docs?
I mean, I agree we should always try to improve things and aim for the stars, but as maintainers of a library who are we to decide what's acceptable and what's not for the users?
Is it because of a comparison with other libraries or solutions?
They may have many reasons for being faster (and definitely we should take inspiration)
If we look to : https://home.apache.org/~mikemccand/lucenebench/indexing.html , we clearly improved the indexing throughput substantially over the years, does this mean that Lucene back in 2011 should have not committed additional features/improvements because for some people (people from the future) "it was slow"?

@mayya-sharipova
Copy link
Contributor

@rmuir

Can we run this test with lucene's defaults (e.g. not a 2GB rambuffer)?

I've done the test and surprising indexing time decreased substantially. It is almost 2 times faster to index with Lucene's defaults than with 2Gb RamBuffer at the expense that we end up with a bigger number of segments.

  • Lucene 9.7 branch with FloatVectorValues.MAX_DIMENSIONS set to 2048
  • preferredBitSize=128
  • Panama Vector API enabled
  • vector dims: 1536
  • num of docs: 2.68M
RamBuffer Size Indexing time Num of segments
16 Mb 1877 s 19
1994 Mb 3141s 9
Details
WARNING: Using incubator modules: jdk.incubator.vector
Jul 10, 2023 3:35:25 P.M. org.apache.lucene.store.MemorySegmentIndexInputProvider <init>
INFO: Using MemorySegmentIndexInput with Java 20; to disable start with -Dorg.apache.lucene.store.MMapDirectory.enableMemorySegments=false
Jul 10, 2023 3:35:26 P.M. org.apache.lucene.util.VectorUtilPanamaProvider <init>
INFO: Java vector incubator API enabled; uses preferredBitSize=128

_fc.fdt                             _v6.fnm                             _vj.si                              _vr_Lucene95HnswVectorsFormat_0.vec
_fc.fdx                             _v6.si                              _vj_Lucene95HnswVectorsFormat_0.vec _vr_Lucene95HnswVectorsFormat_0.vem
_fc.fnm                             _v6_Lucene95HnswVectorsFormat_0.vec _vj_Lucene95HnswVectorsFormat_0.vem _vr_Lucene95HnswVectorsFormat_0.vex
_fc.si                              _v6_Lucene95HnswVectorsFormat_0.vem _vj_Lucene95HnswVectorsFormat_0.vex _vs.fdm
_fc_Lucene95HnswVectorsFormat_0.vec _v6_Lucene95HnswVectorsFormat_0.vex _vl.fdm                             _vs.fdt
creating index in vectors.bin-16-100.index
MS 0 [2023-07-10T14:47:25.668178Z; main]: initDynamicDefaults maxThreadCount=4 maxMergeCount=9
IFD 0 [2023-07-10T14:47:25.725823Z; main]: init: current segments file is "segments"; deletionPolicy=org.apache.lucene.index.KeepOnlyLastCommitDeletionPolicy@64f6106c
IFD 0 [2023-07-10T14:47:25.735809Z; main]: now delete 0 files: []
IFD 0 [2023-07-10T14:47:25.738456Z; main]: now checkpoint "" [0 segments ; isCommit = false]
IFD 0 [2023-07-10T14:47:25.738587Z; main]: now delete 0 files: []
IFD 0 [2023-07-10T14:47:25.743719Z; main]: 2 ms to checkpoint
IW 0 [2023-07-10T14:47:25.744195Z; main]: init: create=true reader=null
IW 0 [2023-07-10T14:47:25.779752Z; main]:
dir=MMapDirectory@/Users/mayya/Elastic/knn/open_ai_vectors/vectors.bin-16-100.index lockFactory=org.apache.lucene.store.NativeFSLockFactory@319b92f3
index=
version=9.7.0
analyzer=org.apache.lucene.analysis.standard.StandardAnalyzer
ramBufferSizeMB=16.0
maxBufferedDocs=-1
mergedSegmentWarmer=null
delPolicy=org.apache.lucene.index.KeepOnlyLastCommitDeletionPolicy
commit=null
openMode=CREATE
similarity=org.apache.lucene.search.similarities.BM25Similarity
mergeScheduler=ConcurrentMergeScheduler: maxThreadCount=4, maxMergeCount=9, ioThrottle=true
codec=Lucene95
infoStream=org.apache.lucene.util.PrintStreamInfoStream
mergePolicy=[TieredMergePolicy: maxMergeAtOnce=10, maxMergedSegmentMB=5120.0, floorSegmentMB=2.0, forceMergeDeletesPctAllowed=10.0, segmentsPerTier=10.0, maxCFSSegmentSizeMB=8.796093022208E12, noCFSRatio=0.1, deletesPctAllowed=20.0
readerPooling=true
perThreadHardLimitMB=1945
useCompoundFile=false
commitOnClose=true
indexSort=null
checkPendingFlushOnUpdate=true
softDeletesField=null
maxFullFlushMergeWaitMillis=500
leafSorter=null
eventListener=org.apache.lucene.index.IndexWriterEventListener$1@10a035a0
writer=org.apache.lucene.index.IndexWriter@67b467e9

IW 0 [2023-07-10T14:47:25.780320Z; main]: MMapDirectory.UNMAP_SUPPORTED=true
FP 0 [2023-07-10T14:47:27.042597Z; main]: trigger flush: activeBytes=16779458 deleteBytes=0 vs ramBufferMB=16.0
FP 0 [2023-07-10T14:47:27.045564Z; main]: thread state has 16779458 bytes; docInRAM=2589
FP 0 [2023-07-10T14:47:27.049109Z; main]: 1 in-use non-flushing threads states
DWPT 0 [2023-07-10T14:47:27.050859Z; main]: flush postings as segment _0 numDocs=2589
....
Indexed 2680961 documents in 1877s

@dweiss
Copy link
Contributor

dweiss commented Jul 10, 2023

Leaving a higher number of segments dodges the merge costs, I think.

@jpountz
Copy link
Contributor

jpountz commented Jul 10, 2023

This benchmark really only measures the flushing cost, as ConcurrentMergeScheduler is used, so merges run in background threads. So the improvement makes sense to me as the cost of adding vectors into a HNSW graph increases as the size of the HNSW graph increases. If we want to get a sense of the number of docs per second per core that we support with a 2GB RAM buffer vs. the 16MB default, using SerialMergeScheduler would be a better choice.

mkleen added a commit to crate/crate that referenced this issue Sep 20, 2023
This increases the limit of float vector to 1024 to 2048.
The previous  limit was based what Lucene provided but current
discussions and current benchmark indicate that 2048 will also
be ok and the next Lucene version will have 2048 as default:

apache/lucene#11507 (comment)
apache/lucene#11507 (comment)
mkleen added a commit to crate/crate that referenced this issue Sep 21, 2023
This increases the limit of float vector to 1024 to 2048.
The previous  limit was based what Lucene provided but current
discussions and current benchmark indicate that 2048 will also
be ok and the next Lucene version will have 2048 as default:

apache/lucene#11507 (comment)
apache/lucene#11507 (comment)
mkleen added a commit to crate/crate that referenced this issue Sep 21, 2023
This increases the limit of float vector to 1024 to 2048.
The previous  limit was based what Lucene provided but current
discussions and current benchmark indicate that 2048 will also
be ok and the next Lucene version will have 2048 as default:

apache/lucene#11507 (comment)
apache/lucene#11507 (comment)
mkleen added a commit to crate/crate that referenced this issue Sep 21, 2023
This increases the limit of float vector from 1024 to 2048.
The previous  limit was based what Lucene provided but current
discussions and current benchmark indicate that 2048 will also
be ok and the next Lucene version will have 2048 as default:

apache/lucene#11507 (comment)
apache/lucene#11507 (comment)
mkleen added a commit to crate/crate that referenced this issue Sep 21, 2023
This increases the limit of float vector from 1024 to 2048.
The previous  limit was based what Lucene provided but current
discussions and current benchmark indicate that 2048 will also
be ok and the next Lucene version will have 2048 as default:

apache/lucene#11507 (comment)
apache/lucene#11507 (comment)
@sylph-eu
Copy link

Last comment is already a couple of months old, so please let me clarify the status of this initiative. If there's a chance it's going to be merged? If there's any blocker or action item that prevents one from being merged?

The context of my inquiry is that Lucene-based solutions (e.g. OpenSearch) are commonly deployed within enterprises, which makes them good candidates to experiment with vector search and commercial LLM-offerings, without deploying and maintaining specialized technologies. Max dimensionality of 1024, however, puts certain restrictions (similar thoughts are here https://arxiv.org/abs/2308.14963).

@uschindler
Copy link
Contributor

uschindler commented Sep 25, 2023

Hi,
actually this issue is already resolved, although the DEFAULT did not change (and won't change due to performance risks), see here: #12436 - this PR allows users of Lucene to raise the limit (at least for HNSW codec) on codec level.

To implement (on your own risk), create your own KnnVectorsFormat and let it return a different number for getMaxDimensions(). Then construct your own codec from it and index your data.

You can do this with Lucene 9.8+

OpenSearch and Elasticsearch and Solr will have custom limits in their code (based on this approach).

@uschindler
Copy link
Contributor

@mayya-sharipova: Should we close this issue or are there any plans to also change the default maximum? I don't think so.

@MarcusSorealheis
Copy link
Contributor

I think we should close it for sure.

@mayya-sharipova
Copy link
Contributor

Yes, thanks for the reminder. Now Codec is responsible for managing dims, we can close it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests