-
-
Notifications
You must be signed in to change notification settings - Fork 670
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Planned removal of index sorting in 0.23.0 #2352
Comments
I was planning on using index sorting for our full text search at Convex to ensure that results are sorted in a deterministic order that's independent of how documents are split across segments. We're an OLTP database that lets users define text search indexes on their existing tables, much like Postgres with its GIN-based indexes. Indexes are fully transactional, with older data stored across multiple Tantivy segments and recent data stored in a multiversioned in-memory index. Furthermore, we want read transactions in our system to be deterministic so we can cache them automatically and precisely detect when they're invalidated. So, we'd like our text search queries (which are just TopK queries by BM25 of an OR of a set of search terms) to be deterministic, even if there's been an index rebuild since the query was executed. Elsewhere in our system, our rows have a natural sort order based on their system-assigned ID and insertion time. Then, during query execution, ideally each segment would return its top K results based on We've already done some of the work here (e.g. adjusting BM25 statistics to be deterministic as of the transaction's timestamp: #1855, Convex Future featuresA few users have asked us for sort orders other than BM25 for text search results (e.g. insertion time), and having them pick a single field in advance would work well for our use case. Furthermore, letting users specify range queries on this sort field and paginate over it would be very helpful. Since we're an OLTP database, we store our indexes on SSDs and want queries to execute quickly (<100ms in total) with bounded IO. So, turning these range queries into a binary search rather than having to do a full aggregation and sort is highly desirable. Even further in the future, I'd love to use Tantivy for implementing more types of OLTP inverted index algorithms in our system:
In these cases, I think having a well-defined sort order and efficient pagination would be important since these are structured application queries, not user-facing ones. I'm also pretty rusty on the Block WAND paper, but if scoring is simpler in this setting (e.g. an array index scores by the simple count of matches), many documents may have the same score, and being able to use ConclusionI'm curious if any of this resonates with you all -- I'm definitely inexperienced with this domain, and let me know if I'm missing anything or not thinking about things in the right way. We've also benefitted greatly from using Tantivy within Convex, so if there are ways I can help implement some of these optimizations (e.g. using index order for range queries), let me know! |
Index sorting is kinda a misnomer, the index is not sorted, but the segments. This feature will not really provide deterministic search results. We already support JSON as a field type. I'm not sure how index sorting would affect it in any way specifically. Geo-spatial would probably have an BKD tree as index independent of the sorting. Index sorting has specific prerequisites and some performance cost during indexing. To utilize it we would need to add special code paths. So it is pretty niche. The time may be more effective spent by optimizing the more general case. |
Right, I think that's my understanding -- for an index sorted by let settings = IndexSettings {
sort_by_field: Some(IndexSortByField {
field: "intval".to_string(),
order: Order::Asc,
}),
..Default::default()
}; each segment within the index will assign In our system, we represent the state We query the top The concern I have with determinism is when we do a compaction and merge My thought was to use index sorting to ensure that Workarounds
Yeah, I think this should work for our use case. We already want to control the threading and memory consumption in our system, so we'd probably use the lower level tantivy/src/indexer/segment_writer.rs Line 399 in 0e9fced
Then, if we wanted to implement range queries on, say, insertion time ourselves, we'd store insertion time in a fast field, perform a binary search on the fast field reader to get a Let me know if that makes sense. |
A DocId is the ordinal of the Document in the segment. Sorting changes the order of the documents, it does not assign values to DocIds explicitly.
You assume that search results are stable between merges with index sorting, only by the implicit ordering. I'm not sure this is the case today, or will stay like this if that's the case.
You can already control threading and memory consumption with
I've been thinking if we should flag fast fields as almost sorted during creation (e.g. almost sorted in a range of 100 values) and then use that information to do a binary_search + 100 values scan. |
Index sorting provides currently little to no benefit, and there's some confusion about it from tantivy users on how to use it. (e.g. that it is a way to get sorted search results, which is not really the case)
It also adds considerable complexity to the indexing and merging parts of the code.
On the potential pros:
There will also be a deprecation warning added to
0.22.0
with a link to this issue.If you have a use case which requires index sorting, please explain your use case here.
Related #2335
The text was updated successfully, but these errors were encountered: