add iterator over documents in docstore #1044

PSeitz · 2021-05-17T17:45:30Z

When profiling, I saw that around 8% of the time in a merge was spent in look-ups into the skip index. Since the documents in the merge case are read continuously, we can replace the random access with an iterator over the documents.

Merge Time on Sorted Index Before/After:
24s / 19s

Merge Time on Unsorted Index Before/After:
15s / 13,5s

So we can expect 10-20% faster merges.
This iterator is also important if we add sorting based on a field in the documents.

When profiling, I saw that around 8% of the time in a merge was spent in look-ups into the skip index. Since the documents in the merge case are read continuously, we can replace the random access with an iterator over the documents. Merge Time on Sorted Index Before/After: 24s / 19s Merge Time on Unsorted Index Before/After: 15s / 13,5s So we can expect 10-20% faster merges. This iterator is also important if we add sorting based on a field in the documents.

fulmicoton · 2021-05-18T00:39:20Z

src/indexer/merger.rs

-                let store_reader = &store_readers[reader_with_ordinal.ordinal as usize];
-                let raw_doc = store_reader.get_raw(*old_doc_id)?;
+                let store_reader = &mut document_iterators[reader_with_ordinal.ordinal as usize];
+                let raw_doc = store_reader.next().expect(&format!(


can we return an error here? (the error message is great.)

fulmicoton · 2021-05-18T00:45:22Z

a) this is great.
b) @ppodolsky will like that.

fulmicoton · 2021-05-18T00:46:43Z

src/store/reader.rs

+    /// Iterator over all RawDocuments in their order as they are stored in the doc store.
+    /// Use this, if you want to extract all Documents from the doc store.
+    /// The delete_bitset has to be forwarded from the `SegmentReader` or the results maybe wrong.
+    pub fn iter_raw<'a: 'b, 'b>(


this should probably be pub(crate)?

fulmicoton · 2021-05-18T01:27:19Z

src/store/reader.rs

+        let mut num_skipped = 0;
+        (0..last_docid)
+            .filter_map(move |doc_id| {
+                // filter_map is only used to resolve lifetime issues between the two closures on


I am incapable of reading this :)

But I think this is ok considering it has no adherence outside of this function

PSeitz requested a review from fulmicoton May 17, 2021 17:45

fulmicoton reviewed May 18, 2021

View reviewed changes

Update reader.rs

dc703cd

fulmicoton merged commit a400262 into quickwit-oss:main May 18, 2021

This was referenced Feb 18, 2022

fix open bytes index PSeitz/tantivy#1

Closed

aggregation PSeitz/tantivy#2

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add iterator over documents in docstore #1044

add iterator over documents in docstore #1044

PSeitz commented May 17, 2021

fulmicoton May 18, 2021

fulmicoton commented May 18, 2021

fulmicoton May 18, 2021 •

edited

Loading

fulmicoton May 18, 2021

add iterator over documents in docstore #1044

add iterator over documents in docstore #1044

Conversation

PSeitz commented May 17, 2021

fulmicoton May 18, 2021

Choose a reason for hiding this comment

fulmicoton commented May 18, 2021

fulmicoton May 18, 2021 • edited Loading

Choose a reason for hiding this comment

fulmicoton May 18, 2021

Choose a reason for hiding this comment

fulmicoton May 18, 2021 •

edited

Loading