-
-
Notifications
You must be signed in to change notification settings - Fork 716
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add iterator over documents in docstore #1044
Conversation
When profiling, I saw that around 8% of the time in a merge was spent in look-ups into the skip index. Since the documents in the merge case are read continuously, we can replace the random access with an iterator over the documents. Merge Time on Sorted Index Before/After: 24s / 19s Merge Time on Unsorted Index Before/After: 15s / 13,5s So we can expect 10-20% faster merges. This iterator is also important if we add sorting based on a field in the documents.
let store_reader = &store_readers[reader_with_ordinal.ordinal as usize]; | ||
let raw_doc = store_reader.get_raw(*old_doc_id)?; | ||
let store_reader = &mut document_iterators[reader_with_ordinal.ordinal as usize]; | ||
let raw_doc = store_reader.next().expect(&format!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we return an error here? (the error message is great.)
a) this is great. |
src/store/reader.rs
Outdated
/// Iterator over all RawDocuments in their order as they are stored in the doc store. | ||
/// Use this, if you want to extract all Documents from the doc store. | ||
/// The delete_bitset has to be forwarded from the `SegmentReader` or the results maybe wrong. | ||
pub fn iter_raw<'a: 'b, 'b>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should probably be pub(crate)
?
let mut num_skipped = 0; | ||
(0..last_docid) | ||
.filter_map(move |doc_id| { | ||
// filter_map is only used to resolve lifetime issues between the two closures on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am incapable of reading this :)
But I think this is ok considering it has no adherence outside of this function
When profiling, I saw that around 8% of the time in a merge was spent in look-ups into the skip index. Since the documents in the merge case are read continuously, we can replace the random access with an iterator over the documents.
Merge Time on Sorted Index Before/After:
24s / 19s
Merge Time on Unsorted Index Before/After:
15s / 13,5s
So we can expect 10-20% faster merges.
This iterator is also important if we add sorting based on a field in the documents.