-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add file skipping with Delta #253
Comments
I have a question regarding this task: If we filter all the files with Delta, does still make sense to filter again with Qbeast to filter by min/max? For the Sampling of course is necessary. But to avoid replicated files in a WHERE predicate, only a filter is needed after applying Delta Data Skipping. No need to rebuild the index structure from scratch. What do you think? @cugni @Jiaweihu08 @alexeiakimov |
And can you @alexeiakimov take care of this task? Thank you! |
After discussion, agreed on:
|
The target version for this feature is 1.0.0, to support it in both 0.x and 1.x is too complex, because the Qbeast metadata format was changed in version 1.0.0. |
The approach with more details is the following. There are two main cases:
If query defines the SAMPLING clause, then apply Delta filtering to the staging area and Qbeast filtering to the normal revisions. It is not necessary to apply Delta filtering to the files returned by Qbeast filtering. If query does not define SAMPLING clause, then apply Delta filtering to all the files, and then exclude the replicated files. Qbeast index is not used in this case. |
This is a rework of the query implementation. This PR uses internal Delta query engine always except the queries with sampling clause. For the later, the Qbeast engine is used. * #253 Initial import of the new implementation of FileIndex * #253 Logging in DefaultFileIndex and SamplingListFileStrategy are improved * #253 A test for DefaultFileIndex * #253 OTreeIndex and test are removed * #253 EmptyFile index is made serializable
Right now, the filtering works as follows:
We should apply Delta (or any other underlying format) file filtering in all the set of files, not just the subset belonging to the Staging Area.
This is the piece of code:
And the code for the
stagingFiles
method:This should not apply
.filter(isStagingFile)
and return all the set of files filtered by the underlying format. Probably also need to update the method's name.The text was updated successfully, but these errors were encountered: