Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add file skipping with Delta #253

Closed
osopardo1 opened this issue Jan 8, 2024 · 5 comments
Closed

Add file skipping with Delta #253

osopardo1 opened this issue Jan 8, 2024 · 5 comments
Assignees
Labels
type: enhancement Improvement of existing feature or code

Comments

@osopardo1
Copy link
Member

Right now, the filtering works as follows:

  1. Apply Delta Filters for Staging Area files.
  2. Apply Qbeast Filters on all the rest.
  3. Union both sets of files.

We should apply Delta (or any other underlying format) file filtering in all the set of files, not just the subset belonging to the Staging Area.

This is the piece of code:

  override def listFiles(...) = {

    // FILTER FILES FROM QBEAST
    val qbeastFileStats = qbeastMatchingFiles(partitionFilters, dataFilters)
    // FILTER FILES FROM DELTA
    val stagingFileStats = stagingFiles(partitionFilters, dataFilters)
    // JOIN BOTH FILTERED FILES
    val fileStats = qbeastFileStats ++ stagingFileStats

And the code for the stagingFiles method:

  /**
   * Collect matching staging files from _delta_log and convert them into FileStatuses.
   * The output is merged with those built from QbeastBlocks.
   * @return
   */
  private def stagingFiles(
      partitionFilters: Seq[Expression],
      dataFilters: Seq[Expression]): Seq[FileStatus] = {

    index
      .matchingFiles(partitionFilters, dataFilters)
      .filter(isStagingFile)
      .map { f =>
        new FileStatus(
          /* length */ f.size,
          /* isDir */ false,
          /* blockReplication */ 0,
          /* blockSize */ 1,
          /* modificationTime */ f.modificationTime,
          absolutePath(f.path))
      }
  }

This should not apply .filter(isStagingFile) and return all the set of files filtered by the underlying format. Probably also need to update the method's name.

@osopardo1 osopardo1 added the type: bug Something isn't working label Jan 8, 2024
@osopardo1
Copy link
Member Author

I have a question regarding this task:

If we filter all the files with Delta, does still make sense to filter again with Qbeast to filter by min/max?

For the Sampling of course is necessary. But to avoid replicated files in a WHERE predicate, only a filter is needed after applying Delta Data Skipping. No need to rebuild the index structure from scratch.

What do you think? @cugni @Jiaweihu08 @alexeiakimov

@osopardo1
Copy link
Member Author

And can you @alexeiakimov take care of this task?

Thank you!

@osopardo1
Copy link
Member Author

After discussion, agreed on:

  • When applying WHERE file filtering, let min/max Delta Skipping filter the set of files.
  • When applying SAMPLING, join both sets of files. This allows Qbeast tree navigation to filter more files.

@alexeiakimov alexeiakimov added type: enhancement Improvement of existing feature or code priority: normal This issue has normal priority status: in-progress This issue is in progress and removed type: bug Something isn't working labels Jan 9, 2024
@alexeiakimov
Copy link
Contributor

The target version for this feature is 1.0.0, to support it in both 0.x and 1.x is too complex, because the Qbeast metadata format was changed in version 1.0.0.

@alexeiakimov
Copy link
Contributor

The approach with more details is the following. There are two main cases:

  1. The query has SAMPLING clause
  2. The query has no SAMPLING clause

If query defines the SAMPLING clause, then apply Delta filtering to the staging area and Qbeast filtering to the normal revisions. It is not necessary to apply Delta filtering to the files returned by Qbeast filtering.

If query does not define SAMPLING clause, then apply Delta filtering to all the files, and then exclude the replicated files. Qbeast index is not used in this case.

alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Jan 12, 2024
alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Jan 12, 2024
alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Jan 12, 2024
alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Jan 12, 2024
alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Jan 12, 2024
@osopardo1 osopardo1 removed priority: normal This issue has normal priority status: in-progress This issue is in progress labels Mar 13, 2024
fpj pushed a commit that referenced this issue Mar 22, 2024
This is a rework of the query implementation. This PR uses internal Delta query engine always
except the queries with sampling clause. For the later, the Qbeast engine is used.



* #253 Initial import of the new implementation of FileIndex
* #253 Logging in DefaultFileIndex and SamplingListFileStrategy are improved
* #253 A test for DefaultFileIndex
* #253 OTreeIndex and test are removed
* #253 EmptyFile index is made serializable
@fpj fpj closed this as completed Mar 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: enhancement Improvement of existing feature or code
Projects
None yet
Development

No branches or pull requests

3 participants