Add file skipping with Delta #253

osopardo1 · 2024-01-08T13:14:26Z

Right now, the filtering works as follows:

Apply Delta Filters for Staging Area files.
Apply Qbeast Filters on all the rest.
Union both sets of files.

We should apply Delta (or any other underlying format) file filtering in all the set of files, not just the subset belonging to the Staging Area.

This is the piece of code:

  override def listFiles(...) = {

    // FILTER FILES FROM QBEAST
    val qbeastFileStats = qbeastMatchingFiles(partitionFilters, dataFilters)
    // FILTER FILES FROM DELTA
    val stagingFileStats = stagingFiles(partitionFilters, dataFilters)
    // JOIN BOTH FILTERED FILES
    val fileStats = qbeastFileStats ++ stagingFileStats

And the code for the stagingFiles method:

  /**
   * Collect matching staging files from _delta_log and convert them into FileStatuses.
   * The output is merged with those built from QbeastBlocks.
   * @return
   */
  private def stagingFiles(
      partitionFilters: Seq[Expression],
      dataFilters: Seq[Expression]): Seq[FileStatus] = {

    index
      .matchingFiles(partitionFilters, dataFilters)
      .filter(isStagingFile)
      .map { f =>
        new FileStatus(
          /* length */ f.size,
          /* isDir */ false,
          /* blockReplication */ 0,
          /* blockSize */ 1,
          /* modificationTime */ f.modificationTime,
          absolutePath(f.path))
      }
  }

This should not apply .filter(isStagingFile) and return all the set of files filtered by the underlying format. Probably also need to update the method's name.

The text was updated successfully, but these errors were encountered:

osopardo1 · 2024-01-09T07:08:30Z

I have a question regarding this task:

If we filter all the files with Delta, does still make sense to filter again with Qbeast to filter by min/max?

For the Sampling of course is necessary. But to avoid replicated files in a WHERE predicate, only a filter is needed after applying Delta Data Skipping. No need to rebuild the index structure from scratch.

What do you think? @cugni @Jiaweihu08 @alexeiakimov

osopardo1 · 2024-01-09T07:08:49Z

And can you @alexeiakimov take care of this task?

Thank you!

osopardo1 · 2024-01-09T10:11:40Z

After discussion, agreed on:

When applying WHERE file filtering, let min/max Delta Skipping filter the set of files.
When applying SAMPLING, join both sets of files. This allows Qbeast tree navigation to filter more files.

alexeiakimov · 2024-01-09T12:46:28Z

The target version for this feature is 1.0.0, to support it in both 0.x and 1.x is too complex, because the Qbeast metadata format was changed in version 1.0.0.

alexeiakimov · 2024-01-09T12:53:02Z

The approach with more details is the following. There are two main cases:

The query has SAMPLING clause
The query has no SAMPLING clause

If query defines the SAMPLING clause, then apply Delta filtering to the staging area and Qbeast filtering to the normal revisions. It is not necessary to apply Delta filtering to the files returned by Qbeast filtering.

If query does not define SAMPLING clause, then apply Delta filtering to all the files, and then exclude the replicated files. Qbeast index is not used in this case.

…y are improved

This is a rework of the query implementation. This PR uses internal Delta query engine always except the queries with sampling clause. For the later, the Qbeast engine is used. * #253 Initial import of the new implementation of FileIndex * #253 Logging in DefaultFileIndex and SamplingListFileStrategy are improved * #253 A test for DefaultFileIndex * #253 OTreeIndex and test are removed * #253 EmptyFile index is made serializable

osopardo1 added the type: bug Something isn't working label Jan 8, 2024

osopardo1 assigned alexeiakimov Jan 9, 2024

alexeiakimov added type: enhancement Improvement of existing feature or code priority: normal This issue has normal priority status: in-progress This issue is in progress and removed type: bug Something isn't working labels Jan 9, 2024

alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Jan 12, 2024

Qbeast-io#253 Initial import of the new implementation of FileIndex

9248619

alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Jan 12, 2024

Qbeast-io#253 Logging in DefaultFileIndex and SamplingListFileStrateg…

ba01e5d

…y are improved

alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Jan 12, 2024

Qbeast-io#253 A test for DefaultFileIndex

1888c43

alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Jan 12, 2024

Qbeast-io#253 OTreeIndex and test are removed

8d18484

alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Jan 12, 2024

Qbeast-io#253 EmptyFile index is made serializable

4536cc2

osopardo1 added the 1.0.0 label Feb 13, 2024

osopardo1 removed priority: normal This issue has normal priority status: in-progress This issue is in progress labels Mar 13, 2024

fpj closed this as completed Mar 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add file skipping with Delta #253

Add file skipping with Delta #253

osopardo1 commented Jan 8, 2024

osopardo1 commented Jan 9, 2024

osopardo1 commented Jan 9, 2024

osopardo1 commented Jan 9, 2024

alexeiakimov commented Jan 9, 2024

alexeiakimov commented Jan 9, 2024

Add file skipping with Delta #253

Add file skipping with Delta #253

Comments

osopardo1 commented Jan 8, 2024

osopardo1 commented Jan 9, 2024

osopardo1 commented Jan 9, 2024

osopardo1 commented Jan 9, 2024

alexeiakimov commented Jan 9, 2024

alexeiakimov commented Jan 9, 2024