Rollup compaction #207

alexeiakimov · 2023-08-03T07:24:17Z

What went wrong?

The current version creates two many small files while indexing or replicating data. Those small files have negative impact on reading the data.

How to reproduce?

Index a data frame where the data has big dispersion, e.g. the records are distributed uniformly in some range. Optimize the index several times to push the data down the index tree. Every time a part of data is saved in some cube that part is stored in a separate file, so after several writes/optimizations where the data falls into different cubes the index has a lot of small files.

Code that triggered the bug, or steps to reproduce:

There is no particular code to blame, the problem is that every write to a cube creates a separate file. This is a problem of the algorithm, not a defect of the implementation.

How to fix?

The proposal is to use the rollup compaction while writing the data to join the blocks of closely related cubes in one physical file, so when the data is queried, there is a high probability that the most of those blocks contribute their data to the query. This approach should reduce the number of small files without significant impact on the query performance. The necessary details can be found in the document https://docs.google.com/document/d/1CY5Gzx46fuatwkAyPRKIRSS7DeD4mOFO697I8_TuTak/edit.

…ck support and new file metadata format.

…at and RangedColumnarBatchIterator

…ents.

…lasses.

…ring

…ts own.

…tion.

…ndex revision.

…lock

osopardo1 · 2023-10-23T12:25:01Z

The next steps are:

Separate Main development from 1.0.0. There are some breaking features that affects some features of the code and should be discussed/analyzed more in-depth.
Open other PR with only the code to Roll-Up, which will mitigate the "small files" problem.
Decide whether we implement Block Metadata from Multi-block Format with or without Ranges. Ranges help select a block inside a file that represents a particular cube. Without these ranges, the part of Replication might be affected since blocks from different cubes with replicated data can be hard to distinguish within the same file.
- One solution is to keep Replicated Data in a different folder and apply specific rules for Rolling Up those files.
- In that way, we can preserve the Ranges (which makes the writing much memory-intensive and can generate pitfalls with huge datasets) only in the Replicated blocks, while for the rest the condition is removed.

…iter

alexeiakimov added type: bug Something isn't working priority: high This issue has high priority status: in-progress This issue is in progress type: enhancement and removed type: bug Something isn't working labels Aug 3, 2023

alexeiakimov self-assigned this Aug 3, 2023

This was referenced Aug 3, 2023

Revert #200 and new version #208

Merged

68 overhead of qbeast hash filtering when doing a sample #194

Closed

#196 2 files based optimization #197

Closed

alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Aug 15, 2023

Qbeast-io#207 QbeastBlock is replaced with Block, File, etc

7aa5488

alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Aug 15, 2023

Qbeast-io#207 QbeastFileFormat and related classes.

c85eea9

alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Aug 15, 2023

Qbeast-io#207 PathRangesCodec, tests and fixes.

77f380d

alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Aug 15, 2023

Qbeast-io#207 Test for RangedColumnarBatchIterator

22dab06

alexeiakimov mentioned this issue Aug 15, 2023

207 rollup compaction #210

Closed

alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Aug 20, 2023

Qbeast-io#207 Initial implementation of IndexFileWriter with multiblo…

6ef20b1

…ck support and new file metadata format.

alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Aug 20, 2023

Qbeast-io#207 a typo in the scaladoc is fixed

31ce679

alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Aug 23, 2023

Qbeast-io#207 Serialization problems in the core classes are fixed.

0d39edb

alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Aug 23, 2023

Qbeast-io#207 Fixes for the QbeastSparkCorrectnessTest.

1960900

alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Aug 30, 2023

Qbeast-io#207 Fixes for IndexFile, QbeastBaseRelation, QbeastFileForm…

87e8f94

…at and RangedColumnarBatchIterator

alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Sep 1, 2023

Qbeast-io#207 Fixes for CubeDataLoader and IndexTest

0eb9097

alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Sep 4, 2023

Qbeast-io#207 Fixes for NormalizedWeightIntegrationTest

63f53e2

alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Sep 4, 2023

Qbeast-io#207 fixes for QueryFileBuilder and QbeastFileFormat.

04106af

alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Sep 4, 2023

Qbeast-io#207 Optimization and compaction tests are temporarily disabled

06d5949

alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Sep 5, 2023

Qbeast-io#207 IndexStatusBuilder is improved according to the PR review

069dc9f

alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Sep 5, 2023

Qbeast-io#207 WriteStrategy and legacy implementation, small improvem…

9f4519f

…ents.

alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Sep 6, 2023

Qbeast-io#207 WriteStrategy and its legacy implementation are reworked.

b2214e9

alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Sep 12, 2023

Qbeast-io#207 Cube domains are added to the TableChanges and its subc…

cba6c82

…lasses.

alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Sep 12, 2023

Qbeast-io#207 RollupWriteStrategy initial implementation.

a1215bc

alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Sep 12, 2023

Qbeast-io#207 PointWeightIndexerTest is fixed.

5e66a75

alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Sep 14, 2023

Qbeast-io#207 Replication is adopted for multi-block files

2c042b5

alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Sep 27, 2023

Qbeast-io#207 Small improvements for IndexFile

5457554

alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Sep 27, 2023

Qbeast-io#207 RollupWriteStrategy is fixed and improved.

ce2d464

alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Sep 27, 2023

Qbeast-io#207 Recent changes from the main branch are merged

da485fd

alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Oct 3, 2023

Qbeast-io#207 Writer related files are renamed to make easier refacto…

77d357f

…ring

alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Oct 3, 2023

Qbeast-io#207 WriteStrategy implementation is reworked

fe50d98

alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Oct 3, 2023

Qbeast-io#207 Rollup algorithm is extracted to be an abstraction on i…

aafa7e5

…ts own.

alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Oct 6, 2023

Qbeast-io#207 Initial implementation of the naive rollup based compac…

db3cbbc

…tion.

alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Oct 6, 2023

Qbeast-io#207 fixes for tests in the core project

d97b365

alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Oct 9, 2023

Qbeast-io#207 Old compaction code is removed

5351ecb

alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Oct 9, 2023

Qbeast-io#207 SparkDeltaDataWriterTest is improved

6e919f2

alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Oct 10, 2023

Qbeast-io#207 Changing fileSize in the index options creates a new ri…

190f4a7

…ndex revision.

alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Oct 11, 2023

Qbeast-io#207 Buffered rows are sorted by weight before writing the b…

0e67b98

…lock

alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Oct 19, 2023

Qbeast-io#207 Old analyze, optimize and compact commands are deprecated.

0b06c2e

alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Oct 19, 2023

Qbeast-io#207 OptimizeSpec data structure s introduced.

03440b6

alexeiakimov mentioned this issue Oct 23, 2023

2 files based optimization #196

Closed

alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Oct 27, 2023

Qbeast-io#207 AddFile metadata format is prepared for multiple blocks.

2d656f7

alexeiakimov mentioned this issue Oct 30, 2023

#207 rollup compaction simplified #225

Closed

alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Oct 31, 2023

Qbeast-io#207 QbeastBlock is removed.

ba1df85

alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Oct 31, 2023

Qbeast-io#207 Abstract rollup implementation and test.

51fc009

alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Oct 31, 2023

Qbeast-io#207 RollupDataWrite as the replacement for SparkDeltaDataWr…

900d611

…iter

alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Nov 7, 2023

Qbeast-io#207 Fixes for tests.

902e6e9

alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Nov 7, 2023

Qbeast-io#207 More fixes for tests, etc

fcb40ea

alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Nov 7, 2023

Qbeast-io#207 Formatting and typos are fixed

c961cb5

alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Nov 8, 2023

Qbeast-io#207 Replicated set is removed from the metadata

aa1193a

alexeiakimov mentioned this issue Nov 14, 2023

207 rollup compaction simplified #232

Merged

alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Nov 22, 2023

Qbeast-io#207 Code coverage is improved.

54b1d73

alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Nov 22, 2023

Qbeast-io#207 Unused tags and Qbeast columns are removed

04f517e

alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Nov 24, 2023

Qbeast-io#207 QbeastFilterPushdownTest is improved

2d39821

cdelfosse closed this as completed Nov 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rollup compaction #207

Rollup compaction #207

alexeiakimov commented Aug 3, 2023

osopardo1 commented Oct 23, 2023 •

edited

Loading

Rollup compaction #207

Rollup compaction #207

Comments

alexeiakimov commented Aug 3, 2023

What went wrong?

How to reproduce?

Code that triggered the bug, or steps to reproduce:

How to fix?

osopardo1 commented Oct 23, 2023 • edited Loading

osopardo1 commented Oct 23, 2023 •

edited

Loading