Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rollup compaction #207

Closed
alexeiakimov opened this issue Aug 3, 2023 · 1 comment
Closed

Rollup compaction #207

alexeiakimov opened this issue Aug 3, 2023 · 1 comment
Assignees
Labels
priority: high This issue has high priority status: in-progress This issue is in progress

Comments

@alexeiakimov
Copy link
Contributor

What went wrong?

The current version creates two many small files while indexing or replicating data. Those small files have negative impact on reading the data.

How to reproduce?

Index a data frame where the data has big dispersion, e.g. the records are distributed uniformly in some range. Optimize the index several times to push the data down the index tree. Every time a part of data is saved in some cube that part is stored in a separate file, so after several writes/optimizations where the data falls into different cubes the index has a lot of small files.

Code that triggered the bug, or steps to reproduce:

There is no particular code to blame, the problem is that every write to a cube creates a separate file. This is a problem of the algorithm, not a defect of the implementation.

How to fix?

The proposal is to use the rollup compaction while writing the data to join the blocks of closely related cubes in one physical file, so when the data is queried, there is a high probability that the most of those blocks contribute their data to the query. This approach should reduce the number of small files without significant impact on the query performance. The necessary details can be found in the document https://docs.google.com/document/d/1CY5Gzx46fuatwkAyPRKIRSS7DeD4mOFO697I8_TuTak/edit.

@alexeiakimov alexeiakimov added type: bug Something isn't working priority: high This issue has high priority status: in-progress This issue is in progress type: enhancement and removed type: bug Something isn't working labels Aug 3, 2023
@alexeiakimov alexeiakimov self-assigned this Aug 3, 2023
alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Aug 15, 2023
alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Aug 15, 2023
alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Aug 15, 2023
alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Aug 15, 2023
alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Aug 20, 2023
alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Aug 20, 2023
alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Aug 23, 2023
alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Aug 23, 2023
alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Aug 30, 2023
alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Sep 1, 2023
alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Sep 4, 2023
alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Sep 4, 2023
alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Sep 4, 2023
alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Sep 5, 2023
alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Sep 5, 2023
alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Sep 6, 2023
alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Sep 12, 2023
alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Sep 12, 2023
alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Sep 12, 2023
alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Sep 14, 2023
alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Sep 27, 2023
alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Sep 27, 2023
alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Sep 27, 2023
alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Oct 3, 2023
alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Oct 3, 2023
alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Oct 3, 2023
alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Oct 6, 2023
alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Oct 6, 2023
alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Oct 9, 2023
alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Oct 9, 2023
alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Oct 10, 2023
alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Oct 11, 2023
alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Oct 19, 2023
alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Oct 19, 2023
@osopardo1
Copy link
Member

osopardo1 commented Oct 23, 2023

The next steps are:

  1. Separate Main development from 1.0.0. There are some breaking features that affects some features of the code and should be discussed/analyzed more in-depth.
  2. Open other PR with only the code to Roll-Up, which will mitigate the "small files" problem.
  3. Decide whether we implement Block Metadata from Multi-block Format with or without Ranges. Ranges help select a block inside a file that represents a particular cube. Without these ranges, the part of Replication might be affected since blocks from different cubes with replicated data can be hard to distinguish within the same file.
    • One solution is to keep Replicated Data in a different folder and apply specific rules for Rolling Up those files.
    • In that way, we can preserve the Ranges (which makes the writing much memory-intensive and can generate pitfalls with huge datasets) only in the Replicated blocks, while for the rest the condition is removed.

alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Oct 27, 2023
alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Oct 31, 2023
alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Oct 31, 2023
alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Oct 31, 2023
alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Nov 7, 2023
alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Nov 7, 2023
alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Nov 7, 2023
alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Nov 8, 2023
alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Nov 22, 2023
alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Nov 22, 2023
alexeiakimov added a commit to alexeiakimov/qbeast-spark that referenced this issue Nov 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority: high This issue has high priority status: in-progress This issue is in progress
Projects
None yet
Development

No branches or pull requests

3 participants