-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rollup compaction #207
Labels
Comments
This was referenced Aug 3, 2023
alexeiakimov
added a commit
to alexeiakimov/qbeast-spark
that referenced
this issue
Aug 15, 2023
alexeiakimov
added a commit
to alexeiakimov/qbeast-spark
that referenced
this issue
Aug 15, 2023
alexeiakimov
added a commit
to alexeiakimov/qbeast-spark
that referenced
this issue
Aug 15, 2023
alexeiakimov
added a commit
to alexeiakimov/qbeast-spark
that referenced
this issue
Aug 15, 2023
Closed
alexeiakimov
added a commit
to alexeiakimov/qbeast-spark
that referenced
this issue
Aug 20, 2023
…ck support and new file metadata format.
alexeiakimov
added a commit
to alexeiakimov/qbeast-spark
that referenced
this issue
Aug 20, 2023
alexeiakimov
added a commit
to alexeiakimov/qbeast-spark
that referenced
this issue
Aug 23, 2023
alexeiakimov
added a commit
to alexeiakimov/qbeast-spark
that referenced
this issue
Aug 23, 2023
alexeiakimov
added a commit
to alexeiakimov/qbeast-spark
that referenced
this issue
Aug 30, 2023
…at and RangedColumnarBatchIterator
alexeiakimov
added a commit
to alexeiakimov/qbeast-spark
that referenced
this issue
Sep 1, 2023
alexeiakimov
added a commit
to alexeiakimov/qbeast-spark
that referenced
this issue
Sep 4, 2023
alexeiakimov
added a commit
to alexeiakimov/qbeast-spark
that referenced
this issue
Sep 4, 2023
alexeiakimov
added a commit
to alexeiakimov/qbeast-spark
that referenced
this issue
Sep 4, 2023
alexeiakimov
added a commit
to alexeiakimov/qbeast-spark
that referenced
this issue
Sep 5, 2023
alexeiakimov
added a commit
to alexeiakimov/qbeast-spark
that referenced
this issue
Sep 5, 2023
alexeiakimov
added a commit
to alexeiakimov/qbeast-spark
that referenced
this issue
Sep 6, 2023
alexeiakimov
added a commit
to alexeiakimov/qbeast-spark
that referenced
this issue
Sep 12, 2023
alexeiakimov
added a commit
to alexeiakimov/qbeast-spark
that referenced
this issue
Sep 12, 2023
alexeiakimov
added a commit
to alexeiakimov/qbeast-spark
that referenced
this issue
Sep 12, 2023
alexeiakimov
added a commit
to alexeiakimov/qbeast-spark
that referenced
this issue
Sep 14, 2023
alexeiakimov
added a commit
to alexeiakimov/qbeast-spark
that referenced
this issue
Sep 27, 2023
alexeiakimov
added a commit
to alexeiakimov/qbeast-spark
that referenced
this issue
Sep 27, 2023
alexeiakimov
added a commit
to alexeiakimov/qbeast-spark
that referenced
this issue
Sep 27, 2023
alexeiakimov
added a commit
to alexeiakimov/qbeast-spark
that referenced
this issue
Oct 3, 2023
alexeiakimov
added a commit
to alexeiakimov/qbeast-spark
that referenced
this issue
Oct 3, 2023
alexeiakimov
added a commit
to alexeiakimov/qbeast-spark
that referenced
this issue
Oct 3, 2023
alexeiakimov
added a commit
to alexeiakimov/qbeast-spark
that referenced
this issue
Oct 6, 2023
alexeiakimov
added a commit
to alexeiakimov/qbeast-spark
that referenced
this issue
Oct 6, 2023
alexeiakimov
added a commit
to alexeiakimov/qbeast-spark
that referenced
this issue
Oct 9, 2023
alexeiakimov
added a commit
to alexeiakimov/qbeast-spark
that referenced
this issue
Oct 9, 2023
alexeiakimov
added a commit
to alexeiakimov/qbeast-spark
that referenced
this issue
Oct 10, 2023
alexeiakimov
added a commit
to alexeiakimov/qbeast-spark
that referenced
this issue
Oct 11, 2023
alexeiakimov
added a commit
to alexeiakimov/qbeast-spark
that referenced
this issue
Oct 19, 2023
alexeiakimov
added a commit
to alexeiakimov/qbeast-spark
that referenced
this issue
Oct 19, 2023
The next steps are:
|
alexeiakimov
added a commit
to alexeiakimov/qbeast-spark
that referenced
this issue
Oct 27, 2023
alexeiakimov
added a commit
to alexeiakimov/qbeast-spark
that referenced
this issue
Oct 31, 2023
alexeiakimov
added a commit
to alexeiakimov/qbeast-spark
that referenced
this issue
Oct 31, 2023
alexeiakimov
added a commit
to alexeiakimov/qbeast-spark
that referenced
this issue
Oct 31, 2023
alexeiakimov
added a commit
to alexeiakimov/qbeast-spark
that referenced
this issue
Nov 7, 2023
alexeiakimov
added a commit
to alexeiakimov/qbeast-spark
that referenced
this issue
Nov 7, 2023
alexeiakimov
added a commit
to alexeiakimov/qbeast-spark
that referenced
this issue
Nov 7, 2023
alexeiakimov
added a commit
to alexeiakimov/qbeast-spark
that referenced
this issue
Nov 8, 2023
alexeiakimov
added a commit
to alexeiakimov/qbeast-spark
that referenced
this issue
Nov 22, 2023
alexeiakimov
added a commit
to alexeiakimov/qbeast-spark
that referenced
this issue
Nov 22, 2023
alexeiakimov
added a commit
to alexeiakimov/qbeast-spark
that referenced
this issue
Nov 24, 2023
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
What went wrong?
The current version creates two many small files while indexing or replicating data. Those small files have negative impact on reading the data.
How to reproduce?
Index a data frame where the data has big dispersion, e.g. the records are distributed uniformly in some range. Optimize the index several times to push the data down the index tree. Every time a part of data is saved in some cube that part is stored in a separate file, so after several writes/optimizations where the data falls into different cubes the index has a lot of small files.
Code that triggered the bug, or steps to reproduce:
There is no particular code to blame, the problem is that every write to a cube creates a separate file. This is a problem of the algorithm, not a defect of the implementation.
How to fix?
The proposal is to use the rollup compaction while writing the data to join the blocks of closely related cubes in one physical file, so when the data is queried, there is a high probability that the most of those blocks contribute their data to the query. This approach should reduce the number of small files without significant impact on the query performance. The necessary details can be found in the document https://docs.google.com/document/d/1CY5Gzx46fuatwkAyPRKIRSS7DeD4mOFO697I8_TuTak/edit.
The text was updated successfully, but these errors were encountered: