-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue 292: Merge main-1.0.0 into main #284
Conversation
…ion-simplified Roll up compaction simplified
Add Domain-Driven Appends
…lafmt, the code is reformatted.
This is taking longer than expected... But to summarize:
|
After some discussion, we agreed that:
These two things would help with easier review and testing. Thanks, folks, On my way to update & merge 😄 |
This reverts commit 8c2dacd.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some additional comments.
src/test/scala/io/qbeast/spark/index/SparkColumnsToIndexSelectorTest.scala
Show resolved
Hide resolved
src/test/scala/io/qbeast/spark/index/SparkColumnsToIndexSelectorTest.scala
Outdated
Show resolved
Hide resolved
src/test/scala/io/qbeast/spark/index/SparkColumnsToIndexSelectorTest.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, lgtm
We published release 0.5.0 last January 9th. The last version included changes such as String Indexing and Updates to Spark and Delta 2.4.0.
Although those changes significantly improved the distribution of text columns, we wanted to work on algorithm optimizations that could balance the file layout for all kinds of data.
Closes #292
What's changed?
Roll-Up
One of the key operations for distributing the files evenly is the Roll-Up.
Roll-Up compaction solves the small files problem by storing the blocks of closely related cubes in a single physical file. Here “closely related” means that there is a high probability that the cubes contribute data to the same query.
New Protocol
We've been working on upbreaking changes in the algorithm, which affects the Metadata written in the Commit Log.
In summary, instead of having one single file containing one single cube, now we allow one file to contain multiple cubes stored in
blocks
.Original protocol metadata:
NEW protocol metadata:
Domain-Driven Appends
Another of the upgrades we made in the new code, is using Cube Domains for appending data incrementally. The change uses the existing index during partition-level domain estimation to help reducing the number of cubes with outdated max weights from 45% to 0.16%. Or in other words, more balanced files!
Fixes #226. Full details in #227
Auto-Indexing
Sometimes the
.option("columnsToIndex", "a,b")
was too hot to handle... That's why we add functionality to automatically choose the best columns to organize the data.The feature is not enabled by default. If you want to use it, you should add the necessary configuration.