Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 292: Merge main-1.0.0 into main #284

Merged
merged 148 commits into from
Mar 27, 2024

Conversation

osopardo1
Copy link
Member

@osopardo1 osopardo1 commented Mar 13, 2024

We published release 0.5.0 last January 9th. The last version included changes such as String Indexing and Updates to Spark and Delta 2.4.0.

Although those changes significantly improved the distribution of text columns, we wanted to work on algorithm optimizations that could balance the file layout for all kinds of data.

Closes #292

What's changed?

Roll-Up

One of the key operations for distributing the files evenly is the Roll-Up.

Roll-Up compaction solves the small files problem by storing the blocks of closely related cubes in a single physical file. Here “closely related” means that there is a high probability that the cubes contribute data to the same query.

image

New Protocol

We've been working on upbreaking changes in the algorithm, which affects the Metadata written in the Commit Log.

In summary, instead of having one single file containing one single cube, now we allow one file to contain multiple cubes stored in blocks.

image

Original protocol metadata:

"tags": {
  "state": "FLOODED",
  "cube": "w",
  "revision": "1",
  "minWeight": "2",
  "maxWeight": "3",
  "elementCount": "4" 
}

NEW protocol metadata:

"tags": {
  "revision": "1",
  "blocks": [
    {
      "cube": "w",
      "minWeight": 2,
      "maxWeight": 3,
      "replicated": false,
      "elementCount": 4
    },
    {
      "cube": "wg",
      "minWeight": 5,
      "maxWeight": 6,
      "replicated": false,
      "elementCount": 7
    },
  ]
}

Domain-Driven Appends

Another of the upgrades we made in the new code, is using Cube Domains for appending data incrementally. The change uses the existing index during partition-level domain estimation to help reducing the number of cubes with outdated max weights from 45% to 0.16%. Or in other words, more balanced files!

Fixes #226. Full details in #227

Auto-Indexing

Sometimes the .option("columnsToIndex", "a,b") was too hot to handle... That's why we add functionality to automatically choose the best columns to organize the data.

The feature is not enabled by default. If you want to use it, you should add the necessary configuration.

spark.qbeast.index.columnsToIndex.auto=true
spark.qbeast.index.columnsToIndex.auto.max=10

alexeiakimov and others added 30 commits October 27, 2023 12:50
…ion-simplified

Roll up compaction simplified
@osopardo1 osopardo1 requested a review from Jiaweihu08 March 25, 2024 14:27
@osopardo1
Copy link
Member Author

osopardo1 commented Mar 25, 2024

This is taking longer than expected...

But to summarize:

@osopardo1
Copy link
Member Author

After some discussion, we agreed that:

  • Compact() command stays where it is.
  • We will open an issue to remove the operation and apply indexation when optimizing the staging area (further details TBD)

These two things would help with easier review and testing.

Thanks, folks, On my way to update & merge 😄

Copy link
Contributor

@fpj fpj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some additional comments.

Copy link
Contributor

@fpj fpj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Merge main-1.0.0 into main Appends that don't update cube weights
5 participants