Issue 292: Merge main-1.0.0 into main #284

osopardo1 · 2024-03-13T10:46:02Z

We published release 0.5.0 last January 9th. The last version included changes such as String Indexing and Updates to Spark and Delta 2.4.0.

Although those changes significantly improved the distribution of text columns, we wanted to work on algorithm optimizations that could balance the file layout for all kinds of data.

Closes #292

What's changed?

Roll-Up

One of the key operations for distributing the files evenly is the Roll-Up.

Roll-Up compaction solves the small files problem by storing the blocks of closely related cubes in a single physical file. Here “closely related” means that there is a high probability that the cubes contribute data to the same query.

New Protocol

We've been working on upbreaking changes in the algorithm, which affects the Metadata written in the Commit Log.

In summary, instead of having one single file containing one single cube, now we allow one file to contain multiple cubes stored in blocks.

Original protocol metadata:

"tags": {
  "state": "FLOODED",
  "cube": "w",
  "revision": "1",
  "minWeight": "2",
  "maxWeight": "3",
  "elementCount": "4" 
}

NEW protocol metadata:

"tags": {
  "revision": "1",
  "blocks": [
    {
      "cube": "w",
      "minWeight": 2,
      "maxWeight": 3,
      "replicated": false,
      "elementCount": 4
    },
    {
      "cube": "wg",
      "minWeight": 5,
      "maxWeight": 6,
      "replicated": false,
      "elementCount": 7
    },
  ]
}

Domain-Driven Appends

Another of the upgrades we made in the new code, is using Cube Domains for appending data incrementally. The change uses the existing index during partition-level domain estimation to help reducing the number of cubes with outdated max weights from 45% to 0.16%. Or in other words, more balanced files!

Fixes #226. Full details in #227

Auto-Indexing

Sometimes the .option("columnsToIndex", "a,b") was too hot to handle... That's why we add functionality to automatically choose the best columns to organize the data.

The feature is not enabled by default. If you want to use it, you should add the necessary configuration.

spark.qbeast.index.columnsToIndex.auto=true
spark.qbeast.index.columnsToIndex.auto.max=10

…iter

…ion-simplified Roll up compaction simplified

Add Domain-Driven Appends

….5.2

…lafmt, the code is reformatted.

osopardo1 · 2024-03-25T16:03:34Z

This is taking longer than expected...

But to summarize:

Updated documentation according to the new version.
Remove unnecessary classes (CubeInfo).
Solve inconsistencies with the Auto Indexing #247 and CREATE EXTERNAL TABLE without OPTIONS #248 changes.
Discuss what to do with compact() command. (I deleted it, but it's best to do so in another PR)

osopardo1 · 2024-03-26T10:12:41Z

After some discussion, we agreed that:

Compact() command stays where it is.
We will open an issue to remove the operation and apply indexation when optimizing the staging area (further details TBD)

These two things would help with easier review and testing.

Thanks, folks, On my way to update & merge 😄

This reverts commit 8c2dacd.

This reverts commit 1cee986.

This reverts commit 4a5a507.

docs/QbeastFormat0.6.0.md

docs/QbeastFormat.md

core/src/main/scala/io/qbeast/core/model/ColumnsToIndexSelector.scala

core/src/main/scala/io/qbeast/core/model/Block.scala

src/main/scala/io/qbeast/spark/delta/IndexFiles.scala

src/main/scala/io/qbeast/spark/delta/writer/BlockWriter.scala

fpj

Some additional comments.

src/main/scala/io/qbeast/spark/table/IndexedTable.scala

src/test/scala/io/qbeast/spark/delta/DefaultFileIndexTest.scala

src/test/scala/io/qbeast/spark/delta/writer/RollupDataWriterTest.scala

src/test/scala/io/qbeast/spark/delta/writer/RollupTest.scala

src/test/scala/io/qbeast/spark/index/SparkColumnsToIndexSelectorTest.scala

fpj

+1, lgtm

alexeiakimov and others added 30 commits October 27, 2023 12:50

Qbeast-io#207 AddFile metadata format is prepared for multiple blocks.

2d656f7

Qbeast-io#207 QbeastBlock is removed.

ba1df85

Qbeast-io#207 Abstract rollup implementation and test.

51fc009

WIP, use existing index for appends

666a534

WIP, adjust tests

71e43bc

Separate newRevision and the existing IndexStatus

683e3bf

Fix tests

dddd510

Update todo, remove tests

3efcbf9

Private classes for CubeDomainsBuilder

e9ea8ed

Make WeightAndCountFactory as an attribute for CubeDomainBuilder

6a863eb

Qbeast-io#207 RollupDataWrite as the replacement for SparkDeltaDataWr…

900d611

…iter

Qbeast-io#207 Fixes for tests.

902e6e9

Qbeast-io#207 More fixes for tests, etc

fcb40ea

Qbeast-io#207 Formatting and typos are fixed

c961cb5

Qbeast-io#207 Replicated set is removed from the metadata

aa1193a

Qbeast-io#207 Code coverage is improved.

54b1d73

Qbeast-io#207 Unused tags and Qbeast columns are removed

04f517e

Qbeast-io#207 QbeastFilterPushdownTest is improved

2d39821

Merge pull request Qbeast-io#232 from alexeiakimov/207-rollup-compact…

6e222c2

…ion-simplified Roll up compaction simplified

Merge with main-1.0.0

ed25729

Merge pull request Qbeast-io#227 from Jiaweihu08/dda

36fcbf2

Add Domain-Driven Appends

WIP on delta data skipping

f7ce934

Reorganize code and tests

1638b56

Chaging to code on main-1.0.0

fa9667d

Remove lazy val spark

7802983

File stats set to seq

81f9d2e

length to size

3be9024

Qbeast-io#236 sbt is upgraded to 1.9.7, sbt-scalafmt is aupdated to 2…

48aa19e

….5.2

Qbeast-io#326 .scalafmt.conf is adapted for the recent version of sca…

e138630

…lafmt, the code is reformatted.

Add another check for staging files

7a9591b

osopardo1 requested a review from Jiaweihu08 March 25, 2024 14:27

osopardo1 added 2 commits March 25, 2024 16:17

Compacting just the staging revision in test

301d757

Add optimize for staging area too

1cee986

Change file format for test

4a5a507

osopardo1 added 2 commits March 26, 2024 11:25

Revert "Update docs and remove compact() command"

533b1a0

This reverts commit 8c2dacd.

Change docs

6238b38

osopardo1 mentioned this pull request Mar 26, 2024

Optimization of the Unindexed Files [Staging Area] #294

Closed

osopardo1 added 2 commits March 26, 2024 14:35

Revert "Add optimize for staging area too"

17aca39

This reverts commit 1cee986.

Revert "Change file format for test"

4cd5dd0

This reverts commit 4a5a507.

Jiaweihu08 approved these changes Mar 26, 2024

View reviewed changes

fpj requested changes Mar 26, 2024

View reviewed changes

osopardo1 added 4 commits March 27, 2024 10:30

Add headers to Test

bfea6fc

Add right headers to core package

883bed1

Move TODO and add issue number

60e4af5

Remove old TODO from BlockWriter

a2a3c69

osopardo1 mentioned this pull request Mar 27, 2024

Add Unit testing for IndexFiles (+ negative tests) #296

Closed

Update build.sbt for headers in core project and fix test imports

50a2f23

fpj approved these changes Mar 27, 2024

View reviewed changes

fpj merged commit 065a6b2 into Qbeast-io:main Mar 27, 2024
2 of 3 checks passed

fpj mentioned this pull request Mar 27, 2024

Issue 290: Add correct license header to files in branch main-1.0.0 #291

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue 292: Merge main-1.0.0 into main #284

Issue 292: Merge main-1.0.0 into main #284

osopardo1 commented Mar 13, 2024 •

edited by fpj

Loading

osopardo1 commented Mar 25, 2024 •

edited

Loading

osopardo1 commented Mar 26, 2024

fpj left a comment

fpj left a comment

Issue 292: Merge main-1.0.0 into main #284

Issue 292: Merge main-1.0.0 into main #284

Conversation

osopardo1 commented Mar 13, 2024 • edited by fpj Loading

What's changed?

Roll-Up

New Protocol

Original protocol metadata:

NEW protocol metadata:

Domain-Driven Appends

Auto-Indexing

osopardo1 commented Mar 25, 2024 • edited Loading

osopardo1 commented Mar 26, 2024

fpj left a comment

Choose a reason for hiding this comment

fpj left a comment

Choose a reason for hiding this comment

osopardo1 commented Mar 13, 2024 •

edited by fpj

Loading

osopardo1 commented Mar 25, 2024 •

edited

Loading