Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid missing cube files on reading #57

Merged
merged 25 commits into from
Jan 11, 2022

Conversation

osopardo1
Copy link
Member

@osopardo1 osopardo1 commented Dec 20, 2021

Due to the bug stated in issue #47 and according to reasoning in #48, we should work on avoiding losing cube records at reading time.

The greedy estimation in the first step of the writing protocol causes having missing parents in the index structure. This means that all data is written down, but the read operation is not supposed to contemplate a behavior in which cube "AAAA" is present but parent "AAA" is not. If a father is not present in the index structure because of bad index quality, the query should be anyways retrieving all the necessary elements.

For that, this Draft PR is for:

  • Use Patricia Trie structure instead of Map for resolving the CubeWeights. A Patricia Trie would find for any matches in a CubeId or its descendants. => No need, a SortedMap is enoguh
  • Possibly change the recursive implementation to an iterative one. This would help also with issue Transform recursive method findSampleFiles to iterative #56

@eavilaes eavilaes linked an issue Dec 20, 2021 that may be closed by this pull request
@osopardo1 osopardo1 force-pushed the 47-stop-losing-records branch from 6bf0bd0 to 720653c Compare December 21, 2021 07:11
@osopardo1
Copy link
Member Author

Hey @Jiaweihu08 @cugni you could check this one for the pair programming session today 😈

@osopardo1 osopardo1 linked an issue Dec 21, 2021 that may be closed by this pull request
@cugni cugni changed the title Avoid losing cube files on reading Avoid missing cube files on reading Dec 21, 2021
@Jiaweihu08 Jiaweihu08 marked this pull request as ready for review December 22, 2021 10:30
@Jiaweihu08 Jiaweihu08 marked this pull request as draft December 22, 2021 10:32
@cugni
Copy link
Member

cugni commented Dec 30, 2021

Why do we need the previouslyMappedFiles? What will delta filter that we cannot? Is there a specific case? if not, it would be better just to have the QbeastFiles directly to the indexStatus, instead of intersecting the two lists. Also, OTreeIndex, we are intersecting the file list twice, which is unnecessary.

@eavilaes eavilaes marked this pull request as ready for review January 4, 2022 15:45
@eavilaes eavilaes requested a review from cugni January 4, 2022 15:46
Copy link
Member

@cugni cugni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems good!

Copy link
Member

@cugni cugni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still need to ensure that the QueryExecutor skips the files with maxWeight < weightRange.from

src/test/scala/io/qbeast/spark/index/FailingTests.scala Outdated Show resolved Hide resolved
Copy link
Contributor

@eavilaes eavilaes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When creating a new test in QueryExecutorTest.scala, I found out that most of these tests are using a cubeSize = 10.
This leads to the cubeSize being ignored and the config.DEFAULT_CUBE_SIZE is used.
Is this the desired behavior for these tests? There's already a specific test that checks the behavior using a smallCubeSize.

@eavilaes eavilaes requested a review from alexeiakimov January 7, 2022 11:36
cugni added 6 commits January 10, 2022 14:54
We decide to use a DFS approach instead of a BFS to reduce memory usage.
As Eric found out, we were ignoring the desired cube size with small values. The problem was we were comparing a double with an Int, and that lead to issues.
Now the code works when there's a missing parent, so there's no need to check if they are missing
@cugni cugni merged commit bcea74f into Qbeast-io:main Jan 11, 2022
@osopardo1 osopardo1 deleted the 47-stop-losing-records branch February 26, 2022 08:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants