Avoid missing cube files on reading #57

osopardo1 · 2021-12-20T15:41:49Z

Due to the bug stated in issue #47 and according to reasoning in #48, we should work on avoiding losing cube records at reading time.

The greedy estimation in the first step of the writing protocol causes having missing parents in the index structure. This means that all data is written down, but the read operation is not supposed to contemplate a behavior in which cube "AAAA" is present but parent "AAA" is not. If a father is not present in the index structure because of bad index quality, the query should be anyways retrieving all the necessary elements.

For that, this Draft PR is for:

Use Patricia Trie structure instead of Map for resolving the CubeWeights. A Patricia Trie would find for any matches in a CubeId or its descendants. => No need, a SortedMap is enoguh
Possibly change the recursive implementation to an iterative one. This would help also with issue Transform recursive method findSampleFiles to iterative #56

osopardo1 · 2021-12-21T07:12:47Z

Hey @Jiaweihu08 @cugni you could check this one for the pair programming session today 😈

src/main/scala/io/qbeast/spark/index/query/QueryIndexStatusExecutor.scala

cugni · 2021-12-30T17:17:56Z

Why do we need the previouslyMappedFiles? What will delta filter that we cannot? Is there a specific case? if not, it would be better just to have the QbeastFiles directly to the indexStatus, instead of intersecting the two lists. Also, OTreeIndex, we are intersecting the file list twice, which is unnecessary.

Tests for `QueryIndexStatusExecutor` have also been restructured, using directly QueryExecutor.

Also, I've simplified a bit the code and added CubeID.isAnchestorOf

cugni

It seems good!

cugni

We still need to ensure that the QueryExecutor skips the files with maxWeight < weightRange.from

src/test/scala/io/qbeast/spark/index/FailingTests.scala

eavilaes

When creating a new test in QueryExecutorTest.scala, I found out that most of these tests are using a cubeSize = 10.
This leads to the cubeSize being ignored and the config.DEFAULT_CUBE_SIZE is used.
Is this the desired behavior for these tests? There's already a specific test that checks the behavior using a smallCubeSize.

We decide to use a DFS approach instead of a BFS to reduce memory usage.

As Eric found out, we were ignoring the desired cube size with small values. The problem was we were comparing a double with an Int, and that lead to issues.

Now the code works when there's a missing parent, so there's no need to check if they are missing

eavilaes linked an issue Dec 20, 2021 that may be closed by this pull request

Losing records when cubeSize is too small #47

Closed

Add TODO

720653c

osopardo1 force-pushed the 47-stop-losing-records branch from 6bf0bd0 to 720653c Compare December 21, 2021 07:11

osopardo1 linked an issue Dec 21, 2021 that may be closed by this pull request

Read protocol should work when a cube is not present #48

Closed

cugni changed the title ~~Avoid losing cube files on reading~~ Avoid missing cube files on reading Dec 21, 2021

osopardo1 and others added 3 commits December 21, 2021 12:37

Merge remote-tracking branch 'origin/main' into 47-stop-losing-records

346e1a3

Separate FailingTests and add TODOs

2411e85

Temporary solution that takes into account the empty inner cubes

b8b2143

Jiaweihu08 marked this pull request as ready for review December 22, 2021 10:30

Jiaweihu08 marked this pull request as draft December 22, 2021 10:32

Update branch to main

5a2e401

cugni mentioned this pull request Dec 22, 2021

Transform recursive method findSampleFiles to iterative #56

Closed

osopardo1 linked an issue Dec 22, 2021 that may be closed by this pull request

Transform recursive method findSampleFiles to iterative #56

Closed

jiaweihu added 3 commits December 22, 2021 18:16

Use a SortedMap for cubesStatuses

ac5bfd7

Replace recursion for an interative approach

faf7127

Use announcedSet for case separation

a7fbd83

cugni requested changes Dec 30, 2021

View reviewed changes

eavilaes and others added 8 commits January 3, 2022 12:23

Clean up and move QueryIndexStatusExecutor to an inner class

5a6623c

Tests for `QueryIndexStatusExecutor` have also been restructured, using directly QueryExecutor.

New year, new IDE. I forgot to enable scalafmt.

72018fe

Add ordering to CubeId, with test

acfd539

Add test for ordering comparison

ac82967

we should not stop the query if maxWeight < weightRange.from

cec8865

Also, I've simplified a bit the code and added CubeID.isAnchestorOf

Improve CubeId comparison for sorting. Added more tests.

b1a4d3f

Typo fix

e73fddc

Using sortedMaps for cubeStatuses. Added tests for ancestors

78f9179

eavilaes marked this pull request as ready for review January 4, 2022 15:45

eavilaes requested review from Jiaweihu08 and eavilaes January 4, 2022 15:45

eavilaes requested a review from cugni January 4, 2022 15:46

Jiaweihu08 approved these changes Jan 4, 2022

View reviewed changes

cugni approved these changes Jan 4, 2022

View reviewed changes

cugni requested changes Jan 4, 2022

View reviewed changes

src/test/scala/io/qbeast/spark/index/FailingTests.scala Outdated Show resolved Hide resolved

eavilaes added 2 commits January 5, 2022 11:21

Organize and remove redundant tests

8cc6a4b

Add test to check file skipping when file.maxWeight < weightRange.from

d52bc28

eavilaes reviewed Jan 5, 2022

View reviewed changes

Update ScalaDoc

8b4a40e

eavilaes requested a review from alexeiakimov January 7, 2022 11:36

cugni approved these changes Jan 10, 2022

View reviewed changes

cugni added 6 commits January 10, 2022 14:54

The replication set is different in each revision.

b36c3d6

better variable naming

66901f4

Using stack instead of Queue to reduce memory

b82fdd1

We decide to use a DFS approach instead of a BFS to reduce memory usage.

Fixing Eric's bug

3e361c8

As Eric found out, we were ignoring the desired cube size with small values. The problem was we were comparing a double with an Int, and that lead to issues.

Removed missing parent test, as not it not a problem now

8a0881a

Now the code works when there's a missing parent, so there's no need to check if they are missing

Removing obsolete tests

0a3ba6e

cugni merged commit bcea74f into Qbeast-io:main Jan 11, 2022

osopardo1 deleted the 47-stop-losing-records branch February 26, 2022 08:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid missing cube files on reading #57

Avoid missing cube files on reading #57

osopardo1 commented Dec 20, 2021 •

edited by cugni

Loading

osopardo1 commented Dec 21, 2021

cugni commented Dec 30, 2021

cugni left a comment

cugni left a comment

eavilaes left a comment •

edited

Loading

Avoid missing cube files on reading #57

Avoid missing cube files on reading #57

Conversation

osopardo1 commented Dec 20, 2021 • edited by cugni Loading

osopardo1 commented Dec 21, 2021

cugni commented Dec 30, 2021

cugni left a comment

Choose a reason for hiding this comment

cugni left a comment

Choose a reason for hiding this comment

eavilaes left a comment • edited Loading

Choose a reason for hiding this comment

osopardo1 commented Dec 20, 2021 •

edited by cugni

Loading

eavilaes left a comment •

edited

Loading