-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid missing cube files on reading #57
Conversation
6bf0bd0
to
720653c
Compare
Hey @Jiaweihu08 @cugni you could check this one for the pair programming session today 😈 |
src/main/scala/io/qbeast/spark/index/query/QueryIndexStatusExecutor.scala
Outdated
Show resolved
Hide resolved
src/main/scala/io/qbeast/spark/index/query/QueryIndexStatusExecutor.scala
Outdated
Show resolved
Hide resolved
src/main/scala/io/qbeast/spark/index/query/QueryIndexStatusExecutor.scala
Outdated
Show resolved
Hide resolved
src/main/scala/io/qbeast/spark/index/query/QueryIndexStatusExecutor.scala
Outdated
Show resolved
Hide resolved
src/main/scala/io/qbeast/spark/index/query/QueryIndexStatusExecutor.scala
Outdated
Show resolved
Hide resolved
src/main/scala/io/qbeast/spark/index/query/QueryIndexStatusExecutor.scala
Outdated
Show resolved
Hide resolved
Why do we need the previouslyMappedFiles? What will delta filter that we cannot? Is there a specific case? if not, it would be better just to have the QbeastFiles directly to the indexStatus, instead of intersecting the two lists. Also, OTreeIndex, we are intersecting the file list twice, which is unnecessary. |
Tests for `QueryIndexStatusExecutor` have also been restructured, using directly QueryExecutor.
Also, I've simplified a bit the code and added CubeID.isAnchestorOf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We still need to ensure that the QueryExecutor skips the files with maxWeight < weightRange.from
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When creating a new test in QueryExecutorTest.scala
, I found out that most of these tests are using a cubeSize = 10
.
This leads to the cubeSize
being ignored and the config.DEFAULT_CUBE_SIZE
is used.
Is this the desired behavior for these tests? There's already a specific test that checks the behavior using a smallCubeSize.
We decide to use a DFS approach instead of a BFS to reduce memory usage.
As Eric found out, we were ignoring the desired cube size with small values. The problem was we were comparing a double with an Int, and that lead to issues.
Now the code works when there's a missing parent, so there's no need to check if they are missing
Due to the bug stated in issue #47 and according to reasoning in #48, we should work on avoiding losing cube records at reading time.
The greedy estimation in the first step of the writing protocol causes having missing parents in the index structure. This means that all data is written down, but the read operation is not supposed to contemplate a behavior in which cube
"AAAA"
is present but parent"AAA"
is not. If a father is not present in the index structure because of bad index quality, the query should be anyways retrieving all the necessary elements.For that, this Draft PR is for: