Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query fails to retrieve elements that are in the 1.0 space #111

Closed
osopardo1 opened this issue Jun 23, 2022 · 0 comments · Fixed by #113
Closed

Query fails to retrieve elements that are in the 1.0 space #111

osopardo1 opened this issue Jun 23, 2022 · 0 comments · Fixed by #113
Assignees
Labels
type: bug Something isn't working

Comments

@osopardo1
Copy link
Member

osopardo1 commented Jun 23, 2022

What went wrong?

After indexing Github dataset by repo_main_language and year, we started doing queries for testing tolerance in each group.

In some of those queries, elements were missing. For example:

df_qbeast.where("""repo_main_language == "SAS" and year == 2022""").count()

was not equal to

df_parquet.where("""repo_main_language == "SAS" and year == 2022""").count()

In a more deep search, we found out that year == 2022 query was translated on a space (from = 1.0, to = 1.0). The condition to select or not a cube to retrieve, is:

  private def intersects(f: Double, t: Double, cube: CubeId, coordinate: Int): Boolean = {
    val cf = cube.from.coordinates(coordinate)
    val ct = cube.to.coordinates(coordinate)
    (f <= cf && cf < t) || (cf <= f && f < ct)
  }

So, in the case of f == t == 1.0, the answer is always false.

How to reproduce?

  1. Read and write the Github Archive Dataset
df.write.format("qbeast").option("columnsToIndex", "repo_main_language,year").save("path")
  1. Read the data frame and run:
val df_qbeast = spark.read.format("qbeast").load("path")
val df_parquet = spark.read.format("parquet").load("path")

df_qbeast.where("""repo_main_language == "SAS" and year == 2022""").count()
df_parquet.where("""repo_main_language == "SAS" and year == 2022""").count()

2. Branch and commit id:

main bb08083

3. Spark version:

On the spark shell run spark.version.

3.1.2

4. Hadoop version:

On the spark shell run org.apache.hadoop.util.VersionInfo.getVersion().

3.2.0

5. How are you running Spark?

Are you running Spark inside a container? Are you launching the app on a remote K8s cluster? Or are you just running the tests in a local computer?

6. Stack trace:

Trace of the log/error messages.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants