Query fails to retrieve elements that are in the 1.0 space #111

osopardo1 · 2022-06-23T14:17:28Z

What went wrong?

After indexing Github dataset by repo_main_language and year, we started doing queries for testing tolerance in each group.

In some of those queries, elements were missing. For example:

df_qbeast.where("""repo_main_language == "SAS" and year == 2022""").count()

was not equal to

df_parquet.where("""repo_main_language == "SAS" and year == 2022""").count()

In a more deep search, we found out that year == 2022 query was translated on a space (from = 1.0, to = 1.0). The condition to select or not a cube to retrieve, is:

  private def intersects(f: Double, t: Double, cube: CubeId, coordinate: Int): Boolean = {
    val cf = cube.from.coordinates(coordinate)
    val ct = cube.to.coordinates(coordinate)
    (f <= cf && cf < t) || (cf <= f && f < ct)
  }

So, in the case of f == t == 1.0, the answer is always false.

How to reproduce?

Read and write the Github Archive Dataset

df.write.format("qbeast").option("columnsToIndex", "repo_main_language,year").save("path")

Read the data frame and run:

val df_qbeast = spark.read.format("qbeast").load("path")
val df_parquet = spark.read.format("parquet").load("path")

df_qbeast.where("""repo_main_language == "SAS" and year == 2022""").count()
df_parquet.where("""repo_main_language == "SAS" and year == 2022""").count()

2. Branch and commit id:

main bb08083

3. Spark version:

On the spark shell run spark.version.

3.1.2

4. Hadoop version:

On the spark shell run org.apache.hadoop.util.VersionInfo.getVersion().

3.2.0

5. How are you running Spark?

Are you running Spark inside a container? Are you launching the app on a remote K8s cluster? Or are you just running the tests in a local computer?

6. Stack trace:

Trace of the log/error messages.

The text was updated successfully, but these errors were encountered:

osopardo1 added type: bug Something isn't working high labels Jun 23, 2022

osopardo1 assigned Jiaweihu08 Jun 23, 2022

osopardo1 mentioned this issue Jun 23, 2022

Losing records with too many null values #112

Closed

Jiaweihu08 mentioned this issue Jun 23, 2022

Space limit #113

Merged

Jiaweihu08 closed this as completed in #113 Jun 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Query fails to retrieve elements that are in the 1.0 space #111

Query fails to retrieve elements that are in the 1.0 space #111

osopardo1 commented Jun 23, 2022 •

edited

Loading

Query fails to retrieve elements that are in the 1.0 space #111

Query fails to retrieve elements that are in the 1.0 space #111

Comments

osopardo1 commented Jun 23, 2022 • edited Loading

What went wrong?

How to reproduce?

2. Branch and commit id:

3. Spark version:

4. Hadoop version:

5. How are you running Spark?

6. Stack trace:

osopardo1 commented Jun 23, 2022 •

edited

Loading