Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Appending data with new columnStats fails in some cases #195

Closed
Adricu8 opened this issue Jun 8, 2023 · 0 comments
Closed

Appending data with new columnStats fails in some cases #195

Adricu8 opened this issue Jun 8, 2023 · 0 comments
Labels
type: bug Something isn't working

Comments

@Adricu8
Copy link
Contributor

Adricu8 commented Jun 8, 2023

What went wrong?

Doing an append specifying new min/max stats fails in some cases.
In particular, I have identified two cases where the new columnStats are not being updated accordingly:

  • table indexed by 2 columns: age: Integer, val2: integer, appending with new column stats on column age. Throws error about missing val2 column.
  • table indexed by 2 columns: age: Integer, name: String, appending with new column stats on column age. New revision is not created.

How to reproduce?

// THROWS ERROR val2 column missing
val names = List("age,val2")
val stats_init_write = """{ "age_min": 0, "age_max": 20 }"""
val stats_append = """{ "age_min": 5, "age_max": 30 }"""
// DOES NOT THROW ERROR, but it does not create new revision
val names = List("age,name")
val stats_init_write = """{ "age_min": 0, "age_max": 20 }"""
val stats_append = """{ "age_min": 5, "age_max": 30 }"""

Full example to reproduce

  it should "create a new revision by appending new columnStats" in
    withQbeastContextSparkAndTmpDir { (spark, tmpDir) => {
      val rdd =
        spark.sparkContext.parallelize(
          Seq(
            Client3(1, s"student-1", 10, 1000 + 123, 2567.3432143),
            Client3(2, s"student-2", 15, 2 * 1000 + 123, 2 * 2567.3432143)))

      val df = spark.createDataFrame(rdd)

      val names = List("age,name")
      val stats_init_write = """{ "age_min": 0, "age_max": 20 }"""
      val stats_append = """{ "age_min": 5, "age_max": 30 }"""

      df.write
        .format("qbeast")
        .mode("overwrite")
        .options(Map("columnsToIndex" -> names.mkString(","), "columnStats" -> stats_init_write))
        .save(tmpDir)

      df.write
        .format("qbeast")
        .mode("append")
        .option("columnsToIndex", names.mkString(","))
        .option("columnStats", stats_append)
        .save(tmpDir)

      val deltaLog = DeltaLog.forTable(spark, tmpDir)
      val qbeastSnapshot = delta.DeltaQbeastSnapshot(deltaLog.snapshot)
      val transformation = qbeastSnapshot.loadLatestRevision.transformations.head

      qbeastSnapshot.loadLatestRevision.revisionID shouldBe 2
      transformation shouldBe a[LinearTransformation]
      transformation.asInstanceOf[LinearTransformation].minNumber shouldBe 5
      transformation.asInstanceOf[LinearTransformation].maxNumber shouldBe 30
    }
    }

2. Branch and commit id:

main branch

3. Spark version:

3.2.3

4. Hadoop version:

3.2.3

@Adricu8 Adricu8 added type: bug Something isn't working high labels Jun 8, 2023
osopardo1 added a commit to osopardo1/qbeast-spark that referenced this issue Aug 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants