Appending data with new columnStats fails in some cases #195

Adricu8 · 2023-06-08T12:51:01Z

What went wrong?

Doing an append specifying new min/max stats fails in some cases.
In particular, I have identified two cases where the new columnStats are not being updated accordingly:

table indexed by 2 columns: age: Integer, val2: integer, appending with new column stats on column age. Throws error about missing val2 column.
table indexed by 2 columns: age: Integer, name: String, appending with new column stats on column age. New revision is not created.

How to reproduce?

// THROWS ERROR val2 column missing
val names = List("age,val2")
val stats_init_write = """{ "age_min": 0, "age_max": 20 }"""
val stats_append = """{ "age_min": 5, "age_max": 30 }"""

// DOES NOT THROW ERROR, but it does not create new revision
val names = List("age,name")
val stats_init_write = """{ "age_min": 0, "age_max": 20 }"""
val stats_append = """{ "age_min": 5, "age_max": 30 }"""

Full example to reproduce

  it should "create a new revision by appending new columnStats" in
    withQbeastContextSparkAndTmpDir { (spark, tmpDir) => {
      val rdd =
        spark.sparkContext.parallelize(
          Seq(
            Client3(1, s"student-1", 10, 1000 + 123, 2567.3432143),
            Client3(2, s"student-2", 15, 2 * 1000 + 123, 2 * 2567.3432143)))

      val df = spark.createDataFrame(rdd)

      val names = List("age,name")
      val stats_init_write = """{ "age_min": 0, "age_max": 20 }"""
      val stats_append = """{ "age_min": 5, "age_max": 30 }"""

      df.write
        .format("qbeast")
        .mode("overwrite")
        .options(Map("columnsToIndex" -> names.mkString(","), "columnStats" -> stats_init_write))
        .save(tmpDir)

      df.write
        .format("qbeast")
        .mode("append")
        .option("columnsToIndex", names.mkString(","))
        .option("columnStats", stats_append)
        .save(tmpDir)

      val deltaLog = DeltaLog.forTable(spark, tmpDir)
      val qbeastSnapshot = delta.DeltaQbeastSnapshot(deltaLog.snapshot)
      val transformation = qbeastSnapshot.loadLatestRevision.transformations.head

      qbeastSnapshot.loadLatestRevision.revisionID shouldBe 2
      transformation shouldBe a[LinearTransformation]
      transformation.asInstanceOf[LinearTransformation].minNumber shouldBe 5
      transformation.asInstanceOf[LinearTransformation].maxNumber shouldBe 30
    }
    }

2. Branch and commit id:

main branch

3. Spark version:

3.2.3

4. Hadoop version:

3.2.3

The text was updated successfully, but these errors were encountered:

Fixed #204 and #195

…imestamp Fixed Qbeast-io#204 and Qbeast-io#195

Adricu8 added type: bug Something isn't working high labels Jun 8, 2023

osopardo1 mentioned this issue Aug 1, 2023

ColumnStats on Append and Timestamp handling #205

Merged

5 tasks

osopardo1 added a commit that referenced this issue Aug 1, 2023

Merge pull request #205 from osopardo1/195-columnStats-and-timestamp

3edd55b

Fixed #204 and #195

osopardo1 closed this as completed Aug 1, 2023

osopardo1 added a commit to osopardo1/qbeast-spark that referenced this issue Aug 2, 2023

Merge pull request Qbeast-io#205 from osopardo1/195-columnStats-and-t…

d9138cb

…imestamp Fixed Qbeast-io#204 and Qbeast-io#195

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Appending data with new columnStats fails in some cases #195

Appending data with new columnStats fails in some cases #195

Adricu8 commented Jun 8, 2023

Appending data with new columnStats fails in some cases #195

Appending data with new columnStats fails in some cases #195

Comments

Adricu8 commented Jun 8, 2023

What went wrong?

How to reproduce?

Full example to reproduce

2. Branch and commit id:

3. Spark version:

4. Hadoop version: