Redundant metadata entries #267

Jiaweihu08 · 2024-02-09T15:14:17Z

What went wrong?

We create redundant metadata entries for each write operation, including appends that don't update neither the schema nor the space Revision.

On top of that, the creation of metadata when not required prevents interleaved concurrent writes from committing. (see ConflictChecker.checkNoMetadataUpdates from delta lake)

How to reproduce?

Append some data to an existing table. The data used should not cause schema change, nor the creation of a new revision.
Check the existance of an metadata entry in the _delta_log for this append.

1. Code that triggered the bug, or steps to reproduce:

// Create table
df.write.mode("overwrite").format("qbeast").option("columnsToIndex", "col_1,col_2").save(tmpDir)
// Append with no schema change, nor Revision update.
df.write.mode("append").format("qbeast").save(tmpDir)

// Check metadata entry for append
val deltaLog = DeltaLog.forTable(spark, tmpDir)
val noMetadataForAppend = (deltaLog
  .store.read(FileNames.deltaFile(deltaLog.logPath, 1L), deltaLog.newDeltaHadoopConf())
  .map(Action.fromJson)
  .collect { case a: Metadata => a }
  .isEmpty)

assert(noMetadataForAppend, "Redundant metadata detected!")

2. Branch and commit id: 6a780ea

3. Spark version: `3.5.0`

4. Hadoop version: `3.3.4`

5. How are you running Spark?: Locally and on AWS EMR

6. Stack trace:

The redundat metadata prevent concurrent writes:

io.delta.exceptions.MetadataChangedException: The metadata of the Delta table has been changed by a concurrent update. Please try the operation again.
Conflicting commit: {"timestamp":...,"operation":"WRITE","operationParameters":{"mode":Append},"readVersion":...,"isolationLevel":"Serializable","isBlindAppend":true,"operationMetrics":{"numFiles":"...","numOutputRows":"...","numOutputBytes":"..."},"engineInfo":"Apache-Spark/3.4.2 Delta-Lake/2.4.0","txnId":"..."}

The text was updated successfully, but these errors were encountered:

osopardo1 · 2024-03-27T13:10:24Z

Merged on #284

Jiaweihu08 added type: bug Something isn't working priority: high This issue has high priority labels Feb 9, 2024

Jiaweihu08 self-assigned this Feb 9, 2024

Jiaweihu08 mentioned this issue Feb 9, 2024

267 Remove redundant metadata creations #268

Merged

5 tasks

osopardo1 added 1.0.0 and removed priority: high This issue has high priority labels Feb 13, 2024

osopardo1 added the 1.0.0-merged label Mar 13, 2024

osopardo1 closed this as completed Mar 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Redundant metadata entries #267

Redundant metadata entries #267

Jiaweihu08 commented Feb 9, 2024

osopardo1 commented Mar 27, 2024

Redundant metadata entries #267

Redundant metadata entries #267

Comments

Jiaweihu08 commented Feb 9, 2024

What went wrong?

How to reproduce?

1. Code that triggered the bug, or steps to reproduce:

2. Branch and commit id: 6a780ea

3. Spark version: 3.5.0

4. Hadoop version: 3.3.4

5. How are you running Spark?: Locally and on AWS EMR

6. Stack trace:

osopardo1 commented Mar 27, 2024

3. Spark version: `3.5.0`

4. Hadoop version: `3.3.4`