Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redundant metadata entries #267

Closed
Jiaweihu08 opened this issue Feb 9, 2024 · 1 comment
Closed

Redundant metadata entries #267

Jiaweihu08 opened this issue Feb 9, 2024 · 1 comment
Assignees
Labels
type: bug Something isn't working

Comments

@Jiaweihu08
Copy link
Member

What went wrong?

We create redundant metadata entries for each write operation, including appends that don't update neither the schema nor the space Revision.

On top of that, the creation of metadata when not required prevents interleaved concurrent writes from committing. (see ConflictChecker.checkNoMetadataUpdates from delta lake)

How to reproduce?

  1. Append some data to an existing table. The data used should not cause schema change, nor the creation of a new revision.
  2. Check the existance of an metadata entry in the _delta_log for this append.

1. Code that triggered the bug, or steps to reproduce:

// Create table
df.write.mode("overwrite").format("qbeast").option("columnsToIndex", "col_1,col_2").save(tmpDir)
// Append with no schema change, nor Revision update.
df.write.mode("append").format("qbeast").save(tmpDir)

// Check metadata entry for append
val deltaLog = DeltaLog.forTable(spark, tmpDir)
val noMetadataForAppend = (deltaLog
  .store.read(FileNames.deltaFile(deltaLog.logPath, 1L), deltaLog.newDeltaHadoopConf())
  .map(Action.fromJson)
  .collect { case a: Metadata => a }
  .isEmpty)

assert(noMetadataForAppend, "Redundant metadata detected!")

2. Branch and commit id: 6a780ea

3. Spark version: 3.5.0

4. Hadoop version: 3.3.4

5. How are you running Spark?: Locally and on AWS EMR

6. Stack trace:

The redundat metadata prevent concurrent writes:

io.delta.exceptions.MetadataChangedException: The metadata of the Delta table has been changed by a concurrent update. Please try the operation again.
Conflicting commit: {"timestamp":...,"operation":"WRITE","operationParameters":{"mode":Append},"readVersion":...,"isolationLevel":"Serializable","isBlindAppend":true,"operationMetrics":{"numFiles":"...","numOutputRows":"...","numOutputBytes":"..."},"engineInfo":"Apache-Spark/3.4.2 Delta-Lake/2.4.0","txnId":"..."}
@Jiaweihu08 Jiaweihu08 added type: bug Something isn't working priority: high This issue has high priority labels Feb 9, 2024
@Jiaweihu08 Jiaweihu08 self-assigned this Feb 9, 2024
@osopardo1 osopardo1 added 1.0.0 and removed priority: high This issue has high priority labels Feb 13, 2024
@osopardo1
Copy link
Member

Merged on #284

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants