Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Protocol Change] Add the In-Commit Timestamps spec change proposal #2599

Merged
merged 8 commits into from
Feb 8, 2024

Conversation

dhruvarya-db
Copy link
Collaborator

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (Protocol)

Description

Adds the proposal for spec change In-Commit Timestamps (see #2532) to the RFC folder.

How was this patch tested?

Spec-only change.

Does this PR introduce any user-facing changes?

No

protocol_rfcs/in-commit-timestamps.md Outdated Show resolved Hide resolved
### Commit Provenance Information
A delta file can optionally contain additional provenance information about what higher-level operation was being performed as well as who executed it.

Implementations are free to store any valid JSON-formatted data via the `commitInfo` action ~~.~~ **as long as any table feature (see [In-Commit Timestamps](#in-commit-timestamps)) does not impose additional requirements on the data.**
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest to highlight insertions with <ins>...</ins> tags instead of bold, since bold can occur naturally in the spec?

Suggested change
Implementations are free to store any valid JSON-formatted data via the `commitInfo` action ~~.~~ **as long as any table feature (see [In-Commit Timestamps](#in-commit-timestamps)) does not impose additional requirements on the data.**
Implementations are free to store any valid JSON [object literal](https://www.w3schools.com/js/js_json_objects.asp) as the `commitInfo` action <ins>unless some table feature (e.g. [In-Commit Timestamps](#in-commit-timestamps)) imposes additional requirements on the data</ins>.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meanwhile tho... this seems messy. We should really think about defining some kind of basic schema for commit info, so that table features don't have to completely change the commit info spec just to ensure a field is present.

See also #1548 and delta-io/delta-rs#1017; the latter was caused because Delta spark uses an array-like string: "partitionBy": "[...]" while Delta trino uses an actual array: "partitionBy": [...]". That sort of variation is really hard for clients to parse reliably, if they're using anything more structured that raw JSON parser nodes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meanwhile, I suggest that we correct the spec to say "valid JSON object literal" instead of "valid JSON-formatted data", to make clear that clients can expect commitInfo to parse as an object rather than one of the other JSON data types (array, string, numeric, boolean, null). In other words, only the first line of the following snippet would be a legal commitInfo action:

{ "commitInfo": { } }
{ "commitInfo": [ ] }
{ "commitInfo": 10 }
{ "commitInfo": null }

AFAIK, all clients out there are already using proper object literals, because otherwise they would break interop with mainstream clients like spark.

We should probably track this clarification as a separate issue, tho, because it's fixing a spec erratum (no associated table feature) and so wouldn't need to go through the RFC process?

protocol_rfcs/in-commit-timestamps.md Outdated Show resolved Hide resolved
protocol_rfcs/in-commit-timestamps.md Show resolved Hide resolved
protocol_rfcs/in-commit-timestamps.md Outdated Show resolved Hide resolved
When In-Commit Timestamps is enabled, then:
1. Writers must write the `commitInfo` (see [Commit Provenance Information](#commit-provenance-information)) action in the commit.
2. The `commitInfo` action must be the first action in the commits.
3. The `commitInfo` action must include a field named `inCommitTimestamp`, of type `timestamp` (see [Primitive Types](#primitive-types)), which represents the UTC time when the commit is considered to have succeeded. It is the larger of two values:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK, the timestamp type (with µs resolution) is only used for user data today. All other Delta spec timestamps seem to be long type (ms since the unix epoch):

What do we gain from a µs resolution in-commit timestamp?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like CDC readers are expected to return a Timestamp (https://github.com/delta-io/delta/pull/2599/files#diff-ff95f55525ec15c41f84b2215cd0f45fbdf4a88b2e4c4ef126ed3bea0b900597R28) but I agree, we don't really need µs resolution so a Long should suffice.

Comment on lines +36 to +50
4. If the table has commits from a period when this feature was not enabled, provenance information around when this feature was enabled must be tracked in table properties:
- The property `delta.inCommitTimestampEnablementVersion` must be used to track the version of the table when this feature was enabled.
- The property `delta.inCommitTimestampEnablementTimestamp` must be the same as the `inCommitTimestamp` of the commit when this feature was enabled.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly:

feature enabled property present meaning
no no feature not enabled
no yes illegal properties
yes no ICT has "always" been enabled (at least for all commits within the history retention window)
yes yes ICT was enabled "recently" (possibly within the current history retention window)

Yes?

And it is an error for a commitInfo to not provide the in-commit timestamp, if it lies within the ICT window those properties identify?

Should we just leave the properties in place permanently, for simpler checking? Or are they bulky enough it's helpful to garbage collect them at some point? If we did GC them, we'd probably want to do it only when metadata is anyway changing for some other reason, since table property changes conflict with all concurrent transactions.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes?

Yes, this is exactly correct. Should I just add this table to the spec?

And it is an error for a commitInfo to not provide the in-commit timestamp, if it lies within the ICT window those properties identify?

Yes.

Should we just leave the properties in place permanently, for simpler checking? Or are they bulky enough it's helpful to garbage collect them at some point? If we did GC them, we'd probably want to do it only when metadata is anyway changing for some other reason, since table property changes conflict with all concurrent transactions.

Would it be okay to just leave this up to the writer? The spec, as written, allows for both possibilities.


## Recommendations for Readers of Tables with In-Commit Timestamps

For tables with In-Commit timestamps enabled, readers should use the `inCommitTimestamp` as the commit timestamp for operations like time travel.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we mention DESCRIBE HISTORY? It's technically a spark-specific command, but it seems reasonable that other clients would want to provide something similar if they don't already.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a references to DESCRIBE HISTORY

Comment on lines 44 to 59
If a table has commits from a period before In-Commit timestamps were enabled, the table property `delta.inCommitTimestampEnablementVersion` would be set and can be used to identify commits that don't have `inCommitTimestamp`.
To correctly determine the commit timestamp for these tables, readers can use the following rules:
1. For commits with version >= `delta.inCommitTimestampEnablementVersion`, readers should use the `inCommitTimestamp` field of the `commitInfo` action.
2. For commits with version < `delta.inCommitTimestampEnablementVersion`, readers should use the file modification timestamp.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be talking about version-based time travel, which isn't really related. If the client is trying to do timestamp-based time travel, they should first check whether the timestamp is before the delta.inCommitTimestampEnablementTimestamp; if so, timestamp resolution should only consider commits before delta.inCommitTimestampEnablementVersion, using file modification timestamps. Otherwise, timestamp resolution should only consider commits starting from delta.inCommitTimestampEnablementVersion, using in-commit timestamps.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this as a separate section: https://github.com/delta-io/delta/pull/2599/files#r1480345698. Could combine these two if that looks better.


When In-Commit Timestamp are enabled, writers are required to include a commitInfo action with every commit, which must include the `inCommitTimestamp` field.

#### Reader Requirements for AddCDCFile
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added new changes to this section.

3. The `commitInfo` action must include a field named `inCommitTimestamp`, of type `long` (see [Primitive Types](#primitive-types)), which represents the UTC time in milliseconds when the commit is considered to have succeeded. It is the larger of two values:
- The UTC wall clock time at which the writer attempted the commit
3. The `commitInfo` action must include a field named `inCommitTimestamp`, of type `long` (see [Primitive Types](#primitive-types)), which represents the Unix time in milliseconds when the commit is considered to have succeeded. It is the larger of two values:
- The Unix wall clock time at which the writer attempted the commit
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re "Unix time" -- most other parts of the spec say "time in milliseconds since the Unix epoch"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced both instances with "in milliseconds since the Unix epoch"

A delta file can optionally contain additional provenance information about what higher-level operation was being performed as well as who executed it.

Implementations are free to store any valid JSON-formatted data via the `commitInfo` action ~~.~~ **as long as any table feature (see [In-Commit Timestamps](#in-commit-timestamps)) does not impose additional requirements on the data.**
Implementations are free to store any valid JSON [object literal](https://www.w3schools.com/js/js_json_objects.asp) as the `commitInfo` action <ins>unless some table feature (e.g. [In-Commit Timestamps](#in-commit-timestamps)) imposes additional requirements on the data</ins>.

When In-Commit Timestamp are enabled, writers are required to include a commitInfo action with every commit, which must include the `inCommitTimestamp` field.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing <ins> tag to mark it as new text?

Field Name | Data Type | Description
-|-|-
_commit_version|`Long`| The table version containing the change. This can be derived from the name of the Delta log file that contains actions.
_commit_timestamp|`Timestamp`| The timestamp associated when the commit was created. ~~This can be derived from the file modification time of the Delta log file that contains actions.~~ <ins> Depending on whether [In-Commit Timestamps](#in-commit-timestamps) are enabled, this is either the file modification time or the `inCommitTimestamp` stored in the `commitInfo` action of the Delta log file with the version `__commit_version`.</ins>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
_commit_timestamp|`Timestamp`| The timestamp associated when the commit was created. ~~This can be derived from the file modification time of the Delta log file that contains actions.~~ <ins> Depending on whether [In-Commit Timestamps](#in-commit-timestamps) are enabled, this is either the file modification time or the `inCommitTimestamp` stored in the `commitInfo` action of the Delta log file with the version `__commit_version`.</ins>
_commit_timestamp|`Timestamp`| The timestamp associated when the commit was created. ~~This can be derived from the file modification time of the Delta log file that contains actions.~~ <ins> Depending on whether [In-Commit Timestamps](#in-commit-timestamps) are enabled, this is either the file modification time or the `inCommitTimestamp` stored in the `commitInfo` action of the Delta log file with the version `_commit_version`.</ins>

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this metadata column has type Timestamp, we probably need to keep the "derived" language. Specifically, reader needs to convert from ms since unix epoch to µs resolution UTC.

Maybe:

The timestamp associated when the commit was created. This can be derived from the file modification time of the Delta log file that contains actions.

The commit timestamp for _commit_version. Depending on whether In-Commit Timestamps are enabled, this is derived from either the inCommitTimestamp field of the commitInfo action in the version's Delta log, or from the Delta log's file modification time.

Copy link
Collaborator

@ryan-johnson-databricks ryan-johnson-databricks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me except the one question about table property. But we can probably iterate on that as the prototype takes shape and we have more information to work from.

Enablement:
- The table must be on Writer Version 7.
- The feature `inCommitTimestamps` must exist in the table `protocol`'s `writerFeatures`.
- The table property `delta.enableInCommitTimestamps` must be set to `true`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What purpose would this table property serve? There isn't much of an enable/disable story, is there? No metadata or data to rewrite, no one-way doors, etc?

CC @bart-samwel since he's thought a lot about properties/upgrade/downgrade issues.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No metadata or data to rewrite

We might want to remove delta.inCommitTimestampEnablementTimestamp/Version from the metadata on disablement. Though it is technically okay even if we do not.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we would definitely want to remove them, unless we're willing to track multiple ranges of commits where ICT was enabled. But that can easily be done in the same commit that removes the feature from protocol.

Copy link
Collaborator

@ryan-johnson-databricks ryan-johnson-databricks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, but we do need to add a proposed entry for this to the RFC index.md!

@dhruvarya-db
Copy link
Collaborator Author

Oh, but we do need to add a proposed entry for this to the RFC index.md!

Done

@scottsand-db scottsand-db merged commit 5545f28 into delta-io:master Feb 8, 2024
6 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants