-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Protocol Change] Add the In-Commit Timestamps spec change proposal #2599
Conversation
### Commit Provenance Information | ||
A delta file can optionally contain additional provenance information about what higher-level operation was being performed as well as who executed it. | ||
|
||
Implementations are free to store any valid JSON-formatted data via the `commitInfo` action ~~.~~ **as long as any table feature (see [In-Commit Timestamps](#in-commit-timestamps)) does not impose additional requirements on the data.** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest to highlight insertions with <ins>...</ins>
tags instead of bold, since bold can occur naturally in the spec?
Implementations are free to store any valid JSON-formatted data via the `commitInfo` action ~~.~~ **as long as any table feature (see [In-Commit Timestamps](#in-commit-timestamps)) does not impose additional requirements on the data.** | |
Implementations are free to store any valid JSON [object literal](https://www.w3schools.com/js/js_json_objects.asp) as the `commitInfo` action <ins>unless some table feature (e.g. [In-Commit Timestamps](#in-commit-timestamps)) imposes additional requirements on the data</ins>. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Meanwhile tho... this seems messy. We should really think about defining some kind of basic schema for commit info, so that table features don't have to completely change the commit info spec just to ensure a field is present.
See also #1548 and delta-io/delta-rs#1017; the latter was caused because Delta spark uses an array-like string: "partitionBy": "[...]"
while Delta trino uses an actual array: "partitionBy": [...]"
. That sort of variation is really hard for clients to parse reliably, if they're using anything more structured that raw JSON parser nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Meanwhile, I suggest that we correct the spec to say "valid JSON object literal" instead of "valid JSON-formatted data", to make clear that clients can expect commitInfo
to parse as an object rather than one of the other JSON data types (array, string, numeric, boolean, null). In other words, only the first line of the following snippet would be a legal commitInfo
action:
{ "commitInfo": { } }
{ "commitInfo": [ ] }
{ "commitInfo": 10 }
{ "commitInfo": null }
AFAIK, all clients out there are already using proper object literals, because otherwise they would break interop with mainstream clients like spark.
We should probably track this clarification as a separate issue, tho, because it's fixing a spec erratum (no associated table feature) and so wouldn't need to go through the RFC process?
When In-Commit Timestamps is enabled, then: | ||
1. Writers must write the `commitInfo` (see [Commit Provenance Information](#commit-provenance-information)) action in the commit. | ||
2. The `commitInfo` action must be the first action in the commits. | ||
3. The `commitInfo` action must include a field named `inCommitTimestamp`, of type `timestamp` (see [Primitive Types](#primitive-types)), which represents the UTC time when the commit is considered to have succeeded. It is the larger of two values: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK, the timestamp
type (with µs resolution) is only used for user data today. All other Delta spec timestamps seem to be long
type (ms since the unix epoch):
- add.modificationTime (seems especially relevant here, since it means time travel currently has ms resolution)
- remove.deletionTimestamp
- sidecar.modificationTime
- metadata.createdTime
- setTransaction.lastUpdated
What do we gain from a µs resolution in-commit timestamp?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like CDC readers are expected to return a Timestamp
(https://github.com/delta-io/delta/pull/2599/files#diff-ff95f55525ec15c41f84b2215cd0f45fbdf4a88b2e4c4ef126ed3bea0b900597R28) but I agree, we don't really need µs resolution so a Long
should suffice.
4. If the table has commits from a period when this feature was not enabled, provenance information around when this feature was enabled must be tracked in table properties: | ||
- The property `delta.inCommitTimestampEnablementVersion` must be used to track the version of the table when this feature was enabled. | ||
- The property `delta.inCommitTimestampEnablementTimestamp` must be the same as the `inCommitTimestamp` of the commit when this feature was enabled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly:
feature enabled | property present | meaning |
---|---|---|
no | no | feature not enabled |
no | yes | illegal properties |
yes | no | ICT has "always" been enabled (at least for all commits within the history retention window) |
yes | yes | ICT was enabled "recently" (possibly within the current history retention window) |
Yes?
And it is an error for a commitInfo to not provide the in-commit timestamp, if it lies within the ICT window those properties identify?
Should we just leave the properties in place permanently, for simpler checking? Or are they bulky enough it's helpful to garbage collect them at some point? If we did GC them, we'd probably want to do it only when metadata is anyway changing for some other reason, since table property changes conflict with all concurrent transactions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes?
Yes, this is exactly correct. Should I just add this table to the spec?
And it is an error for a commitInfo to not provide the in-commit timestamp, if it lies within the ICT window those properties identify?
Yes.
Should we just leave the properties in place permanently, for simpler checking? Or are they bulky enough it's helpful to garbage collect them at some point? If we did GC them, we'd probably want to do it only when metadata is anyway changing for some other reason, since table property changes conflict with all concurrent transactions.
Would it be okay to just leave this up to the writer? The spec, as written, allows for both possibilities.
|
||
## Recommendations for Readers of Tables with In-Commit Timestamps | ||
|
||
For tables with In-Commit timestamps enabled, readers should use the `inCommitTimestamp` as the commit timestamp for operations like time travel. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we mention DESCRIBE HISTORY? It's technically a spark-specific command, but it seems reasonable that other clients would want to provide something similar if they don't already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a references to DESCRIBE HISTORY
If a table has commits from a period before In-Commit timestamps were enabled, the table property `delta.inCommitTimestampEnablementVersion` would be set and can be used to identify commits that don't have `inCommitTimestamp`. | ||
To correctly determine the commit timestamp for these tables, readers can use the following rules: | ||
1. For commits with version >= `delta.inCommitTimestampEnablementVersion`, readers should use the `inCommitTimestamp` field of the `commitInfo` action. | ||
2. For commits with version < `delta.inCommitTimestampEnablementVersion`, readers should use the file modification timestamp. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be talking about version-based time travel, which isn't really related. If the client is trying to do timestamp-based time travel, they should first check whether the timestamp is before the delta.inCommitTimestampEnablementTimestamp
; if so, timestamp resolution should only consider commits before delta.inCommitTimestampEnablementVersion
, using file modification timestamps. Otherwise, timestamp resolution should only consider commits starting from delta.inCommitTimestampEnablementVersion
, using in-commit timestamps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added this as a separate section: https://github.com/delta-io/delta/pull/2599/files#r1480345698. Could combine these two if that looks better.
4f6be45
to
c3be37f
Compare
|
||
When In-Commit Timestamp are enabled, writers are required to include a commitInfo action with every commit, which must include the `inCommitTimestamp` field. | ||
|
||
#### Reader Requirements for AddCDCFile |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added new changes to this section.
3. The `commitInfo` action must include a field named `inCommitTimestamp`, of type `long` (see [Primitive Types](#primitive-types)), which represents the UTC time in milliseconds when the commit is considered to have succeeded. It is the larger of two values: | ||
- The UTC wall clock time at which the writer attempted the commit | ||
3. The `commitInfo` action must include a field named `inCommitTimestamp`, of type `long` (see [Primitive Types](#primitive-types)), which represents the Unix time in milliseconds when the commit is considered to have succeeded. It is the larger of two values: | ||
- The Unix wall clock time at which the writer attempted the commit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re "Unix time" -- most other parts of the spec say "time in milliseconds since the Unix epoch"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replaced both instances with "in milliseconds since the Unix epoch"
A delta file can optionally contain additional provenance information about what higher-level operation was being performed as well as who executed it. | ||
|
||
Implementations are free to store any valid JSON-formatted data via the `commitInfo` action ~~.~~ **as long as any table feature (see [In-Commit Timestamps](#in-commit-timestamps)) does not impose additional requirements on the data.** | ||
Implementations are free to store any valid JSON [object literal](https://www.w3schools.com/js/js_json_objects.asp) as the `commitInfo` action <ins>unless some table feature (e.g. [In-Commit Timestamps](#in-commit-timestamps)) imposes additional requirements on the data</ins>. | ||
|
||
When In-Commit Timestamp are enabled, writers are required to include a commitInfo action with every commit, which must include the `inCommitTimestamp` field. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing <ins>
tag to mark it as new text?
Field Name | Data Type | Description | ||
-|-|- | ||
_commit_version|`Long`| The table version containing the change. This can be derived from the name of the Delta log file that contains actions. | ||
_commit_timestamp|`Timestamp`| The timestamp associated when the commit was created. ~~This can be derived from the file modification time of the Delta log file that contains actions.~~ <ins> Depending on whether [In-Commit Timestamps](#in-commit-timestamps) are enabled, this is either the file modification time or the `inCommitTimestamp` stored in the `commitInfo` action of the Delta log file with the version `__commit_version`.</ins> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_commit_timestamp|`Timestamp`| The timestamp associated when the commit was created. ~~This can be derived from the file modification time of the Delta log file that contains actions.~~ <ins> Depending on whether [In-Commit Timestamps](#in-commit-timestamps) are enabled, this is either the file modification time or the `inCommitTimestamp` stored in the `commitInfo` action of the Delta log file with the version `__commit_version`.</ins> | |
_commit_timestamp|`Timestamp`| The timestamp associated when the commit was created. ~~This can be derived from the file modification time of the Delta log file that contains actions.~~ <ins> Depending on whether [In-Commit Timestamps](#in-commit-timestamps) are enabled, this is either the file modification time or the `inCommitTimestamp` stored in the `commitInfo` action of the Delta log file with the version `_commit_version`.</ins> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this metadata column has type Timestamp
, we probably need to keep the "derived" language. Specifically, reader needs to convert from ms since unix epoch to µs resolution UTC.
Maybe:
The timestamp associated when the commit was created. This can be derived from the file modification time of the Delta log file that contains actions.
The commit timestamp for
_commit_version
. Depending on whether In-Commit Timestamps are enabled, this is derived from either theinCommitTimestamp
field of thecommitInfo
action in the version's Delta log, or from the Delta log's file modification time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me except the one question about table property. But we can probably iterate on that as the prototype takes shape and we have more information to work from.
Enablement: | ||
- The table must be on Writer Version 7. | ||
- The feature `inCommitTimestamps` must exist in the table `protocol`'s `writerFeatures`. | ||
- The table property `delta.enableInCommitTimestamps` must be set to `true`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What purpose would this table property serve? There isn't much of an enable/disable story, is there? No metadata or data to rewrite, no one-way doors, etc?
CC @bart-samwel since he's thought a lot about properties/upgrade/downgrade issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No metadata or data to rewrite
We might want to remove delta.inCommitTimestampEnablementTimestamp/Version
from the metadata on disablement. Though it is technically okay even if we do not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we would definitely want to remove them, unless we're willing to track multiple ranges of commits where ICT was enabled. But that can easily be done in the same commit that removes the feature from protocol.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, but we do need to add a proposed entry for this to the RFC index.md!
…t considers delta.inCommitTimestampEnablementTimestamp
Done |
Which Delta project/connector is this regarding?
Description
Adds the proposal for spec change In-Commit Timestamps (see #2532) to the RFC folder.
How was this patch tested?
Spec-only change.
Does this PR introduce any user-facing changes?
No