Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SNOW-253] update table comments for userprofile_latest, file_latest, and teammember_latest dynamic tables #158

Merged
merged 5 commits into from
Feb 28, 2025

Conversation

danlu1
Copy link
Contributor

@danlu1 danlu1 commented Feb 28, 2025

Problem:

As requested in SNOW-177, we’d like to update table comments to indicate how we deduplicate and transform data in dynamic tables.

Solution:

Altered table comments for these tables.

@danlu1 danlu1 requested a review from a team as a code owner February 28, 2025 18:38
Copy link
Collaborator

@philerooski philerooski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For clarity and conciseness, I think we can remove some implementation details around how the most recent snapshot was obtained, since:

  • It's not usually relevant to analysts
  • Developers can look at the table definition to get a more precise definition of how the latest snapshot was obtained.
  • We can reframe some of this information in a more useful way, e.g., rather than describing which columns we used to deduplicate the table, we can state that these columns are the table index.

Particularly for data types which capture delete events, it doesn't matter whether these are the most recent snapshots from the past 30 or 300 days, and actually I think specifying this can be confusing, since we go on to explain that within this more or less arbitrary time period we only keep the most recent snapshot. Here's how I think we could streamline the comments:

-- The information about NULL values would be more useful elsewhere (like as a column comment)
ALTER DYNAMIC TABLE USERPROFILE_LATEST SET COMMENT = 'This dynamic table, indexed by the ID column, contains the most recent snapshot of user profiles from the past 14 days. The latest snapshots snapshot are is determined by the most recent CHANGE_TIMESTAMP and SNAPSHOT_TIMESTAMP. If multiple snapshots exist within this period, only the latest snapshot for each ID is retained. Since snapshotting does not capture DELETE changes events, records may still be retained if a user was deleted within the past 14 days. For easier querying, NULL values in the LOCATION, COMPANY, POSITION, and INDUSTRY columns are replaced with empty strings.';

-- updated "files" to "file handles" since this is a common point of confusion among analysts
ALTER DYNAMIC TABLE FILE_LATEST SET COMMENT = 'This dynamic table, indexed by the ID column, contains the most recent snapshot of file handles from the past 30 days. The latest snapshots are determined by the most recent CHANGE_TIMESTAMP and SNAPSHOT_TIMESTAMP. If multiple snapshots exist within this period, only the latest snapshot for each ID is retained.';

ALTER DYNAMIC TABLE TEAMMEMBER_LATEST SET COMMENT = 'This dynamic table, indexed by the TEAM_ID and MEMBER_ID columns, contains the most recent snapshot of team members from the past 14 days. The latest snapshots are determined by the most recent CHANGE_TIMESTAMP and SNAPSHOT_TIMESTAMP. If multiple snapshots exist within this period, only the latest snapshot for each TEAM_ID and MEMBER_ID pair is retained. Since snapshotting does not capture DELETE changes events, records may still be retained if a team member was deleted within the past 14 days.';

lmk what you think.

Copy link
Contributor

@jaymedina jaymedina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this work 🔥 Deferring to Phil for final review

@danlu1
Copy link
Contributor Author

danlu1 commented Feb 28, 2025

Particularly for data types which capture delete events, it doesn't matter whether these are the most recent snapshots from the past 30 or 300 days, and actually I think specifying this can be confusing, since we go on to explain that within this more or less arbitrary time period we only keep the most recent snapshot.

I see your point, but I think specifying the time window is important for clarity. While it's true that the logic applies regardless of whether we're looking at 30 or 300 days, the time window we define is not arbitrary—it determines the retention period for snapshots. By explicitly stating it, we make it clear how long records might persist, which is especially relevant when explaining why deleted records could still appear. Without this context, it might be unclear how long outdated snapshots are retained, which could lead to confusion rather than prevent it.

@philerooski
Copy link
Collaborator

philerooski commented Feb 28, 2025

the time window we define is not arbitrary—it determines the retention period for snapshots. By explicitly stating it, we make it clear how long records might persist, which is especially relevant when explaining why deleted records could still appear.

I think your objection here is around the non-capture of delete events for some data types? That's absolutely relevant -- we should keep that information as I did in my suggestions:

Since snapshotting does not capture DELETE events, records may still be retained if a user was deleted within the past 14 days.

Since snapshotting does not capture DELETE events, records may still be retained if a team member was deleted within the past 14 days.

But what's less relevant is including that time-window information about snapshots more generally. Actually, by including that information:

This dynamic table contains the most recent snapshot of user profiles from the past 14 days.

You have to go on to clarify that the table doesn't contain the most recent snapshots of user profiles from the past 14 days, but actually just the most recent snapshot.

The latest snapshots are determined by the most recent CHANGE_TIMESTAMP and SNAPSHOT_TIMESTAMP. If multiple snapshots exist within this period, only the latest snapshot for each ID is retained. Since snapshotting does not capture DELETE changes, records may still be retained if a user was deleted within the past 14 days.

It's a more confusing and verbose way of expressing the same thing:

This dynamic table, indexed by the ID column, contains the most recent snapshot of user profiles. Since snapshotting does not capture DELETE events, records may still be retained if a user was deleted within the past 14 days.

@danlu1
Copy link
Contributor Author

danlu1 commented Feb 28, 2025

The time window thing is for file_latest table. Thanks for clarifying. I forgot snapshots were taken independent of changes.

Copy link
Collaborator

@philerooski philerooski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work and discussion!

@danlu1 danlu1 merged commit bf6cb41 into dev Feb 28, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants