-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SNOW-155] Converting userprofile_latest table to dynamic table #100
Changes from 4 commits
fce782c
40d3ce5
d852152
58935aa
5d77da1
8ab919c
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
-- Introduce the dynamic table | ||
USE SCHEMA {{database_name}}.synapse; --noqa: JJ01,PRS,TMP | ||
CREATE OR REPLACE DYNAMIC TABLE USERPROFILE_LATEST | ||
TARGET_LAG = '1 day' | ||
WAREHOUSE = compute_xsmall | ||
AS | ||
WITH dedup_userprofile AS ( | ||
SELECT | ||
* | ||
FROM {{database_name}}.SYNAPSE_RAW.USERPROFILESNAPSHOT --noqa: TMP | ||
WHERE | ||
SNAPSHOT_DATE >= CURRENT_TIMESTAMP - INTERVAL '14 days' | ||
QUALIFY | ||
ROW_NUMBER() OVER ( | ||
PARTITION BY ID | ||
ORDER BY CHANGE_TIMESTAMP DESC, SNAPSHOT_TIMESTAMP DESC | ||
) = 1 | ||
) | ||
SELECT | ||
* exclude (LOCATION, COMPANY, POSITION, INDUSTRY), | ||
NULLIF(LOCATION, '') AS LOCATION, | ||
NULLIF(COMPANY, '') AS COMPANY, | ||
NULLIF(POSITION, '') AS POSITION, | ||
NULLIF(INDUSTRY, '') AS INDUSTRY, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Did we ever figure out why some columns were being recorded as null and others as a blank string? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Nope, not sure why this happened. Besides these columns, First_Name and Last_Name do have both empty strings and nulls. But I feel these 4 columns are more generic for business metrics so I convert empty strings for them. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There's a platform ticket to investigate this here There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I bugged Marco about this in the Slack thread to see if we can figure out whether this is in the Parquet or if Snowflake is introducing these inconsistencies during data load. @danlu1 if you want to merge this before we've figured out the root of the issue can you leave an inline comment with the Jira ticket identifier which Jenny linked so that we have some provenance for these changes? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As Marco's reply, it seems there is no easy answer for the discrepancies since he noticed this in Athena as well. I'd just merge the PR and leave a comment and we can revisit after SWC-7215 is resolved. |
||
FROM | ||
dedup_userprofile; |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
-- Add table and column comments to userprofile_latest dynamic table | ||
USE SCHEMA {{database_name}}.synapse; --noqa: JJ01,PRS,TMP | ||
|
||
-- Table comments | ||
COMMENT ON DYNAMIC TABLE USERPROFILE_LATEST IS 'This dynamic table contain the latest snapshot of user-profiles during the past 14 days. Snapshots are taken when user profiles are created or modified. Note: Snapshots are also taken periodically and independently of the changes. The snapshot_timestamp records when the snapshot was taken.'; | ||
|
||
-- Column comments | ||
COMMENT ON COLUMN USERPROFILE_LATEST.CHANGE_TYPE IS 'The type of change that occurred to the user profile, e.g., CREATE, UPDATE (Snapshotting does not capture DELETE change).'; | ||
COMMENT ON COLUMN USERPROFILE_LATEST.CHANGE_TIMESTAMP IS 'The time when any change to the user profile was made (e.g. create or update).'; | ||
COMMENT ON COLUMN USERPROFILE_LATEST.CHANGE_USER_ID IS 'The unique identifier of the user who made the change to the user profile.'; | ||
COMMENT ON COLUMN USERPROFILE_LATEST.SNAPSHOT_TIMESTAMP IS 'The time when the snapshot was taken (It is usually after the change happened).'; | ||
COMMENT ON COLUMN USERPROFILE_LATEST.ID IS 'The unique identifier of the user.'; | ||
COMMENT ON COLUMN USERPROFILE_LATEST.USER_NAME IS 'The Synapse username.'; | ||
COMMENT ON COLUMN USERPROFILE_LATEST.FIRST_NAME IS 'The first name of the user.'; | ||
COMMENT ON COLUMN USERPROFILE_LATEST.LAST_NAME IS 'The last name of the user.'; | ||
COMMENT ON COLUMN USERPROFILE_LATEST.EMAIL IS 'The primary email of the user.'; | ||
COMMENT ON COLUMN USERPROFILE_LATEST.SNAPSHOT_DATE IS 'The data is partitioned for fast and cost effective queries. The snapshot_timestamp field is converted into a date and stored in the snapshot_date field for partitioning. The date should be used as a condition (WHERE CLAUSE) in the queries.'; | ||
COMMENT ON COLUMN USERPROFILE_LATEST.CREATED_ON IS 'The creation time of the user profile.'; | ||
COMMENT ON COLUMN USERPROFILE_LATEST.IS_TWO_FACTOR_AUTH_ENABLED IS 'Indicates if the user had two factor authentication enabled when the snapshot was captured.'; | ||
COMMENT ON COLUMN USERPROFILE_LATEST.TOS_AGREEMENTS IS 'Contains the list of all the term of service that the user agreed to, with their agreed on date and version.'; | ||
COMMENT ON COLUMN USERPROFILE_LATEST.LOCATION IS 'The location of the user.'; | ||
COMMENT ON COLUMN USERPROFILE_LATEST.COMPANY IS 'The company where the user works.'; | ||
COMMENT ON COLUMN USERPROFILE_LATEST.POSITION IS 'The position of the user in the company.'; | ||
COMMENT ON COLUMN USERPROFILE_LATEST.INDUSTRY IS 'The industry/discipline that this person is associated with.'; |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
-- Backup the original latest table | ||
USE SCHEMA {{database_name}}.synapse; --noqa: JJ01,PRS,TMP | ||
|
||
-- Clone the USERPROFILE_LATEST table to ``USERPROFILE_LATEST_BACKUP`` for validation purposes | ||
CREATE OR REPLACE TABLE USERPROFILE_LATEST_BACKUP CLONE USERPROFILE_LATEST; |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
-- Drop the ``USERPROFILE_LATEST`` table | ||
USE SCHEMA {{database_name}}.synapse; | ||
DROP TABLE USERPROFILE_LATEST; |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
-- Drop the snapshot stream | ||
USE SCHEMA {{database_name}}.synapse_raw; | ||
DROP STREAM USERPROFILESNAPSHOT_STREAM; |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
-- Drop any scheduled tasks | ||
USE SCHEMA {{database_name}}.synapse_raw; | ||
-- Suspend ROOT TASK | ||
ALTER TASK REFRESH_SYNAPSE_WAREHOUSE_S3_STAGE_TASK SUSPEND; | ||
-- Drop LATEST_TABLE UPSERTING TASK | ||
DROP TASK UPSERT_TO_USERPROFILE_LATEST_TASK; | ||
-- Resume the ROOT task and its child tasks | ||
SELECT SYSTEM$TASK_DEPENDENTS_ENABLE( 'REFRESH_SYNAPSE_WAREHOUSE_S3_STAGE_TASK' ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we be consistent and use the same day interval for reduplication of all snapshots?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consistency is a good idea. If we do this, we will have to go with
INTERVAL '30 days'
as that is the time window fornode_latest
and we can't make that one shorter.Edit: Coming back to this, it's better to keep the interval as 14 days for all dynamic tables but with exceptions like
node_latest
because not all snapshot tables have thechange_type=DELETE
value, so we will inevitably end up with duplicates in all our tables if we stick to a single window (which would have to be 30+ days because of the exceptions)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I second Jenny. Another reason is that the majority of the existing dynamic table is implementing the 14-days interval. @jaymedina I will add the interval setting in SOP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jaymedina Another risk that we assume is if snapshotting doesn't occur for 2 weeks in a row.
That said - I think that risk is small except around the holidays. We will need to determine all the tables that we need to be concerned about delete events (even if they don't all track that change type). So do all of these below need to use interval of 30?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like
As well ( although User group and Userprofile don't capture deletions )
Another thought I just had is a lot of these tables are related to each other (file entities in
node_latest
should also be present infile_latest
) sooo maybe for consistency as far as data availability, having all the same windows is better for the users. It would be nice to expedite havingchange_type=DELETE
be available for all snapshots with a delete option on Synapse so we don't have to worry about the time window thing.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jaymedina @thomasyu888 @philerooski
I did a brief check and the following snapshot tables have
change_type=DELETE
:Do we agree to keep time window for dynamic tables derived from snopshots with change_type=DELETE to 30 days and other dynamic tables to 14 days?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! This workaround will do until platform team can address the missing
change_type=DELETE
for some snapshot tables. Thanks!