Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SNOW-155] Converting userprofile_latest table to dynamic table #100

Merged
merged 6 commits into from
Jan 11, 2025

Conversation

danlu1
Copy link
Contributor

@danlu1 danlu1 commented Jan 8, 2025

Problem:

Currently, the userprofile_latest table is a table that is updated via upserting task that is based on userprofilesnapshot stream. We want to convert it to dynamic table so it gets refreshed automatically.

Solution:
Following the instructions of this SOP:

  • Add a script to create a backup of the original userprofile_latest table which need to be removed manually after validation
  • Add a script to delete the upserting task for the original userprofile_latest table
  • Add a script to delete stream for userprofilesnapshot table that is referred in userprofile_latest upserting task
  • Add a script to drop the original userprofile_latest table so we can create a dynamic table with the same name
  • Create userprofile_latest dynamic table and convert empty strings in LOCATION, COMPANY, POSITION, INDUSTRY to Null for easy search. Keep TOS_AGREEMENTS as VARIANT variable as Phil's comments.
  • Add a script to add table and column comments.

Validation(edited):
As the instructions in Validating a "latest" table on synapse_data_warehouse_dev:

  1. Ensure there are no duplicates in the latest table
select
    count(*) as total_rows,
    count(distinct id) as total_unique_ids
from SYNAPSE_DATA_WAREHOUSE_DANLU_STAGE.SYNAPSE.USERPROFILE_LATEST_test2;
Screenshot 2025-01-08 at 5 12 49 PM
  1. Ensure no missing IDs be in latest table compared to snapshot table
with snapshot_ids as (
    select
        id,
        max(snapshot_timestamp) as latest_snapshot_timestamp,
        max(change_timestamp) as latest_change_timestamp
    from 
        SYNAPSE_DATA_WAREHOUSE_DANLU_STAGE.SYNAPSE_RAW.userprofilesnapshot
    group by 
        id
),
-- 2. Then we get all the unique IDs from latest
latest_ids as (
    select 
        id
    from 
        SYNAPSE_DATA_WAREHOUSE_DANLU_STAGE.SYNAPSE.USERPROFILE_LATEST_test2
),
-- 3. Identify missing IDs and exclude those present in both
missing_ids as (
    select 
        coalesce(snapshot_ids.id, latest_ids.id) as id,
        case 
            when snapshot_ids.id is null then 'missing_from_snapshots'
            when latest_ids.id is null then 'missing_from_latest'
        end as missing_status,
        snapshot_ids.latest_snapshot_timestamp,
        snapshot_ids.latest_change_timestamp
    from 
        snapshot_ids
    full outer join 
        latest_ids
    on 
        snapshot_ids.id = latest_ids.id
    where 
        snapshot_ids.id is null 
        or 
        latest_ids.id is null
)
select * from missing_ids
where latest_snapshot_timestamp > current_timestamp - INTERVAL '14 days';
Screenshot 2025-01-08 at 5 14 10 PM
  1. Ensure the empty string replacement works as expected

Before:
Screenshot 2025-01-07 at 8 51 59 PM

After:
Screenshot 2025-01-07 at 8 57 34 PM

Depends on #99

@danlu1 danlu1 requested a review from a team as a code owner January 8, 2025 20:10
LOCATION VARCHAR(16777216) COMMENT 'The location of the user.',
COMPANY VARCHAR(16777216) COMMENT 'The company where the user works.',
POSITION VARCHAR(16777216) COMMENT 'The position of the user in the company.',
INDUSTRY VARCHAR(16777216) COMMENT 'The industry/discipline that this person is associated with.'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if explicitly listing out the columns like this would ever clash with the SELECT * from the main query, if say userprofilesnapshot gets a column added to it later down the line? So either we keep the SELECT * and implicitly call the columns in this script + introduce the column comments in a separate script, or the columns selected are explicitly written out in the main query of this script

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're not modifying any columns from the upstream table I recommend using the more succinct style which you can see in Jenny's PR which I linked to below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I will add column comments as a separate step as like what Jenny said, new columns might be added to the source table.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a heads up: I noticed that dynamic table comments don't show up in their own little Description box the way regular table comments do. With the dynamic table, you have to dive into the Table Definition itself to see that the dynamic table comment was added. It would be nice if we can also have the dynamic table comments front and center the way it is with the normal ones, but for now this is what it is. Dynamic tables were just added last year, so I'm sure this is a QOL improvement they'll add in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the only ways to check the Table comments are either hover over the table name and the comment section which is pretty tiny or refer to the Table Definition. Hope the Description box gets added soon.

Comment on lines 44 to 47
NULLIF(LOCATION, '') AS LOCATION,
NULLIF(COMPANY, '') AS COMPANY,
NULLIF(POSITION, '') AS POSITION,
NULLIF(INDUSTRY, '') AS INDUSTRY,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we ever figure out why some columns were being recorded as null and others as a blank string?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, not sure why this happened. Besides these columns, First_Name and Last_Name do have both empty strings and nulls. But I feel these 4 columns are more generic for business metrics so I convert empty strings for them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a platform ticket to investigate this here

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I bugged Marco about this in the Slack thread to see if we can figure out whether this is in the Parquet or if Snowflake is introducing these inconsistencies during data load.

@danlu1 if you want to merge this before we've figured out the root of the issue can you leave an inline comment with the Jira ticket identifier which Jenny linked so that we have some provenance for these changes?

Copy link
Contributor Author

@danlu1 danlu1 Jan 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As Marco's reply, it seems there is no easy answer for the discrepancies since he noticed this in Athena as well. I'd just merge the PR and leave a comment and we can revisit after SWC-7215 is resolved.

@thomasyu888
Copy link
Member

thomasyu888 commented Jan 9, 2025

Depends on #99 due to V script versioning

*
FROM {{database_name}}.SYNAPSE_RAW.USERPROFILESNAPSHOT --noqa: TMP
WHERE
SNAPSHOT_DATE >= CURRENT_TIMESTAMP - INTERVAL '14 days'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be consistent and use the same day interval for reduplication of all snapshots?

Copy link
Contributor

@jaymedina jaymedina Jan 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consistency is a good idea. If we do this, we will have to go with INTERVAL '30 days' as that is the time window for node_latest and we can't make that one shorter.

Edit: Coming back to this, it's better to keep the interval as 14 days for all dynamic tables but with exceptions like node_latest because not all snapshot tables have the change_type=DELETE value, so we will inevitably end up with duplicates in all our tables if we stick to a single window (which would have to be 30+ days because of the exceptions)

Copy link
Contributor Author

@danlu1 danlu1 Jan 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I second Jenny. Another reason is that the majority of the existing dynamic table is implementing the 14-days interval. @jaymedina I will add the interval setting in SOP.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, thank you!

Copy link
Member

@thomasyu888 thomasyu888 Jan 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jaymedina Another risk that we assume is if snapshotting doesn't occur for 2 weeks in a row.

That said - I think that risk is small except around the holidays. We will need to determine all the tables that we need to be concerned about delete events (even if they don't all track that change type). So do all of these below need to use interval of 30?

  • Teams
  • Node
  • File
  • ACL
  • teammember?
  • project settings?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like

  • AR snapshots
  • User group
  • Userprofile

As well ( although User group and Userprofile don't capture deletions )

Another thought I just had is a lot of these tables are related to each other (file entities in node_latest should also be present in file_latest) sooo maybe for consistency as far as data availability, having all the same windows is better for the users. It would be nice to expedite having change_type=DELETE be available for all snapshots with a delete option on Synapse so we don't have to worry about the time window thing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jaymedina @thomasyu888 @philerooski
I did a brief check and the following snapshot tables have change_type=DELETE:

  • ACCESSREQUIREMENTSNAPSHOTS
  • ACLSNAPSHOTS
  • FILESNAPSHOTS
  • NODESNAPSHOTS
  • PROJECTSETTINGSNAPSHOTS

Do we agree to keep time window for dynamic tables derived from snopshots with change_type=DELETE to 30 days and other dynamic tables to 14 days?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! This workaround will do until platform team can address the missing change_type=DELETE for some snapshot tables. Thanks!

@danlu1
Copy link
Contributor Author

danlu1 commented Jan 9, 2025

Depends on #99 due to V script versioning

@thomasyu888 @jaymedina The above PR is still in draft mode. How about I merge this one first?

@thomasyu888
Copy link
Member

thomasyu888 commented Jan 10, 2025

The above PR is still in draft mode. How about I merge this one first?

Ill defer to @jaymedina and you to self-organize. The issue is the V script versions, if we merge this first, then it would have to be modified to 2.28 and Jenny's PR would have to be modified to 2.29.

My 2cents: I am ok with waiting a bit - Jenny's PR seems almost done - but since you'll be traveling for conference on M-Th, we can help merge if it's not complete.

@jaymedina
Copy link
Contributor

@danlu1 I still have some testing to do as well as a review for #99 before it can get merged, so would you mind renaming your scripts to V2.28.X? Then we can merge this in ASAP. Sorry for the initial hold-up! Feel free to add me as validator for this update so I can handle that for you next week.

@danlu1
Copy link
Contributor Author

danlu1 commented Jan 10, 2025

@jaymedina The V2.28 has been taken to refresh userprofilesnapshot. If we want this PR to be merged after #99, I can rename it as V2.30.X

Uploading Screenshot 2025-01-10 at 10.37.35 AM.png…

Copy link

dpulls bot commented Jan 11, 2025

🎉 All dependencies have been resolved !

@jaymedina
Copy link
Contributor

@danlu1 - You're right - I merged my V2.29 scripts in #99. Feel free to rename your scripts to V2.30.X. I'm approving now so you can merge whenever that's done, and I've added myself as validator which I'll start on Monday.

Thank you for these updates!

@danlu1 danlu1 merged commit 22bc594 into dev Jan 11, 2025
1 check passed
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants