Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SNOW-107] Add deduplication logic for filehandleassociation_latest #73

Merged
merged 16 commits into from
Sep 6, 2024
Original file line number Diff line number Diff line change
Expand Up @@ -30,18 +30,28 @@ CREATE DYNAMIC TABLE IF NOT EXISTS filehandleassociation_latest
TARGET_LAG = '7 days'
WAREHOUSE = compute_xsmall
AS
with latest_instance as (
WITH latest_unique_rows AS (
Copy link
Member

@thomasyu888 thomasyu888 Aug 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the great work and testing of the solution! I should've mentioned that it would be a challenge for you to learn about windows functions here instead of using joins. Take a look at some of the other scripts where I de-duplicate.

In particular, look for this in this repo ROW_NUMBER() OVER (

A brief overview, a "window function" will group by a set number of columns and can assign a row number based on a criteria, in this case maybe max timestamp. Take a look!

Copy link
Contributor Author

@jaymedina jaymedina Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @thomasyu888 thanks for the suggestion. I went ahead and adjusted the CTE to use the ROW_NUMBER window function in much the same way that it's done for the ACL_LATEST dynamic table (see solution here). I found that it produces the exact same result, so this method also works. Unfortunately although the query is shorter, it took about 20 seconds longer (total: 1m11s) with the majority of the expense coming from the ROW_NUMBER window function, I assume because of the partitioning that has to happen:

image

The original solution which uses the aggregate function MAX() and the clause GROUP BY to get the latest rows based on TIMESTAMP runs for 45s. Here is the node expense distribution:

image

With this in mind, I'll be moving forward with the first solution. I did however make some structural changes, namely splitting off the dynamic table logic for TEAM_LATEST and FILEHANDLEASSOCIATION_LATEST to be in their own R scripts, since they had different TIME_LAGs and I didn't see the point of them being together. Lmk if it's best to put these back in the same script or if this is fine as is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into this! This is very interesting, since there are so many partitions, it becomes more inefficient to use a windows function!

SELECT
max(instance) as max_instance
filehandleid,
associateid,
max(timestamp) as latest_timestamp
FROM
{{database_name}}.synapse_raw.filehandleassociationsnapshots --noqa: TMP
where
{{database_name}}.synapse_raw.filehandleassociationsnapshots --noqa: TMP
WHERE
timestamp >= CURRENT_TIMESTAMP - INTERVAL '14 DAYS'
GROUP BY
filehandleid,
associateid
)
SELECT
filehandleassociationsnapshots.*
FROM
{{database_name}}.synapse_raw.filehandleassociationsnapshots --noqa: TMP
inner JOIN
latest_instance
ON filehandleassociationsnapshots.instance = latest_instance.max_instance;
JOIN
latest_unique_rows
ON
filehandleassociationsnapshots.filehandleid = latest_unique_rows.filehandleid
AND
filehandleassociationsnapshots.associateid = latest_unique_rows.associateid
AND
filehandleassociationsnapshots.timestamp = latest_unique_rows.latest_timestamp;