-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SNOW-107] Add deduplication logic for filehandleassociation_latest
#73
Conversation
@@ -30,18 +30,28 @@ CREATE DYNAMIC TABLE IF NOT EXISTS filehandleassociation_latest | |||
TARGET_LAG = '7 days' | |||
WAREHOUSE = compute_xsmall | |||
AS | |||
with latest_instance as ( | |||
WITH latest_unique_rows AS ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the great work and testing of the solution! I should've mentioned that it would be a challenge for you to learn about windows functions here instead of using joins. Take a look at some of the other scripts where I de-duplicate.
In particular, look for this in this repo ROW_NUMBER() OVER (
A brief overview, a "window function" will group by a set number of columns and can assign a row number based on a criteria, in this case maybe max timestamp. Take a look!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @thomasyu888 thanks for the suggestion. I went ahead and adjusted the CTE to use the ROW_NUMBER
window function in much the same way that it's done for the ACL_LATEST
dynamic table (see solution here). I found that it produces the exact same result, so this method also works. Unfortunately although the query is shorter, it took about 20 seconds longer (total: 1m11s) with the majority of the expense coming from the ROW_NUMBER
window function, I assume because of the partitioning that has to happen:

The original solution which uses the aggregate function MAX()
and the clause GROUP BY
to get the latest rows based on TIMESTAMP
runs for 45s. Here is the node expense distribution:

With this in mind, I'll be moving forward with the first solution. I did however make some structural changes, namely splitting off the dynamic table logic for TEAM_LATEST
and FILEHANDLEASSOCIATION_LATEST
to be in their own R scripts, since they had different TIME_LAG
s and I didn't see the point of them being together. Lmk if it's best to put these back in the same script or if this is fine as is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking into this! This is very interesting, since there are so many partitions, it becomes more inefficient to use a windows function!
filehandleassociation_latest
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔥 LGTM! I'm going to wait for @philerooski to final review, one note about the team latest script, it's CREATE IF NOT EXISTS
The documentation has it formatted the way it currently exists in the R script. Either way, I'm going to change it to |
|
Problem
The
filehandleassociation_latest
dynamic table was found to have duplicates based onfilehandleid
.Solution
associateid
andfilehandleid
combination within a 14 day windowlatest_instance
filter that was accidentally filtering out valid rows (see Testing & Preview for more context)The main change was replacing the
latest_instance
CTE with one that pull the latest row for everyassociateid
andfilehandleid
combination such thatbecomes
Duplicates are defined as any row with the same combination of
filehandleid
andassociateid
as another existing row, and are handled accordingly with the new CTE.Old query example
Using the definition of a Duplicate from above, we find duplicates from the old query in rows 3-7 for this example:
New query example
With the new query, all duplicates are removed and the LATEST rows from
snapshots
are taken (note the differences ininstance
andtimestamp
, because we are no longer usinginstance
for filtering - we are now usingtimestamp
) :Testing
Test the efficacy of the new CTE
1. Total rows with old query
2. Total rows with old query with duplicates removed
This should be the number of rows we get when we implement our new query...

3. Total rows with new query
4. Addressing the discrepancy
Our final result is a table with 101,741,895 rows. More than the expected 101,716,860 by 25,035 rows. A comparison between tables from 2 and 3 shows the following...
All 25,035 rows that are missing from the expected result are due to the
instance
column not matchingmax(instance)
from the original query, which of 511 as of Aug-30-2024:Further testing...
Ensuring the
WHERE
clause for thetimestamp
window works by getting the MINIMUMtimestamp
from the final table (CURRENT_TIMESTAMP
is Aug-29-2024):14 DAYS
20 DAYS
5 DAYS