Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SNOW-107] Add deduplication logic for filehandleassociation_latest #73

Merged
merged 16 commits into from
Sep 6, 2024

Conversation

jaymedina
Copy link
Contributor

@jaymedina jaymedina commented Aug 29, 2024

Problem

The filehandleassociation_latest dynamic table was found to have duplicates based on filehandleid.

Solution

  • Modify dynamic table query to select the most recent associateid and filehandleid combination within a 14 day window
  • Remove the latest_instance filter that was accidentally filtering out valid rows (see Testing & Preview for more context)

The main change was replacing the latest_instance CTE with one that pull the latest row for every associateid and filehandleid combination such that

filehandleid associateid timestamp instance
1 A 2024-08-15 12:00:00 510
1 A 2024-08-20 14:00:00 511
2 B 2024-08-18 09:00:00 509
2 B 2024-08-22 10:00:00 511
3 C 2024-08-19 15:00:00 510

becomes

filehandleid associateid timestamp instance
1 A 2024-08-20 14:00:00 511
2 B 2024-08-22 10:00:00 511
3 C 2024-08-19 15:00:00 510

Duplicates are defined as any row with the same combination of filehandleid and associateid as another existing row, and are handled accordingly with the new CTE.

Old query example

Using the definition of a Duplicate from above, we find duplicates from the old query in rows 3-7 for this example:

image

New query example

With the new query, all duplicates are removed and the LATEST rows from snapshots are taken (note the differences in instance and timestamp, because we are no longer using instance for filtering - we are now using timestamp) :

image

Testing

Test the efficacy of the new CTE

1. Total rows with old query

image

2. Total rows with old query with duplicates removed

This should be the number of rows we get when we implement our new query...
image

3. Total rows with new query

image

4. Addressing the discrepancy

Our final result is a table with 101,741,895 rows. More than the expected 101,716,860 by 25,035 rows. A comparison between tables from 2 and 3 shows the following...

image

All 25,035 rows that are missing from the expected result are due to the instance column not matching max(instance) from the original query, which of 511 as of Aug-30-2024:

image

From this we can conclude that the new CTE does the following:

  • Removes duplicates as defined by rows having the same pair of associateid and filehandleid
  • Re-introduces rows that were incorrectly filtered out due to their instance value

Further testing...

Ensuring the WHERE clause for the timestamp window works by getting the MINIMUM timestamp from the final table (CURRENT_TIMESTAMP is Aug-29-2024):

14 DAYS

image

20 DAYS

image

5 DAYS

image

@jaymedina jaymedina marked this pull request as ready for review August 30, 2024 16:04
@jaymedina jaymedina requested a review from a team as a code owner August 30, 2024 16:04
@@ -30,18 +30,28 @@ CREATE DYNAMIC TABLE IF NOT EXISTS filehandleassociation_latest
TARGET_LAG = '7 days'
WAREHOUSE = compute_xsmall
AS
with latest_instance as (
WITH latest_unique_rows AS (
Copy link
Member

@thomasyu888 thomasyu888 Aug 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the great work and testing of the solution! I should've mentioned that it would be a challenge for you to learn about windows functions here instead of using joins. Take a look at some of the other scripts where I de-duplicate.

In particular, look for this in this repo ROW_NUMBER() OVER (

A brief overview, a "window function" will group by a set number of columns and can assign a row number based on a criteria, in this case maybe max timestamp. Take a look!

Copy link
Contributor Author

@jaymedina jaymedina Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @thomasyu888 thanks for the suggestion. I went ahead and adjusted the CTE to use the ROW_NUMBER window function in much the same way that it's done for the ACL_LATEST dynamic table (see solution here). I found that it produces the exact same result, so this method also works. Unfortunately although the query is shorter, it took about 20 seconds longer (total: 1m11s) with the majority of the expense coming from the ROW_NUMBER window function, I assume because of the partitioning that has to happen:

image

The original solution which uses the aggregate function MAX() and the clause GROUP BY to get the latest rows based on TIMESTAMP runs for 45s. Here is the node expense distribution:

image

With this in mind, I'll be moving forward with the first solution. I did however make some structural changes, namely splitting off the dynamic table logic for TEAM_LATEST and FILEHANDLEASSOCIATION_LATEST to be in their own R scripts, since they had different TIME_LAGs and I didn't see the point of them being together. Lmk if it's best to put these back in the same script or if this is fine as is.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into this! This is very interesting, since there are so many partitions, it becomes more inefficient to use a windows function!

@jaymedina jaymedina changed the title [SNOW-107] Deduplication logic for filehandleassociation_latest [SNOW-107] Add deduplication logic for filehandleassociation_latest Sep 3, 2024
Copy link
Member

@thomasyu888 thomasyu888 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥 LGTM! I'm going to wait for @philerooski to final review, one note about the team latest script, it's CREATE IF NOT EXISTS

@jaymedina
Copy link
Contributor Author

it's CREATE IF NOT EXISTS

The documentation has it formatted the way it currently exists in the R script. Either way, I'm going to change it to CREATE OR REPLACE DYNAMIC TABLE to keep this language consistent with the other R scripts.

Copy link

sonarqubecloud bot commented Sep 6, 2024

@jaymedina jaymedina merged commit 9567cfa into dev Sep 6, 2024
3 checks passed
@jaymedina jaymedina deleted the SNOW-107-dedup-filehandleassociation-latest branch September 6, 2024 17:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants