[SNOW-107] Add deduplication logic for `filehandleassociation_latest` #73

jaymedina · 2024-08-29T18:29:29Z

Problem

The filehandleassociation_latest dynamic table was found to have duplicates based on filehandleid.

Solution

Modify dynamic table query to select the most recent associateid and filehandleid combination within a 14 day window
Remove the latest_instance filter that was accidentally filtering out valid rows (see Testing & Preview for more context)

The main change was replacing the latest_instance CTE with one that pull the latest row for every associateid and filehandleid combination such that

filehandleid	associateid	timestamp	instance
1	A	2024-08-15 12:00:00	510
1	A	2024-08-20 14:00:00	511
2	B	2024-08-18 09:00:00	509
2	B	2024-08-22 10:00:00	511
3	C	2024-08-19 15:00:00	510

becomes

filehandleid	associateid	timestamp	instance
1	A	2024-08-20 14:00:00	511
2	B	2024-08-22 10:00:00	511
3	C	2024-08-19 15:00:00	510

Duplicates are defined as any row with the same combination of filehandleid and associateid as another existing row, and are handled accordingly with the new CTE.

Old query example

Using the definition of a Duplicate from above, we find duplicates from the old query in rows 3-7 for this example:

New query example

With the new query, all duplicates are removed and the LATEST rows from snapshots are taken (note the differences in instance and timestamp, because we are no longer using instance for filtering - we are now using timestamp) :

Testing

Test the efficacy of the new CTE

1. Total rows with old query

2. Total rows with old query with duplicates removed

This should be the number of rows we get when we implement our new query...

3. Total rows with new query

4. Addressing the discrepancy

Our final result is a table with 101,741,895 rows. More than the expected 101,716,860 by 25,035 rows. A comparison between tables from 2 and 3 shows the following...

All 25,035 rows that are missing from the expected result are due to the instance column not matching max(instance) from the original query, which of 511 as of Aug-30-2024:

From this we can conclude that the new CTE does the following:

Removes duplicates as defined by rows having the same pair of associateid and filehandleid

Re-introduces rows that were incorrectly filtered out due to their instance value

Further testing...

Ensuring the `WHERE` clause for the `timestamp` window works by getting the MINIMUM `timestamp` from the final table (`CURRENT_TIMESTAMP` is Aug-29-2024):

`14 DAYS`

`20 DAYS`

`5 DAYS`

thomasyu888 · 2024-08-30T21:49:52Z

synapse_data_warehouse/synapse/dynamic_tables/V2.20.1__fileassociation_team_latest.sql

@@ -30,18 +30,28 @@ CREATE DYNAMIC TABLE IF NOT EXISTS filehandleassociation_latest
    TARGET_LAG = '7 days'
    WAREHOUSE = compute_xsmall
 AS
-    with latest_instance as (
+    WITH latest_unique_rows AS (


Thanks for the great work and testing of the solution! I should've mentioned that it would be a challenge for you to learn about windows functions here instead of using joins. Take a look at some of the other scripts where I de-duplicate.

In particular, look for this in this repo ROW_NUMBER() OVER (

A brief overview, a "window function" will group by a set number of columns and can assign a row number based on a criteria, in this case maybe max timestamp. Take a look!

Hey @thomasyu888 thanks for the suggestion. I went ahead and adjusted the CTE to use the ROW_NUMBER window function in much the same way that it's done for the ACL_LATEST dynamic table (see solution here). I found that it produces the exact same result, so this method also works. Unfortunately although the query is shorter, it took about 20 seconds longer (total: 1m11s) with the majority of the expense coming from the ROW_NUMBER window function, I assume because of the partitioning that has to happen:

The original solution which uses the aggregate function MAX() and the clause GROUP BY to get the latest rows based on TIMESTAMP runs for 45s. Here is the node expense distribution:

With this in mind, I'll be moving forward with the first solution. I did however make some structural changes, namely splitting off the dynamic table logic for TEAM_LATEST and FILEHANDLEASSOCIATION_LATEST to be in their own R scripts, since they had different TIME_LAGs and I didn't see the point of them being together. Lmk if it's best to put these back in the same script or if this is fine as is.

Thanks for looking into this! This is very interesting, since there are so many partitions, it becomes more inefficient to use a windows function!

…g versioned script.

thomasyu888

🔥 LGTM! I'm going to wait for @philerooski to final review, one note about the team latest script, it's CREATE IF NOT EXISTS

jaymedina · 2024-09-06T13:50:26Z

it's CREATE IF NOT EXISTS

The documentation has it formatted the way it currently exists in the R script. Either way, I'm going to change it to CREATE OR REPLACE DYNAMIC TABLE to keep this language consistent with the other R scripts.

sonarqubecloud · 2024-09-06T13:58:15Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud

jaymedina added 7 commits August 29, 2024 14:27

Deduplication logic for filehandleassociation_latest

5e41d4c

New sql file

5be0e9a

FILEHANDLEID and ASSOCIATEID are a key pair

677084f

Patch up syntax

5fb80f4

Newline

16db29d

typo

800f485

Change CTE name for clarity

8b717e9

jaymedina marked this pull request as ready for review August 30, 2024 16:04

jaymedina requested a review from a team as a code owner August 30, 2024 16:04

thomasyu888 reviewed Aug 30, 2024

View reviewed changes

jaymedina changed the title ~~[SNOW-107] Deduplication logic for filehandleassociation_latest~~ [SNOW-107] Add deduplication logic for filehandleassociation_latest Sep 3, 2024

jaymedina added 7 commits September 4, 2024 16:16

Using ROW_NUMBER. Separating team_latest and fhassoc_latest. Revertin…

6a53167

…g versioned script.

Revert back to original solution

ae91e2a

typos

a88e5ff

adding --noqa: TMP back

41ae6f9

newline

e8332a0

removing stack='prod'

c24de67

Rename CTE, reorder JOIN ON AND

b000a9b

thomasyu888 requested a review from philerooski September 4, 2024 22:45

thomasyu888 approved these changes Sep 4, 2024

View reviewed changes

philerooski approved these changes Sep 5, 2024

View reviewed changes

jaymedina added 2 commits September 6, 2024 09:51

CREATE OR REPLACE

7881fd7

fix CTE reference

2eb3975

jaymedina merged commit 9567cfa into dev Sep 6, 2024
3 checks passed

jaymedina deleted the SNOW-107-dedup-filehandleassociation-latest branch September 6, 2024 17:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SNOW-107] Add deduplication logic for `filehandleassociation_latest` #73

[SNOW-107] Add deduplication logic for `filehandleassociation_latest` #73

jaymedina commented Aug 29, 2024 •

edited

Loading

thomasyu888 Aug 30, 2024 •

edited

Loading

jaymedina Sep 4, 2024 •

edited

Loading

thomasyu888 Sep 4, 2024

thomasyu888 left a comment

jaymedina commented Sep 6, 2024

sonarqubecloud bot commented Sep 6, 2024

[SNOW-107] Add deduplication logic for filehandleassociation_latest #73

[SNOW-107] Add deduplication logic for filehandleassociation_latest #73

Conversation

jaymedina commented Aug 29, 2024 • edited Loading

Problem

Solution

Old query example

New query example

Testing

Test the efficacy of the new CTE

1. Total rows with old query

2. Total rows with old query with duplicates removed

3. Total rows with new query

4. Addressing the discrepancy

From this we can conclude that the new CTE does the following:

Further testing...

Ensuring the WHERE clause for the timestamp window works by getting the MINIMUM timestamp from the final table (CURRENT_TIMESTAMP is Aug-29-2024):

14 DAYS

20 DAYS

5 DAYS

thomasyu888 Aug 30, 2024 • edited Loading

Choose a reason for hiding this comment

jaymedina Sep 4, 2024 • edited Loading

Choose a reason for hiding this comment

thomasyu888 Sep 4, 2024

Choose a reason for hiding this comment

thomasyu888 left a comment

Choose a reason for hiding this comment

jaymedina commented Sep 6, 2024

sonarqubecloud bot commented Sep 6, 2024

Quality Gate passed

[SNOW-107] Add deduplication logic for `filehandleassociation_latest` #73

[SNOW-107] Add deduplication logic for `filehandleassociation_latest` #73

jaymedina commented Aug 29, 2024 •

edited

Loading

Ensuring the `WHERE` clause for the `timestamp` window works by getting the MINIMUM `timestamp` from the final table (`CURRENT_TIMESTAMP` is Aug-29-2024):

`14 DAYS`

`20 DAYS`

`5 DAYS`

thomasyu888 Aug 30, 2024 •

edited

Loading

jaymedina Sep 4, 2024 •

edited

Loading