-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SNOW-107] Add deduplication logic for filehandleassociation_latest
#73
Merged
Merged
Changes from 7 commits
Commits
Show all changes
16 commits
Select commit
Hold shift + click to select a range
5e41d4c
Deduplication logic for filehandleassociation_latest
jaymedina 5be0e9a
New sql file
jaymedina 677084f
FILEHANDLEID and ASSOCIATEID are a key pair
jaymedina 5fb80f4
Patch up syntax
jaymedina 16db29d
Newline
jaymedina 800f485
typo
jaymedina 8b717e9
Change CTE name for clarity
jaymedina 6a53167
Using ROW_NUMBER. Separating team_latest and fhassoc_latest. Revertin…
jaymedina ae91e2a
Revert back to original solution
jaymedina a88e5ff
typos
jaymedina 41ae6f9
adding --noqa: TMP back
jaymedina e8332a0
newline
jaymedina c24de67
removing stack='prod'
jaymedina b000a9b
Rename CTE, reorder JOIN ON AND
jaymedina 7881fd7
CREATE OR REPLACE
jaymedina 2eb3975
fix CTE reference
jaymedina File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the great work and testing of the solution! I should've mentioned that it would be a challenge for you to learn about windows functions here instead of using joins. Take a look at some of the other scripts where I de-duplicate.
In particular, look for this in this repo
ROW_NUMBER() OVER (
A brief overview, a "window function" will group by a set number of columns and can assign a row number based on a criteria, in this case maybe max timestamp. Take a look!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @thomasyu888 thanks for the suggestion. I went ahead and adjusted the CTE to use the
ROW_NUMBER
window function in much the same way that it's done for theACL_LATEST
dynamic table (see solution here). I found that it produces the exact same result, so this method also works. Unfortunately although the query is shorter, it took about 20 seconds longer (total: 1m11s) with the majority of the expense coming from theROW_NUMBER
window function, I assume because of the partitioning that has to happen:The original solution which uses the aggregate function
MAX()
and the clauseGROUP BY
to get the latest rows based onTIMESTAMP
runs for 45s. Here is the node expense distribution:With this in mind, I'll be moving forward with the first solution. I did however make some structural changes, namely splitting off the dynamic table logic for
TEAM_LATEST
andFILEHANDLEASSOCIATION_LATEST
to be in their own R scripts, since they had differentTIME_LAG
s and I didn't see the point of them being together. Lmk if it's best to put these back in the same script or if this is fine as is.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking into this! This is very interesting, since there are so many partitions, it becomes more inefficient to use a windows function!