-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix!: handle multiple disputes #88
fix!: handle multiple disputes #88
Conversation
* MagicBot/documentation-updates * Apply suggestions from code review * Apply suggestions from code review --------- Co-authored-by: Jamie Rodriguez <65564846+fivetran-jamie@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bramrodenburg thanks for opening this PR to address the issue you raised! This definitely is a change we would like to fold into the broader package; however, I do have a few questions:
- Please see my question below regarding the use of
array_agg
and if we should considerstring_agg
? - Would you be able to create a CHANGELOG.md entry for these changes? Since this will be changing field names, we will need to make this a breaking change (
v0.15.0
).
array_agg(dispute_id) as dispute_ids, | ||
array_agg(dispute_reason) as dispute_reasons, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For these aggs is there a reason you used the array_agg
function instead of string_agg
? Reason I ask is because in BigQuery and Databricks especially I know the array_agg
function can change the nature of the table. Do you find using array_agg
to be easier to leverage in queries down the road? Would there be any concerns with using string_agg
instead, or would that impact the usability of these fields?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main reason for using array_agg
is that I find it easier to use the data downstream (e.g. if I need to join back to dispute
using the dispute_id
). If I would use string_agg
, I would first have to split the string again. However, this is a "fictional" use case - I am not doing this now :)
I am perfectly fine with using string_agg
if that's the preferred way of doing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replaced array_agg
with string_agg
Makes sense. Will do. |
source_relation, | ||
string_agg(dispute_id) as dispute_ids, | ||
string_agg(dispute_reason) as dispute_reasons, | ||
sum(dispute_amount) as dispute_amount |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the same amount can get disputed more than once, I think it makes more sense to also string_agg the dispute_amount. Here's an example where I have a balance transaction where the full amount has been disputed twice. Summing the dispute_amount in this case will erroneously double the amount of the transaction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jsnorthrup thanks so much for chiming in and sharing this level of insight. We definitely would not want to have a scenario where we could possibly be double counting the amount.
It looks like the two disputes you shared have a different "status" where it seems only one was "won". Since there is a "status" field in the source table (via the ERD), do you think it would be appropriate to retain the sum, but only sum disputes that are of status='won'
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that would work for when the status is "won", but what about if it hasn't been won yet? Will the dispute_amount
be 0? In that case I think the column needs to be renamed to dispute_amount_won
to indicate it's not showing dispute amounts in other statuses.
If you go that route, it might be good to add another column dispute_amounts
that still has a string_agg of each individual dispute amount to match dispute_ids
and dispute_reasons
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For us, we are not only interested in disputes won
, but also lost
, under_review
etc. If we would go down the route of adding a dispute_amount_won
, then I feel we should also add similar columns for the other statuses. There are currently 7 different statuses in Stripe for disputes (see here). So you would get something like this:
dispute_amount_won
dispute_amount_lost
dispute_amount_under_review
dispute_amount_needs_response
dispute_amount_warning_closed
dispute_amount_warning_under_review
dispute_amount_warning_needs_response
If we wouldn't do this, then the user of the table needs to join back to the dispute
table to get this information, which kind of renders having dispute amounts in this table obsolete. (I actually naively took the sum in PR, because we are actually already doing this)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we wouldn't do this, then the user of the table needs to join back to the dispute table to get this information, which kind of renders having dispute amounts in this table obsolete.
@bramrodenburg I agree with this idea. I also worry if we go the string_agg route for the dispute information then it won't provide much analytical value. Whereas, if we take the approach in creating distinct aggregate sums for each status, these values can be used more easily for any downstream analysis. Since we can confirm from Stripe's documentation that there are only 7 statuses, I'd request we take the approach of having distinct dispute aggregate records for each status like you suggest.
My only remaining question here - Is it possible for Stripe to provide multiple entries for a dispute status that could still result in us double counting? For example, could there be some reason the same dispute amount has two won
records in the underlying table? If not, then I'm comfortable with the above approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @jsnorthrup - In that scenario, would you want to only look at the most recent dispute amount for won
and lost
disputes (or any other statuses)? So in the above screenshot, if you won Dispute 2, the model would only include the second $175 in the dispute_amount_won
sum.
If so, I think this would be possible by doing something like row_number() over (partition by charge_id, dispute_status, source_relation order by dispute_created_at desc) = 1 or dispute_status not in ('won', 'lost') as is_dispute_included
and filtering based on that
Otherwise, yes, perhaps joining the balance_transactions
model and with dispute
downstream is the most sensible option
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bramrodenburg @jsnorthrup I suppose my bigger question is how do you two think about Disputes? What is the bottom line you'd like to analyze or see surfaced?
Also, are there cases were a singe transaction's disputes can have different dollar amounts? So like in the above example, what if the second dispute only pertained to some items in the charge and therefore had a dispute value $100 instead of $175? Assuming you won Dispute 2, would you want the dispute_won_amount
to reflect the max amount or most recent amount (or both, which would lead me to think that just joining in dispute
downstream may make more sense)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Multiple scenarios are possible. If different parts of the total are disputed, the amounts could be additive. If the total amount is disputed multiple times, you'd only want the max.
I think doing the latest might be a good compromise. Remember, this is a very rare occurrence. I have exactly one instance of multiple disputes on the same transaction. If we have the dispute columns only refer to the latest dispute (in which case I recommend prefixing with latest_
), it will represent the total disputed amount in the vast majority of cases, and in those rare cases where there are multiple disputes, we could have a dispute_count
field to signal that you should reference the dispute
table for more information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that! @bramrodenburg if this compromise sounds good to you, I'll go ahead and merge your PR into my own working branch and add this extra code to stripe__balance_transactions
, as it's a bit involved.
Specifically, I'll:
- Add code to select the latest dispute in each status to account for the rare (but possible) case of multiple disputes per transaction:
row_number() over (partition by charge_id, dispute_status, source_relation order by dispute_created_at desc) = 1 as is_latest_dispute
. - This filter will be leveraged to create the following
dispute_amount
fields split by status:latest_dispute_amount_won
latest_dispute_amount_lost
latest_dispute_amount_under_review
latest_dispute_amount_needs_response
latest_dispute_amount_warning_closed
latest_dispute_amount_warning_under_review
latest_dispute_amount_warning_needs_response
- Add a
count(distinct dispute_id) as dispute_count
field that indicates whether there are multiple disputes for a transaction. If so, users should then join in thedispute
table for more details
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late reply. @jsnorthrup is completely right that there can be multiple scenarios.
The suggested approach sounds great. Thanks for merging this into your working branch @fivetran-jamie
Joe's going on PTO and I'm making additional changes to this in a working branch before merging to main
…nsactions (#92) * fix!: handle multiple disputes (#88) * Documentation Standard Updates (#85) * MagicBot/documentation-updates * Apply suggestions from code review * Apply suggestions from code review --------- Co-authored-by: Jamie Rodriguez <65564846+fivetran-jamie@users.noreply.github.com> * fix!: handle multiple disputes * replace array_agg with string_agg * update changelog * bump version --------- Co-authored-by: Joe Markiewicz <74217849+fivetran-joemarkiewicz@users.noreply.github.com> Co-authored-by: Jamie Rodriguez <65564846+fivetran-jamie@users.noreply.github.com> * changes * aws and changelog * validations * Update models/stripe__balance_transactions.sql Co-authored-by: Renee Li <91097070+fivetran-reneeli@users.noreply.github.com> * renee feedback * docs * update package ref * changelog --------- Co-authored-by: bramrodenburg <14278376+bramrodenburg@users.noreply.github.com> Co-authored-by: Joe Markiewicz <74217849+fivetran-joemarkiewicz@users.noreply.github.com> Co-authored-by: Renee Li <91097070+fivetran-reneeli@users.noreply.github.com>
Please provide your name and company
Bram - Gigs
Link the issue/feature request which this PR is meant to address
#87
Detail what changes this PR introduces and how this addresses the issue/feature request linked above.
Although it happens rarely (see here), a payment can be disputed more than once. The existing
balance_transactions
model cannot handle this scenario. In this PR, that is fixed by aggregating the dispute_[id/reason/amount] before joining the dispute data back.How did you validate the changes introduced within this PR?
Followed the steps here and built the
dbt_stripe
package against our data.Which warehouse did you use to develop these changes?
BigQuery
Did you update the CHANGELOG?
Did you update the dbt_project.yml files with the version upgrade (please leverage standard semantic versioning)? (In both your main project and integration_tests)
Typically there are additional maintenance changes required before this will be ready for an upcoming release. Are you comfortable with the Fivetran team making a few commits directly to your branch?
If you had to summarize this PR in an emoji, which would it be?
🥸
Feedback
We are so excited you decided to contribute to the Fivetran community dbt package! We continue to work to improve the packages and would greatly appreciate your feedback on our existing dbt packages or what you'd like to see next.
PR Template
Community Pull Request Template (default)
Maintainer Pull Request Template (to be used by maintainers)