Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug Fix] Remove incremental logic #97

Merged
merged 11 commits into from
Jan 22, 2025

Conversation

fivetran-avinash
Copy link
Contributor

@fivetran-avinash fivetran-avinash commented Jan 13, 2025

PR Overview

This PR will address the following Issue/Feature: [#95]

This PR will result in the following new package version: v0.16.0

Source release is breaking, necessitating an upgrade.

Please provide the finalized CHANGELOG entry which details the relevant changes included in this PR:

Bug Fixes

  • Removed incremental logic in the following end models:
    • shopify__discounts
    • shopify__order_lines
    • shopify__orders
    • shopify__transactions
  • These models utilized the merge incremental strategy on BigQuery and Databricks, as we could not rely on a time series timestamp to impelment the insert_overwrite strategy. Using merge is a costly strategy, so it defeats the purpose of leveraging incremental logic.
  • There were also concerns about the incremental logic returning incorrect data in some end models. For example, if a repeat order within the new_vs_repeat CTE logic in shopify__orders was calculated within the specified incremental window but the new order was not in that same time period, it could be incorrectly processed as a new order.

Upstream Under-the-Hood Updates from shopify_source Package

  • (Affects Redshift only) Creates new shopify_union_data macro to accommodate Redshift's treatment of empty tables.
    • For each staging model, if the source table is not found in any of your schemas, the package will create a empty table with 0 rows for non-Redshift warehouses and a table with 1 all-null row for Redshift destinations.
    • This is necessary as Redshift will ignore explicit data casts when a table is completely empty and materialize every column as a varchar. This throws errors in downstream transformations in the shopify package. The 1 row will ensure that Redshift will respect the package's datatype casts.

PR Checklist

Basic Validation

Please acknowledge that you have successfully performed the following commands locally:

  • dbt run –full-refresh && dbt test
  • dbt run (if incremental models are present) && dbt test

Before marking this PR as "ready for review" the following have been applied:

  • The appropriate issue has been linked, tagged, and properly assigned
  • All necessary documentation and version upgrades have been applied
  • docs were regenerated (unless this PR does not include any code or yml updates)
  • BuildKite integration tests are passing
  • Detailed validation steps have been provided below

Detailed Validation

Please share any and all of your validation steps:

Screenshot 2025-01-13 at 2 32 14 PM

If you had to summarize this PR in an emoji, which would it be?

🇪🇺

@fivetran-avinash fivetran-avinash self-assigned this Jan 13, 2025
@fivetran-avinash fivetran-avinash marked this pull request as ready for review January 14, 2025 18:44
Copy link
Contributor

@fivetran-joemarkiewicz fivetran-joemarkiewicz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fivetran-avinash a few questions and change requests before approval.

@@ -1,8 +1,7 @@
{{
config(
materialized='table' if target.type in ('bigquery', 'databricks', 'spark') else 'incremental',
materialized='table',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need any of this configuration anymore now that the incremental strategy is being removed? Why do we need to define the cluster and unique_key if there is no incremental strategy leveraging them?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question on the other incremental updates in this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. Removed.

CHANGELOG.md Outdated
Comment on lines 10 to 11
- These models utilized the `merge` incremental strategy on BigQuery and Databricks, as we could not rely on a time series timestamp to impelment the `insert_overwrite` strategy. Using `merge` is a costly strategy, so it defeats the purpose of leveraging incremental logic.
- There were also concerns about the incremental logic returning incorrect data in some end models. For example, if a repeat order within the `new_vs_repeat` CTE logic in `shopify__orders` was calculated within the specified incremental window but the new order was not in that same time period, it could be incorrectly processed as a new order.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worry this is bordering too much information that customers may not entirely understand this update. Here is a recommendation for making this a bit more succinct.

Suggested change
- These models utilized the `merge` incremental strategy on BigQuery and Databricks, as we could not rely on a time series timestamp to impelment the `insert_overwrite` strategy. Using `merge` is a costly strategy, so it defeats the purpose of leveraging incremental logic.
- There were also concerns about the incremental logic returning incorrect data in some end models. For example, if a repeat order within the `new_vs_repeat` CTE logic in `shopify__orders` was calculated within the specified incremental window but the new order was not in that same time period, it could be incorrectly processed as a new order.
- Incremental strategies were removed from these models due to potential inaccuracies with the `merge` strategy on BigQuery and Databricks. For instance, the `new_vs_repeat` field in `shopify__orders` could produce incorrect results during incremental runs. To ensure consistency, this logic was removed across all warehouses. If the previous incremental functionality was valuable to you, please consider opening a feature request to revisit this approach.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

CHANGELOG.md Outdated

## [Upstream Under-the-Hood Updates from `shopify_source` Package](https://github.com/fivetran/dbt_shopify_source/releases/tag/v0.15.0)
- (Affects Redshift only) Creates new `shopify_union_data` macro to accommodate Redshift's treatment of empty tables.
- For each staging model, if the source table is not found in any of your schemas, the package will create a empty table with 0 rows for non-Redshift warehouses and a table with 1 all-`null` row for Redshift destinations.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reads a bit strange. You mention above that this change only impacts Redshift users; however, the first part of this bullet mentions the behavior for every warehouse other than Redshift. Can we switch this around so the Redshift change is mentioned first and then we can say there should be no change in behavior for other warehouses.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, as done in the source CHANGELOG.

packages.yml Outdated
Comment on lines 2 to 4
revision: bugfix/redshift-limit-one
warn-unpinned: false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reminder to swap before release

packages.yml Outdated Show resolved Hide resolved
Copy link
Contributor Author

@fivetran-avinash fivetran-avinash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fivetran-joemarkiewicz Changes applied. This is ready for re-review!

Copy link
Contributor

@fivetran-joemarkiewicz fivetran-joemarkiewicz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but one change needed in the README to bump the version range of the dbt_shopify package. Once that's applied this will be good for release review.

README.md Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The package version for this package is still referencing the old v0.15.0 for dbt_shopify. This needs to be updated to reflect the latest version range.

Copy link
Contributor Author

@fivetran-avinash fivetran-avinash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fivetran-joemarkiewicz Oops. Updated!

CHANGELOG.md Outdated
- `shopify__order_lines`
- `shopify__orders`
- `shopify__transactions`
- Incremental strategies were removed from these models due to potential inaccuracies with the `merge` strategy on BigQuery and Databricks. For instance, the `new_vs_repeat` field in `shopify__orders` could produce incorrect results during incremental runs. To ensure consistency, this logic was removed across all warehouses. If the previous incremental functionality was valuable to you, please consider opening a feature request to revisit this approach.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding was that it was the incremental strategy in general and not limited to merge, especially since we had the models in question materializing as tables by default for BQ and Databricks, and delete+insert uses the same unique-key logic. Is this not the case? Unless I am mistaken, I would recommend generalizing it a bit:

Suggested change
- Incremental strategies were removed from these models due to potential inaccuracies with the `merge` strategy on BigQuery and Databricks. For instance, the `new_vs_repeat` field in `shopify__orders` could produce incorrect results during incremental runs. To ensure consistency, this logic was removed across all warehouses. If the previous incremental functionality was valuable to you, please consider opening a feature request to revisit this approach.
- Incremental strategies were removed from these models due to potential inaccuracies with the incremental strategy. For instance, the `new_vs_repeat` field in `shopify__orders` could produce incorrect results during incremental runs. If the previous incremental functionality was valuable to you, please consider opening a feature request to revisit this approach.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I wasn't certain, as the majority of the conversation centered around merge. I've tweaked this a little, let me know if this looks good!

Copy link
Contributor

@fivetran-catfritz fivetran-catfritz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

fivetran-avinash and others added 2 commits January 22, 2025 13:58
* MagicBot/add-model-counts updates

* Update README.md

---------

Co-authored-by: Avinash Kunnath <108772760+fivetran-avinash@users.noreply.github.com>
@fivetran-avinash fivetran-avinash merged commit d9df704 into main Jan 22, 2025
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants