Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Cartesian Join based deletion is causing performance problems when it hits a certain scale for microbatch models #1228

Closed
2 tasks done
Tracked by #10624
graciegoheen opened this issue Oct 16, 2024 · 1 comment · Fixed by #1192
Assignees
Labels
type:bug Something isn't working

Comments

@graciegoheen
Copy link

Is this a new bug in dbt-core?

  • I believe this is a new bug in dbt-core
  • I have searched the existing issues, and I could not find an existing issue for this bug

Current Behavior

Cartesian Join based deletion is causing data spilling to disk which heavily bogs down performance

The delete statement looks like:

delete from analytics_dev.dbt_aescay.my_model DBT_INTERNAL_TARGET
    using analytics_dev.dbt_aescay.my_model__dbt_tmp DBT_TMP_TARGET
    where (
    DBT_INTERNAL_TARGET.event_at >= TIMESTAMP '2024-10-14 00:00:00+00:00'
    and DBT_INTERNAL_TARGET.event_at < TIMESTAMP '2024-10-15 00:00:00+00:00'
    
    );

But we are not doing anything with my_model__dbt_tmp in the where clause.

We can simplify this logic and improve the performance, by instead doing:

delete from <existing> where <date range>;
insert into <existing> from <new data for same date range>;

One advantage of microbatch is that we know in advance the exact boundaries of every batch (time range, cf. "static" insert_overwrite).

In a world where we support "microbatch merge" models (= update batches by upserting on unique_key, rather than full batch replacement), then we would want to join (using) based on unique_key match, like so:

delete from analytics_dev.dbt_aescay.my_model DBT_INTERNAL_TARGET
    using analytics_dev.dbt_aescay.my_model__dbt_tmp DBT_TMP_TARGET
    where DBT_INTERNAL_TARGET.event_id = DBT_TMP_TARGET.event_id
    and (
    DBT_INTERNAL_TARGET.event_at >= TIMESTAMP '2024-10-14 00:00:00+00:00'
    and DBT_INTERNAL_TARGET.event_at < TIMESTAMP '2024-10-15 00:00:00+00:00'
    
    );

But this shouldn't be the default assumption.

Expected Behavior

We should delete this line.

Steps To Reproduce

See here.

Relevant log output

No response

Environment

- OS:
- Python:
- dbt:

Which database adapter are you using with dbt?

snowflake

Additional Context

No response

@QuentinCoviaux
Copy link

This seems like a decent deal-breaker at the moment - unless I'm facing some other config issue.

To put some numbers in perspectives (on Snowflake, based on a X-Small warehouse):

  1. Take one table with ~30 millions records (~0.5GB) roughly spread over 30 days
  2. Lookback of 3 days
  3. Delete query takes more than 10 minutes (I cancelled it after that), and this has to be done for each lookback period

Hoping we can get this one prioritized 😊

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:bug Something isn't working
Projects
None yet
3 participants