You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I found a unexpected behaviour with hudi and incremental loads with unique_key and merge strategy, when using partitions.
If the partition of a record change the run lead to duplicates, as the same key used for uniqueness appears multiple times, one with then old partition, and then with the new partition value.
The expected behaviour is that the record is replaced and its new partition take effect.
Steps To Reproduce
Create a model with this content
{{ config(
materialized='incremental',
incremental_strategy='merge',
unique_key='user_id',
partition_by=['status'],
file_format='hudi'
) }}
WITH data AS (
SELECT 'A' AS user_id, 'active' AS status
UNION ALL
SELECT 'B' AS user_id, 'active' AS status
UNION ALL
SELECT 'C' AS user_id, 'disabled' AS status
)
SELECT
user_id,
status,
current_timestamp() AS inserted_at
FROM data
Run the above model. Then change the model to this one:
{{ config(
materialized='incremental',
incremental_strategy='merge',
unique_key='user_id',
partition_by=['status'],
file_format='hudi'
) }}
WITH data AS (
SELECT 'C' AS user_id, 'disabled' AS status
)
SELECT
user_id,
status,
current_timestamp() AS inserted_at
FROM data
run the updated model.
For the user_id='C' we expect only one record with the partition status='disabled'`, but 2 records will be returned.
Expected behaviour
When using materialisation incremental with unique_keys, the model should not produce duplicates. Hence queries like:
select my_unique_id, count(*) as c
from my_model
group by 1
having c > 1
Should give an empty result, and this is not the case.
System information
The output of dbt --version:
Core:
- installed: 1.2.1
- latest: 1.3.0 - Update available!
Your version of dbt-core is out of date!
You can find instructions for upgrading here:
https://docs.getdbt.com/docs/installation
Plugins:
- spark: 1.2.0 - Update available!
At least one plugin is out of date or incompatible with dbt-core.
You can find instructions for upgrading here:
https://docs.getdbt.com/docs/installation
The operating system you're using: MacOS BigSur
The output of python --version:Python 3.9.6
Additional context
I believe that the issue is in here, as if the table already exist, we should just overwrite
Not sure if this behaviour is related to the dbt-adapter or to Hudi itself, see this
also a possible solution to consider can be found here
The text was updated successfully, but these errors were encountered:
Describe the bug
I found a unexpected behaviour with hudi and incremental loads with unique_key and merge strategy, when using partitions.
If the partition of a record change the run lead to duplicates, as the same key used for uniqueness appears multiple times, one with then old partition, and then with the new partition value.
The expected behaviour is that the record is replaced and its new partition take effect.
Steps To Reproduce
Create a model with this content
Run the above model. Then change the model to this one:
run the updated model.
For the
user_id='C' we expect only one record with the partition
status='disabled'`, but 2 records will be returned.Expected behaviour
When using materialisation incremental with unique_keys, the model should not produce duplicates. Hence queries like:
Should give an empty result, and this is not the case.
System information
The output of
dbt --version
:The operating system you're using: MacOS BigSur
The output of
python --version
:Python 3.9.6
Additional context
I believe that the issue is in here, as if the table already exist, we should just overwriteNot sure if this behaviour is related to the dbt-adapter or to Hudi itself, see this
also a possible solution to consider can be found here
The text was updated successfully, but these errors were encountered: