Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicates records with using incremental with hudi merge and unique_key and partitions #90

Closed
nicor88 opened this issue Oct 14, 2022 · 0 comments · Fixed by #91
Closed
Labels
bug Something isn't working

Comments

@nicor88
Copy link
Contributor

nicor88 commented Oct 14, 2022

Describe the bug

I found a unexpected behaviour with hudi and incremental loads with unique_key and merge strategy, when using partitions.

If the partition of a record change the run lead to duplicates, as the same key used for uniqueness appears multiple times, one with then old partition, and then with the new partition value.
The expected behaviour is that the record is replaced and its new partition take effect.

Steps To Reproduce

Create a model with this content

{{ config(
    materialized='incremental',
    incremental_strategy='merge',
    unique_key='user_id',
	partition_by=['status'],
    file_format='hudi'
) }}

WITH data AS (
	SELECT 'A' AS user_id, 'active' AS status
	UNION ALL
	SELECT 'B' AS user_id, 'active' AS status
	UNION ALL
	SELECT 'C' AS user_id, 'disabled' AS status
)

SELECT
	user_id,
	status,
	current_timestamp() AS inserted_at
FROM data

Run the above model. Then change the model to this one:

{{ config(
    materialized='incremental',
    incremental_strategy='merge',
    unique_key='user_id',
	partition_by=['status'],
    file_format='hudi'
) }}

WITH data AS (
	SELECT 'C' AS user_id, 'disabled' AS status
)

SELECT
	user_id,
	status,
	current_timestamp() AS inserted_at
FROM data

run the updated model.

For the user_id='C' we expect only one record with the partition status='disabled'`, but 2 records will be returned.

Expected behaviour

When using materialisation incremental with unique_keys, the model should not produce duplicates. Hence queries like:

select my_unique_id, count(*) as c
from my_model
group by 1
having c > 1

Should give an empty result, and this is not the case.

System information

The output of dbt --version:

Core:
  - installed: 1.2.1
  - latest:    1.3.0 - Update available!

  Your version of dbt-core is out of date!
  You can find instructions for upgrading here:
  https://docs.getdbt.com/docs/installation

Plugins:
  - spark: 1.2.0 - Update available!

  At least one plugin is out of date or incompatible with dbt-core.
  You can find instructions for upgrading here:
  https://docs.getdbt.com/docs/installation

The operating system you're using: MacOS BigSur

The output of python --version: Python 3.9.6

Additional context

I believe that the issue is in here, as if the table already exist, we should just overwrite
Not sure if this behaviour is related to the dbt-adapter or to Hudi itself, see this
also a possible solution to consider can be found here

@nicor88 nicor88 added the bug Something isn't working label Oct 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant