-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] adapter deleted most of my bucket. #405
Comments
I think I found it, in some dev version There should be some precautions around it in my opinion:
|
@nbenezri just to clarify - did dbt drop files from bucket-a that were also related to |
There is no relation between the buckets. stg_raw_tab was created in athena at first as external table on bucket a. In |
There are few reasons why this could happen, and you are not the first to spot this complication, I had many discussions about this with few users. Another reason why the adapter might delete data from a bucket is when a model is created in the same external location of an existing table. Also in this case the adapter first clean the target location, to avoid issues on creation. IMO both cases are not an issue, but a misconfiguration from the users, and such behavior must be properly documented. @amychen1776 I hope that this help you to triage better this issue, I leave the final decision to you folks of dbt Labs. @nbenezri note that for both iceberg and hive tables, we do a delete object operation and a delete table using glue apis. Drop DDL for iceberg tables lead to situations where not all the s3 objects are removed, and the workaround described allow to have it properly working in a dbt context. |
Thank you @nicor88 for that context! This is super helpful. This to me is expected behavior to maintain dbt's idempotency (not accidentally create duplicate objects). I will look into getting this documented on our docs site. |
Chiming in, the reason the table location S3 data needs to be deleted first (in particular for hive tables) is that you receive an Athena error
Like Nico says, it's better to still do a cleanup for Iceberg as well. You can configure
The adapter just uses boto3 which uses a chain of auth locations. There's no need to configure |
Is this a new bug in dbt-athena?
Current Behavior
I am doing a POC with athena where I try to load data from parquet file into iceberg.
The parquet table was created outsides of dbt with create external table syntax on bucket-a.
In dbt configuration I mention iceberg destination as bucket-b and I don't specify anywhere bucket-a.
I notice after a few test run (first time using this) - that most of bucket -a was deleted. tracing it back with aws cloud train and datadog I found it was dbt-athena that deleted those files with deleteobejcts API call.
Since it was during the initial creation of the repo/poc, I am not sure which configuration exactly was it that led to it. nor do I want to test again as I am not sure how it happen. Luckily the bucket had versioning. Any idea what in this adapter may cause this?
Expected Behavior
dbt does not touch buckets outsides of the models-scope
Steps To Reproduce
I dont have a way to reproduce.
What I can say that the model was:
models/project/staging/stg_raw_tab.sql:
and table ddl is:
latest profile
Relevant log output
No response
Environment
The text was updated successfully, but these errors were encountered: