-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Diff in schema in google_bigquery_table when using a connection_id and hive_partitioning_options #12465
Comments
Sure. I've built a complete working (simplified) example from scratch so you can reproduce. Run once to apply, and run again to see the diff problem. I didn't include the backend and provider configurations. data "google_project" "project" {
}
locals {
project_id = data.google_project.project.project_id
project_number = data.google_project.project.number
bucket_name = google_storage_bucket.data_lake.name
bucket_path = "my_source/my_table"
gcs_uri_prefix = "gs://${local.bucket_name}/${local.bucket_path}/"
gcs_uri = "gs://${local.bucket_name}/${local.bucket_path}/*.csv"
use_connection = true
schema = <<-EOT
[
{
"name": "date",
"type": "DATE",
"mode": "NULLABLE"
},
{
"name": "index",
"type": "STRING",
"mode": "NULLABLE"
},
{
"name": "weight",
"type": "BIGNUMERIC",
"mode": "NULLABLE"
}
]
EOT
csv_content = <<-EOT
date,index,weight
2022-09-05,index1,0.63
2022-09-05,index2,0.27
2022-09-05,index3,0.1
EOT
}
resource "google_storage_bucket" "data_lake" {
name = "test-data-lake-ee7e7914" # suffix is part of a random uuid (generate one for you)
location = "US"
uniform_bucket_level_access = true
}
resource "google_storage_bucket_object" "example_csv_daily" {
bucket = google_storage_bucket.data_lake.name
name = "${local.bucket_path}/environment=prod/frequency=daily/2022-09-01-example.csv"
content = local.csv_content
}
resource "google_storage_bucket_object" "example_csv_monthly" {
bucket = google_storage_bucket.data_lake.name
name = "${local.bucket_path}/environment=prod/frequency=monthly/2022-09-01-example.csv"
content = local.csv_content
}
resource "google_bigquery_connection" "data_lake" {
connection_id = "test-connection"
location = "US"
cloud_resource {}
}
resource "google_project_iam_member" "data_lake_bigquery_connection" {
project = local.project_id
role = "roles/storage.objectViewer"
member = "serviceAccount:${google_bigquery_connection.data_lake.cloud_resource[0].service_account_id}"
}
resource "google_bigquery_dataset" "main" {
dataset_id = "my_source"
location = "US"
access {
role = "OWNER"
special_group = "projectOwners"
}
}
resource "google_bigquery_table" "my_table" {
depends_on = [
google_project_iam_member.data_lake_bigquery_connection,
google_storage_bucket_object.example_csv_daily,
google_storage_bucket_object.example_csv_monthly,
]
dataset_id = google_bigquery_dataset.main.dataset_id
table_id = "my_table"
deletion_protection = false
schema = (local.use_connection) ? local.schema : null
external_data_configuration {
autodetect = false
# Schema has to be informed outside this block when a connection is used
schema = (local.use_connection) ? null : local.schema
connection_id = (local.use_connection) ? google_bigquery_connection.data_lake.id : null
source_format = "CSV"
source_uris = [local.gcs_uri]
ignore_unknown_values = true
csv_options {
quote = "\""
encoding = "UTF-8"
field_delimiter = ","
skip_leading_rows = 1
allow_jagged_rows = true
}
hive_partitioning_options {
mode = "AUTO"
source_uri_prefix = local.gcs_uri_prefix
require_partition_filter = false
}
}
} |
This looks like an API issue. b/245412495
|
The connection id has updated due to another issue that yourself has referenced above in the comments. #12386 |
I think that external tables expect schema to be informed inside the However, when you set a connection id, it doesn't accept a schema in IMO, it makes sense to inform the schema in the same block the connection id is informed. |
@melinath checkout the response from b/245412495. How do you want to resolve this? |
Wasn't able to get to this. I can't tell whether the extra fields are something we should diff suppress or what. It looks like there's also an extraneous change to |
b/245412495 |
@edwardmedia any updates on this? |
This is consequence of the issue 10919 not being fixed.
|
Hi All, the diff suppression comparison for the region in the "dot"/normalized format should be case insensitive. |
Hi, any updates on this? This issue makes it really painful to define partitioned BQ data lake tables using terraform... |
@nevzheng just to clarify we're still considering this open due to the complications with figuring out the best long-term fix; is that correct? |
Hi, is there any opdate on this one? Thank you |
Hi, this is still an issue, could someone have look at it please? Thank you. It makes unnecessary and annoying steps to deploy the IaC using an CI/CD pipeline. |
Hi @wj-chen, |
Hi @Karpisek - I have another suggestion for a workaround that I implemented. We source the schema from a json file and calcuate a hash of the content with terraform (e.g. by using the resource "google_bigquery_table" "external_table" {
for_each = {
for td in local.table_definitions : "${td.dataset_id}_${td.table_name}_${base64sha256(td.schema)}" => td
}
[...]
lifecycle {
ignore_changes = [
schema
]
}
} This way we force a resource recreation when the actual schema changes. You could also implement a different solution without changing the identifier by depending on a |
Hi @mdlnr thank you for the suggestion, it pointed me to a solution which will work for our use-case. Although it is not a clean one it is better than it is now. Thank you |
I received some guidance from the corresponding API team. There may be a way to exclude columns generated by Hive partitioning from diff detection which would solve this issue. I will experiment and update here again. |
[upstream:834d30be6e0212771d6f1757856cd54d6c57451b] Signed-off-by: Modular Magician <magic-modules@google.com>
Hi, just checking, did the diff detection experiments work? |
Community Note
modular-magician
user, it is either in the process of being autogenerated, or is planned to be autogenerated soon. If an issue is assigned to a user, that user is claiming responsibility for the issue. If an issue is assigned tohashibot
, a community member has claimed the issue already.Terraform Version
Terraform v1.1.7
Affected Resource(s)
Terraform Configuration Files
Debug Output
Panic Output
None.
Expected Behavior
No changes detected in schema after apply.
Actual Behavior
The actual behavior is that the plan will try to remove the columns coming from hive-partitioning format (environment and frequency in my example), while not actually removing it after
terraform apply
.I'd like to point out that although
connection_id
is informed inside theexternal_data_configuration
block, theschema
must be informed outside this block. Notice that in configuration file I had to put conditionals when assigning theschema
(both inside and outside external_data_configuration) to handle scenarios where I have a connection_id and where I don't. Maybe that alone would fix the issue, i.e. allow to inform schema in external_data_configuration when aconnection_id
is infomed.Steps to Reproduce
gs://my-bucket/data/raw/my_source/my_table/environment=prod/frequency=daily/something.csv
containing the columns specified by the following schema (I removed some columns from original for simplicity):terraform init
terraform validate
terraform plan -out tfplan
terraform apply -input=false -auto-approve tfplan
terraform plan -out tfplan
(again)Important Factoids
None.
References
None
b/300616880
The text was updated successfully, but these errors were encountered: