Diff in schema in google_bigquery_table when using a connection_id and hive_partitioning_options #12465

psvaiter · 2022-09-02T19:16:47Z

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request.
Please do not leave +1 or me too comments, they generate extra noise for issue followers and do not help prioritize the request.
If you are interested in working on this issue or have submitted a pull request, please leave a comment.
If an issue is assigned to the modular-magician user, it is either in the process of being autogenerated, or is planned to be autogenerated soon. If an issue is assigned to a user, that user is claiming responsibility for the issue. If an issue is assigned to hashibot, a community member has claimed the issue already.

Terraform Version

Terraform v1.1.7

  required_providers {
    google = {
      source  = "hashicorp/google"
      version = "4.33.0"
    }
  }

Affected Resource(s)

google_bigquery_table

Terraform Configuration Files

locals {
  table_name       = "${var.data_provider}_${var.event_type}"
  schema_file_path = "${path.module}/../../schemas/bigquery/${var.data_provider}/${var.event_type}.json"

  file_format       = upper(var.file_format)
  file_format_lower = lower(var.file_format)

  bucket_path    = (var.bucket_path == null) ? "" : trimsuffix(var.bucket_path, "/")
  gcs_uri_prefix = "gs://${var.gcs_bucket_name}/${local.bucket_path}/"
  gcs_uri        = "gs://${var.gcs_bucket_name}/${local.bucket_path}/*.${local.file_format_lower}"
  use_connection = (local.file_format != "GOOGLE_SHEETS" && var.bigquery_connection_id != null)
}

resource "google_bigquery_table" "external_data_table" {
  dataset_id          = var.destination_dataset_id
  table_id            = local.table_name
  deletion_protection = false

  labels = {
    table_class = (var.is_lookup_table) ? "lookup" : "fact"
  }

  schema = (local.use_connection) ? file(local.schema_file_path) : null

  external_data_configuration {
    autodetect    = false

    # Schema has to be informed outside this block when a connection is used
    schema        = (local.use_connection) ? null : file(local.schema_file_path)
    connection_id = (local.use_connection) ? var.bigquery_connection_id : null

    source_format = local.file_format
    source_uris = [
      (local.file_format == "GOOGLE_SHEETS") ? var.drive_uri : local.gcs_uri
    ]

    ignore_unknown_values = true

    dynamic "csv_options" {
      for_each = (local.file_format == "CSV") ? ["dummy"] : []
      content {
        quote             = "\""
        encoding          = "UTF-8"
        field_delimiter   = var.field_delimiter
        skip_leading_rows = var.header_rows_count
        allow_jagged_rows = true
      }
    }

    dynamic "hive_partitioning_options" {
      for_each = (var.use_hive_partitioning) ? ["dummy"] : []
      content {
        mode                     = "AUTO"
        source_uri_prefix        = local.gcs_uri_prefix
        require_partition_filter = false
      }
    }

    dynamic "google_sheets_options" {
      for_each = (local.file_format == "GOOGLE_SHEETS") ? ["dummy"] : []
      content {
        range             = var.sheet_range
        skip_leading_rows = var.header_rows_count
      }
    }
  }
}

Debug Output

Terraform will perform the following actions:

  # module.ingestion_raw_my_source.module.data_transfer["my_table"].google_bigquery_table.external_data_table must be replaced
-/+ resource "google_bigquery_table" "external_data_table" ***
      ~ creation_time       = 1662135144335 -> (known after apply)
      ~ etag                = "YM0wyvc+irEkyo+hrpGwaQ==" -> (known after apply)
      ~ expiration_time     = 0 -> (known after apply)
      ~ id                  = "projects/my-project/datasets/raw/tables/my_source_my_table" -> (known after apply)
      ~ last_modified_time  = 1662135144799 -> (known after apply)
      ~ location            = "US" -> (known after apply)
      ~ num_bytes           = 0 -> (known after apply)
      ~ num_long_term_bytes = 0 -> (known after apply)
      ~ num_rows            = 0 -> (known after apply)
      ~ project             = "my-project" -> (known after apply)
      ~ schema              = jsonencode(
          ~ [ # forces replacement
                # (5 unchanged elements hidden)
                ***
                    mode = "NULLABLE"
                    name = "weight"
                    type = "BIGNUMERIC"
                ***,
              - ***
                  - mode = "NULLABLE"
                  - name = "environment"
                  - type = "STRING"
                ***,
              - ***
                  - mode = "NULLABLE"
                  - name = "frequency"
                  - type = "STRING"
                ***,
            ]
        )
      ~ self_link           = "https://bigquery.googleapis.com/bigquery/v2/projects/my-project/datasets/raw/tables/my_source_my_table" -> (known after apply)
      ~ type                = "EXTERNAL" -> (known after apply)
        # (4 unchanged attributes hidden)

      ~ external_data_configuration ***
          ~ connection_id         = "my-project.us.data-lake-connection" -> "projects/my-project/locations/US/connections/data-lake-connection"
          - max_bad_records       = 0 -> null
          + schema                = (known after apply)
            # (5 unchanged attributes hidden)


            # (2 unchanged blocks hidden)
        ***
    ***

Panic Output

None.

Expected Behavior

No changes detected in schema after apply.

Actual Behavior

The actual behavior is that the plan will try to remove the columns coming from hive-partitioning format (environment and frequency in my example), while not actually removing it after terraform apply.

I'd like to point out that although connection_id is informed inside the external_data_configuration block, the schema must be informed outside this block. Notice that in configuration file I had to put conditionals when assigning the schema (both inside and outside external_data_configuration) to handle scenarios where I have a connection_id and where I don't. Maybe that alone would fix the issue, i.e. allow to inform schema in external_data_configuration when a connection_id is infomed.

Steps to Reproduce

Create a file in GCS for the hive-partitioned scenario in path gs://my-bucket/data/raw/my_source/my_table/environment=prod/frequency=daily/something.csv containing the columns specified by the following schema (I removed some columns from original for simplicity):

[
    {
        "name": "date",
        "type": "DATE",
        "mode": "NULLABLE"
    },
    {
        "name": "index",
        "type": "STRING",
        "mode": "NULLABLE"
    },
    {
        "name": "weight",
        "type": "BIGNUMERIC",
        "mode": "NULLABLE"
    }
]

terraform init
terraform validate
terraform plan -out tfplan
terraform apply -input=false -auto-approve tfplan
terraform plan -out tfplan (again)

Important Factoids

None.

References

None

b/300616880

The text was updated successfully, but these errors were encountered:

edwardmedia · 2022-09-02T21:53:10Z

@psvaiter can you provide the values for the variables (locals) so I can repro?

#12386

psvaiter · 2022-09-05T16:58:42Z

Sure. I've built a complete working (simplified) example from scratch so you can reproduce. Run once to apply, and run again to see the diff problem. I didn't include the backend and provider configurations.

data "google_project" "project" {
}

locals {
  project_id     = data.google_project.project.project_id
  project_number = data.google_project.project.number

  bucket_name    = google_storage_bucket.data_lake.name
  bucket_path    = "my_source/my_table"
  gcs_uri_prefix = "gs://${local.bucket_name}/${local.bucket_path}/"
  gcs_uri        = "gs://${local.bucket_name}/${local.bucket_path}/*.csv"
  use_connection = true

  schema = <<-EOT
    [
      {
        "name": "date",
        "type": "DATE",
        "mode": "NULLABLE"
      },
      {
        "name": "index",
        "type": "STRING",
        "mode": "NULLABLE"
      },
      {
        "name": "weight",
        "type": "BIGNUMERIC",
        "mode": "NULLABLE"
      }
    ]
  EOT

  csv_content = <<-EOT
    date,index,weight
    2022-09-05,index1,0.63
    2022-09-05,index2,0.27
    2022-09-05,index3,0.1
  EOT
}

resource "google_storage_bucket" "data_lake" {
  name     = "test-data-lake-ee7e7914" # suffix is part of a random uuid (generate one for you)
  location = "US"

  uniform_bucket_level_access = true
}

resource "google_storage_bucket_object" "example_csv_daily" {
  bucket  = google_storage_bucket.data_lake.name
  name    = "${local.bucket_path}/environment=prod/frequency=daily/2022-09-01-example.csv"
  content = local.csv_content
}

resource "google_storage_bucket_object" "example_csv_monthly" {
  bucket  = google_storage_bucket.data_lake.name
  name    = "${local.bucket_path}/environment=prod/frequency=monthly/2022-09-01-example.csv"
  content = local.csv_content
}

resource "google_bigquery_connection" "data_lake" {
  connection_id = "test-connection"
  location      = "US"

  cloud_resource {}
}

resource "google_project_iam_member" "data_lake_bigquery_connection" {
  project = local.project_id
  role    = "roles/storage.objectViewer"
  member  = "serviceAccount:${google_bigquery_connection.data_lake.cloud_resource[0].service_account_id}"
}

resource "google_bigquery_dataset" "main" {
  dataset_id    = "my_source"
  location      = "US"

  access {
    role          = "OWNER"
    special_group = "projectOwners"
  }
}

resource "google_bigquery_table" "my_table" {
  depends_on = [
    google_project_iam_member.data_lake_bigquery_connection,
    google_storage_bucket_object.example_csv_daily,
    google_storage_bucket_object.example_csv_monthly,
  ]

  dataset_id          = google_bigquery_dataset.main.dataset_id
  table_id            = "my_table"
  deletion_protection = false

  schema = (local.use_connection) ? local.schema : null

  external_data_configuration {
    autodetect    = false

    # Schema has to be informed outside this block when a connection is used
    schema        = (local.use_connection) ? null : local.schema
    connection_id = (local.use_connection) ? google_bigquery_connection.data_lake.id : null

    source_format = "CSV"
    source_uris   = [local.gcs_uri]

    ignore_unknown_values = true

    csv_options {
      quote             = "\""
      encoding          = "UTF-8"
      field_delimiter   = ","
      skip_leading_rows = 1
      allow_jagged_rows = true
    }

    hive_partitioning_options {
      mode                     = "AUTO"
      source_uri_prefix        = local.gcs_uri_prefix
      require_partition_filter = false
    }
  }
}

edwardmedia · 2022-09-06T23:41:08Z

This looks like an API issue.

b/245412495

two more fields are added into the schema from api
why connectionId has updated in the response?

psvaiter · 2022-09-07T10:50:10Z

The connection id has updated due to another issue that yourself has referenced above in the comments. #12386

psvaiter · 2022-09-07T10:55:22Z

I think that external tables expect schema to be informed inside the external_data_configuration. Hive-partitioning works properly this way.

However, when you set a connection id, it doesn't accept a schema in external_data_configuration and that's when things mess up.

IMO, it makes sense to inform the schema in the same block the connection id is informed.

edwardmedia · 2022-09-08T14:40:26Z

@melinath checkout the response from b/245412495. How do you want to resolve this?

melinath · 2022-09-12T20:31:18Z

Wasn't able to get to this. I can't tell whether the extra fields are something we should diff suppress or what. It looks like there's also an extraneous change to max_bad_records?

edwardmedia · 2022-09-13T13:26:35Z

b/245412495

psvaiter · 2022-10-05T14:56:02Z

@edwardmedia any updates on this?

pkarlik · 2023-06-15T13:32:47Z

This is consequence of the issue 10919 not being fixed.
The google_bigquery_table.schema field and google_bigquery_table.external_data_configuration.schema behave very different:
The google_bigquery_table.schema is much more sensitive to the schema definition:

requires a partitioning key column defined in schema to avoid constant re-creation of the table (but partition key as part of the schema prevents initial creation of the table so you need to first create table with schema without partition column and add the partition column into schema later)
it requires mode attribute filled in for all the columns in the schema
it requires precise datatype definitions (e.g. INT64 is not good enough, you must provide INTEGER otherwise table is constantly re-created)

nevzheng · 2023-09-22T18:50:20Z

Hi All,
I investigated the connection_id issue. I discovered and fixed the issue. GoogleCloudPlatform/magic-modules#9053

the diff suppression comparison for the region in the "dot"/normalized format should be case insensitive.

Karpisek · 2023-10-02T11:34:08Z

Hi, any updates on this? This issue makes it really painful to define partitioned BQ data lake tables using terraform...

melinath · 2023-10-13T16:47:21Z

@nevzheng just to clarify we're still considering this open due to the complications with figuring out the best long-term fix; is that correct?

Karpisek · 2023-12-01T08:55:44Z

Hi, is there any opdate on this one? Thank you

Karpisek · 2023-12-20T11:13:30Z

Hi, this is still an issue, could someone have look at it please? Thank you. It makes unnecessary and annoying steps to deploy the IaC using an CI/CD pipeline.

wj-chen · 2024-01-05T22:17:20Z

Yes as @melinath mentioned this issue is still open due to complications figuring out the long-term fix. It may likely result in an API-level change.

For those affected, would you be able to apply the workaround mentioned in #16907 about using a lifecycle config in the meantime?

Karpisek · 2024-01-09T11:39:56Z

Hi @wj-chen,
the mentioned workaround is not really a workaround - it just shadows the issue, but once one is willing to make a change to the actual schema it will get ignored (thanks to the mentioned 'workaround') so it by solving one inconvenience in deployment just replaces it with an another.

mdlnr · 2024-01-09T11:59:39Z

Hi @Karpisek - I have another suggestion for a workaround that I implemented.

We source the schema from a json file and calcuate a hash of the content with terraform (e.g. by using the base64sha256 function). This hash is used as part of the resource identifier of the external tables (in our case setup with for_each):

resource "google_bigquery_table" "external_table" {
  for_each = {
    for td in local.table_definitions : "${td.dataset_id}_${td.table_name}_${base64sha256(td.schema)}" => td
  }

 [...]

  lifecycle {
    ignore_changes = [
      schema
    ]
  }
}

This way we force a resource recreation when the actual schema changes. You could also implement a different solution without changing the identifier by depending on a terraform_data resource that contains the hash of the schema.

Karpisek · 2024-01-15T14:57:47Z

Hi @mdlnr thank you for the suggestion, it pointed me to a solution which will work for our use-case. Although it is not a clean one it is better than it is now. Thank you

wj-chen · 2024-09-12T17:48:12Z

I received some guidance from the corresponding API team. There may be a way to exclude columns generated by Hive partitioning from diff detection which would solve this issue. I will experiment and update here again.

[upstream:834d30be6e0212771d6f1757856cd54d6c57451b] Signed-off-by: Modular Magician <magic-modules@google.com>

cvm-a · 2025-01-19T17:04:50Z

Hi, just checking, did the diff detection experiments work?

psvaiter added the bug label Sep 2, 2022

edwardmedia self-assigned this Sep 2, 2022

edwardmedia added the waiting-response label Sep 2, 2022

github-actions bot removed the waiting-response label Sep 2, 2022

edwardmedia added the waiting-response label Sep 2, 2022

github-actions bot removed the waiting-response label Sep 2, 2022

edwardmedia added the waiting-response label Sep 3, 2022

github-actions bot removed the waiting-response label Sep 5, 2022

edwardmedia added the upstream label Sep 6, 2022

edwardmedia assigned melinath and unassigned edwardmedia Sep 8, 2022

melinath assigned edwardmedia and unassigned melinath Sep 12, 2022

edwardmedia removed their assignment Apr 9, 2023

github-actions bot added service/bigquery forward/review In review; remove label to forward labels Aug 17, 2023

edwardmedia removed the forward/review In review; remove label to forward label Sep 14, 2023

modular-magician added the forward/linked label Sep 15, 2023

nevzheng mentioned this issue Sep 22, 2023

Fix: external_data_configuration.connection_id Diff Suppression GoogleCloudPlatform/magic-modules#9053

Merged

This was referenced Sep 25, 2023

Fix: external_data_configuration.connection_id Diff Suppression hashicorp/terraform-provider-google-beta#6368

Merged

Fix: external_data_configuration.connection_id Diff Suppression #15983

Merged

wj-chen mentioned this issue Jan 5, 2024

google_bigquery_table configured as a Biglake table show permadiff on partition column #16907

Open

obada-ab mentioned this issue Mar 27, 2024

Allow in-place column drop for bigquery table GoogleCloudPlatform/magic-modules#10170

Merged

wj-chen mentioned this issue Sep 7, 2024

In-place upgrade fails for BigQuery ExternalTables #19177

Closed

modular-magician added a commit that referenced this issue Dec 23, 2024

Update Stream.yaml to add cdc method support for MySQL (#12465) (#20777)

294bade

[upstream:834d30be6e0212771d6f1757856cd54d6c57451b] Signed-off-by: Modular Magician <magic-modules@google.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Diff in schema in google_bigquery_table when using a connection_id and hive_partitioning_options #12465

Diff in schema in google_bigquery_table when using a connection_id and hive_partitioning_options #12465

psvaiter commented Sep 2, 2022 •

edited by modular-magician

Loading

edwardmedia commented Sep 2, 2022 •

edited

Loading

psvaiter commented Sep 5, 2022

edwardmedia commented Sep 6, 2022 •

edited

Loading

psvaiter commented Sep 7, 2022 •

edited

Loading

psvaiter commented Sep 7, 2022

edwardmedia commented Sep 8, 2022

melinath commented Sep 12, 2022

edwardmedia commented Sep 13, 2022

psvaiter commented Oct 5, 2022

pkarlik commented Jun 15, 2023

nevzheng commented Sep 22, 2023

Karpisek commented Oct 2, 2023

melinath commented Oct 13, 2023

Karpisek commented Dec 1, 2023

Karpisek commented Dec 20, 2023

wj-chen commented Jan 5, 2024

Karpisek commented Jan 9, 2024 •

edited

Loading

mdlnr commented Jan 9, 2024

Karpisek commented Jan 15, 2024 •

edited

Loading

wj-chen commented Sep 12, 2024

cvm-a commented Jan 19, 2025

Diff in schema in google_bigquery_table when using a connection_id and hive_partitioning_options #12465

Diff in schema in google_bigquery_table when using a connection_id and hive_partitioning_options #12465

Comments

psvaiter commented Sep 2, 2022 • edited by modular-magician Loading

Community Note

Terraform Version

Affected Resource(s)

Terraform Configuration Files

Debug Output

Panic Output

Expected Behavior

Actual Behavior

Steps to Reproduce

Important Factoids

References

edwardmedia commented Sep 2, 2022 • edited Loading

psvaiter commented Sep 5, 2022

edwardmedia commented Sep 6, 2022 • edited Loading

psvaiter commented Sep 7, 2022 • edited Loading

psvaiter commented Sep 7, 2022

edwardmedia commented Sep 8, 2022

melinath commented Sep 12, 2022

edwardmedia commented Sep 13, 2022

psvaiter commented Oct 5, 2022

pkarlik commented Jun 15, 2023

nevzheng commented Sep 22, 2023

Karpisek commented Oct 2, 2023

melinath commented Oct 13, 2023

Karpisek commented Dec 1, 2023

Karpisek commented Dec 20, 2023

wj-chen commented Jan 5, 2024

Karpisek commented Jan 9, 2024 • edited Loading

mdlnr commented Jan 9, 2024

Karpisek commented Jan 15, 2024 • edited Loading

wj-chen commented Sep 12, 2024

cvm-a commented Jan 19, 2025

psvaiter commented Sep 2, 2022 •

edited by modular-magician

Loading

edwardmedia commented Sep 2, 2022 •

edited

Loading

edwardmedia commented Sep 6, 2022 •

edited

Loading

psvaiter commented Sep 7, 2022 •

edited

Loading

Karpisek commented Jan 9, 2024 •

edited

Loading

Karpisek commented Jan 15, 2024 •

edited

Loading