Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Diff in schema in google_bigquery_table when using a connection_id and hive_partitioning_options #12465

Open
psvaiter opened this issue Sep 2, 2022 · 21 comments

Comments

@psvaiter
Copy link

psvaiter commented Sep 2, 2022

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request.
  • Please do not leave +1 or me too comments, they generate extra noise for issue followers and do not help prioritize the request.
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment.
  • If an issue is assigned to the modular-magician user, it is either in the process of being autogenerated, or is planned to be autogenerated soon. If an issue is assigned to a user, that user is claiming responsibility for the issue. If an issue is assigned to hashibot, a community member has claimed the issue already.

Terraform Version

Terraform v1.1.7

  required_providers {
    google = {
      source  = "hashicorp/google"
      version = "4.33.0"
    }
  }

Affected Resource(s)

  • google_bigquery_table

Terraform Configuration Files

locals {
  table_name       = "${var.data_provider}_${var.event_type}"
  schema_file_path = "${path.module}/../../schemas/bigquery/${var.data_provider}/${var.event_type}.json"

  file_format       = upper(var.file_format)
  file_format_lower = lower(var.file_format)

  bucket_path    = (var.bucket_path == null) ? "" : trimsuffix(var.bucket_path, "/")
  gcs_uri_prefix = "gs://${var.gcs_bucket_name}/${local.bucket_path}/"
  gcs_uri        = "gs://${var.gcs_bucket_name}/${local.bucket_path}/*.${local.file_format_lower}"
  use_connection = (local.file_format != "GOOGLE_SHEETS" && var.bigquery_connection_id != null)
}

resource "google_bigquery_table" "external_data_table" {
  dataset_id          = var.destination_dataset_id
  table_id            = local.table_name
  deletion_protection = false

  labels = {
    table_class = (var.is_lookup_table) ? "lookup" : "fact"
  }

  schema = (local.use_connection) ? file(local.schema_file_path) : null

  external_data_configuration {
    autodetect    = false

    # Schema has to be informed outside this block when a connection is used
    schema        = (local.use_connection) ? null : file(local.schema_file_path)
    connection_id = (local.use_connection) ? var.bigquery_connection_id : null

    source_format = local.file_format
    source_uris = [
      (local.file_format == "GOOGLE_SHEETS") ? var.drive_uri : local.gcs_uri
    ]

    ignore_unknown_values = true

    dynamic "csv_options" {
      for_each = (local.file_format == "CSV") ? ["dummy"] : []
      content {
        quote             = "\""
        encoding          = "UTF-8"
        field_delimiter   = var.field_delimiter
        skip_leading_rows = var.header_rows_count
        allow_jagged_rows = true
      }
    }

    dynamic "hive_partitioning_options" {
      for_each = (var.use_hive_partitioning) ? ["dummy"] : []
      content {
        mode                     = "AUTO"
        source_uri_prefix        = local.gcs_uri_prefix
        require_partition_filter = false
      }
    }

    dynamic "google_sheets_options" {
      for_each = (local.file_format == "GOOGLE_SHEETS") ? ["dummy"] : []
      content {
        range             = var.sheet_range
        skip_leading_rows = var.header_rows_count
      }
    }
  }
}

Debug Output

Terraform will perform the following actions:

  # module.ingestion_raw_my_source.module.data_transfer["my_table"].google_bigquery_table.external_data_table must be replaced
-/+ resource "google_bigquery_table" "external_data_table" ***
      ~ creation_time       = 1662135144335 -> (known after apply)
      ~ etag                = "YM0wyvc+irEkyo+hrpGwaQ==" -> (known after apply)
      ~ expiration_time     = 0 -> (known after apply)
      ~ id                  = "projects/my-project/datasets/raw/tables/my_source_my_table" -> (known after apply)
      ~ last_modified_time  = 1662135144799 -> (known after apply)
      ~ location            = "US" -> (known after apply)
      ~ num_bytes           = 0 -> (known after apply)
      ~ num_long_term_bytes = 0 -> (known after apply)
      ~ num_rows            = 0 -> (known after apply)
      ~ project             = "my-project" -> (known after apply)
      ~ schema              = jsonencode(
          ~ [ # forces replacement
                # (5 unchanged elements hidden)
                ***
                    mode = "NULLABLE"
                    name = "weight"
                    type = "BIGNUMERIC"
                ***,
              - ***
                  - mode = "NULLABLE"
                  - name = "environment"
                  - type = "STRING"
                ***,
              - ***
                  - mode = "NULLABLE"
                  - name = "frequency"
                  - type = "STRING"
                ***,
            ]
        )
      ~ self_link           = "https://bigquery.googleapis.com/bigquery/v2/projects/my-project/datasets/raw/tables/my_source_my_table" -> (known after apply)
      ~ type                = "EXTERNAL" -> (known after apply)
        # (4 unchanged attributes hidden)

      ~ external_data_configuration ***
          ~ connection_id         = "my-project.us.data-lake-connection" -> "projects/my-project/locations/US/connections/data-lake-connection"
          - max_bad_records       = 0 -> null
          + schema                = (known after apply)
            # (5 unchanged attributes hidden)


            # (2 unchanged blocks hidden)
        ***
    ***

Panic Output

None.

Expected Behavior

No changes detected in schema after apply.

Actual Behavior

The actual behavior is that the plan will try to remove the columns coming from hive-partitioning format (environment and frequency in my example), while not actually removing it after terraform apply.

I'd like to point out that although connection_id is informed inside the external_data_configuration block, the schema must be informed outside this block. Notice that in configuration file I had to put conditionals when assigning the schema (both inside and outside external_data_configuration) to handle scenarios where I have a connection_id and where I don't. Maybe that alone would fix the issue, i.e. allow to inform schema in external_data_configuration when a connection_id is infomed.

Steps to Reproduce

  • Create a file in GCS for the hive-partitioned scenario in path gs://my-bucket/data/raw/my_source/my_table/environment=prod/frequency=daily/something.csv containing the columns specified by the following schema (I removed some columns from original for simplicity):
[
    {
        "name": "date",
        "type": "DATE",
        "mode": "NULLABLE"
    },
    {
        "name": "index",
        "type": "STRING",
        "mode": "NULLABLE"
    },
    {
        "name": "weight",
        "type": "BIGNUMERIC",
        "mode": "NULLABLE"
    }
]
  • terraform init
  • terraform validate
  • terraform plan -out tfplan
  • terraform apply -input=false -auto-approve tfplan
  • terraform plan -out tfplan (again)

Important Factoids

None.

References

None

b/300616880

@psvaiter psvaiter added the bug label Sep 2, 2022
@edwardmedia edwardmedia self-assigned this Sep 2, 2022
@edwardmedia
Copy link
Contributor

edwardmedia commented Sep 2, 2022

@psvaiter can you provide the values for the variables (locals) so I can repro?

#12386

@psvaiter
Copy link
Author

psvaiter commented Sep 5, 2022

Sure. I've built a complete working (simplified) example from scratch so you can reproduce. Run once to apply, and run again to see the diff problem. I didn't include the backend and provider configurations.

data "google_project" "project" {
}

locals {
  project_id     = data.google_project.project.project_id
  project_number = data.google_project.project.number

  bucket_name    = google_storage_bucket.data_lake.name
  bucket_path    = "my_source/my_table"
  gcs_uri_prefix = "gs://${local.bucket_name}/${local.bucket_path}/"
  gcs_uri        = "gs://${local.bucket_name}/${local.bucket_path}/*.csv"
  use_connection = true

  schema = <<-EOT
    [
      {
        "name": "date",
        "type": "DATE",
        "mode": "NULLABLE"
      },
      {
        "name": "index",
        "type": "STRING",
        "mode": "NULLABLE"
      },
      {
        "name": "weight",
        "type": "BIGNUMERIC",
        "mode": "NULLABLE"
      }
    ]
  EOT

  csv_content = <<-EOT
    date,index,weight
    2022-09-05,index1,0.63
    2022-09-05,index2,0.27
    2022-09-05,index3,0.1
  EOT
}

resource "google_storage_bucket" "data_lake" {
  name     = "test-data-lake-ee7e7914" # suffix is part of a random uuid (generate one for you)
  location = "US"

  uniform_bucket_level_access = true
}

resource "google_storage_bucket_object" "example_csv_daily" {
  bucket  = google_storage_bucket.data_lake.name
  name    = "${local.bucket_path}/environment=prod/frequency=daily/2022-09-01-example.csv"
  content = local.csv_content
}

resource "google_storage_bucket_object" "example_csv_monthly" {
  bucket  = google_storage_bucket.data_lake.name
  name    = "${local.bucket_path}/environment=prod/frequency=monthly/2022-09-01-example.csv"
  content = local.csv_content
}

resource "google_bigquery_connection" "data_lake" {
  connection_id = "test-connection"
  location      = "US"

  cloud_resource {}
}

resource "google_project_iam_member" "data_lake_bigquery_connection" {
  project = local.project_id
  role    = "roles/storage.objectViewer"
  member  = "serviceAccount:${google_bigquery_connection.data_lake.cloud_resource[0].service_account_id}"
}

resource "google_bigquery_dataset" "main" {
  dataset_id    = "my_source"
  location      = "US"

  access {
    role          = "OWNER"
    special_group = "projectOwners"
  }
}

resource "google_bigquery_table" "my_table" {
  depends_on = [
    google_project_iam_member.data_lake_bigquery_connection,
    google_storage_bucket_object.example_csv_daily,
    google_storage_bucket_object.example_csv_monthly,
  ]

  dataset_id          = google_bigquery_dataset.main.dataset_id
  table_id            = "my_table"
  deletion_protection = false

  schema = (local.use_connection) ? local.schema : null

  external_data_configuration {
    autodetect    = false

    # Schema has to be informed outside this block when a connection is used
    schema        = (local.use_connection) ? null : local.schema
    connection_id = (local.use_connection) ? google_bigquery_connection.data_lake.id : null

    source_format = "CSV"
    source_uris   = [local.gcs_uri]

    ignore_unknown_values = true

    csv_options {
      quote             = "\""
      encoding          = "UTF-8"
      field_delimiter   = ","
      skip_leading_rows = 1
      allow_jagged_rows = true
    }

    hive_partitioning_options {
      mode                     = "AUTO"
      source_uri_prefix        = local.gcs_uri_prefix
      require_partition_filter = false
    }
  }
}

@edwardmedia
Copy link
Contributor

edwardmedia commented Sep 6, 2022

This looks like an API issue.

b/245412495

  • two more fields are added into the schema from api
  • why connectionId has updated in the response?

@psvaiter
Copy link
Author

psvaiter commented Sep 7, 2022

The connection id has updated due to another issue that yourself has referenced above in the comments. #12386

@psvaiter
Copy link
Author

psvaiter commented Sep 7, 2022

I think that external tables expect schema to be informed inside the external_data_configuration. Hive-partitioning works properly this way.

However, when you set a connection id, it doesn't accept a schema in external_data_configuration and that's when things mess up.

IMO, it makes sense to inform the schema in the same block the connection id is informed.

@edwardmedia
Copy link
Contributor

@melinath checkout the response from b/245412495. How do you want to resolve this?

@edwardmedia edwardmedia assigned melinath and unassigned edwardmedia Sep 8, 2022
@melinath
Copy link
Collaborator

Wasn't able to get to this. I can't tell whether the extra fields are something we should diff suppress or what. It looks like there's also an extraneous change to max_bad_records?

@melinath melinath assigned edwardmedia and unassigned melinath Sep 12, 2022
@edwardmedia
Copy link
Contributor

b/245412495

@psvaiter
Copy link
Author

psvaiter commented Oct 5, 2022

@edwardmedia any updates on this?

@edwardmedia edwardmedia removed their assignment Apr 9, 2023
@pkarlik
Copy link

pkarlik commented Jun 15, 2023

This is consequence of the issue 10919 not being fixed.
The google_bigquery_table.schema field and google_bigquery_table.external_data_configuration.schema behave very different:
The google_bigquery_table.schema is much more sensitive to the schema definition:

  • requires a partitioning key column defined in schema to avoid constant re-creation of the table (but partition key as part of the schema prevents initial creation of the table so you need to first create table with schema without partition column and add the partition column into schema later)
  • it requires mode attribute filled in for all the columns in the schema
  • it requires precise datatype definitions (e.g. INT64 is not good enough, you must provide INTEGER otherwise table is constantly re-created)

@nevzheng
Copy link

Hi All,
I investigated the connection_id issue. I discovered and fixed the issue. GoogleCloudPlatform/magic-modules#9053

the diff suppression comparison for the region in the "dot"/normalized format should be case insensitive.

@Karpisek
Copy link

Karpisek commented Oct 2, 2023

Hi, any updates on this? This issue makes it really painful to define partitioned BQ data lake tables using terraform...

@melinath
Copy link
Collaborator

@nevzheng just to clarify we're still considering this open due to the complications with figuring out the best long-term fix; is that correct?

@Karpisek
Copy link

Karpisek commented Dec 1, 2023

Hi, is there any opdate on this one? Thank you

@Karpisek
Copy link

Hi, this is still an issue, could someone have look at it please? Thank you. It makes unnecessary and annoying steps to deploy the IaC using an CI/CD pipeline.

@wj-chen
Copy link

wj-chen commented Jan 5, 2024

Yes as @melinath mentioned this issue is still open due to complications figuring out the long-term fix. It may likely result in an API-level change.

For those affected, would you be able to apply the workaround mentioned in #16907 about using a lifecycle config in the meantime?

@Karpisek
Copy link

Karpisek commented Jan 9, 2024

Hi @wj-chen,
the mentioned workaround is not really a workaround - it just shadows the issue, but once one is willing to make a change to the actual schema it will get ignored (thanks to the mentioned 'workaround') so it by solving one inconvenience in deployment just replaces it with an another.

@mdlnr
Copy link

mdlnr commented Jan 9, 2024

Hi @Karpisek - I have another suggestion for a workaround that I implemented.

We source the schema from a json file and calcuate a hash of the content with terraform (e.g. by using the base64sha256 function). This hash is used as part of the resource identifier of the external tables (in our case setup with for_each):

resource "google_bigquery_table" "external_table" {
  for_each = {
    for td in local.table_definitions : "${td.dataset_id}_${td.table_name}_${base64sha256(td.schema)}" => td
  }

 [...]

  lifecycle {
    ignore_changes = [
      schema
    ]
  }
}

This way we force a resource recreation when the actual schema changes. You could also implement a different solution without changing the identifier by depending on a terraform_data resource that contains the hash of the schema.

@Karpisek
Copy link

Karpisek commented Jan 15, 2024

Hi @mdlnr thank you for the suggestion, it pointed me to a solution which will work for our use-case. Although it is not a clean one it is better than it is now. Thank you

@wj-chen
Copy link

wj-chen commented Sep 12, 2024

I received some guidance from the corresponding API team. There may be a way to exclude columns generated by Hive partitioning from diff detection which would solve this issue. I will experiment and update here again.

modular-magician added a commit to modular-magician/terraform-provider-google that referenced this issue Dec 23, 2024
[upstream:834d30be6e0212771d6f1757856cd54d6c57451b]

Signed-off-by: Modular Magician <magic-modules@google.com>
modular-magician added a commit that referenced this issue Dec 23, 2024
[upstream:834d30be6e0212771d6f1757856cd54d6c57451b]

Signed-off-by: Modular Magician <magic-modules@google.com>
@cvm-a
Copy link

cvm-a commented Jan 19, 2025

Hi, just checking, did the diff detection experiments work?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants