[CT-1044] Leverage BigQuery table clones for a Low Cost CI environment #270

amirbtb · 2022-08-16T14:27:17Z

Describe the feature

The BigQuery table clone feature is in Preview since February 2022. It would allow us to setup a CI environment in a dedicated GCP Project/Bigquery Database at Zero/Low Cost (following the zero-copy-clone logic of Snowflake).

Describe alternatives you've considered

Configure CI to build in a separate project.

Additional context

Limitations

You can't create a clone of a :

view
materialized view
external table
(Source : Table Clones Limitations)

View and external tables being just an abstraction for data stored elsewhere, we could recreate them in the CI Google Project/BigQuery Database.

Workflow

GCP describe how to Create Tables Clones in SQL, I just need to figure out a couple of things about the workflow :

Filter only new/modified materializations (Slim CI)
Clone Tables + Create External Tables & Views
Run dbt test on the new/modified materializations (Slim CI)
Destroy Table Clones, External Tables & Views (optional)

Who will this benefit?

Any user of dbt-bigquery who wishes to setup a secured test environment for CI.

Are you interested in contributing this feature?

Yes.

romanoww · 2023-01-31T15:05:08Z

I'd like to add to this issue/enhancement. I assume it might be fairly easy to add simple table clones within the current copy materialization framework provided by dbt-bigquery.

The CopyJobConfig is already incorporated in the dbt-bigquery framework and used for the materialized='copy' configurations. This CopyJobConfig is implemented at the following location.

connections.py

We can pass the operation_type parameter to the CopyJobConfig, see docs and accepted_strings. The accepted values for this parameter include both COPY and CLONE, my guess is that default is set to COPY.

Adding operation_type to the copy_bq_table and copy_and_results functions would reconfigure our copy job.

def copy_bq_table(self, source, destination, write_disposition, operation_type):
        conn = self.get_thread_connection()
        client = conn.handle

        # -------------------------------------------------------------------------------
        #  BigQuery allows to use copy API using two different formats:
        #  1. client.copy_table(source_table_id, destination_table_id)
        #     where source_table_id = "your-project.source_dataset.source_table"
        #  2. client.copy_table(source_table_ids, destination_table_id)
        #     where source_table_ids = ["your-project.your_dataset.your_table_name", ...]
        #  Let's use uniform function call and always pass list there
        # -------------------------------------------------------------------------------
        if type(source) is not list:
            source = [source]

        source_ref_array = [
            self.table_ref(src_table.database, src_table.schema, src_table.table)
            for src_table in source
        ]
        destination_ref = self.table_ref(
            destination.database, destination.schema, destination.table
        )

        logger.debug(
            'Copying table(s) "{}" to "{}" with disposition: "{}"',
            ", ".join(source_ref.path for source_ref in source_ref_array),
            destination_ref.path,
            write_disposition,
        )

        def copy_and_results():
            job_config = google.cloud.bigquery.CopyJobConfig(write_disposition=write_disposition, 
                operation_type=operation_type)
            copy_job = client.copy_table(source_ref_array, destination_ref, job_config=job_config)
            timeout = self.get_job_execution_timeout_seconds(conn) or 300
            iterator = copy_job.result(timeout=timeout)
            return copy_job, iterator

        self._retry_and_handle(
            msg='copy table "{}" to "{}"'.format(
                ", ".join(source_ref.path for source_ref in source_ref_array),
                destination_ref.path,
            ),
            conn=conn,
            fn=copy_and_results,
        )

Further down the line, in the BigQuery Adapter Class file we could redefine the copy_table function by adding a operation_type input parameters. This way, we can call the bigquery adapter with either copy or clone (should be the accepted values, default is set to clone)

def copy_table(self, source, destination, materialization, operation_type):
        if materialization == "incremental":
            write_disposition = WRITE_APPEND
        elif materialization == "table":
            write_disposition = WRITE_TRUNCATE
        else:
            raise dbt.exceptions.CompilationError(
                'Copy table materialization must be "copy" or "table", but '
                f"config.get('copy_materialization', 'table') was "
                f"{materialization}"
            )

        self.connections.copy_bq_table(source, destination, write_disposition, operation_type)
        
        return "{operation_type} TABLE with materialization: {}".format(materialization)

We can then duplicate the copy materialization and changing it to clone.sql and changing the write_dispositions input to clone copy.sql.

{# Call adapter copy_table function #}
  {%- set result_str = adapter.copy_table(
      source_array,
      destination,
      config.get('copy_materialization', default = 'table'),
      config.get('operation_type', default= 'clone') -%}

This would require the config to include an operation_type (logging should be more extensive).

I'm not an expert by any chance and have never meddled with dbt source code, so I'm probably missing something / underestimating things. However, these are my two cents.
This implementation would not tick all the boxes suggested by @amirbtb, but it's a start.

Fleid · 2023-02-14T00:13:18Z

I'm moving that out of triage and into refinement.
Yes we want to do this, but have to think across the board for the other adapters. In the meantime, all contributions are welcome!

Timfrazer · 2023-02-23T20:34:56Z

This would really help us as our costs for running dbt on bigquery is quite expensive on our increasing data tables. We are going to do something hacky like this.

jtcohen6 · 2023-04-19T10:15:21Z

Let's do this for real :) and not just on BQ (but leveraging BQ-specific copying/cloning capabilities where possible!)

Check out:

[CT-2348] [Feature] dbt clone command dbt-core#7256

github-actions · 2023-10-17T01:49:14Z

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

github-actions · 2023-10-25T01:48:48Z

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

amirbtb added enhancement New feature or request triage labels Aug 16, 2022

github-actions bot changed the title ~~Leverage BigQuery table clones for a Low Cost CI environment~~ [CT-1044] Leverage BigQuery table clones for a Low Cost CI environment Aug 16, 2022

jtcohen6 mentioned this issue Aug 20, 2022

[CT-1045] [Feature] Provide option to allow tests to be ran before a model's data is updated dbt-labs/dbt-core#5660

Closed

jtcohen6 mentioned this issue Nov 22, 2022

[CT-1481] [Feature] Expose async arg in run_query() dbt-labs/dbt-snowflake#315

Closed

3 tasks

Fleid added refinement Product or leadership input needed and removed triage labels Feb 14, 2023

Fleid self-assigned this Mar 24, 2023

github-actions bot added the Stale label Oct 17, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 25, 2023

Fleid removed the Stale label Oct 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CT-1044] Leverage BigQuery table clones for a Low Cost CI environment #270

[CT-1044] Leverage BigQuery table clones for a Low Cost CI environment #270

amirbtb commented Aug 16, 2022 •

edited

Loading

romanoww commented Jan 31, 2023 •

edited

Loading

Fleid commented Feb 14, 2023

Timfrazer commented Feb 23, 2023

jtcohen6 commented Apr 19, 2023

github-actions bot commented Oct 17, 2023

github-actions bot commented Oct 25, 2023

[CT-1044] Leverage BigQuery table clones for a Low Cost CI environment #270

[CT-1044] Leverage BigQuery table clones for a Low Cost CI environment #270

Comments

amirbtb commented Aug 16, 2022 • edited Loading

Describe the feature

Describe alternatives you've considered

Additional context

Limitations

Workflow

Who will this benefit?

Are you interested in contributing this feature?

romanoww commented Jan 31, 2023 • edited Loading

Fleid commented Feb 14, 2023

Timfrazer commented Feb 23, 2023

jtcohen6 commented Apr 19, 2023

github-actions bot commented Oct 17, 2023

github-actions bot commented Oct 25, 2023

amirbtb commented Aug 16, 2022 •

edited

Loading

romanoww commented Jan 31, 2023 •

edited

Loading